From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 8259 invoked by alias); 9 Aug 2008 14:52:02 -0000 Received: (qmail 8247 invoked by uid 22791); 9 Aug 2008 14:52:01 -0000 X-Spam-Check-By: sourceware.org Received: from mail.codesourcery.com (HELO mail.codesourcery.com) (65.74.133.4) by sourceware.org (qpsmtpd/0.31) with ESMTP; Sat, 09 Aug 2008 14:51:16 +0000 Received: (qmail 30571 invoked from network); 9 Aug 2008 14:51:14 -0000 Received: from unknown (HELO orlando.local) (pedro@127.0.0.2) by mail.codesourcery.com with ESMTPA; 9 Aug 2008 14:51:14 -0000 From: Pedro Alves To: "John David Anglin" Subject: Re: ttrace: Protocal error Date: Sat, 09 Aug 2008 14:52:00 -0000 User-Agent: KMail/1.9.9 Cc: gdb-patches@sourceware.org References: <20080808201457.3064B4EBE@hiauly1.hia.nrc.ca> In-Reply-To: <20080808201457.3064B4EBE@hiauly1.hia.nrc.ca> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Message-Id: <200808091551.17013.pedro@codesourcery.com> X-IsSubscribed: yes Mailing-List: contact gdb-patches-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gdb-patches-owner@sourceware.org X-SW-Source: 2008-08/txt/msg00239.txt.bz2 On Friday 08 August 2008 21:14:56, John David Anglin wrote: > > Hmmm, the thread seems to have exited but state_ is still 1. > > The patch below seems to fix the problem. =A0I was finally able > to catch an abort in vla6.f90. > > Ok? > > > =A0=A0=A0=A0=A0=A0=A0=A0* inf-ttrace.c (inf_ttrace_resume_callback): Don'= t resume dying > thread. > > Index: inf-ttrace.c > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > RCS file: /cvs/src/src/gdb/inf-ttrace.c,v > retrieving revision 1.30 > diff -u -3 -p -r1.30 inf-ttrace.c > --- inf-ttrace.c=A0=A0=A0=A0=A0=A0=A0=A09 Jul 2008 22:23:05 -0000=A0=A0= =A0=A0=A0=A0=A01.30 > +++ inf-ttrace.c=A0=A0=A0=A0=A0=A0=A0=A08 Aug 2008 19:57:13 -0000 > @@ -787,7 +804,9 @@ inf_ttrace_kill (void) > =A0static int > =A0inf_ttrace_resume_callback (struct thread_info *info, void *arg) > =A0{ > - =A0if (!ptid_equal (info->ptid, inferior_ptid)) > + =A0if (!ptid_equal (info->ptid, inferior_ptid) > + =A0 =A0 =A0&& !((struct inf_ttrace_private_thread_info *)info->private)= ->dying > + =A0 =A0 =A0&& !is_exited (info->ptid)) > =A0 =A0 =A0{ > =A0 =A0 =A0 =A0pid_t pid =3D ptid_get_pid (info->ptid); > =A0 =A0 =A0 =A0lwpid_t lwpid =3D ptid_get_lwp (info->ptid); >From your original backtrace, I'm still puzzled on how you got here. When we tag the thread as dying, we return TARGET_WAITKIND_SPURIOUS, which triggers a resume. This first resume should have removed the thread from the thread list.=20=20 If you were really in a TARGET_WAITKIND_SYSCALL_ENTRY, this would be at least the second resume after the lwp exit. Maybe I am reading the backtrace wrong though. If you have the patience, showing what GDB outputs when you do a "run" after setting "set debug infrun 1" would help. Looking again at the TTEVT_LWP_EXIT event, which tags the thread as dying. When we get it, the lwp is not dead yet: "TTEVT_LWP_EXIT This event flag indicates that the debugger wants to be notified when a= =20 thread is exiting via the lwp_exit() system call. The thread stops upon ent= ry=20 to the system call." That's why we resume it immediatelly here, inf_ttrace_wait () ... case TTEVT_LWP_EXIT: if (print_thread_events) printf_unfiltered (_("[%s exited]\n"), target_pid_to_str (ptid)); ti =3D find_thread_pid (ptid); gdb_assert (ti !=3D NULL); ((struct inf_ttrace_private_thread_info *)ti->private)->dying =3D 1; inf_ttrace_num_lwps--; (1) ttrace (TT_LWP_CONTINUE, ptid_get_pid (ptid), ptid_get_lwp (ptid), TT_NOPC, 0, 0); /* If we don't return -1 here, core GDB will re-add the thread. */ ptid =3D minus_one_ptid; break; ... /* Make sure all threads within the process are stopped. */ (2) if (ttrace (TT_PROC_STOP, tts.tts_pid, 0, 0, 0, 0) =3D=3D -1) perror_with_name (("ttrace")); return ptid; } It seems to me, that for some reason, in most cases, the inferior was slow enough that when you reach (2), the dying thread hadn't exited yet. The TT_PROC_STOP call stops all lwps of the process, the dying one included, I would think. In that case, you still need the resume on the dying thread in inf_ttrace_wait. Otherwise, you *may* get this bug back, depending on how the OS is waking waiting processes: http://www.cygwin.com/ml/gdb-patches/2007-09/msg00238.html "So, if the dying thread is stopped, it should be resumed one last time. Otherwise, any other thread waiting for its death on pthread_join would be blocked forever (e.g. in attachment, a simple program which freezes when it is run under GDB)." That would explain why you trip on this bug, but, Jerome didn't, I guess. It may be your testcase triggers the race better. The way this is solved currently, is also racy. We resume all threads, dying threads included, and assume they will die just shortly, so, we immediately delete dying threads after resuming: if (ptid_equal (ptid, minus_one_ptid)) { /* Let all the other threads run too. */ iterate_over_threads (inf_ttrace_resume_callback, NULL); **** iterate_over_threads (inf_ttrace_delete_dying_threads_callback, NULL); } It could happen, that under stress, GDB handles another event, and stops all lwps, *before* the dying lwp having a chance of exiting. The symptom would again the be one Jerome was fixing. Having no idea on how much the thread still executes after an TTEVT_LWP_EXIT event, I can't tell how likelly this is to happen, say causing spurious testsuite failures. The ttrace API doesn't have an "lwp is really gone from the OS tables" event (sigh), but, it does have a TTEVT_LWP_CREATE event. This mean that we can detect if the OS is reusing an lwpid at that point. So, to minimise the possible race, how about: - still try to resume a dying lwp. Ignore the errno you were originally seeing in that case (only). - on resume failure, delete it from GDBs thread table. - if by any chance, the lwp exits, and the inferior spawn a new lwp, and the OS reuses the same lwpid of the lwp we knew was dying, we delete the dying lwp, and add the new one. If the OS is reusing the id, the original lwp has to be gone. This is just an add_thread call, as that is already handled by it internally (*). - If the thread is still alive, but is dying, let that show in "info threads". The linux pthread support implementation also does this. I think the race below still exists, but it happens to be innocuous: - we know an lwp was dying, but, we didn't detect its exit, because we resumed it really exited. - another thread hits breakpoint - yet another thread spawns a new thread, which reuses the dying thread's lwpid. - GDB is informed of the breakpoint hit. - Since events are handled sequencially, we will only notice the new thread event after handling the breakpoint hit. - the user resumes the inferior, after inspecting what happened at the breakpoint. - GDB tries to resume the dying lwpid --- actually, this id belongs to a new thread by now. Since we were handling a breakpoint, we had stopped the whole process, so this resume succeeds. We consider that the lwp is still dying. - GDB now handles the TTEVT_LWP_CREATE event, and adds it to GDB's thread list, which automatically gets rid of the dying lwp, due to PTID collision. Also, when we detect a TTEVT_LWP_CREATE, TTEVT_LWP_EXIT or=20 TTEVT_LWP_TERMINATE, The ttrace docs indicate that only one lwp is stopped. There's no reason to stop all lwps, and return TARGET_WAITKING_SPURIOUS, only to resume all lwps again. This just adds overhead and messes more with the scheduling of the inferior than needed. We could just resume the stopped lwp, and return TARGET_WAITKIND_IGNORE. But, we'd not detect that the thread really is gone until the next whole inferior resume. I guess we could also detect if dying threads have already exited after stopping the whole process and sending signal 0. We can also do that in inf_ttrace_thread_alive (like inf_ptrace_thread_alive does), so "info threads" also gets rid of dying threads that have already really died. Sorry, that came out longer that I was expecting. Am I making any sense? I can work up a patch for this. If I'm analising the race correctly, an alternative simpler solution to the above, would be to just ignore a failed resume on a dying thread, but still try to resume it. Not resuming dying threads may be a problem. --=20 Pedro Alves