From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 102402 invoked by alias); 24 Jun 2016 21:52:26 -0000 Mailing-List: contact gdb-patches-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gdb-patches-owner@sourceware.org Received: (qmail 102385 invoked by uid 89); 24 Jun 2016 21:52:25 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.0 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.2 spammy= X-HELO: relay1.mentorg.com Received: from relay1.mentorg.com (HELO relay1.mentorg.com) (192.94.38.131) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES256-GCM-SHA384 encrypted) ESMTPS; Fri, 24 Jun 2016 21:52:20 +0000 Received: from svr-orw-fem-06.mgc.mentorg.com ([147.34.97.120]) by relay1.mentorg.com with esmtp id 1bGZ1N-0004CR-Vw from Don_Breazeal@mentor.com ; Fri, 24 Jun 2016 14:52:18 -0700 Received: from NA-MBX-02.mgc.mentorg.com ([169.254.2.247]) by SVR-ORW-FEM-06.mgc.mentorg.com ([147.34.97.120]) with mapi id 14.03.0224.002; Fri, 24 Jun 2016 14:52:17 -0700 From: "Breazeal, Don" To: Joel Brobecker CC: "gdb-patches@sourceware.org" Subject: RE: RFH: failed assert debugging threaded+fork program over gdbserver Date: Fri, 24 Jun 2016 21:52:00 -0000 Message-ID: References: <20160512171650.GC26324@adacore.com> <5734C06C.8040008@codesourcery.com> <20160623225935.GC3295@adacore.com> In-Reply-To: <20160623225935.GC3295@adacore.com> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-SW-Source: 2016-06/txt/msg00410.txt.bz2 > -----Original Message----- > From: Joel Brobecker [mailto:brobecker@adacore.com] > Sent: Thursday, June 23, 2016 4:00 PM > To: Breazeal, Don > Cc: gdb-patches@sourceware.org > Subject: Re: RFH: failed assert debugging threaded+fork program over > gdbserver >=20 > Hello, >=20 > I finally had some time to look into this again. I think I have a > better understanding. >=20 > On Thu, May 12, 2016 at 10:42:04AM -0700, Don Breazeal wrote: > > On 5/12/2016 10:16 AM, Joel Brobecker wrote: > > > Hello, > > > > > > I have noticed the following problem, when debugging a program > which > > > uses both threads and fork. The program is attached in copy, and > it > > > was compiled by simply doing: > > > > > > % gnatmake -g a_test > > > > > > The issue appears only randomly, but it seems to show up fairly > > > reliably when using certain versions of GNU/Linux such as RHES7, > or > > > WRSLinux. I also see it on Ubuntu, but less reliably. Here is what > I > > > have found, debugging on WRSLinux (we set it up as a cross, but it > > > should be the same with native GNU/Linux distros): > > > > > > % gdb a_test > > > (gdb) break a_test.adb:30 > > > (gdb) break a_test.adb:39 > > > (gdb) target remote my_board:4444 > > > (gdb) continue > > > Continuing. > > > [...] > > > [New Thread 866.868] > > > [New Thread 866.869] > > > [New Thread 870.870] > > > /[...]/gdb/thread.c:89: internal-error: thread_info* > inferior_thread(): Assertion `tp' failed. > > > A problem internal to GDB has been detected, > > > further debugging may prove unreliable. > > > Quit this debugging session? (y or n) > > > > > > The error happens because GDBserver returns a list of threads to > GDB > > > where a new thread as a different PID (870 in the case above, > > > instead of 866). > > > > This is not supposed to happen. In > remote.c:remote_update_thread_list > > there is a call to remove_new_fork_children that is explicitly > > supposed to prevent this scenario. The new fork child thread should > > be deleted from the thread list ("context") before we call > remote_notice_new_inferior. > > > > We don't want the remote to report new threads related to the fork > > child until after we have handled the fork using > infrun.c:follow_fork. (Note: > > it looks like the function comment for remove_new_fork_children is > > stale, since there are scenarios other than being stopped at a fork > > catchpoint where this can occur.) > > > > > Any advice on how I should be fixing the issue? > > > > It looks like there is a case that remove_new_fork_children isn't > > handling correctly with your test+target(s). I'd start there to find > > out how the new thread is getting through. >=20 > I started by looking at remove_new_fork_children, which, in essence, > remove threads whose target_waitstatus->kind is fork or vforked. > And for that to happen, I think the remote (gdbserver) has to send GDB > a stop:fork packet, something like this: >=20 > Packet received: > T05fork:p4bc4.4bc4;06:0100000000000000;07:d0ddffffff7f0000;10:6a4f83f7 > ff7f0000;thread:p4ba5.4ba5;core:5; >=20 > However, this is not what happens, in our case. When continuing, we > just receive a SIGTRAP event, at which point GDB asks for the thread > list, and by then, the thread for the forked process is in the list, > leading to the situation described above. >=20 > Deciphering the gdbserver logs, I think I understand why this is > happening. After we're done single-stepping out of our breakpoint, and > we have resumed our program we go into a wait loop in > linux_wait_for_event_filtered, which, as I understand it, pulls event > from the kernel, but without doing a blocking wait (WNOHANG), and as > long as events are found, we keep waiting for more: >=20 > /* Always pull all events out of the kernel. We'll randomly select > an event LWP out of all that have events, to prevent > starvation. */ > while (event_child =3D=3D NULL) > { > [...] > ret =3D my_waitpid (-1, wstatp, options | WNOHANG); > if (ret > 0) > { > [...] > continue; > } >=20 > After resuming, we go through a couple of loops where I think we > discover the new threads that are getting started. Each time, we > resume the threads via: >=20 > /* Now that we've pulled all events out of the kernel, resume > LWPs that don't have an interesting event to report. */ > if (stopping_threads =3D=3D NOT_STOPPING_THREADS) > for_each_inferior (&all_threads, resume_stopped_resumed_lwps); >=20 > But shortly after, we get the following batch of events: >=20 > | leader_pid=3D2178, leader_lp!=3DNULL=3D1, num_lwps=3D3, zombie=3D0 > | sigsuspend'ing sigchld_handler my_waitpid (-1, 0x40000001) > my_waitpid > | (-1, 0x40000001): status(137f), 2190 > | LWFE: waitpid(-1, ...) returned 2190, ERRNO-OK > | LLW: waitpid 2190 received Stopped (signal) (stopped) my_waitpid (- > 1, > | 0x40000001) my_waitpid (-1, 0x40000001): status(57f), 2189 > | LWFE: waitpid(-1, ...) returned 2189, ERRNO-OK > | LLW: waitpid 2189 received Trace/breakpoint trap (stopped) pc is > | 0x1000336c > | CSBB: LWP 2178.2189 stopped by software breakpoint my_waitpid (-1, > | 0x40000001) my_waitpid (-1, 0x40000001): status(57f), 2188 > | LWFE: waitpid(-1, ...) returned 2188, ERRNO-OK > | LLW: waitpid 2188 received Trace/breakpoint trap (stopped) pc is > | 0x10003218 > | CSBB: LWP 2178.2188 stopped by software breakpoint my_waitpid (-1, > | 0x40000001) my_waitpid (-1, 0x40000001): status(1057f), 2178 > | LWFE: waitpid(-1, ...) returned 2178, ERRNO-OK > | LLW: waitpid 2178 received Trace/breakpoint trap (stopped) pc is > | 0xfe87158 > | HEW: Got fork event from LWP 2178, new child is 2190 <<<<--- fork! > | > | pc is 0xfe87158 > | my_waitpid (-1, 0x40000001) > | my_waitpid (-1, 0x40000001): status(ffffffff), 0 > | LWFE: waitpid(-1, ...) returned 0, ERRNO-OK >=20 > This tells me that we get 3 SIGTRAP events from our program's threads > hitting a breakpoint, *and* the fork event. The fork event is > confirmed by some logs printed right after: >=20 > | LWP 2178: extended event with waitstatus status->kind =3D forked Hit a > | non-gdbserver trap event. >=20 > Now that we've have interesting stuff to report, we decide to stop all > LWPS: >=20 > | >>>> entering stop_all_lwps > | stop_all_lwps (stop, except=3Dnone) > | wait_for_sigstop: pulling events > | my_waitpid (-1, 0x40000001) > | my_waitpid (-1, 0x40000001): status(ffffffff), 0 > | LWFE: waitpid(-1, ...) returned 0, ERRNO-OK leader_pid=3D2178, > | leader_lp!=3DNULL=3D1, num_lwps=3D3, zombie=3D0 leader_pid=3D2190, > | leader_lp!=3DNULL=3D1, num_lwps=3D1, zombie=3D0 > | LLW: exit (no unwaited-for LWP) > | stop_all_lwps done, setting stopping_threads back to !stopping <<<< > | exiting stop_all_lwps >=20 > Because all our threads had hit a breakpoint, we can see above that > none of the lwps got sent a sigstop signal, and therefore waitpid > returns zero as expected - confirmed by "(no unwaited-for LWP)". >=20 > And this is where I think things go wrong - instead of reporting the > fork event to GDB, we look at all our threads, find that none of them > was being single-stepped, and so select one of the threads that got a > SIGTRAP at random, reporting the event of that thread: >=20 > | SEL: Found 3 SIGTRAP events, selecting #1 Checking whether LWP 2178 > | needs to move out of the jump pad...no Checking whether LWP 2190 > needs > | to move out of the jump pad...no > | linux_wait_1 ret =3D LWP 2178.2188, 1, 5 <<<< exiting linux_wait_1 > | Writing resume reply for LWP 2178.2188:1 putpkt > | ("$T05swbreak:;01:482009e0;40:10003218;thread:p882.88c;core:0;#75"); > | [noack mode] >=20 > This is in linux-low.c::select_event_lwp >=20 > I'm thinking the right thing to do, here is to enhance > select_event_lwp to look for threads that received a fork/vfork event > first, and report that event to gdb ahead of all the other kinds of > events. >=20 > But I ran out of time to try this today - so I decided to send my > notes about this so far. I hope I am on the right track! >=20 > -- > Joel Hi Joel, Thanks for following up on this. Your diagnosis makes sense to me. As for= modifications to select_event_lwp, Pedro would be better suited to comment= on potential side effects. Thanks --Don