From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gdb-patches-return-132266-listarch-gdb-patches=sources.redhat.com@sourceware.org>
Received: (qmail 102402 invoked by alias); 24 Jun 2016 21:52:26 -0000
Mailing-List: contact gdb-patches-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <gdb-patches.sourceware.org>
List-Subscribe: <mailto:gdb-patches-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/gdb-patches/>
List-Post: <mailto:gdb-patches@sourceware.org>
List-Help: <mailto:gdb-patches-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: gdb-patches-owner@sourceware.org
Received: (qmail 102385 invoked by uid 89); 24 Jun 2016 21:52:25 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-1.0 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_NONE,SPF_PASS autolearn=ham version=3.3.2 spammy=
X-HELO: relay1.mentorg.com
Received: from relay1.mentorg.com (HELO relay1.mentorg.com) (192.94.38.131) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES256-GCM-SHA384 encrypted) ESMTPS; Fri, 24 Jun 2016 21:52:20 +0000
Received: from svr-orw-fem-06.mgc.mentorg.com ([147.34.97.120])	by relay1.mentorg.com with esmtp 	id 1bGZ1N-0004CR-Vw from Don_Breazeal@mentor.com ; Fri, 24 Jun 2016 14:52:18 -0700
Received: from NA-MBX-02.mgc.mentorg.com ([169.254.2.247]) by SVR-ORW-FEM-06.mgc.mentorg.com ([147.34.97.120]) with mapi id 14.03.0224.002; Fri, 24 Jun 2016 14:52:17 -0700
From: "Breazeal, Don" <Don_Breazeal@mentor.com>
To: Joel Brobecker <brobecker@adacore.com>
CC: "gdb-patches@sourceware.org" <gdb-patches@sourceware.org>
Subject: RE: RFH: failed assert debugging threaded+fork program over gdbserver
Date: Fri, 24 Jun 2016 21:52:00 -0000
Message-ID: <DA279C53C4A5884A907135DFCD7A059A37ED2DBF@NA-MBX-02.mgc.mentorg.com>
References: <20160512171650.GC26324@adacore.com> <5734C06C.8040008@codesourcery.com> <20160623225935.GC3295@adacore.com>
In-Reply-To: <20160623225935.GC3295@adacore.com>
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0
X-SW-Source: 2016-06/txt/msg00410.txt.bz2


> -----Original Message-----
> From: Joel Brobecker [mailto:brobecker@adacore.com]
> Sent: Thursday, June 23, 2016 4:00 PM
> To: Breazeal, Don
> Cc: gdb-patches@sourceware.org
> Subject: Re: RFH: failed assert debugging threaded+fork program over
> gdbserver
>=20
> Hello,
>=20
> I finally had some time to look into this again. I think I have a
> better understanding.
>=20
> On Thu, May 12, 2016 at 10:42:04AM -0700, Don Breazeal wrote:
> > On 5/12/2016 10:16 AM, Joel Brobecker wrote:
> > > Hello,
> > >
> > > I have noticed the following problem, when debugging a program
> which
> > > uses both threads and fork. The program is attached in copy, and
> it
> > > was compiled by simply doing:
> > >
> > >     % gnatmake -g a_test
> > >
> > > The issue appears only randomly, but it seems to show up fairly
> > > reliably when using certain versions of GNU/Linux such as RHES7,
> or
> > > WRSLinux. I also see it on Ubuntu, but less reliably. Here is what
> I
> > > have found, debugging on WRSLinux (we set it up as a cross, but it
> > > should be the same with native GNU/Linux distros):
> > >
> > >     % gdb a_test
> > >     (gdb) break a_test.adb:30
> > >     (gdb) break a_test.adb:39
> > >     (gdb) target remote my_board:4444
> > >     (gdb) continue
> > >     Continuing.
> > >     [...]
> > >     [New Thread 866.868]
> > >     [New Thread 866.869]
> > >     [New Thread 870.870]
> > >     /[...]/gdb/thread.c:89: internal-error: thread_info*
> inferior_thread(): Assertion `tp' failed.
> > >     A problem internal to GDB has been detected,
> > >     further debugging may prove unreliable.
> > >     Quit this debugging session? (y or n)
> > >
> > > The error happens because GDBserver returns a list of threads to
> GDB
> > > where a new thread as a different PID (870 in the case above,
> > > instead of 866).
> >
> > This is not supposed to happen.  In
> remote.c:remote_update_thread_list
> > there is a call to remove_new_fork_children that is explicitly
> > supposed to prevent this scenario.  The new fork child thread should
> > be deleted from the thread list ("context") before we call
> remote_notice_new_inferior.
> >
> > We don't want the remote to report new threads related to the fork
> > child until after we have handled the fork using
> infrun.c:follow_fork. (Note:
> > it looks like the function comment for remove_new_fork_children is
> > stale, since there are scenarios other than being stopped at a fork
> > catchpoint where this can occur.)
> >
> > > Any advice on how I should be fixing the issue?
> >
> > It looks like there is a case that remove_new_fork_children isn't
> > handling correctly with your test+target(s). I'd start there to find
> > out how the new thread is getting through.
>=20
> I started by looking at remove_new_fork_children, which, in essence,
> remove threads whose target_waitstatus->kind is fork or vforked.
> And for that to happen, I think the remote (gdbserver) has to send GDB
> a stop:fork packet, something like this:
>=20
>     Packet received:
> T05fork:p4bc4.4bc4;06:0100000000000000;07:d0ddffffff7f0000;10:6a4f83f7
> ff7f0000;thread:p4ba5.4ba5;core:5;
>=20
> However, this is not what happens, in our case. When continuing, we
> just receive a SIGTRAP event, at which point GDB asks for the thread
> list, and by then, the thread for the forked process is in the list,
> leading to the situation described above.
>=20
> Deciphering the gdbserver logs, I think I understand why this is
> happening. After we're done single-stepping out of our breakpoint, and
> we have resumed our program we go into a wait loop in
> linux_wait_for_event_filtered, which, as I understand it, pulls event
> from the kernel, but without doing a blocking wait (WNOHANG), and as
> long as events are found, we keep waiting for more:
>=20
>   /* Always pull all events out of the kernel.  We'll randomly select
>      an event LWP out of all that have events, to prevent
>      starvation.  */
>   while (event_child =3D=3D NULL)
>     {
>       [...]
>       ret =3D my_waitpid (-1, wstatp, options | WNOHANG);
>       if (ret > 0)
>         {
>           [...]
>           continue;
>         }
>=20
> After resuming, we go through a couple of loops where I think we
> discover the new threads that are getting started. Each time, we
> resume the threads via:
>=20
>       /* Now that we've pulled all events out of the kernel, resume
>          LWPs that don't have an interesting event to report.  */
>       if (stopping_threads =3D=3D NOT_STOPPING_THREADS)
>         for_each_inferior (&all_threads, resume_stopped_resumed_lwps);
>=20
> But shortly after, we get the following batch of events:
>=20
> | leader_pid=3D2178, leader_lp!=3DNULL=3D1, num_lwps=3D3, zombie=3D0
> | sigsuspend'ing sigchld_handler my_waitpid (-1, 0x40000001)
> my_waitpid
> | (-1, 0x40000001): status(137f), 2190
> | LWFE: waitpid(-1, ...) returned 2190, ERRNO-OK
> | LLW: waitpid 2190 received Stopped (signal) (stopped) my_waitpid (-
> 1,
> | 0x40000001) my_waitpid (-1, 0x40000001): status(57f), 2189
> | LWFE: waitpid(-1, ...) returned 2189, ERRNO-OK
> | LLW: waitpid 2189 received Trace/breakpoint trap (stopped) pc is
> | 0x1000336c
> | CSBB: LWP 2178.2189 stopped by software breakpoint my_waitpid (-1,
> | 0x40000001) my_waitpid (-1, 0x40000001): status(57f), 2188
> | LWFE: waitpid(-1, ...) returned 2188, ERRNO-OK
> | LLW: waitpid 2188 received Trace/breakpoint trap (stopped) pc is
> | 0x10003218
> | CSBB: LWP 2178.2188 stopped by software breakpoint my_waitpid (-1,
> | 0x40000001) my_waitpid (-1, 0x40000001): status(1057f), 2178
> | LWFE: waitpid(-1, ...) returned 2178, ERRNO-OK
> | LLW: waitpid 2178 received Trace/breakpoint trap (stopped) pc is
> | 0xfe87158
> | HEW: Got fork event from LWP 2178, new child is 2190  <<<<--- fork!
> |
> | pc is 0xfe87158
> | my_waitpid (-1, 0x40000001)
> | my_waitpid (-1, 0x40000001): status(ffffffff), 0
> | LWFE: waitpid(-1, ...) returned 0, ERRNO-OK
>=20
> This tells me that we get 3 SIGTRAP events from our program's threads
> hitting a breakpoint, *and* the fork event. The fork event is
> confirmed by some logs printed right after:
>=20
> | LWP 2178: extended event with waitstatus status->kind =3D forked Hit a
> | non-gdbserver trap event.
>=20
> Now that we've have interesting stuff to report, we decide to stop all
> LWPS:
>=20
> | >>>> entering stop_all_lwps
> | stop_all_lwps (stop, except=3Dnone)
> | wait_for_sigstop: pulling events
> | my_waitpid (-1, 0x40000001)
> | my_waitpid (-1, 0x40000001): status(ffffffff), 0
> | LWFE: waitpid(-1, ...) returned 0, ERRNO-OK leader_pid=3D2178,
> | leader_lp!=3DNULL=3D1, num_lwps=3D3, zombie=3D0 leader_pid=3D2190,
> | leader_lp!=3DNULL=3D1, num_lwps=3D1, zombie=3D0
> | LLW: exit (no unwaited-for LWP)
> | stop_all_lwps done, setting stopping_threads back to !stopping <<<<
> | exiting stop_all_lwps
>=20
> Because all our threads had hit a breakpoint, we can see above that
> none of the lwps got sent a sigstop signal, and therefore waitpid
> returns zero as expected - confirmed by "(no unwaited-for LWP)".
>=20
> And this is where I think things go wrong - instead of reporting the
> fork event to GDB, we look at all our threads, find that none of them
> was being single-stepped, and so select one of the threads that got a
> SIGTRAP at random, reporting the event of that thread:
>=20
> | SEL: Found 3 SIGTRAP events, selecting #1 Checking whether LWP 2178
> | needs to move out of the jump pad...no Checking whether LWP 2190
> needs
> | to move out of the jump pad...no
> | linux_wait_1 ret =3D LWP 2178.2188, 1, 5 <<<< exiting linux_wait_1
> | Writing resume reply for LWP 2178.2188:1 putpkt
> | ("$T05swbreak:;01:482009e0;40:10003218;thread:p882.88c;core:0;#75");
> | [noack mode]
>=20
> This is in linux-low.c::select_event_lwp
>=20
> I'm thinking the right thing to do, here is to enhance
> select_event_lwp to look for threads that received a fork/vfork event
> first, and report that event to gdb ahead of all the other kinds of
> events.
>=20
> But I ran out of time to try this today - so I decided to send my
> notes about this so far. I hope I am on the right track!
>=20
> --
> Joel

Hi Joel,
Thanks for following up on this.  Your diagnosis makes sense to me.  As for=
 modifications to select_event_lwp, Pedro would be better suited to comment=
 on potential side effects.
Thanks
--Don