From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gdb-patches-bounces+public-inbox=simark.ca@sourceware.org>
Received: from simark.ca
	by simark.ca with LMTP
	id sewtGGg2R2KfYAAAWB0awg
	(envelope-from <gdb-patches-bounces+public-inbox=simark.ca@sourceware.org>)
	for <public-inbox@simark.ca>; Fri, 01 Apr 2022 13:29:12 -0400
Received: by simark.ca (Postfix, from userid 112)
	id 52C171F1F4; Fri,  1 Apr 2022 13:29:12 -0400 (EDT)
X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on simark.ca
X-Spam-Level: 
X-Spam-Status: No, score=-2.9 required=5.0 tests=BAYES_00,DKIM_SIGNED,
	DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,NICE_REPLY_A,RDNS_DYNAMIC,
	URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.2
Received: from sourceware.org (ip-8-43-85-97.sourceware.org [8.43.85.97])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
	(No client certificate requested)
	by simark.ca (Postfix) with ESMTPS id 507321E787
	for <public-inbox@simark.ca>; Fri,  1 Apr 2022 13:29:11 -0400 (EDT)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id D72F83858C55
	for <public-inbox@simark.ca>; Fri,  1 Apr 2022 17:29:10 +0000 (GMT)
DKIM-Filter: OpenDKIM Filter v2.11.0 sourceware.org D72F83858C55
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=sourceware.org;
	s=default; t=1648834150;
	bh=CvWQz2/5ZgSj+UHuvPaNc+X2fao1SFiECwDytnuTMLc=;
	h=Date:Subject:To:References:In-Reply-To:List-Id:List-Unsubscribe:
	 List-Archive:List-Post:List-Help:List-Subscribe:From:Reply-To:Cc:
	 From;
	b=vYbwh9sxt712mVnUDejC1S4MevG7BJqHeOrrzSeRH7k4ImNrisnVRSixKxFnwLxwg
	 1/LPv3putN73OmCX/3yivFbZa2Glf1u2IEWnKVNhRSCCv7JnV/p/GvAGiM4VaU3RWA
	 H09hD9oWm643oY93stgg68oQmWv3wAXgFfAtlWO4=
Received: from smtp.polymtl.ca (smtp.polymtl.ca [132.207.4.11])
 by sourceware.org (Postfix) with ESMTPS id 4345B3858C52
 for <gdb-patches@sourceware.org>; Fri,  1 Apr 2022 17:28:27 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 4345B3858C52
Received: from simark.ca (simark.ca [158.69.221.121]) (authenticated bits=0)
 by smtp.polymtl.ca (8.14.7/8.14.7) with ESMTP id 231HS7nd022771
 (version=TLSv1/SSLv3 cipher=ECDHE-RSA-AES256-GCM-SHA384 bits=256 verify=NOT); 
 Fri, 1 Apr 2022 13:28:12 -0400
DKIM-Filter: OpenDKIM Filter v2.11.0 smtp.polymtl.ca 231HS7nd022771
Received: from [10.0.0.11] (192-222-157-6.qc.cable.ebox.net [192.222.157.6])
 (using TLSv1.3 with cipher TLS_AES_128_GCM_SHA256 (128/128 bits)
 key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256)
 (No client certificate requested)
 by simark.ca (Postfix) with ESMTPSA id 2696C1E787;
 Fri,  1 Apr 2022 13:28:07 -0400 (EDT)
Message-ID: <2bf14e68-7f2f-81cc-8c40-e847a6aca1a7@polymtl.ca>
Date: Fri, 1 Apr 2022 13:28:06 -0400
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101
 Thunderbird/91.7.0
Subject: Re: [PATCH 6/8] gdb: fix handling of vfork by multi-threaded program
 (follow-fork-mode=parent, detach-on-fork=on)
Content-Language: en-US
To: Pedro Alves <pedro@palves.net>, gdb-patches@sourceware.org
References: <20220117162742.524350-1-simon.marchi@polymtl.ca>
 <20220117162742.524350-7-simon.marchi@polymtl.ca>
 <97adba8a-5007-c017-7730-18805d9dccc7@palves.net>
In-Reply-To: <97adba8a-5007-c017-7730-18805d9dccc7@palves.net>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-Poly-FromMTA: (simark.ca [158.69.221.121]) at Fri, 1 Apr 2022 17:28:07 +0000
X-BeenThere: gdb-patches@sourceware.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: Gdb-patches mailing list <gdb-patches.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/gdb-patches>,
 <mailto:gdb-patches-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/gdb-patches/>
List-Post: <mailto:gdb-patches@sourceware.org>
List-Help: <mailto:gdb-patches-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/gdb-patches>,
 <mailto:gdb-patches-request@sourceware.org?subject=subscribe>
From: Simon Marchi via Gdb-patches <gdb-patches@sourceware.org>
Reply-To: Simon Marchi <simon.marchi@polymtl.ca>
Cc: Simon Marchi <simon.marchi@efficios.com>
Errors-To: gdb-patches-bounces+public-inbox=simark.ca@sourceware.org
Sender: "Gdb-patches"
 <gdb-patches-bounces+public-inbox=simark.ca@sourceware.org>

> Thinking through this, one concern I had is that if the vfork child gets stuck for some reason
> before execing or exiting (hey we're a debugger, bugs happen!), then pausing all threads makes
> it so that you can end up stuck in gdb and can't get back the prompt (if the child ignores
> SIGINT, otherwise it just dies), or if you continue in the background, you
> can't interrupt the parent threads.
> 
> E.g., with your series, and:
> 
> diff --git c/gdb/testsuite/gdb.threads/vfork-multi-thread.c w/gdb/testsuite/gdb.threads/vfork-multi-thread.c
> index cd3dfcc1e65..6178de91dca 100644
> --- c/gdb/testsuite/gdb.threads/vfork-multi-thread.c
> +++ w/gdb/testsuite/gdb.threads/vfork-multi-thread.c
> @@ -25,6 +25,8 @@ static void *vforker(void *arg)
>      /* A vfork child is not supposed to mess with the state of the program,
>         but it is helpful for the purpose of this test.  */
>      release_main = 1;
> +    while (1)
> +      sleep (1);
>      _exit(7);
>    }
>  
> @@ -41,7 +43,7 @@ static void should_break_here(void) {}
>  
>  int main()
>  {
> -
> +  signal (SIGINT, SIG_IGN);
>    pthread_t thread;
>    int ret = pthread_create(&thread, NULL, vforker, NULL);
>    assert(ret == 0);
> (END)
>  
> 
> We get:
> 
> (gdb) r
> Starting program: /home/pedro/gdb/binutils-gdb/build/gdb/testsuite/outputs/gdb.threads/vfork-multi-thread/vfork-multi-thread 
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> [New Thread 0x7ffff7d8a700 (LWP 1713229)]
> [Detaching after vfork from child process 1713230]
> ^C^C^C^C^C^C
> (hung forever)
> 
> Or:
> 
> $ gdb -ex "set non-stop on" ...
> (gdb) r&
> Starting program: /home/pedro/gdb/binutils-gdb/build/gdb/testsuite/outputs/gdb.threads/vfork-multi-thread/vfork-multi-thread 
> (gdb) [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> [New Thread 0x7ffff7d8a700 (LWP 1345571)]
> [Detaching after vfork from child process 1345572]
> info threads 
>   Id   Target Id                                             Frame 
> * 1    Thread 0x7ffff7d8b740 (LWP 1345567) "vfork-multi-thr" (running)
>   2    Thread 0x7ffff7d8a700 (LWP 1345571) "vfork-multi-thr" (running)
> (gdb) interrupt 
> *nothing*
> (gdb) interrupt 
> *nothing*
> (gdb) info threads 
>   Id   Target Id                                             Frame 
> * 1    Thread 0x7ffff7d8b740 (LWP 1345567) "vfork-multi-thr" (running)
>   2    Thread 0x7ffff7d8a700 (LWP 1345571) "vfork-multi-thr" (running)
> (gdb) 
> 
> 
> And then you can't quit gdb:
> 
> (gdb) q
> A debugging session is active.
> 
>         Inferior 1 [process 1345567] will be killed.
> 
> Quit anyway? (y or n) y
> 
> ^C^C^C^C
> (hung forever)

Ah yeah... that sucks.

We can find a ton of cases where we can deadlock when dealing with stuck
vfork children.  You can run the above program on the shell and try
attaching the parent, for example.

> I guess the ^C issue will be fixed by my series reworking ctrl-c, leaving
> gdb in control of the terminal at all times..

What will happen then?  GDB receives the SIGINT, then sends a SIGSTOP to
the vfork parent thread.  Do you know if a vfork parent thread can be
stopped at that point (while the vfork child thread is stuck in an
infinite loop)?

> Bah, I guess that this might be a meh/schrug, since if the parent is
> single-threaded, then you can't interrupt it either.  Though at least GDB
> doesn't get you get into the stuck state.

What do you mean "doesn't get you into the stuck state"?  I don't see
what's different here between the single and multi thread cases.

> I wonder whether a better solution would be something completely different.
> Like, say, make it so that gdb always remains attached to the child until it
> execs/exits, in which case we would no longer need to remove breakpoints from the
> parent/child, and thus we could let all the parent's threads run as well.
> 
> We actually do that for follow-fork parent / detach-on-fork-on and
> vfork -- see inferior::pending_detach.

You mean "follow-fork child" I think.  In that case, we keep the parent
attached until the child exec or exits.

I think it would be possible to do what you describe.  If we did this, I
think we could make it so we never resume a vfork parent thread, to
avoid getting into the situation where we want to interrupt / stop a
stuck vfork parent thread.  In my experience that's not possible, this
thread will not report a stop until the vfork child has exited/execed.
Although I haven't tried with SIGSTOP.


> Anyhow, onward.  With your currently solution at least the common scenario
> works, and we have a new testcase that helps whatever redesign in future.

Yeah, I think that pragamatically it's a step forward, assuming a
"collaborating" vfork children that doesn't hang.

> Despite these concerns, I'm actually happy to get this into the tree.  I think
> it's still better than what we have today.

Cool, thanks.

>> +
>> +  /* The rest of the function assumes non-stop==off and
>> +     target-non-stop==off.
>> +
>> +     If a thread is waiting for a vfork-done event, it means breakpoints are out
>> +     for this inferior (well, program space in fact).  We don't want to resume
>> +     any thread other than the one waiting for vfork done, otherwise these other
>> +     threads could miss breakpoints.  So if a thread in the resumption set is
>> +     waiting for a vfork-done event, resume only that thread.
>> +
>> +     The resumption set width depends on whether schedule-multiple is on or off.
>> +
>> +     Note that if the target_resume interface was more flexible, we could be
>> +     smarter here when schedule-multiple is on.  For example, imagine 3
>> +     inferiors with 2 threads each (1.1, 1.2, 2.1, 2.2, 3.1 and 3.2).  Threads
>> +     2.1 and 3.2 are both waiting for a vfork-done event.  Then we could ask the
>> +     target(s) to resume:
>> +
>> +      - All threads of inferior 1
>> +      - Thread 2.1
>> +      - Thread 3.2
>> +
>> +     Since we don't have that flexibility (we can only pass one ptid), just
>> +     resume the first thread waiting for a vfork-done event we find (e.g. thread
>> +     2.1).  */
> 
> We actually have the flexibility, but we don't use it.  I mean, we could use
> the commit_resumed mechanism in all-stop as well, which would let us prepare such
> a resume with multiple target_resume calls and final commit.  For a rainy day.

Ah, yeah.  We could make internal_resume_ptid return a vector of ptids,
and have do_target_resume call target_resume multiple times.  Even
without commit_resumed... that should work?  I think?

>> +  if (sched_multi)
>> +    {
>> +      for (inferior *inf : all_non_exited_inferiors ())
>> +	if (inf->thread_waiting_for_vfork_done != nullptr)
>> +	  return inf->thread_waiting_for_vfork_done->ptid;
>> +    }
>> +  else if (current_inferior ()->thread_waiting_for_vfork_done != nullptr)
>> +    return current_inferior ()->thread_waiting_for_vfork_done->ptid;
>> +
>> +  return user_visible_resume_ptid (user_step);
>>  }
>>  
>>  /* Wrapper for target_resume, that handles infrun-specific
>> @@ -3258,6 +3360,19 @@ proceed (CORE_ADDR addr, enum gdb_signal siggnal)
>>  		continue;
> 
> 
>> diff --git a/gdb/testsuite/gdb.threads/vfork-multi-thread.c b/gdb/testsuite/gdb.threads/vfork-multi-thread.c
>> new file mode 100644
>> index 000000000000..cd3dfcc1e653
>> --- /dev/null
>> +++ b/gdb/testsuite/gdb.threads/vfork-multi-thread.c
>> @@ -0,0 +1,74 @@
>> +#include <assert.h>
> 
> Missing copyright header.

Done.

>> +#include <pthread.h>
>> +#include <stdio.h>
>> +#include <unistd.h>
>> +#include <sys/wait.h>
>> +
>> +// Start with:
>> +//
>> +//   ./gdb -nx -q --data-directory=data-directory repro -ex "tb break_here_first" -ex r -ex "b should_break_here"
>> +//
>> +// Then do "continue".
>> +//
>> +// The main thread will likely cross should_break_here while we are handling
>> +// the vfork and breakpoints are removed, therefore missing the breakpoint.
> 
> The whole file needs a pass to adjust it to GNU conventions/formatting.

Yeah, I did that after Lancelot's comments on the v2.

>> +
>> +static volatile int release_vfork = 0;
>> +static volatile int release_main = 0;
>> +
>> +static void *vforker(void *arg)
>> +{
>> +  while (!release_vfork);
> 
> I'd add some usleep(1) here, to avoid hogging the cpu.

Ok.

>> +
>> +  pid_t pid = vfork ();
>> +  if (pid == 0) {
>> +    /* A vfork child is not supposed to mess with the state of the program,
>> +       but it is helpful for the purpose of this test.  */
>> +    release_main = 1;
>> +    _exit(7);
>> +  }
>> +
>> +  int stat;
>> +  int ret = waitpid (pid, &stat, 0);
>> +  assert (ret == pid);
>> +  assert (WIFEXITED (stat));
>> +  assert (WEXITSTATUS (stat) == 7);
>> +
>> +  return NULL;
>> +}
>> +
>> +static void should_break_here(void) {}
>> +
>> +int main()
>> +{
>> +
>> +  pthread_t thread;
>> +  int ret = pthread_create(&thread, NULL, vforker, NULL);
>> +  assert(ret == 0);
>> +
>> +  /* We break here first, while the thread is stuck on `!release_fork`.  */
>> +  release_vfork = 1;
>> +
>> +  /* We set a breakpoint on should_break_here.
>> +
>> +     We then set "release_fork" from the debugger and continue.  The main
>> +     thread hangs on `!release_main` while the non-main thread vforks.  During
>> +     the window of time where the two processes have a shared address space
>> +     (after vfork, before _exit), GDB removes the breakpoints from the address
>> +     space.  During that window, only the vfork-ing thread (the non-main
>> +     thread) is frozen by the kernel.  The main thread is free to execute.  The
>> +     child process sets `release_main`, releasing the main thread. A buggy GDB
>> +     would let the main thread execute during that window, leading to the
>> +     breakpoint on should_break_here being missed.  A fixed GDB does not resume
>> +     the threads of the vforking process other than the vforking thread.  When
>> +     the vfork child exits, the fixed GDB resumes the main thread, after
>> +     breakpoints are reinserted, so the breakpoint is not missed.  */
>> +
>> +  while (!release_main);
> 
> Ditto, usleep.

I don't know about this one, if it would affect the ability of the test
to reproduce the failure.  For the failure to be reproduced, we want
that main thread to exit that loop as fast as possible once release_main
is set and dash through "should_break_here".  If we make it sleep, it
increases the chances that the vfork child calls exit before the main
thread crosses should_break_here, therefore not exposing the problem.

I gave it a try, and with a usleep I see some failures (when removing
the fix), meaning the test still does its job, so I can add it.  If I
try with a "sleep (1)" though, the failures disappear.

Simon