Mirror of the gdb-patches mailing list
 help / color / mirror / Atom feed
From: Pedro Alves <pedro@palves.net>
To: Thiago Jung Bauermann <thiago.bauermann@linaro.org>
Cc: gdb-patches@sourceware.org
Subject: Re: [RFC PATCH 3/3] gdb/nat/linux: Fix attaching to process when it has zombie threads
Date: Wed, 17 Apr 2024 16:32:56 +0100	[thread overview]
Message-ID: <9680e3cf-b8ad-4329-a51c-2aafb98d9476@palves.net> (raw)
In-Reply-To: <87msptgbey.fsf@linaro.org>

On 2024-04-16 05:48, Thiago Jung Bauermann wrote:
> 
> Hello Pedro,
> 
> [ I just now noticed that I replied to you in private last time. Sorry
>   about that, I thought I had cc'd the list. ]

I hadn't noticed.  :-)

> 
> Pedro Alves <pedro@palves.net> writes:
> 
>> On 2024-03-21 23:11, Thiago Jung Bauermann wrote:
>>> When GDB attaches to a multi-threaded process, it calls
>>> linux_proc_attach_tgid_threads () to go through all threads found in
>>> /proc/PID/task/ and call attach_proc_task_lwp_callback () on each of
>>> them.  If it does that twice without the callback reporting that a new
>>> thread was found, then it considers that all inferior threads have been
>>> found and returns.
>>>
>>> The problem is that the callback considers any thread that it hasn't
>>> attached to yet as new.  This causes problems if the process has one or
>>> more zombie threads, because GDB can't attach to it and the loop will
>>> always "find" a new thread (the zombie one), and get stuck in an
>>> infinite loop.
>>>
>>> This is easy to trigger (at least on aarch64-linux and powerpc64le-linux)
>>> with the gdb.threads/attach-many-short-lived-threads.exp testcase, because
>>> its test program constantly creates and finishes joinable threads so the
>>> chance of having zombie threads is high.
>>
>> Hmm.  How about simply not restarting the loop if attach_lwp tries to attach to
>> a zombie lwp (and silently fails)?
>>
>> I.e., in gdb, make attach_proc_task_lwp_callback return false/0 here:
>>
>>       if (ptrace (PTRACE_ATTACH, lwpid, 0, 0) < 0)
>> 	{
>> 	  int err = errno;
>>
>> 	  /* Be quiet if we simply raced with the thread exiting.
>> 	     EPERM is returned if the thread's task still exists, and
>> 	     is marked as exited or zombie, as well as other
>> 	     conditions, so in that case, confirm the status in
>> 	     /proc/PID/status.  */
>> 	  if (err == ESRCH
>> 	      || (err == EPERM && linux_proc_pid_is_gone (lwpid)))
>> 	    {
>> 	      linux_nat_debug_printf
>> 		("Cannot attach to lwp %d: thread is gone (%d: %s)",
>> 		 lwpid, err, safe_strerror (err));
>>
>>               return 0;        <<<< NEW RETURN
>> 	    }
>>
>> Similar thing for gdbserver, of course.
> 
> Thank you for the suggestion. I tried doing that, and in fact I attached
> a patch with that change in comment #17 of PR 31312 when I was
> investigating a fix¹. I called it a workaround because I also had to
> increase the number of iterations in linux_proc_attach_tgid_threads from
> 2 to 100, otherwise GDB gives up on waiting for new inferior threads too
> early and the inferior dies with a SIGTRAP because some new unnoticed
> thread tripped into the breakpoint.
> 
> Because of the need to increase the number of iterations, I didn't
> consider it a good solution and went with the approach in this patch
> series instead. Now I finally understand why I had to increase the
> number of iterations (though I still don't see a way around it other
> than what this patch series does):
> 
> With the approach in this patch series, even if a new thread becomes
> zombie by the time GDB tries to attach to it, it still causes
> new_threads_found to be set the first time GDB notices it, and the loop
> in linux_proc_attach_tgid_threads starts over.
> 
> With the approach above, a new thread that becomes zombie before GDB has
> a chance to attach to it never causes the loop to start over, and so it
> exits earlier.
> 
> I think it's a matter of opinion whether one approach or the other can
> be considered the better one.
> 
> Even with this patch series, it's not guaranteed that two iterations
> without finding new threads is enough to ensure that GDB has found all
> threads in the inferior. I left the test running in a loop overnight
> with the patch series applied and it failed after about 2500 runs.

Thanks for all the investigations.  I hadn't seen the bugzilla before,
I didn't notice there was one identified in the patch.

Let me just start by saying "bleh"...

This is all of course bad ptrace/kernel design...  The only way to plug this race
completely is with kernel help, I believe.

The only way that I know of today that we can make the kernel pause all threads for
us, is with a group-stop.  I.e., pass a SIGSTOP to the inferior, and the kernel stops
_all_ threads with SIGSTOP.

I prototyped it today, and while far from perfect (would need to handle a corner
scenarios like attaching and getting back something != SIGSTOP, and I still see
some spurious SIGSTOPs), it kind of works ...

... except for what I think is a deal breaker that I don't know how to work around:

If you attach to a process that is running on a shell, you will see that the group-stop
makes it go to the background.  GDB resumes the process, but it won't go back to the
foreground without an explicit "fg" in the shell...  Like:

 $ ./test_program
 [1]+  Stopped                 ./test_program
 $

I'm pasting the incomplete patch below for the archives, but I'm not going to
pursue this further.  

Another idea I had was to look for the thread_db thread creation event breakpoint
address, and instead of putting a breakpoint there, put a "jmp $pc" instruction there.
That would halt creation of new pthreads.  But it wouldn't halt raw clones...

Maybe someone has some better idea.  Really the kernel should have a nice interface
for this.  

Thankfully this is a contrived test, no real program should be spawning threads
like this.  One would hope.

I will take a new look at your series.

From 9dad08eff6fc534964e457de7f99127354dfae44 Mon Sep 17 00:00:00 2001
From: Pedro Alves <pedro@palves.net>
Date: Wed, 17 Apr 2024 12:47:25 +0100
Subject: [PATCH] Force group-stop when attaching

Change-Id: I95bc82d2347af87c74a11ee9ddfc65b851f3e6c7
---
 gdb/linux-nat.c | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
index 2602e1f240d..c95b4370d1b 100644
--- a/gdb/linux-nat.c
+++ b/gdb/linux-nat.c
@@ -1033,6 +1033,14 @@ linux_nat_post_attach_wait (ptid_t ptid, int *signalled)
 			      status_to_str (status).c_str ());
     }
 
+  /* Force the inferior into group-stop, so all threads are frozen by
+     the kernel, so we can attach to all of them, race-free.  */
+  ptrace (PTRACE_CONT, pid, 0, SIGSTOP);
+
+  /* */
+  pid_t pid2 = my_waitpid (pid, &status, __WALL);
+  gdb_assert (pid == pid2);
+
   return status;
 }
 
@@ -1231,6 +1239,31 @@ linux_nat_target::attach (const char *args, int from_tty)
       throw;
     }
 
+  /* Consume the group-stop SIGSTOP for every thread, and move them
+     into ptrace-stopped.  */
+  for (lwp_info *lwp : all_lwps_safe ())
+    {
+      if (lwp->ptid.pid () == lwp->ptid.lwp ())
+	continue;
+
+      int lwp_status;
+      int lwpid = lwp->ptid.lwp ();
+      pid_t ret = my_waitpid (lwpid, &lwp_status, __WALL);
+      gdb_assert (ret == lwpid);
+
+      gdb_assert (WIFSTOPPED (lwp_status));
+      gdb_assert (WSTOPSIG (lwp_status) == SIGSTOP);
+
+      /* If PTRACE_GETSIGINFO fails with EINVAL, then it is definitely
+	 a group-stop.  */
+      siginfo_t siginfo;
+      gdb_assert (!linux_nat_get_siginfo (lwp->ptid, &siginfo));
+      gdb_assert (errno == EINVAL);
+
+      kill_lwp (lwpid, SIGSTOP);
+      ptrace (PTRACE_CONT, lwpid, 0, 0);
+    }
+
   /* Add all the LWPs to gdb's thread list.  */
   iterate_over_lwps (ptid_t (ptid.pid ()),
 		     [] (struct lwp_info *lwp) -> int

base-commit: 2d4c39a885d4d12325d0a9be9e014e75a295fb25
-- 
2.43.2


  reply	other threads:[~2024-04-17 15:33 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-21 23:11 [RFC PATCH 0/3] " Thiago Jung Bauermann
2024-03-21 23:11 ` [RFC PATCH 1/3] gdb/nat: Use procfs(5) indexes in linux_common_core_of_thread Thiago Jung Bauermann
2024-03-22 17:33   ` Luis Machado
2024-04-17 15:55     ` Pedro Alves
2024-04-20  5:15       ` Thiago Jung Bauermann
2024-03-21 23:11 ` [RFC PATCH 2/3] gdb/nat: Factor linux_find_proc_stat_field out of linux_common_core_of_thread Thiago Jung Bauermann
2024-03-22 16:12   ` Luis Machado
2024-04-17 16:06   ` Pedro Alves
2024-04-20  5:16     ` Thiago Jung Bauermann
2024-03-21 23:11 ` [RFC PATCH 3/3] gdb/nat/linux: Fix attaching to process when it has zombie threads Thiago Jung Bauermann
2024-03-22 16:19   ` Luis Machado
2024-03-22 16:52   ` Pedro Alves
2024-04-16  4:48     ` Thiago Jung Bauermann
2024-04-17 15:32       ` Pedro Alves [this message]
2024-04-20  5:00         ` Thiago Jung Bauermann
2024-04-26 15:35           ` Pedro Alves
2024-04-17 16:28   ` Pedro Alves
2024-04-20  5:28     ` Thiago Jung Bauermann
2024-03-22 10:17 ` [RFC PATCH 0/3] " Christophe Lyon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9680e3cf-b8ad-4329-a51c-2aafb98d9476@palves.net \
    --to=pedro@palves.net \
    --cc=gdb-patches@sourceware.org \
    --cc=thiago.bauermann@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox