From: Pedro Alves <pedro@palves.net>
To: Thiago Jung Bauermann <thiago.bauermann@linaro.org>
Cc: gdb-patches@sourceware.org
Subject: Re: [RFC PATCH 3/3] gdb/nat/linux: Fix attaching to process when it has zombie threads
Date: Wed, 17 Apr 2024 16:32:56 +0100 [thread overview]
Message-ID: <9680e3cf-b8ad-4329-a51c-2aafb98d9476@palves.net> (raw)
In-Reply-To: <87msptgbey.fsf@linaro.org>
On 2024-04-16 05:48, Thiago Jung Bauermann wrote:
>
> Hello Pedro,
>
> [ I just now noticed that I replied to you in private last time. Sorry
> about that, I thought I had cc'd the list. ]
I hadn't noticed. :-)
>
> Pedro Alves <pedro@palves.net> writes:
>
>> On 2024-03-21 23:11, Thiago Jung Bauermann wrote:
>>> When GDB attaches to a multi-threaded process, it calls
>>> linux_proc_attach_tgid_threads () to go through all threads found in
>>> /proc/PID/task/ and call attach_proc_task_lwp_callback () on each of
>>> them. If it does that twice without the callback reporting that a new
>>> thread was found, then it considers that all inferior threads have been
>>> found and returns.
>>>
>>> The problem is that the callback considers any thread that it hasn't
>>> attached to yet as new. This causes problems if the process has one or
>>> more zombie threads, because GDB can't attach to it and the loop will
>>> always "find" a new thread (the zombie one), and get stuck in an
>>> infinite loop.
>>>
>>> This is easy to trigger (at least on aarch64-linux and powerpc64le-linux)
>>> with the gdb.threads/attach-many-short-lived-threads.exp testcase, because
>>> its test program constantly creates and finishes joinable threads so the
>>> chance of having zombie threads is high.
>>
>> Hmm. How about simply not restarting the loop if attach_lwp tries to attach to
>> a zombie lwp (and silently fails)?
>>
>> I.e., in gdb, make attach_proc_task_lwp_callback return false/0 here:
>>
>> if (ptrace (PTRACE_ATTACH, lwpid, 0, 0) < 0)
>> {
>> int err = errno;
>>
>> /* Be quiet if we simply raced with the thread exiting.
>> EPERM is returned if the thread's task still exists, and
>> is marked as exited or zombie, as well as other
>> conditions, so in that case, confirm the status in
>> /proc/PID/status. */
>> if (err == ESRCH
>> || (err == EPERM && linux_proc_pid_is_gone (lwpid)))
>> {
>> linux_nat_debug_printf
>> ("Cannot attach to lwp %d: thread is gone (%d: %s)",
>> lwpid, err, safe_strerror (err));
>>
>> return 0; <<<< NEW RETURN
>> }
>>
>> Similar thing for gdbserver, of course.
>
> Thank you for the suggestion. I tried doing that, and in fact I attached
> a patch with that change in comment #17 of PR 31312 when I was
> investigating a fix¹. I called it a workaround because I also had to
> increase the number of iterations in linux_proc_attach_tgid_threads from
> 2 to 100, otherwise GDB gives up on waiting for new inferior threads too
> early and the inferior dies with a SIGTRAP because some new unnoticed
> thread tripped into the breakpoint.
>
> Because of the need to increase the number of iterations, I didn't
> consider it a good solution and went with the approach in this patch
> series instead. Now I finally understand why I had to increase the
> number of iterations (though I still don't see a way around it other
> than what this patch series does):
>
> With the approach in this patch series, even if a new thread becomes
> zombie by the time GDB tries to attach to it, it still causes
> new_threads_found to be set the first time GDB notices it, and the loop
> in linux_proc_attach_tgid_threads starts over.
>
> With the approach above, a new thread that becomes zombie before GDB has
> a chance to attach to it never causes the loop to start over, and so it
> exits earlier.
>
> I think it's a matter of opinion whether one approach or the other can
> be considered the better one.
>
> Even with this patch series, it's not guaranteed that two iterations
> without finding new threads is enough to ensure that GDB has found all
> threads in the inferior. I left the test running in a loop overnight
> with the patch series applied and it failed after about 2500 runs.
Thanks for all the investigations. I hadn't seen the bugzilla before,
I didn't notice there was one identified in the patch.
Let me just start by saying "bleh"...
This is all of course bad ptrace/kernel design... The only way to plug this race
completely is with kernel help, I believe.
The only way that I know of today that we can make the kernel pause all threads for
us, is with a group-stop. I.e., pass a SIGSTOP to the inferior, and the kernel stops
_all_ threads with SIGSTOP.
I prototyped it today, and while far from perfect (would need to handle a corner
scenarios like attaching and getting back something != SIGSTOP, and I still see
some spurious SIGSTOPs), it kind of works ...
... except for what I think is a deal breaker that I don't know how to work around:
If you attach to a process that is running on a shell, you will see that the group-stop
makes it go to the background. GDB resumes the process, but it won't go back to the
foreground without an explicit "fg" in the shell... Like:
$ ./test_program
[1]+ Stopped ./test_program
$
I'm pasting the incomplete patch below for the archives, but I'm not going to
pursue this further.
Another idea I had was to look for the thread_db thread creation event breakpoint
address, and instead of putting a breakpoint there, put a "jmp $pc" instruction there.
That would halt creation of new pthreads. But it wouldn't halt raw clones...
Maybe someone has some better idea. Really the kernel should have a nice interface
for this.
Thankfully this is a contrived test, no real program should be spawning threads
like this. One would hope.
I will take a new look at your series.
From 9dad08eff6fc534964e457de7f99127354dfae44 Mon Sep 17 00:00:00 2001
From: Pedro Alves <pedro@palves.net>
Date: Wed, 17 Apr 2024 12:47:25 +0100
Subject: [PATCH] Force group-stop when attaching
Change-Id: I95bc82d2347af87c74a11ee9ddfc65b851f3e6c7
---
gdb/linux-nat.c | 33 +++++++++++++++++++++++++++++++++
1 file changed, 33 insertions(+)
diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
index 2602e1f240d..c95b4370d1b 100644
--- a/gdb/linux-nat.c
+++ b/gdb/linux-nat.c
@@ -1033,6 +1033,14 @@ linux_nat_post_attach_wait (ptid_t ptid, int *signalled)
status_to_str (status).c_str ());
}
+ /* Force the inferior into group-stop, so all threads are frozen by
+ the kernel, so we can attach to all of them, race-free. */
+ ptrace (PTRACE_CONT, pid, 0, SIGSTOP);
+
+ /* */
+ pid_t pid2 = my_waitpid (pid, &status, __WALL);
+ gdb_assert (pid == pid2);
+
return status;
}
@@ -1231,6 +1239,31 @@ linux_nat_target::attach (const char *args, int from_tty)
throw;
}
+ /* Consume the group-stop SIGSTOP for every thread, and move them
+ into ptrace-stopped. */
+ for (lwp_info *lwp : all_lwps_safe ())
+ {
+ if (lwp->ptid.pid () == lwp->ptid.lwp ())
+ continue;
+
+ int lwp_status;
+ int lwpid = lwp->ptid.lwp ();
+ pid_t ret = my_waitpid (lwpid, &lwp_status, __WALL);
+ gdb_assert (ret == lwpid);
+
+ gdb_assert (WIFSTOPPED (lwp_status));
+ gdb_assert (WSTOPSIG (lwp_status) == SIGSTOP);
+
+ /* If PTRACE_GETSIGINFO fails with EINVAL, then it is definitely
+ a group-stop. */
+ siginfo_t siginfo;
+ gdb_assert (!linux_nat_get_siginfo (lwp->ptid, &siginfo));
+ gdb_assert (errno == EINVAL);
+
+ kill_lwp (lwpid, SIGSTOP);
+ ptrace (PTRACE_CONT, lwpid, 0, 0);
+ }
+
/* Add all the LWPs to gdb's thread list. */
iterate_over_lwps (ptid_t (ptid.pid ()),
[] (struct lwp_info *lwp) -> int
base-commit: 2d4c39a885d4d12325d0a9be9e014e75a295fb25
--
2.43.2
next prev parent reply other threads:[~2024-04-17 15:33 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-21 23:11 [RFC PATCH 0/3] " Thiago Jung Bauermann
2024-03-21 23:11 ` [RFC PATCH 1/3] gdb/nat: Use procfs(5) indexes in linux_common_core_of_thread Thiago Jung Bauermann
2024-03-22 17:33 ` Luis Machado
2024-04-17 15:55 ` Pedro Alves
2024-04-20 5:15 ` Thiago Jung Bauermann
2024-03-21 23:11 ` [RFC PATCH 2/3] gdb/nat: Factor linux_find_proc_stat_field out of linux_common_core_of_thread Thiago Jung Bauermann
2024-03-22 16:12 ` Luis Machado
2024-04-17 16:06 ` Pedro Alves
2024-04-20 5:16 ` Thiago Jung Bauermann
2024-03-21 23:11 ` [RFC PATCH 3/3] gdb/nat/linux: Fix attaching to process when it has zombie threads Thiago Jung Bauermann
2024-03-22 16:19 ` Luis Machado
2024-03-22 16:52 ` Pedro Alves
2024-04-16 4:48 ` Thiago Jung Bauermann
2024-04-17 15:32 ` Pedro Alves [this message]
2024-04-20 5:00 ` Thiago Jung Bauermann
2024-04-26 15:35 ` Pedro Alves
2024-04-17 16:28 ` Pedro Alves
2024-04-20 5:28 ` Thiago Jung Bauermann
2024-03-22 10:17 ` [RFC PATCH 0/3] " Christophe Lyon
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=9680e3cf-b8ad-4329-a51c-2aafb98d9476@palves.net \
--to=pedro@palves.net \
--cc=gdb-patches@sourceware.org \
--cc=thiago.bauermann@linaro.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox