From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gdb-patches-bounces+public-inbox=simark.ca@sourceware.org>
Received: from simark.ca
	by simark.ca with LMTP
	id mAeANc/rH2bcoDIAWB0awg
	(envelope-from <gdb-patches-bounces+public-inbox=simark.ca@sourceware.org>)
	for <public-inbox@simark.ca>; Wed, 17 Apr 2024 11:33:35 -0400
Received: by simark.ca (Postfix, from userid 112)
	id D5E7F1E0C0; Wed, 17 Apr 2024 11:33:35 -0400 (EDT)
Received: from server2.sourceware.org (server2.sourceware.org [8.43.85.97])
	(using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits)
	 key-exchange X25519 server-signature ECDSA (prime256v1) server-digest SHA256)
	(No client certificate requested)
	by simark.ca (Postfix) with ESMTPS id AA82C1E092
	for <public-inbox@simark.ca>; Wed, 17 Apr 2024 11:33:33 -0400 (EDT)
Received: from server2.sourceware.org (localhost [IPv6:::1])
	by sourceware.org (Postfix) with ESMTP id 1A19F3858424
	for <public-inbox@simark.ca>; Wed, 17 Apr 2024 15:33:33 +0000 (GMT)
Received: from mail-lj1-f169.google.com (mail-lj1-f169.google.com
 [209.85.208.169])
 by sourceware.org (Postfix) with ESMTPS id 252363858D20
 for <gdb-patches@sourceware.org>; Wed, 17 Apr 2024 15:33:01 +0000 (GMT)
DMARC-Filter: OpenDMARC Filter v1.4.2 sourceware.org 252363858D20
Authentication-Results: sourceware.org;
 dmarc=none (p=none dis=none) header.from=palves.net
Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=gmail.com
ARC-Filter: OpenARC Filter v1.0.0 sourceware.org 252363858D20
Authentication-Results: server2.sourceware.org;
 arc=none smtp.remote-ip=209.85.208.169
ARC-Seal: i=1; a=rsa-sha256; d=sourceware.org; s=key; t=1713367992; cv=none;
 b=Nv6TjFXFfSwKcf5vnKkmy9KcrYpmuOrteVjxljWUHDGzmqcEy7CVdKSlW6oiHsY9tyHECDXH0bPaFQ7UIw0Euk5tNH5P9lL3oWNKTY9CDhhg0W9E/hJ9X89mncUCCM1zQIxGrNRY8haIFiJ5wL2GF7Ic6Uh46r9hR+Q1aPnAmoI=
ARC-Message-Signature: i=1; a=rsa-sha256; d=sourceware.org; s=key;
 t=1713367992; c=relaxed/simple;
 bh=TNPx9IxP+sLDj5VXJYkWagQW3baefphOatnW1MwpF3w=;
 h=Message-ID:Date:MIME-Version:From:Subject:To;
 b=fivxhoM7uv3OjHny3YrTcH2USrve7DVWJtBZ/2v1JFuGjpzjGBZAa1K03yHzDQVS2AG5GSnJMU/21/CP5sgMiNvzTinr+EK2uxicgIXzFbnLh17qzEBIYgph6TQdOqlsHR508ocnzWV/t7FcNRxVLkosn7exnAor3fJmurStTQ4=
ARC-Authentication-Results: i=1; server2.sourceware.org
Received: by mail-lj1-f169.google.com with SMTP id
 38308e7fff4ca-2d8a24f8a3cso65020971fa.1
 for <gdb-patches@sourceware.org>; Wed, 17 Apr 2024 08:33:01 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1713367980; x=1713972780;
 h=content-transfer-encoding:in-reply-to:content-language:references
 :cc:to:subject:from:user-agent:mime-version:date:message-id
 :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
 bh=jYA/sCS5OXZIdJxtMkQVZt8ATBxMgUZZI8ZuD/wUp14=;
 b=L43n08UUiQ1GGLwbDU536jc4BOWDchNim1C45l+YNn6HVEEi+HYUuYcFw5BDmEqyD5
 756WQ7D6t6Mbc9UrCGcDxk1Q2hxp4V//aNv+jyilYnrDcIjNniWKjl4Vc91Lfy9LYK1F
 Gj1b6xGdC+shSGH6/tc13srnqFBQFkz2vzCxQF5Qvik/YrFQyXa+gYkcsm87/86WLacP
 AsxxN54MzDtVTzZNLa1BR+NXUVVklSer2ljF6S3EOZp8TQTEz9tptbS1tZq3cmEoyDZI
 gCSxLODY+hN++0dwfuHON/mHHXksbIgOehtZ4frWycvQuo4xm9tAh8RBR2obX+RP1Ume
 ubRg==
X-Gm-Message-State: AOJu0Yw/wcOuGJIa1b8R3YI8DJPZhMrq8P+jaJJdsRIWgRIRv734T3xa
 epgr1jH9xF8+i7plBICOnQbdGQxdcpmQ46cs4ZdHV4FiibhdRwHvb270QDiW
X-Google-Smtp-Source: AGHT+IH4GXwgwhcoHh+OJHyFrUCCt6NtkIpPJF/3i2C+tmJ0wgzCOuGiiCSO7cBS/Qga4cVC1ZacfA==
X-Received: by 2002:a2e:84c4:0:b0:2da:8b96:7470 with SMTP id
 q4-20020a2e84c4000000b002da8b967470mr5515402ljh.33.1713367979303; 
 Wed, 17 Apr 2024 08:32:59 -0700 (PDT)
Received: from ?IPV6:2001:8a0:f93d:b900:e643:3adf:640c:b3ad?
 ([2001:8a0:f93d:b900:e643:3adf:640c:b3ad])
 by smtp.gmail.com with ESMTPSA id
 d5-20020adfc805000000b00349bd105089sm1928834wrh.47.2024.04.17.08.32.58
 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128);
 Wed, 17 Apr 2024 08:32:58 -0700 (PDT)
Message-ID: <9680e3cf-b8ad-4329-a51c-2aafb98d9476@palves.net>
Date: Wed, 17 Apr 2024 16:32:56 +0100
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
From: Pedro Alves <pedro@palves.net>
Subject: Re: [RFC PATCH 3/3] gdb/nat/linux: Fix attaching to process when it
 has zombie threads
To: Thiago Jung Bauermann <thiago.bauermann@linaro.org>
Cc: gdb-patches@sourceware.org
References: <20240321231149.519549-1-thiago.bauermann@linaro.org>
 <20240321231149.519549-4-thiago.bauermann@linaro.org>
 <f7dac3e3-a9c5-43c6-8696-3ea683beddff@palves.net> <87msptgbey.fsf@linaro.org>
Content-Language: en-US
In-Reply-To: <87msptgbey.fsf@linaro.org>
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Spam-Status: No, score=-11.0 required=5.0 tests=BAYES_00,
 FREEMAIL_FORGED_FROMDOMAIN, FREEMAIL_FROM, GIT_PATCH_0,
 HEADER_FROM_DIFFERENT_DOMAINS, KAM_DMARC_STATUS, RCVD_IN_DNSWL_NONE,
 RCVD_IN_MSPIKE_H2, SPF_HELO_NONE, SPF_PASS,
 TXREP autolearn=ham autolearn_force=no version=3.4.6
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
 server2.sourceware.org
X-BeenThere: gdb-patches@sourceware.org
X-Mailman-Version: 2.1.30
Precedence: list
List-Id: Gdb-patches mailing list <gdb-patches.sourceware.org>
List-Unsubscribe: <https://sourceware.org/mailman/options/gdb-patches>,
 <mailto:gdb-patches-request@sourceware.org?subject=unsubscribe>
List-Archive: <https://sourceware.org/pipermail/gdb-patches/>
List-Post: <mailto:gdb-patches@sourceware.org>
List-Help: <mailto:gdb-patches-request@sourceware.org?subject=help>
List-Subscribe: <https://sourceware.org/mailman/listinfo/gdb-patches>,
 <mailto:gdb-patches-request@sourceware.org?subject=subscribe>
Errors-To: gdb-patches-bounces+public-inbox=simark.ca@sourceware.org

On 2024-04-16 05:48, Thiago Jung Bauermann wrote:
> 
> Hello Pedro,
> 
> [ I just now noticed that I replied to you in private last time. Sorry
>   about that, I thought I had cc'd the list. ]

I hadn't noticed.  :-)

> 
> Pedro Alves <pedro@palves.net> writes:
> 
>> On 2024-03-21 23:11, Thiago Jung Bauermann wrote:
>>> When GDB attaches to a multi-threaded process, it calls
>>> linux_proc_attach_tgid_threads () to go through all threads found in
>>> /proc/PID/task/ and call attach_proc_task_lwp_callback () on each of
>>> them.  If it does that twice without the callback reporting that a new
>>> thread was found, then it considers that all inferior threads have been
>>> found and returns.
>>>
>>> The problem is that the callback considers any thread that it hasn't
>>> attached to yet as new.  This causes problems if the process has one or
>>> more zombie threads, because GDB can't attach to it and the loop will
>>> always "find" a new thread (the zombie one), and get stuck in an
>>> infinite loop.
>>>
>>> This is easy to trigger (at least on aarch64-linux and powerpc64le-linux)
>>> with the gdb.threads/attach-many-short-lived-threads.exp testcase, because
>>> its test program constantly creates and finishes joinable threads so the
>>> chance of having zombie threads is high.
>>
>> Hmm.  How about simply not restarting the loop if attach_lwp tries to attach to
>> a zombie lwp (and silently fails)?
>>
>> I.e., in gdb, make attach_proc_task_lwp_callback return false/0 here:
>>
>>       if (ptrace (PTRACE_ATTACH, lwpid, 0, 0) < 0)
>> 	{
>> 	  int err = errno;
>>
>> 	  /* Be quiet if we simply raced with the thread exiting.
>> 	     EPERM is returned if the thread's task still exists, and
>> 	     is marked as exited or zombie, as well as other
>> 	     conditions, so in that case, confirm the status in
>> 	     /proc/PID/status.  */
>> 	  if (err == ESRCH
>> 	      || (err == EPERM && linux_proc_pid_is_gone (lwpid)))
>> 	    {
>> 	      linux_nat_debug_printf
>> 		("Cannot attach to lwp %d: thread is gone (%d: %s)",
>> 		 lwpid, err, safe_strerror (err));
>>
>>               return 0;        <<<< NEW RETURN
>> 	    }
>>
>> Similar thing for gdbserver, of course.
> 
> Thank you for the suggestion. I tried doing that, and in fact I attached
> a patch with that change in comment #17 of PR 31312 when I was
> investigating a fix¹. I called it a workaround because I also had to
> increase the number of iterations in linux_proc_attach_tgid_threads from
> 2 to 100, otherwise GDB gives up on waiting for new inferior threads too
> early and the inferior dies with a SIGTRAP because some new unnoticed
> thread tripped into the breakpoint.
> 
> Because of the need to increase the number of iterations, I didn't
> consider it a good solution and went with the approach in this patch
> series instead. Now I finally understand why I had to increase the
> number of iterations (though I still don't see a way around it other
> than what this patch series does):
> 
> With the approach in this patch series, even if a new thread becomes
> zombie by the time GDB tries to attach to it, it still causes
> new_threads_found to be set the first time GDB notices it, and the loop
> in linux_proc_attach_tgid_threads starts over.
> 
> With the approach above, a new thread that becomes zombie before GDB has
> a chance to attach to it never causes the loop to start over, and so it
> exits earlier.
> 
> I think it's a matter of opinion whether one approach or the other can
> be considered the better one.
> 
> Even with this patch series, it's not guaranteed that two iterations
> without finding new threads is enough to ensure that GDB has found all
> threads in the inferior. I left the test running in a loop overnight
> with the patch series applied and it failed after about 2500 runs.

Thanks for all the investigations.  I hadn't seen the bugzilla before,
I didn't notice there was one identified in the patch.

Let me just start by saying "bleh"...

This is all of course bad ptrace/kernel design...  The only way to plug this race
completely is with kernel help, I believe.

The only way that I know of today that we can make the kernel pause all threads for
us, is with a group-stop.  I.e., pass a SIGSTOP to the inferior, and the kernel stops
_all_ threads with SIGSTOP.

I prototyped it today, and while far from perfect (would need to handle a corner
scenarios like attaching and getting back something != SIGSTOP, and I still see
some spurious SIGSTOPs), it kind of works ...

... except for what I think is a deal breaker that I don't know how to work around:

If you attach to a process that is running on a shell, you will see that the group-stop
makes it go to the background.  GDB resumes the process, but it won't go back to the
foreground without an explicit "fg" in the shell...  Like:

 $ ./test_program
 [1]+  Stopped                 ./test_program
 $

I'm pasting the incomplete patch below for the archives, but I'm not going to
pursue this further.  

Another idea I had was to look for the thread_db thread creation event breakpoint
address, and instead of putting a breakpoint there, put a "jmp $pc" instruction there.
That would halt creation of new pthreads.  But it wouldn't halt raw clones...

Maybe someone has some better idea.  Really the kernel should have a nice interface
for this.  

Thankfully this is a contrived test, no real program should be spawning threads
like this.  One would hope.

I will take a new look at your series.

>From 9dad08eff6fc534964e457de7f99127354dfae44 Mon Sep 17 00:00:00 2001
From: Pedro Alves <pedro@palves.net>
Date: Wed, 17 Apr 2024 12:47:25 +0100
Subject: [PATCH] Force group-stop when attaching

Change-Id: I95bc82d2347af87c74a11ee9ddfc65b851f3e6c7
---
 gdb/linux-nat.c | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
index 2602e1f240d..c95b4370d1b 100644
--- a/gdb/linux-nat.c
+++ b/gdb/linux-nat.c
@@ -1033,6 +1033,14 @@ linux_nat_post_attach_wait (ptid_t ptid, int *signalled)
 			      status_to_str (status).c_str ());
     }
 
+  /* Force the inferior into group-stop, so all threads are frozen by
+     the kernel, so we can attach to all of them, race-free.  */
+  ptrace (PTRACE_CONT, pid, 0, SIGSTOP);
+
+  /* */
+  pid_t pid2 = my_waitpid (pid, &status, __WALL);
+  gdb_assert (pid == pid2);
+
   return status;
 }
 
@@ -1231,6 +1239,31 @@ linux_nat_target::attach (const char *args, int from_tty)
       throw;
     }
 
+  /* Consume the group-stop SIGSTOP for every thread, and move them
+     into ptrace-stopped.  */
+  for (lwp_info *lwp : all_lwps_safe ())
+    {
+      if (lwp->ptid.pid () == lwp->ptid.lwp ())
+	continue;
+
+      int lwp_status;
+      int lwpid = lwp->ptid.lwp ();
+      pid_t ret = my_waitpid (lwpid, &lwp_status, __WALL);
+      gdb_assert (ret == lwpid);
+
+      gdb_assert (WIFSTOPPED (lwp_status));
+      gdb_assert (WSTOPSIG (lwp_status) == SIGSTOP);
+
+      /* If PTRACE_GETSIGINFO fails with EINVAL, then it is definitely
+	 a group-stop.  */
+      siginfo_t siginfo;
+      gdb_assert (!linux_nat_get_siginfo (lwp->ptid, &siginfo));
+      gdb_assert (errno == EINVAL);
+
+      kill_lwp (lwpid, SIGSTOP);
+      ptrace (PTRACE_CONT, lwpid, 0, 0);
+    }
+
   /* Add all the LWPs to gdb's thread list.  */
   iterate_over_lwps (ptid_t (ptid.pid ()),
 		     [] (struct lwp_info *lwp) -> int

base-commit: 2d4c39a885d4d12325d0a9be9e014e75a295fb25
-- 
2.43.2