[RFC PATCH 0/3] Fix attaching to process when it has zombie threads

Mirror of the gdb-patches mailing list
 help / color / mirror / Atom feed

* [RFC PATCH 0/3] Fix attaching to process when it has zombie threads
@ 2024-03-21 23:11 Thiago Jung Bauermann
  2024-03-21 23:11 ` [RFC PATCH 1/3] gdb/nat: Use procfs(5) indexes in linux_common_core_of_thread Thiago Jung Bauermann
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Thiago Jung Bauermann @ 2024-03-21 23:11 UTC (permalink / raw)
  To: gdb-patches

Hello,

This patch series fixes a GDB hang when attaching to a multi-threaded
inferior which happens often (but not always) on aarch64-linux and
powerpc64le-linux, as described in PR 31312.  See patch 3 for a detailed
descripiton of the problem.

Patches 1 and 2 are preparatory patches because I want to use existing code
to parse the /proc/PID/stat file to get the thread's starttime value, so
that GDB and gdbserver aren't fooled by PID reuse.

This patch series was tested on native and extended-remote aarch64-linux
and armv8l-linux-gnueabihf and no regressions were found, except for the
following:

When running gdb.threads/detach-step-over.exp on armv8l-linux-gnueabihf
extended-remote, sometimes GDBserver dies with:

  builtin_spawn /home/thiago.bauermann/.cache/builds/gdb-native-aarch32/gdb/testsuite/outputs/gdb.threads/detach-step-over/detach-step-over
  Remote debugging from host 127.0.0.1, port 56624
  Process /home/thiago.bauermann/.cache/builds/gdb-native-aarch32/gdb/testsuite/outputs/gdb.threads/detach-step-over/detach-step-over created; pid = 840876
  Attached; pid = 840821
  Detaching from process 840821
  Attached; pid = 840821
  /home/thiago.bauermann/src/binutils-gdb/gdbserver/linux-low.cc:1956: A problem internal to GDBserver has been detected.
  unsuspend LWP 840821, suspended=-1

The assertion triggered is this one:

  /* Decrement LWP's suspend count.  */

  static void
  lwp_suspended_decr (struct lwp_info *lwp)
  {
    lwp->suspended--;

    if (lwp->suspended < 0)
      {
        struct thread_info *thread = get_lwp_thread (lwp);

        internal_error ("unsuspend LWP %ld, suspended=%d\n", lwpid_of (thread),
  		      lwp->suspended);
      }
  }

Unfortunately for the moment I don't have time to further debug this
problem and I didn't want to keep sitting on these patches until I can come
back to this issue.

Note that of all the testcases in the GDB testsuite, only
detach-step-over.exp triggers the GDBserver internal error so it's a
localized problem.

This is why I'm posting the patch series as an RFC. Considering that it
fixes a problem that is causing instability in the testsuite results for
aarch64-linux and powerpc64le-linux, does it make sense to commit it as is,
and then investigate the GDBserver internal error on armv8l-linux-gnueabihf
later?

Thiago Jung Bauermann (3):
  gdb/nat: Use procfs(5) indexes in linux_common_core_of_thread
  gdb/nat: Factor linux_find_proc_stat_field out of
    linux_common_core_of_thread
  gdb/nat/linux: Fix attaching to process when it has zombie threads

 gdb/nat/linux-osdata.c | 65 +++++++++++++++++++++++++++++++++---------
 gdb/nat/linux-osdata.h |  7 +++++
 gdb/nat/linux-procfs.c | 19 ++++++++++++
 3 files changed, 77 insertions(+), 14 deletions(-)

base-commit: b42aa684f6ff2bce9b8bc58aa89574723f17f1ce

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH 1/3] gdb/nat: Use procfs(5) indexes in linux_common_core_of_thread
  2024-03-21 23:11 [RFC PATCH 0/3] Fix attaching to process when it has zombie threads Thiago Jung Bauermann
@ 2024-03-21 23:11 ` Thiago Jung Bauermann
  2024-03-22 17:33   ` Luis Machado
  2024-03-21 23:11 ` [RFC PATCH 2/3] gdb/nat: Factor linux_find_proc_stat_field out of linux_common_core_of_thread Thiago Jung Bauermann
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 19+ messages in thread
From: Thiago Jung Bauermann @ 2024-03-21 23:11 UTC (permalink / raw)
  To: gdb-patches

The code and comment reference stat fields by made-up indexes.  The
procfs(5) man page, which describes the /proc/PID/stat file, has a numbered
list of these fields so it's more convenient to use those numbers instead.

This is currently an implementation detail inside the function so it's
not really relevant with the code as-is, but a future patch will do some
refactoring which will make the index more prominent.

Therefore, make this change in a separate patch so that it's simpler to
review.
---
 gdb/nat/linux-osdata.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/gdb/nat/linux-osdata.c b/gdb/nat/linux-osdata.c
index c9192940f236..172fea5cea85 100644
--- a/gdb/nat/linux-osdata.c
+++ b/gdb/nat/linux-osdata.c
@@ -75,10 +75,10 @@ linux_common_core_of_thread (ptid_t ptid)
   if (pos == std::string::npos)
     return -1;
 
-  /* If the first field after program name has index 0, then core number is
-     the field with index 36 (so, the 37th).  There's no constant for that
-     anywhere.  */
-  for (int i = 0; i < 37; ++i)
+  /* If the first field after program name has index 3, then core number is
+     the field with index 39.  These are the indexes shown in the procfs(5)
+     man page.  */
+  for (int i = 3; i <= 39; ++i)
     {
       /* Find separator.  */
       pos = content->find_first_of (' ', pos);

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH 2/3] gdb/nat: Factor linux_find_proc_stat_field out of linux_common_core_of_thread
  2024-03-21 23:11 [RFC PATCH 0/3] Fix attaching to process when it has zombie threads Thiago Jung Bauermann
  2024-03-21 23:11 ` [RFC PATCH 1/3] gdb/nat: Use procfs(5) indexes in linux_common_core_of_thread Thiago Jung Bauermann
@ 2024-03-21 23:11 ` Thiago Jung Bauermann
  2024-03-22 16:12   ` Luis Machado
  2024-04-17 16:06   ` Pedro Alves
  2024-03-21 23:11 ` [RFC PATCH 3/3] gdb/nat/linux: Fix attaching to process when it has zombie threads Thiago Jung Bauermann
  2024-03-22 10:17 ` [RFC PATCH 0/3] " Christophe Lyon
  3 siblings, 2 replies; 19+ messages in thread
From: Thiago Jung Bauermann @ 2024-03-21 23:11 UTC (permalink / raw)
  To: gdb-patches

The new function will be used in a subsequent patch to read a different
stat field.

The new code is believed to be equivalent to the old code, so there should
be no change in GDB behaviour.

Also, take the opportunity to move the function's documentation comment to
the header file, to conform with GDB practice.
---
 gdb/nat/linux-osdata.c | 43 ++++++++++++++++++++++++++++--------------
 gdb/nat/linux-osdata.h |  3 +++
 2 files changed, 32 insertions(+), 14 deletions(-)

diff --git a/gdb/nat/linux-osdata.c b/gdb/nat/linux-osdata.c
index 172fea5cea85..c254f2e4f05b 100644
--- a/gdb/nat/linux-osdata.c
+++ b/gdb/nat/linux-osdata.c
@@ -53,32 +53,31 @@ typedef long long TIME_T;
 
 #define MAX_PID_T_STRLEN  (sizeof ("-9223372036854775808") - 1)
 
-/* Returns the CPU core that thread PTID is currently running on.  */
+/* Returns FIELD (as numbered in procfs(5) man page) of
+   /proc/PID/task/LWP/stat file.  */
 
-/* Compute and return the processor core of a given thread.  */
-
-int
-linux_common_core_of_thread (ptid_t ptid)
+static std::optional<std::string>
+linux_find_proc_stat_field (ptid_t ptid, int field)
 {
+  /* We never need to read PID from the stat file, and there's
+     command_from_pid to read the comm field.  */
+  gdb_assert (field >= 3);
+
   char filename[sizeof ("/proc//task//stat") + 2 * MAX_PID_T_STRLEN];
-  int core;
 
   sprintf (filename, "/proc/%lld/task/%lld/stat",
 	   (PID_T) ptid.pid (), (PID_T) ptid.lwp ());
 
   std::optional<std::string> content = read_text_file_to_string (filename);
   if (!content.has_value ())
-    return -1;
+    return {};
 
   /* ps command also relies on no trailing fields ever contain ')'.  */
   std::string::size_type pos = content->find_last_of (')');
   if (pos == std::string::npos)
-    return -1;
+    return {};
 
-  /* If the first field after program name has index 3, then core number is
-     the field with index 39.  These are the indexes shown in the procfs(5)
-     man page.  */
-  for (int i = 3; i <= 39; ++i)
+  for (int i = 3; i <= field; ++i)
     {
       /* Find separator.  */
       pos = content->find_first_of (' ', pos);
@@ -91,8 +90,24 @@ linux_common_core_of_thread (ptid_t ptid)
 	return {};
     }
 
-  if (sscanf (&(*content)[pos], "%d", &core) == 0)
-    core = -1;
+  /* Find end of field.  */
+  std::string::size_type end_pos = content->find_first_of (' ', pos);
+  if (end_pos == std::string::npos)
+    return content->substr (pos);
+  else
+    return content->substr (pos, end_pos - pos);
+}
+
+/* See linux-osdata.h.  */
+
+int
+linux_common_core_of_thread (ptid_t ptid)
+{
+  std::optional<std::string> field = linux_find_proc_stat_field (ptid, 39);
+  int core;
+
+  if (!field.has_value () || sscanf (field->c_str (), "%d", &core) == 0)
+    return -1;
 
   return core;
 }
diff --git a/gdb/nat/linux-osdata.h b/gdb/nat/linux-osdata.h
index 833915cdb2fd..a82fb08b998e 100644
--- a/gdb/nat/linux-osdata.h
+++ b/gdb/nat/linux-osdata.h
@@ -20,7 +20,10 @@
 #ifndef NAT_LINUX_OSDATA_H
 #define NAT_LINUX_OSDATA_H
 
+/* Returns the CPU core that thread PTID is currently running on.  */
+
 extern int linux_common_core_of_thread (ptid_t ptid);
+
 extern LONGEST linux_common_xfer_osdata (const char *annex, gdb_byte *readbuf,
 					 ULONGEST offset, ULONGEST len);
 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [RFC PATCH 3/3] gdb/nat/linux: Fix attaching to process when it has zombie threads
  2024-03-21 23:11 [RFC PATCH 0/3] Fix attaching to process when it has zombie threads Thiago Jung Bauermann
  2024-03-21 23:11 ` [RFC PATCH 1/3] gdb/nat: Use procfs(5) indexes in linux_common_core_of_thread Thiago Jung Bauermann
  2024-03-21 23:11 ` [RFC PATCH 2/3] gdb/nat: Factor linux_find_proc_stat_field out of linux_common_core_of_thread Thiago Jung Bauermann
@ 2024-03-21 23:11 ` Thiago Jung Bauermann
  2024-03-22 16:19   ` Luis Machado
                     ` (2 more replies)
  2024-03-22 10:17 ` [RFC PATCH 0/3] " Christophe Lyon
  3 siblings, 3 replies; 19+ messages in thread
From: Thiago Jung Bauermann @ 2024-03-21 23:11 UTC (permalink / raw)
  To: gdb-patches

When GDB attaches to a multi-threaded process, it calls
linux_proc_attach_tgid_threads () to go through all threads found in
/proc/PID/task/ and call attach_proc_task_lwp_callback () on each of
them.  If it does that twice without the callback reporting that a new
thread was found, then it considers that all inferior threads have been
found and returns.

The problem is that the callback considers any thread that it hasn't
attached to yet as new.  This causes problems if the process has one or
more zombie threads, because GDB can't attach to it and the loop will
always "find" a new thread (the zombie one), and get stuck in an
infinite loop.

This is easy to trigger (at least on aarch64-linux and powerpc64le-linux)
with the gdb.threads/attach-many-short-lived-threads.exp testcase, because
its test program constantly creates and finishes joinable threads so the
chance of having zombie threads is high.

This problem causes the following failures:

FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: attach (timeout)
FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: no new threads (timeout)
FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: set breakpoint always-inserted on (timeout)
FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: break break_fn (timeout)
FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: break at break_fn: 1 (timeout)
FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: break at break_fn: 2 (timeout)
FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: break at break_fn: 3 (timeout)
FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: reset timer in the inferior (timeout)
FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: print seconds_left (timeout)
FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: detach (timeout)
FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: set breakpoint always-inserted off (timeout)
FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: delete all breakpoints, watchpoints, tracepoints, and catchpoints in delete_breakpoints (timeout)
ERROR: breakpoints not deleted

The iteration number is random, and all tests in the subsequent iterations
fail too, because GDB is stuck in the attach command at the beginning of
the iteration.

The solution is to make linux_proc_attach_tgid_threads () remember when it
has already processed a given LWP and skip it in the subsequent iterations.

PR testsuite/31312
Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=31312
---
 gdb/nat/linux-osdata.c | 22 ++++++++++++++++++++++
 gdb/nat/linux-osdata.h |  4 ++++
 gdb/nat/linux-procfs.c | 19 +++++++++++++++++++
 3 files changed, 45 insertions(+)

diff --git a/gdb/nat/linux-osdata.c b/gdb/nat/linux-osdata.c
index c254f2e4f05b..998279377433 100644
--- a/gdb/nat/linux-osdata.c
+++ b/gdb/nat/linux-osdata.c
@@ -112,6 +112,28 @@ linux_common_core_of_thread (ptid_t ptid)
   return core;
 }
 
+/* See linux-osdata.h.  */
+
+std::optional<ULONGEST>
+linux_get_starttime (ptid_t ptid)
+{
+  std::optional<std::string> field = linux_find_proc_stat_field (ptid, 22);
+
+  if (!field.has_value ())
+    return {};
+
+  errno = 0;
+  const char *trailer;
+  ULONGEST starttime = strtoulst (field->c_str (), &trailer, 10);
+  if (starttime == ULONGEST_MAX && errno == ERANGE)
+    return {};
+  else if (*trailer != '\0')
+    /* There were unexpected characters.  */
+    return {};
+
+  return starttime;
+}
+
 /* Finds the command-line of process PID and copies it into COMMAND.
    At most MAXLEN characters are copied.  If the command-line cannot
    be found, PID is copied into command in text-form.  */
diff --git a/gdb/nat/linux-osdata.h b/gdb/nat/linux-osdata.h
index a82fb08b998e..1cdc687aa9cf 100644
--- a/gdb/nat/linux-osdata.h
+++ b/gdb/nat/linux-osdata.h
@@ -27,4 +27,8 @@ extern int linux_common_core_of_thread (ptid_t ptid);
 extern LONGEST linux_common_xfer_osdata (const char *annex, gdb_byte *readbuf,
 					 ULONGEST offset, ULONGEST len);
 
+/* Get the start time of thread PTID.  */
+
+extern std::optional<ULONGEST> linux_get_starttime (ptid_t ptid);
+
 #endif /* NAT_LINUX_OSDATA_H */
diff --git a/gdb/nat/linux-procfs.c b/gdb/nat/linux-procfs.c
index b17e3120792e..b01bf36c0b53 100644
--- a/gdb/nat/linux-procfs.c
+++ b/gdb/nat/linux-procfs.c
@@ -17,10 +17,13 @@
    along with this program.  If not, see <http://www.gnu.org/licenses/>.  */
 
 #include "gdbsupport/common-defs.h"
+#include "linux-osdata.h"
 #include "linux-procfs.h"
 #include "gdbsupport/filestuff.h"
 #include <dirent.h>
 #include <sys/stat.h>
+#include <set>
+#include <utility>
 
 /* Return the TGID of LWPID from /proc/pid/status.  Returns -1 if not
    found.  */
@@ -290,6 +293,10 @@ linux_proc_attach_tgid_threads (pid_t pid,
       return;
     }
 
+  /* Keeps track of the LWPs we have already visited in /proc,
+     identified by their PID and starttime to detect PID reuse.  */
+  std::set<std::pair<unsigned long,ULONGEST>> visited_lwps;
+
   /* Scan the task list for existing threads.  While we go through the
      threads, new threads may be spawned.  Cycle through the list of
      threads until we have done two iterations without finding new
@@ -308,6 +315,18 @@ linux_proc_attach_tgid_threads (pid_t pid,
 	  if (lwp != 0)
 	    {
 	      ptid_t ptid = ptid_t (pid, lwp);
+	      std::optional<ULONGEST> starttime = linux_get_starttime (ptid);
+
+	      if (starttime.has_value ())
+		{
+		  std::pair<unsigned long,ULONGEST> key (lwp, *starttime);
+
+		  /* If we already visited this LWP, skip it this time.  */
+		  if (visited_lwps.find (key) != visited_lwps.cend ())
+		    continue;
+
+		  visited_lwps.insert (key);
+		}
 
 	      if (attach_lwp (ptid))
 		new_threads_found = 1;

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 0/3] Fix attaching to process when it has zombie threads
  2024-03-21 23:11 [RFC PATCH 0/3] Fix attaching to process when it has zombie threads Thiago Jung Bauermann
                   ` (2 preceding siblings ...)
  2024-03-21 23:11 ` [RFC PATCH 3/3] gdb/nat/linux: Fix attaching to process when it has zombie threads Thiago Jung Bauermann
@ 2024-03-22 10:17 ` Christophe Lyon
  3 siblings, 0 replies; 19+ messages in thread
From: Christophe Lyon @ 2024-03-22 10:17 UTC (permalink / raw)
  To: Thiago Jung Bauermann; +Cc: gdb-patches

Hi!


On Fri, 22 Mar 2024 at 00:12, Thiago Jung Bauermann
<thiago.bauermann@linaro.org> wrote:
>
> Hello,
>
> This patch series fixes a GDB hang when attaching to a multi-threaded
> inferior which happens often (but not always) on aarch64-linux and
> powerpc64le-linux, as described in PR 31312.  See patch 3 for a detailed
> descripiton of the problem.
>
> Patches 1 and 2 are preparatory patches because I want to use existing code
> to parse the /proc/PID/stat file to get the thread's starttime value, so
> that GDB and gdbserver aren't fooled by PID reuse.
>
> This patch series was tested on native and extended-remote aarch64-linux
> and armv8l-linux-gnueabihf and no regressions were found, except for the
> following:
>
> When running gdb.threads/detach-step-over.exp on armv8l-linux-gnueabihf
> extended-remote, sometimes GDBserver dies with:
>
>   builtin_spawn /home/thiago.bauermann/.cache/builds/gdb-native-aarch32/gdb/testsuite/outputs/gdb.threads/detach-step-over/detach-step-over
>   Remote debugging from host 127.0.0.1, port 56624
>   Process /home/thiago.bauermann/.cache/builds/gdb-native-aarch32/gdb/testsuite/outputs/gdb.threads/detach-step-over/detach-step-over created; pid = 840876
>   Attached; pid = 840821
>   Detaching from process 840821
>   Attached; pid = 840821
>   /home/thiago.bauermann/src/binutils-gdb/gdbserver/linux-low.cc:1956: A problem internal to GDBserver has been detected.
>   unsuspend LWP 840821, suspended=-1
>
> The assertion triggered is this one:
>
>   /* Decrement LWP's suspend count.  */
>
>   static void
>   lwp_suspended_decr (struct lwp_info *lwp)
>   {
>     lwp->suspended--;
>
>     if (lwp->suspended < 0)
>       {
>         struct thread_info *thread = get_lwp_thread (lwp);
>
>         internal_error ("unsuspend LWP %ld, suspended=%d\n", lwpid_of (thread),
>                       lwp->suspended);
>       }
>   }
>
> Unfortunately for the moment I don't have time to further debug this
> problem and I didn't want to keep sitting on these patches until I can come
> back to this issue.
>
> Note that of all the testcases in the GDB testsuite, only
> detach-step-over.exp triggers the GDBserver internal error so it's a
> localized problem.
>
> This is why I'm posting the patch series as an RFC. Considering that it
> fixes a problem that is causing instability in the testsuite results for
> aarch64-linux and powerpc64le-linux, does it make sense to commit it as is,
> and then investigate the GDBserver internal error on armv8l-linux-gnueabihf
> later?

I quickly looked at the series, patches 1 and 2 LGTM, I would say
patch 3 too, but it seems to be causing the random failures you
mention :-(
However, I think your rationale is OK, trading many failures for a
single, localized one.
But of course, I'm not a maintainer :-)

Thanks,

Christophe

>
> Thiago Jung Bauermann (3):
>   gdb/nat: Use procfs(5) indexes in linux_common_core_of_thread
>   gdb/nat: Factor linux_find_proc_stat_field out of
>     linux_common_core_of_thread
>   gdb/nat/linux: Fix attaching to process when it has zombie threads
>
>  gdb/nat/linux-osdata.c | 65 +++++++++++++++++++++++++++++++++---------
>  gdb/nat/linux-osdata.h |  7 +++++
>  gdb/nat/linux-procfs.c | 19 ++++++++++++
>  3 files changed, 77 insertions(+), 14 deletions(-)
>
>
> base-commit: b42aa684f6ff2bce9b8bc58aa89574723f17f1ce

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 2/3] gdb/nat: Factor linux_find_proc_stat_field out of linux_common_core_of_thread
  2024-03-21 23:11 ` [RFC PATCH 2/3] gdb/nat: Factor linux_find_proc_stat_field out of linux_common_core_of_thread Thiago Jung Bauermann
@ 2024-03-22 16:12   ` Luis Machado
  2024-04-17 16:06   ` Pedro Alves
  1 sibling, 0 replies; 19+ messages in thread
From: Luis Machado @ 2024-03-22 16:12 UTC (permalink / raw)
  To: Thiago Jung Bauermann, gdb-patches

On 3/21/24 23:11, Thiago Jung Bauermann wrote:
> The new function will be used in a subsequent patch to read a different
> stat field.
> 
> The new code is believed to be equivalent to the old code, so there should
> be no change in GDB behaviour.
> 
> Also, take the opportunity to move the function's documentation comment to
> the header file, to conform with GDB practice.
> ---
>  gdb/nat/linux-osdata.c | 43 ++++++++++++++++++++++++++++--------------
>  gdb/nat/linux-osdata.h |  3 +++
>  2 files changed, 32 insertions(+), 14 deletions(-)
> 
> diff --git a/gdb/nat/linux-osdata.c b/gdb/nat/linux-osdata.c
> index 172fea5cea85..c254f2e4f05b 100644
> --- a/gdb/nat/linux-osdata.c
> +++ b/gdb/nat/linux-osdata.c
> @@ -53,32 +53,31 @@ typedef long long TIME_T;
>  
>  #define MAX_PID_T_STRLEN  (sizeof ("-9223372036854775808") - 1)
>  
> -/* Returns the CPU core that thread PTID is currently running on.  */
> +/* Returns FIELD (as numbered in procfs(5) man page) of
> +   /proc/PID/task/LWP/stat file.  */
>  
> -/* Compute and return the processor core of a given thread.  */
> -
> -int
> -linux_common_core_of_thread (ptid_t ptid)
> +static std::optional<std::string>
> +linux_find_proc_stat_field (ptid_t ptid, int field)
>  {
> +  /* We never need to read PID from the stat file, and there's
> +     command_from_pid to read the comm field.  */
> +  gdb_assert (field >= 3);
> +
>    char filename[sizeof ("/proc//task//stat") + 2 * MAX_PID_T_STRLEN];
> -  int core;
>  
>    sprintf (filename, "/proc/%lld/task/%lld/stat",
>  	   (PID_T) ptid.pid (), (PID_T) ptid.lwp ());
>  
>    std::optional<std::string> content = read_text_file_to_string (filename);
>    if (!content.has_value ())
> -    return -1;
> +    return {};
>  
>    /* ps command also relies on no trailing fields ever contain ')'.  */
>    std::string::size_type pos = content->find_last_of (')');
>    if (pos == std::string::npos)
> -    return -1;
> +    return {};
>  
> -  /* If the first field after program name has index 3, then core number is
> -     the field with index 39.  These are the indexes shown in the procfs(5)
> -     man page.  */
> -  for (int i = 3; i <= 39; ++i)
> +  for (int i = 3; i <= field; ++i)
>      {
>        /* Find separator.  */
>        pos = content->find_first_of (' ', pos);
> @@ -91,8 +90,24 @@ linux_common_core_of_thread (ptid_t ptid)
>  	return {};
>      }
>  
> -  if (sscanf (&(*content)[pos], "%d", &core) == 0)
> -    core = -1;
> +  /* Find end of field.  */
> +  std::string::size_type end_pos = content->find_first_of (' ', pos);
> +  if (end_pos == std::string::npos)
> +    return content->substr (pos);
> +  else
> +    return content->substr (pos, end_pos - pos);
> +}
> +
> +/* See linux-osdata.h.  */
> +
> +int
> +linux_common_core_of_thread (ptid_t ptid)
> +{
> +  std::optional<std::string> field = linux_find_proc_stat_field (ptid, 39);
> +  int core;
> +
> +  if (!field.has_value () || sscanf (field->c_str (), "%d", &core) == 0)
> +    return -1;
>  
>    return core;
>  }
> diff --git a/gdb/nat/linux-osdata.h b/gdb/nat/linux-osdata.h
> index 833915cdb2fd..a82fb08b998e 100644
> --- a/gdb/nat/linux-osdata.h
> +++ b/gdb/nat/linux-osdata.h
> @@ -20,7 +20,10 @@
>  #ifndef NAT_LINUX_OSDATA_H
>  #define NAT_LINUX_OSDATA_H
>  
> +/* Returns the CPU core that thread PTID is currently running on.  */
> +
>  extern int linux_common_core_of_thread (ptid_t ptid);
> +
>  extern LONGEST linux_common_xfer_osdata (const char *annex, gdb_byte *readbuf,
>  					 ULONGEST offset, ULONGEST len);
>  

Looks OK overall. I'd just attempt to replace the constants (3 and 39) with something more
explicit (or an enum, not sure if we need to go that far though)

Reviewed-By: Luis Machado <luis.machado@arm.com>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 3/3] gdb/nat/linux: Fix attaching to process when it has zombie threads
  2024-03-21 23:11 ` [RFC PATCH 3/3] gdb/nat/linux: Fix attaching to process when it has zombie threads Thiago Jung Bauermann
@ 2024-03-22 16:19   ` Luis Machado
  2024-03-22 16:52   ` Pedro Alves
  2024-04-17 16:28   ` Pedro Alves
  2 siblings, 0 replies; 19+ messages in thread
From: Luis Machado @ 2024-03-22 16:19 UTC (permalink / raw)
  To: Thiago Jung Bauermann, gdb-patches

Thanks for the patch.

On 3/21/24 23:11, Thiago Jung Bauermann wrote:
> When GDB attaches to a multi-threaded process, it calls
> linux_proc_attach_tgid_threads () to go through all threads found in
> /proc/PID/task/ and call attach_proc_task_lwp_callback () on each of
> them.  If it does that twice without the callback reporting that a new
> thread was found, then it considers that all inferior threads have been
> found and returns.
> 
> The problem is that the callback considers any thread that it hasn't
> attached to yet as new.  This causes problems if the process has one or
> more zombie threads, because GDB can't attach to it and the loop will
> always "find" a new thread (the zombie one), and get stuck in an
> infinite loop.
> 
> This is easy to trigger (at least on aarch64-linux and powerpc64le-linux)
> with the gdb.threads/attach-many-short-lived-threads.exp testcase, because
> its test program constantly creates and finishes joinable threads so the
> chance of having zombie threads is high.
> 
> This problem causes the following failures:
> 
> FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: attach (timeout)
> FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: no new threads (timeout)
> FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: set breakpoint always-inserted on (timeout)
> FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: break break_fn (timeout)
> FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: break at break_fn: 1 (timeout)
> FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: break at break_fn: 2 (timeout)
> FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: break at break_fn: 3 (timeout)
> FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: reset timer in the inferior (timeout)
> FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: print seconds_left (timeout)
> FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: detach (timeout)
> FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: set breakpoint always-inserted off (timeout)
> FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: delete all breakpoints, watchpoints, tracepoints, and catchpoints in delete_breakpoints (timeout)
> ERROR: breakpoints not deleted
> 
> The iteration number is random, and all tests in the subsequent iterations
> fail too, because GDB is stuck in the attach command at the beginning of
> the iteration.
> 
> The solution is to make linux_proc_attach_tgid_threads () remember when it
> has already processed a given LWP and skip it in the subsequent iterations.
> 
> PR testsuite/31312
> Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=31312
> ---
>  gdb/nat/linux-osdata.c | 22 ++++++++++++++++++++++
>  gdb/nat/linux-osdata.h |  4 ++++
>  gdb/nat/linux-procfs.c | 19 +++++++++++++++++++
>  3 files changed, 45 insertions(+)
> 
> diff --git a/gdb/nat/linux-osdata.c b/gdb/nat/linux-osdata.c
> index c254f2e4f05b..998279377433 100644
> --- a/gdb/nat/linux-osdata.c
> +++ b/gdb/nat/linux-osdata.c
> @@ -112,6 +112,28 @@ linux_common_core_of_thread (ptid_t ptid)
>    return core;
>  }
>  
> +/* See linux-osdata.h.  */
> +
> +std::optional<ULONGEST>
> +linux_get_starttime (ptid_t ptid)
> +{
> +  std::optional<std::string> field = linux_find_proc_stat_field (ptid, 22);

Same idea about turning the explicit index into a named constant.

> +
> +  if (!field.has_value ())
> +    return {};
> +
> +  errno = 0;
> +  const char *trailer;
> +  ULONGEST starttime = strtoulst (field->c_str (), &trailer, 10);
> +  if (starttime == ULONGEST_MAX && errno == ERANGE)
> +    return {};
> +  else if (*trailer != '\0')
> +    /* There were unexpected characters.  */
> +    return {};
> +
> +  return starttime;
> +}
> +
>  /* Finds the command-line of process PID and copies it into COMMAND.
>     At most MAXLEN characters are copied.  If the command-line cannot
>     be found, PID is copied into command in text-form.  */
> diff --git a/gdb/nat/linux-osdata.h b/gdb/nat/linux-osdata.h
> index a82fb08b998e..1cdc687aa9cf 100644
> --- a/gdb/nat/linux-osdata.h
> +++ b/gdb/nat/linux-osdata.h
> @@ -27,4 +27,8 @@ extern int linux_common_core_of_thread (ptid_t ptid);
>  extern LONGEST linux_common_xfer_osdata (const char *annex, gdb_byte *readbuf,
>  					 ULONGEST offset, ULONGEST len);
>  
> +/* Get the start time of thread PTID.  */
> +
> +extern std::optional<ULONGEST> linux_get_starttime (ptid_t ptid);
> +
>  #endif /* NAT_LINUX_OSDATA_H */
> diff --git a/gdb/nat/linux-procfs.c b/gdb/nat/linux-procfs.c
> index b17e3120792e..b01bf36c0b53 100644
> --- a/gdb/nat/linux-procfs.c
> +++ b/gdb/nat/linux-procfs.c
> @@ -17,10 +17,13 @@
>     along with this program.  If not, see <http://www.gnu.org/licenses/>.  */
>  
>  #include "gdbsupport/common-defs.h"
> +#include "linux-osdata.h"
>  #include "linux-procfs.h"
>  #include "gdbsupport/filestuff.h"
>  #include <dirent.h>
>  #include <sys/stat.h>
> +#include <set>
> +#include <utility>
>  
>  /* Return the TGID of LWPID from /proc/pid/status.  Returns -1 if not
>     found.  */
> @@ -290,6 +293,10 @@ linux_proc_attach_tgid_threads (pid_t pid,
>        return;
>      }
>  
> +  /* Keeps track of the LWPs we have already visited in /proc,
> +     identified by their PID and starttime to detect PID reuse.  */
> +  std::set<std::pair<unsigned long,ULONGEST>> visited_lwps;
> +
>    /* Scan the task list for existing threads.  While we go through the
>       threads, new threads may be spawned.  Cycle through the list of
>       threads until we have done two iterations without finding new
> @@ -308,6 +315,18 @@ linux_proc_attach_tgid_threads (pid_t pid,
>  	  if (lwp != 0)
>  	    {
>  	      ptid_t ptid = ptid_t (pid, lwp);
> +	      std::optional<ULONGEST> starttime = linux_get_starttime (ptid);
> +
> +	      if (starttime.has_value ())
> +		{
> +		  std::pair<unsigned long,ULONGEST> key (lwp, *starttime);
> +
> +		  /* If we already visited this LWP, skip it this time.  */
> +		  if (visited_lwps.find (key) != visited_lwps.cend ())
> +		    continue;
> +
> +		  visited_lwps.insert (key);
> +		}
>  
>  	      if (attach_lwp (ptid))
>  		new_threads_found = 1;

Looks OK to me overall as well, pending the constant comment.

I'd like to give others a chance to comment first.

Reviewed-By: Luis Machado <luis.machado@arm.com>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 3/3] gdb/nat/linux: Fix attaching to process when it has zombie threads
  2024-03-21 23:11 ` [RFC PATCH 3/3] gdb/nat/linux: Fix attaching to process when it has zombie threads Thiago Jung Bauermann
  2024-03-22 16:19   ` Luis Machado
@ 2024-03-22 16:52   ` Pedro Alves
  2024-04-16  4:48     ` Thiago Jung Bauermann
  2024-04-17 16:28   ` Pedro Alves
  2 siblings, 1 reply; 19+ messages in thread
From: Pedro Alves @ 2024-03-22 16:52 UTC (permalink / raw)
  To: Thiago Jung Bauermann, gdb-patches

On 2024-03-21 23:11, Thiago Jung Bauermann wrote:
> When GDB attaches to a multi-threaded process, it calls
> linux_proc_attach_tgid_threads () to go through all threads found in
> /proc/PID/task/ and call attach_proc_task_lwp_callback () on each of
> them.  If it does that twice without the callback reporting that a new
> thread was found, then it considers that all inferior threads have been
> found and returns.
> 
> The problem is that the callback considers any thread that it hasn't
> attached to yet as new.  This causes problems if the process has one or
> more zombie threads, because GDB can't attach to it and the loop will
> always "find" a new thread (the zombie one), and get stuck in an
> infinite loop.
> 
> This is easy to trigger (at least on aarch64-linux and powerpc64le-linux)
> with the gdb.threads/attach-many-short-lived-threads.exp testcase, because
> its test program constantly creates and finishes joinable threads so the
> chance of having zombie threads is high.

Hmm.  How about simply not restarting the loop if attach_lwp tries to attach to
a zombie lwp (and silently fails)?

I.e., in gdb, make attach_proc_task_lwp_callback return false/0 here:

      if (ptrace (PTRACE_ATTACH, lwpid, 0, 0) < 0)
	{
	  int err = errno;

	  /* Be quiet if we simply raced with the thread exiting.
	     EPERM is returned if the thread's task still exists, and
	     is marked as exited or zombie, as well as other
	     conditions, so in that case, confirm the status in
	     /proc/PID/status.  */
	  if (err == ESRCH
	      || (err == EPERM && linux_proc_pid_is_gone (lwpid)))
	    {
	      linux_nat_debug_printf
		("Cannot attach to lwp %d: thread is gone (%d: %s)",
		 lwpid, err, safe_strerror (err));

              return 0;        <<<< NEW RETURN
	    }

Similar thing for gdbserver, of course.

Pedro Alves


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 1/3] gdb/nat: Use procfs(5) indexes in linux_common_core_of_thread
  2024-03-21 23:11 ` [RFC PATCH 1/3] gdb/nat: Use procfs(5) indexes in linux_common_core_of_thread Thiago Jung Bauermann
@ 2024-03-22 17:33   ` Luis Machado
  2024-04-17 15:55     ` Pedro Alves
  0 siblings, 1 reply; 19+ messages in thread
From: Luis Machado @ 2024-03-22 17:33 UTC (permalink / raw)
  To: Thiago Jung Bauermann, gdb-patches

On 3/21/24 23:11, Thiago Jung Bauermann wrote:
> The code and comment reference stat fields by made-up indexes.  The
> procfs(5) man page, which describes the /proc/PID/stat file, has a numbered
> list of these fields so it's more convenient to use those numbers instead.
> 
> This is currently an implementation detail inside the function so it's
> not really relevant with the code as-is, but a future patch will do some
> refactoring which will make the index more prominent.
> 
> Therefore, make this change in a separate patch so that it's simpler to
> review.
> ---
>  gdb/nat/linux-osdata.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/gdb/nat/linux-osdata.c b/gdb/nat/linux-osdata.c
> index c9192940f236..172fea5cea85 100644
> --- a/gdb/nat/linux-osdata.c
> +++ b/gdb/nat/linux-osdata.c
> @@ -75,10 +75,10 @@ linux_common_core_of_thread (ptid_t ptid)
>    if (pos == std::string::npos)
>      return -1;
>  
> -  /* If the first field after program name has index 0, then core number is
> -     the field with index 36 (so, the 37th).  There's no constant for that
> -     anywhere.  */
> -  for (int i = 0; i < 37; ++i)
> +  /* If the first field after program name has index 3, then core number is
> +     the field with index 39.  These are the indexes shown in the procfs(5)
> +     man page.  */
> +  for (int i = 3; i <= 39; ++i)
>      {
>        /* Find separator.  */
>        pos = content->find_first_of (' ', pos);

Looks ok to me.

See the comment about turning the numeric constants into named constants.

Reviewed-By: Luis Machado <luis.machado@arm.com>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 3/3] gdb/nat/linux: Fix attaching to process when it has zombie threads
  2024-03-22 16:52   ` Pedro Alves
@ 2024-04-16  4:48     ` Thiago Jung Bauermann
  2024-04-17 15:32       ` Pedro Alves
  0 siblings, 1 reply; 19+ messages in thread
From: Thiago Jung Bauermann @ 2024-04-16  4:48 UTC (permalink / raw)
  To: Pedro Alves; +Cc: gdb-patches

Hello Pedro,

[ I just now noticed that I replied to you in private last time. Sorry
  about that, I thought I had cc'd the list. ]

Pedro Alves <pedro@palves.net> writes:

> On 2024-03-21 23:11, Thiago Jung Bauermann wrote:
>> When GDB attaches to a multi-threaded process, it calls
>> linux_proc_attach_tgid_threads () to go through all threads found in
>> /proc/PID/task/ and call attach_proc_task_lwp_callback () on each of
>> them.  If it does that twice without the callback reporting that a new
>> thread was found, then it considers that all inferior threads have been
>> found and returns.
>>
>> The problem is that the callback considers any thread that it hasn't
>> attached to yet as new.  This causes problems if the process has one or
>> more zombie threads, because GDB can't attach to it and the loop will
>> always "find" a new thread (the zombie one), and get stuck in an
>> infinite loop.
>>
>> This is easy to trigger (at least on aarch64-linux and powerpc64le-linux)
>> with the gdb.threads/attach-many-short-lived-threads.exp testcase, because
>> its test program constantly creates and finishes joinable threads so the
>> chance of having zombie threads is high.
>
> Hmm.  How about simply not restarting the loop if attach_lwp tries to attach to
> a zombie lwp (and silently fails)?
>
> I.e., in gdb, make attach_proc_task_lwp_callback return false/0 here:
>
>       if (ptrace (PTRACE_ATTACH, lwpid, 0, 0) < 0)
> 	{
> 	  int err = errno;
>
> 	  /* Be quiet if we simply raced with the thread exiting.
> 	     EPERM is returned if the thread's task still exists, and
> 	     is marked as exited or zombie, as well as other
> 	     conditions, so in that case, confirm the status in
> 	     /proc/PID/status.  */
> 	  if (err == ESRCH
> 	      || (err == EPERM && linux_proc_pid_is_gone (lwpid)))
> 	    {
> 	      linux_nat_debug_printf
> 		("Cannot attach to lwp %d: thread is gone (%d: %s)",
> 		 lwpid, err, safe_strerror (err));
>
>               return 0;        <<<< NEW RETURN
> 	    }
>
> Similar thing for gdbserver, of course.

Thank you for the suggestion. I tried doing that, and in fact I attached
a patch with that change in comment #17 of PR 31312 when I was
investigating a fix¹. I called it a workaround because I also had to
increase the number of iterations in linux_proc_attach_tgid_threads from
2 to 100, otherwise GDB gives up on waiting for new inferior threads too
early and the inferior dies with a SIGTRAP because some new unnoticed
thread tripped into the breakpoint.

Because of the need to increase the number of iterations, I didn't
consider it a good solution and went with the approach in this patch
series instead. Now I finally understand why I had to increase the
number of iterations (though I still don't see a way around it other
than what this patch series does):

With the approach in this patch series, even if a new thread becomes
zombie by the time GDB tries to attach to it, it still causes
new_threads_found to be set the first time GDB notices it, and the loop
in linux_proc_attach_tgid_threads starts over.

With the approach above, a new thread that becomes zombie before GDB has
a chance to attach to it never causes the loop to start over, and so it
exits earlier.

I think it's a matter of opinion whether one approach or the other can
be considered the better one.

Even with this patch series, it's not guaranteed that two iterations
without finding new threads is enough to ensure that GDB has found all
threads in the inferior. I left the test running in a loop overnight
with the patch series applied and it failed after about 2500 runs.

--
Thiago

¹ https://sourceware.org/bugzilla/attachment.cgi?id=15405&action=diff

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 3/3] gdb/nat/linux: Fix attaching to process when it has zombie threads
  2024-04-16  4:48     ` Thiago Jung Bauermann
@ 2024-04-17 15:32       ` Pedro Alves
  2024-04-20  5:00         ` Thiago Jung Bauermann
  0 siblings, 1 reply; 19+ messages in thread
From: Pedro Alves @ 2024-04-17 15:32 UTC (permalink / raw)
  To: Thiago Jung Bauermann; +Cc: gdb-patches

On 2024-04-16 05:48, Thiago Jung Bauermann wrote:
> 
> Hello Pedro,
> 
> [ I just now noticed that I replied to you in private last time. Sorry
>   about that, I thought I had cc'd the list. ]

I hadn't noticed.  :-)

> 
> Pedro Alves <pedro@palves.net> writes:
> 
>> On 2024-03-21 23:11, Thiago Jung Bauermann wrote:
>>> When GDB attaches to a multi-threaded process, it calls
>>> linux_proc_attach_tgid_threads () to go through all threads found in
>>> /proc/PID/task/ and call attach_proc_task_lwp_callback () on each of
>>> them.  If it does that twice without the callback reporting that a new
>>> thread was found, then it considers that all inferior threads have been
>>> found and returns.
>>>
>>> The problem is that the callback considers any thread that it hasn't
>>> attached to yet as new.  This causes problems if the process has one or
>>> more zombie threads, because GDB can't attach to it and the loop will
>>> always "find" a new thread (the zombie one), and get stuck in an
>>> infinite loop.
>>>
>>> This is easy to trigger (at least on aarch64-linux and powerpc64le-linux)
>>> with the gdb.threads/attach-many-short-lived-threads.exp testcase, because
>>> its test program constantly creates and finishes joinable threads so the
>>> chance of having zombie threads is high.
>>
>> Hmm.  How about simply not restarting the loop if attach_lwp tries to attach to
>> a zombie lwp (and silently fails)?
>>
>> I.e., in gdb, make attach_proc_task_lwp_callback return false/0 here:
>>
>>       if (ptrace (PTRACE_ATTACH, lwpid, 0, 0) < 0)
>> 	{
>> 	  int err = errno;
>>
>> 	  /* Be quiet if we simply raced with the thread exiting.
>> 	     EPERM is returned if the thread's task still exists, and
>> 	     is marked as exited or zombie, as well as other
>> 	     conditions, so in that case, confirm the status in
>> 	     /proc/PID/status.  */
>> 	  if (err == ESRCH
>> 	      || (err == EPERM && linux_proc_pid_is_gone (lwpid)))
>> 	    {
>> 	      linux_nat_debug_printf
>> 		("Cannot attach to lwp %d: thread is gone (%d: %s)",
>> 		 lwpid, err, safe_strerror (err));
>>
>>               return 0;        <<<< NEW RETURN
>> 	    }
>>
>> Similar thing for gdbserver, of course.
> 
> Thank you for the suggestion. I tried doing that, and in fact I attached
> a patch with that change in comment #17 of PR 31312 when I was
> investigating a fix¹. I called it a workaround because I also had to
> increase the number of iterations in linux_proc_attach_tgid_threads from
> 2 to 100, otherwise GDB gives up on waiting for new inferior threads too
> early and the inferior dies with a SIGTRAP because some new unnoticed
> thread tripped into the breakpoint.
> 
> Because of the need to increase the number of iterations, I didn't
> consider it a good solution and went with the approach in this patch
> series instead. Now I finally understand why I had to increase the
> number of iterations (though I still don't see a way around it other
> than what this patch series does):
> 
> With the approach in this patch series, even if a new thread becomes
> zombie by the time GDB tries to attach to it, it still causes
> new_threads_found to be set the first time GDB notices it, and the loop
> in linux_proc_attach_tgid_threads starts over.
> 
> With the approach above, a new thread that becomes zombie before GDB has
> a chance to attach to it never causes the loop to start over, and so it
> exits earlier.
> 
> I think it's a matter of opinion whether one approach or the other can
> be considered the better one.
> 
> Even with this patch series, it's not guaranteed that two iterations
> without finding new threads is enough to ensure that GDB has found all
> threads in the inferior. I left the test running in a loop overnight
> with the patch series applied and it failed after about 2500 runs.

Thanks for all the investigations.  I hadn't seen the bugzilla before,
I didn't notice there was one identified in the patch.

Let me just start by saying "bleh"...

This is all of course bad ptrace/kernel design...  The only way to plug this race
completely is with kernel help, I believe.

The only way that I know of today that we can make the kernel pause all threads for
us, is with a group-stop.  I.e., pass a SIGSTOP to the inferior, and the kernel stops
_all_ threads with SIGSTOP.

I prototyped it today, and while far from perfect (would need to handle a corner
scenarios like attaching and getting back something != SIGSTOP, and I still see
some spurious SIGSTOPs), it kind of works ...

... except for what I think is a deal breaker that I don't know how to work around:

If you attach to a process that is running on a shell, you will see that the group-stop
makes it go to the background.  GDB resumes the process, but it won't go back to the
foreground without an explicit "fg" in the shell...  Like:

 $ ./test_program
 [1]+  Stopped                 ./test_program
 $

I'm pasting the incomplete patch below for the archives, but I'm not going to
pursue this further.  

Another idea I had was to look for the thread_db thread creation event breakpoint
address, and instead of putting a breakpoint there, put a "jmp $pc" instruction there.
That would halt creation of new pthreads.  But it wouldn't halt raw clones...

Maybe someone has some better idea.  Really the kernel should have a nice interface
for this.  

Thankfully this is a contrived test, no real program should be spawning threads
like this.  One would hope.

I will take a new look at your series.

From 9dad08eff6fc534964e457de7f99127354dfae44 Mon Sep 17 00:00:00 2001
From: Pedro Alves <pedro@palves.net>
Date: Wed, 17 Apr 2024 12:47:25 +0100
Subject: [PATCH] Force group-stop when attaching

Change-Id: I95bc82d2347af87c74a11ee9ddfc65b851f3e6c7
---
 gdb/linux-nat.c | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
index 2602e1f240d..c95b4370d1b 100644
--- a/gdb/linux-nat.c
+++ b/gdb/linux-nat.c
@@ -1033,6 +1033,14 @@ linux_nat_post_attach_wait (ptid_t ptid, int *signalled)
 			      status_to_str (status).c_str ());
     }
 
+  /* Force the inferior into group-stop, so all threads are frozen by
+     the kernel, so we can attach to all of them, race-free.  */
+  ptrace (PTRACE_CONT, pid, 0, SIGSTOP);
+
+  /* */
+  pid_t pid2 = my_waitpid (pid, &status, __WALL);
+  gdb_assert (pid == pid2);
+
   return status;
 }
 
@@ -1231,6 +1239,31 @@ linux_nat_target::attach (const char *args, int from_tty)
       throw;
     }
 
+  /* Consume the group-stop SIGSTOP for every thread, and move them
+     into ptrace-stopped.  */
+  for (lwp_info *lwp : all_lwps_safe ())
+    {
+      if (lwp->ptid.pid () == lwp->ptid.lwp ())
+	continue;
+
+      int lwp_status;
+      int lwpid = lwp->ptid.lwp ();
+      pid_t ret = my_waitpid (lwpid, &lwp_status, __WALL);
+      gdb_assert (ret == lwpid);
+
+      gdb_assert (WIFSTOPPED (lwp_status));
+      gdb_assert (WSTOPSIG (lwp_status) == SIGSTOP);
+
+      /* If PTRACE_GETSIGINFO fails with EINVAL, then it is definitely
+	 a group-stop.  */
+      siginfo_t siginfo;
+      gdb_assert (!linux_nat_get_siginfo (lwp->ptid, &siginfo));
+      gdb_assert (errno == EINVAL);
+
+      kill_lwp (lwpid, SIGSTOP);
+      ptrace (PTRACE_CONT, lwpid, 0, 0);
+    }
+
   /* Add all the LWPs to gdb's thread list.  */
   iterate_over_lwps (ptid_t (ptid.pid ()),
 		     [] (struct lwp_info *lwp) -> int

base-commit: 2d4c39a885d4d12325d0a9be9e014e75a295fb25
-- 
2.43.2


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 1/3] gdb/nat: Use procfs(5) indexes in linux_common_core_of_thread
  2024-03-22 17:33   ` Luis Machado
@ 2024-04-17 15:55     ` Pedro Alves
  2024-04-20  5:15       ` Thiago Jung Bauermann
  0 siblings, 1 reply; 19+ messages in thread
From: Pedro Alves @ 2024-04-17 15:55 UTC (permalink / raw)
  To: Luis Machado, Thiago Jung Bauermann, gdb-patches

On 2024-03-22 17:33, Luis Machado wrote:

>> ---
>>  gdb/nat/linux-osdata.c | 8 ++++----
>>  1 file changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/gdb/nat/linux-osdata.c b/gdb/nat/linux-osdata.c
>> index c9192940f236..172fea5cea85 100644
>> --- a/gdb/nat/linux-osdata.c
>> +++ b/gdb/nat/linux-osdata.c
>> @@ -75,10 +75,10 @@ linux_common_core_of_thread (ptid_t ptid)
>>    if (pos == std::string::npos)
>>      return -1;
>>  
>> -  /* If the first field after program name has index 0, then core number is
>> -     the field with index 36 (so, the 37th).  There's no constant for that
>> -     anywhere.  */
>> -  for (int i = 0; i < 37; ++i)
>> +  /* If the first field after program name has index 3, then core number is
>> +     the field with index 39.  These are the indexes shown in the procfs(5)
>> +     man page.  */
>> +  for (int i = 3; i <= 39; ++i)
>>      {
>>        /* Find separator.  */
>>        pos = content->find_first_of (' ', pos);
> 
> Looks ok to me.
> 
> See the comment about turning the numeric constants into named constants.

IMHO, macros here would obfuscate more than help.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 2/3] gdb/nat: Factor linux_find_proc_stat_field out of linux_common_core_of_thread
  2024-03-21 23:11 ` [RFC PATCH 2/3] gdb/nat: Factor linux_find_proc_stat_field out of linux_common_core_of_thread Thiago Jung Bauermann
  2024-03-22 16:12   ` Luis Machado
@ 2024-04-17 16:06   ` Pedro Alves
  2024-04-20  5:16     ` Thiago Jung Bauermann
  1 sibling, 1 reply; 19+ messages in thread
From: Pedro Alves @ 2024-04-17 16:06 UTC (permalink / raw)
  To: Thiago Jung Bauermann, gdb-patches

On 2024-03-21 23:11, Thiago Jung Bauermann wrote:
> The new function will be used in a subsequent patch to read a different
> stat field.
> 
> The new code is believed to be equivalent to the old code, so there should
> be no change in GDB behaviour.
> 
> Also, take the opportunity to move the function's documentation comment to
> the header file, to conform with GDB practice.

IMO, it'd be even better to move the new function to linux-proc.c, with a
linux_proc prefix.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 3/3] gdb/nat/linux: Fix attaching to process when it has zombie threads
  2024-03-21 23:11 ` [RFC PATCH 3/3] gdb/nat/linux: Fix attaching to process when it has zombie threads Thiago Jung Bauermann
  2024-03-22 16:19   ` Luis Machado
  2024-03-22 16:52   ` Pedro Alves
@ 2024-04-17 16:28   ` Pedro Alves
  2024-04-20  5:28     ` Thiago Jung Bauermann
  2 siblings, 1 reply; 19+ messages in thread
From: Pedro Alves @ 2024-04-17 16:28 UTC (permalink / raw)
  To: Thiago Jung Bauermann, gdb-patches

On 2024-03-21 23:11, Thiago Jung Bauermann wrote:
> When GDB attaches to a multi-threaded process, it calls
> linux_proc_attach_tgid_threads () to go through all threads found in
> /proc/PID/task/ and call attach_proc_task_lwp_callback () on each of
> them.  If it does that twice without the callback reporting that a new
> thread was found, then it considers that all inferior threads have been
> found and returns.
> 
> The problem is that the callback considers any thread that it hasn't
> attached to yet as new.  This causes problems if the process has one or
> more zombie threads, because GDB can't attach to it and the loop will
> always "find" a new thread (the zombie one), and get stuck in an
> infinite loop.
> 
> This is easy to trigger (at least on aarch64-linux and powerpc64le-linux)
> with the gdb.threads/attach-many-short-lived-threads.exp testcase, because
> its test program constantly creates and finishes joinable threads so the
> chance of having zombie threads is high.
> 
> This problem causes the following failures:
> 
> FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: attach (timeout)
> FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: no new threads (timeout)
> FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: set breakpoint always-inserted on (timeout)
> FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: break break_fn (timeout)
> FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: break at break_fn: 1 (timeout)
> FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: break at break_fn: 2 (timeout)
> FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: break at break_fn: 3 (timeout)
> FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: reset timer in the inferior (timeout)
> FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: print seconds_left (timeout)
> FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: detach (timeout)
> FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: set breakpoint always-inserted off (timeout)
> FAIL: gdb.threads/attach-many-short-lived-threads.exp: iter 8: delete all breakpoints, watchpoints, tracepoints, and catchpoints in delete_breakpoints (timeout)
> ERROR: breakpoints not deleted
> 
> The iteration number is random, and all tests in the subsequent iterations
> fail too, because GDB is stuck in the attach command at the beginning of
> the iteration.
> 
> The solution is to make linux_proc_attach_tgid_threads () remember when it
> has already processed a given LWP and skip it in the subsequent iterations.
> 
> PR testsuite/31312
> Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=31312
> ---
>  gdb/nat/linux-osdata.c | 22 ++++++++++++++++++++++
>  gdb/nat/linux-osdata.h |  4 ++++
>  gdb/nat/linux-procfs.c | 19 +++++++++++++++++++
>  3 files changed, 45 insertions(+)
> 
> diff --git a/gdb/nat/linux-osdata.c b/gdb/nat/linux-osdata.c
> index c254f2e4f05b..998279377433 100644
> --- a/gdb/nat/linux-osdata.c
> +++ b/gdb/nat/linux-osdata.c
> @@ -112,6 +112,28 @@ linux_common_core_of_thread (ptid_t ptid)
>    return core;
>  }
>  
> +/* See linux-osdata.h.  */
> +
> +std::optional<ULONGEST>
> +linux_get_starttime (ptid_t ptid)

Ditto re. moving this to linux-procfs.  This has nothing to do with "info osdata".

> index a82fb08b998e..1cdc687aa9cf 100644
> --- a/gdb/nat/linux-osdata.h
> +++ b/gdb/nat/linux-osdata.h
> @@ -27,4 +27,8 @@ extern int linux_common_core_of_thread (ptid_t ptid);
>  extern LONGEST linux_common_xfer_osdata (const char *annex, gdb_byte *readbuf,
>  					 ULONGEST offset, ULONGEST len);
>  
> +/* Get the start time of thread PTID.  */
> +
> +extern std::optional<ULONGEST> linux_get_starttime (ptid_t ptid);
> +
>  #endif /* NAT_LINUX_OSDATA_H */
> diff --git a/gdb/nat/linux-procfs.c b/gdb/nat/linux-procfs.c
> index b17e3120792e..b01bf36c0b53 100644
> --- a/gdb/nat/linux-procfs.c
> +++ b/gdb/nat/linux-procfs.c
> @@ -17,10 +17,13 @@
>     along with this program.  If not, see <http://www.gnu.org/licenses/>.  */
>  
>  #include "gdbsupport/common-defs.h"
> +#include "linux-osdata.h"
>  #include "linux-procfs.h"

linux-procfs.h is the main header of this source file, so it should be first.
But then again, I think this shouldn't be consuming things from linux-osdata, it
should be the other way around.

>  #include "gdbsupport/filestuff.h"
>  #include <dirent.h>
>  #include <sys/stat.h>
> +#include <set>
> +#include <utility>
>  
>  /* Return the TGID of LWPID from /proc/pid/status.  Returns -1 if not
>     found.  */
> @@ -290,6 +293,10 @@ linux_proc_attach_tgid_threads (pid_t pid,
>        return;
>      }
>  
> +  /* Keeps track of the LWPs we have already visited in /proc,
> +     identified by their PID and starttime to detect PID reuse.  */
> +  std::set<std::pair<unsigned long,ULONGEST>> visited_lwps;

Missing space before ULONGEST.

AFAICT, you don't rely on order, so this could be an unordered_set?


> +
>    /* Scan the task list for existing threads.  While we go through the
>       threads, new threads may be spawned.  Cycle through the list of
>       threads until we have done two iterations without finding new
> @@ -308,6 +315,18 @@ linux_proc_attach_tgid_threads (pid_t pid,
>  	  if (lwp != 0)
>  	    {
>  	      ptid_t ptid = ptid_t (pid, lwp);
> +	      std::optional<ULONGEST> starttime = linux_get_starttime (ptid);
> +
> +	      if (starttime.has_value ())
> +		{
> +		  std::pair<unsigned long,ULONGEST> key (lwp, *starttime);

Space before ULONGEST.

> +
> +		  /* If we already visited this LWP, skip it this time.  */
> +		  if (visited_lwps.find (key) != visited_lwps.cend ())
> +		    continue;
> +
> +		  visited_lwps.insert (key);
> +		}
>  
>  	      if (attach_lwp (ptid))
>  		new_threads_found = 1;


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 3/3] gdb/nat/linux: Fix attaching to process when it has zombie threads
  2024-04-17 15:32       ` Pedro Alves
@ 2024-04-20  5:00         ` Thiago Jung Bauermann
  2024-04-26 15:35           ` Pedro Alves
  0 siblings, 1 reply; 19+ messages in thread
From: Thiago Jung Bauermann @ 2024-04-20  5:00 UTC (permalink / raw)
  To: Pedro Alves; +Cc: gdb-patches


Hello Pedro,

Pedro Alves <pedro@palves.net> writes:

> On 2024-04-16 05:48, Thiago Jung Bauermann wrote:
>>
>> Pedro Alves <pedro@palves.net> writes:
>>
>>> Hmm.  How about simply not restarting the loop if attach_lwp tries to attach to
>>> a zombie lwp (and silently fails)?

<snip>

>>> Similar thing for gdbserver, of course.
>>
>> Thank you for the suggestion. I tried doing that, and in fact I attached
>> a patch with that change in comment #17 of PR 31312 when I was
>> investigating a fix¹. I called it a workaround because I also had to
>> increase the number of iterations in linux_proc_attach_tgid_threads from
>> 2 to 100, otherwise GDB gives up on waiting for new inferior threads too
>> early and the inferior dies with a SIGTRAP because some new unnoticed
>> thread tripped into the breakpoint.
>>
>> Because of the need to increase the number of iterations, I didn't
>> consider it a good solution and went with the approach in this patch
>> series instead. Now I finally understand why I had to increase the
>> number of iterations (though I still don't see a way around it other
>> than what this patch series does):
>>
>> With the approach in this patch series, even if a new thread becomes
>> zombie by the time GDB tries to attach to it, it still causes
>> new_threads_found to be set the first time GDB notices it, and the loop
>> in linux_proc_attach_tgid_threads starts over.
>>
>> With the approach above, a new thread that becomes zombie before GDB has
>> a chance to attach to it never causes the loop to start over, and so it
>> exits earlier.
>>
>> I think it's a matter of opinion whether one approach or the other can
>> be considered the better one.
>>
>> Even with this patch series, it's not guaranteed that two iterations
>> without finding new threads is enough to ensure that GDB has found all
>> threads in the inferior. I left the test running in a loop overnight
>> with the patch series applied and it failed after about 2500 runs.
>
> Thanks for all the investigations.  I hadn't seen the bugzilla before,
> I didn't notice there was one identified in the patch.
>
> Let me just start by saying "bleh"...

Agreed.

> This is all of course bad ptrace/kernel design...  The only way to plug this race
> completely is with kernel help, I believe.

Indeed. Having a ptrace request to tell the kernel to stop all threads in
the process, and having a "way to use ptrace without signals or wait"
(as mentioned in the LinuxKernelWishList wiki page) would be a big
improvement. In the past few years the kernel has been introducing more
and more syscalls that accept a pidfd¹. Perhaps the time is ripe for a
pidfd_ptrace syscall?

> The only way that I know of today that we can make the kernel pause all threads for
> us, is with a group-stop.  I.e., pass a SIGSTOP to the inferior, and the kernel stops
> _all_ threads with SIGSTOP.
>
> I prototyped it today, and while far from perfect (would need to handle a corner
> scenarios like attaching and getting back something != SIGSTOP, and I still see
> some spurious SIGSTOPs), it kind of works ...

This is amazing. Thank you for this exploration and the patch.

> ... except for what I think is a deal breaker that I don't know how to work around:
>
> If you attach to a process that is running on a shell, you will see that the group-stop
> makes it go to the background.  GDB resumes the process, but it won't go back to the
> foreground without an explicit "fg" in the shell...  Like:
>
>  $ ./test_program
>  [1]+  Stopped                 ./test_program
>  $

Couldn't this be a shell behaviour rather than a kernel one? I don't see
anything that I could associate with this effect mentioned in the ptrace
man page (though I could have easily missed it). If it is, then the fix
would be in the shell too.

> Thankfully this is a contrived test, no real program should be spawning threads
> like this.  One would hope.

Yes, it's a particularly egregious program. :-)

> I will take a new look at your series.

Thank you!

--
Thiago

¹ https://lwn.net/Articles/794707/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 1/3] gdb/nat: Use procfs(5) indexes in linux_common_core_of_thread
  2024-04-17 15:55     ` Pedro Alves
@ 2024-04-20  5:15       ` Thiago Jung Bauermann
  0 siblings, 0 replies; 19+ messages in thread
From: Thiago Jung Bauermann @ 2024-04-20  5:15 UTC (permalink / raw)
  To: Pedro Alves; +Cc: Luis Machado, gdb-patches

Pedro Alves <pedro@palves.net> writes:

> On 2024-03-22 17:33, Luis Machado wrote:
>
>>> ---
>>>  gdb/nat/linux-osdata.c | 8 ++++----
>>>  1 file changed, 4 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/gdb/nat/linux-osdata.c b/gdb/nat/linux-osdata.c
>>> index c9192940f236..172fea5cea85 100644
>>> --- a/gdb/nat/linux-osdata.c
>>> +++ b/gdb/nat/linux-osdata.c
>>> @@ -75,10 +75,10 @@ linux_common_core_of_thread (ptid_t ptid)
>>>    if (pos == std::string::npos)
>>>      return -1;
>>>
>>> -  /* If the first field after program name has index 0, then core number is
>>> -     the field with index 36 (so, the 37th).  There's no constant for that
>>> -     anywhere.  */
>>> -  for (int i = 0; i < 37; ++i)
>>> +  /* If the first field after program name has index 3, then core number is
>>> +     the field with index 39.  These are the indexes shown in the procfs(5)
>>> +     man page.  */
>>> +  for (int i = 3; i <= 39; ++i)
>>>      {
>>>        /* Find separator.  */
>>>        pos = content->find_first_of (' ', pos);
>>
>> Looks ok to me.
>>
>> See the comment about turning the numeric constants into named constants.
>
> IMHO, macros here would obfuscate more than help.

In v2, I'm sending with the macros so that it's easy to compare and
decide. I can send a v3 without macros again if needed.

--
Thiago

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 2/3] gdb/nat: Factor linux_find_proc_stat_field out of linux_common_core_of_thread
  2024-04-17 16:06   ` Pedro Alves
@ 2024-04-20  5:16     ` Thiago Jung Bauermann
  0 siblings, 0 replies; 19+ messages in thread
From: Thiago Jung Bauermann @ 2024-04-20  5:16 UTC (permalink / raw)
  To: Pedro Alves; +Cc: gdb-patches

Pedro Alves <pedro@palves.net> writes:

> On 2024-03-21 23:11, Thiago Jung Bauermann wrote:
>> The new function will be used in a subsequent patch to read a different
>> stat field.
>>
>> The new code is believed to be equivalent to the old code, so there should
>> be no change in GDB behaviour.
>>
>> Also, take the opportunity to move the function's documentation comment to
>> the header file, to conform with GDB practice.
>
> IMO, it'd be even better to move the new function to linux-proc.c, with a
> linux_proc prefix.

Done in v2.

--
Thiago

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 3/3] gdb/nat/linux: Fix attaching to process when it has zombie threads
  2024-04-17 16:28   ` Pedro Alves
@ 2024-04-20  5:28     ` Thiago Jung Bauermann
  0 siblings, 0 replies; 19+ messages in thread
From: Thiago Jung Bauermann @ 2024-04-20  5:28 UTC (permalink / raw)
  To: Pedro Alves; +Cc: gdb-patches


Pedro Alves <pedro@palves.net> writes:

> On 2024-03-21 23:11, Thiago Jung Bauermann wrote:
>> diff --git a/gdb/nat/linux-osdata.c b/gdb/nat/linux-osdata.c
>> index c254f2e4f05b..998279377433 100644
>> --- a/gdb/nat/linux-osdata.c
>> +++ b/gdb/nat/linux-osdata.c
>> @@ -112,6 +112,28 @@ linux_common_core_of_thread (ptid_t ptid)
>>    return core;
>>  }
>>
>> +/* See linux-osdata.h.  */
>> +
>> +std::optional<ULONGEST>
>> +linux_get_starttime (ptid_t ptid)
>
> Ditto re. moving this to linux-procfs.  This has nothing to do with "info osdata".

Done in v2.

>> index a82fb08b998e..1cdc687aa9cf 100644
>> --- a/gdb/nat/linux-osdata.h
>> +++ b/gdb/nat/linux-osdata.h
>> @@ -27,4 +27,8 @@ extern int linux_common_core_of_thread (ptid_t ptid);
>>  extern LONGEST linux_common_xfer_osdata (const char *annex, gdb_byte *readbuf,
>>  					 ULONGEST offset, ULONGEST len);
>>
>> +/* Get the start time of thread PTID.  */
>> +
>> +extern std::optional<ULONGEST> linux_get_starttime (ptid_t ptid);
>> +
>>  #endif /* NAT_LINUX_OSDATA_H */
>> diff --git a/gdb/nat/linux-procfs.c b/gdb/nat/linux-procfs.c
>> index b17e3120792e..b01bf36c0b53 100644
>> --- a/gdb/nat/linux-procfs.c
>> +++ b/gdb/nat/linux-procfs.c
>> @@ -17,10 +17,13 @@
>>     along with this program.  If not, see <http://www.gnu.org/licenses/>.  */
>>
>>  #include "gdbsupport/common-defs.h"
>> +#include "linux-osdata.h"
>>  #include "linux-procfs.h"
>
> linux-procfs.h is the main header of this source file, so it should be first.
> But then again, I think this shouldn't be consuming things from linux-osdata, it
> should be the other way around.

Right. v2 doesn't need to include "linux-osdata.h" anymore.

>>  #include "gdbsupport/filestuff.h"
>>  #include <dirent.h>
>>  #include <sys/stat.h>
>> +#include <set>
>> +#include <utility>
>>
>>  /* Return the TGID of LWPID from /proc/pid/status.  Returns -1 if not
>>     found.  */
>> @@ -290,6 +293,10 @@ linux_proc_attach_tgid_threads (pid_t pid,
>>        return;
>>      }
>>
>> +  /* Keeps track of the LWPs we have already visited in /proc,
>> +     identified by their PID and starttime to detect PID reuse.  */
>> +  std::set<std::pair<unsigned long,ULONGEST>> visited_lwps;
>
> Missing space before ULONGEST.

Ah, indeed. Fixed in v2.

> AFAICT, you don't rely on order, so this could be an unordered_set?

True. Done in v2, but I had to add a hash function for
std::pair<unsigned long, ULONGEST>. I don't know anything about
constructing hashes, so I just went with what I saw elsewhere in GDB
(index_key_hasher in dwarf2/index-write.c) and XOR the hashes of each
element of the pair.

>> +
>>    /* Scan the task list for existing threads.  While we go through the
>>       threads, new threads may be spawned.  Cycle through the list of
>>       threads until we have done two iterations without finding new
>> @@ -308,6 +315,18 @@ linux_proc_attach_tgid_threads (pid_t pid,
>>  	  if (lwp != 0)
>>  	    {
>>  	      ptid_t ptid = ptid_t (pid, lwp);
>> +	      std::optional<ULONGEST> starttime = linux_get_starttime (ptid);
>> +
>> +	      if (starttime.has_value ())
>> +		{
>> +		  std::pair<unsigned long,ULONGEST> key (lwp, *starttime);
>
> Space before ULONGEST.

Fixed in v2.

>> +
>> +		  /* If we already visited this LWP, skip it this time.  */
>> +		  if (visited_lwps.find (key) != visited_lwps.cend ())
>> +		    continue;
>> +
>> +		  visited_lwps.insert (key);
>> +		}
>>
>>  	      if (attach_lwp (ptid))
>>  		new_threads_found = 1;

Thank you for the patch reviews.

--
Thiago

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [RFC PATCH 3/3] gdb/nat/linux: Fix attaching to process when it has zombie threads
  2024-04-20  5:00         ` Thiago Jung Bauermann
@ 2024-04-26 15:35           ` Pedro Alves
  0 siblings, 0 replies; 19+ messages in thread
From: Pedro Alves @ 2024-04-26 15:35 UTC (permalink / raw)
  To: Thiago Jung Bauermann; +Cc: gdb-patches

Hi!

Going back to give some closure to this subthread...

On 2024-04-20 06:00, Thiago Jung Bauermann wrote:
> 
> Pedro Alves <pedro@palves.net> writes:

>> The only way that I know of today that we can make the kernel pause all threads for
>> us, is with a group-stop.  I.e., pass a SIGSTOP to the inferior, and the kernel stops
>> _all_ threads with SIGSTOP.
>>
>> I prototyped it today, and while far from perfect (would need to handle a corner
>> scenarios like attaching and getting back something != SIGSTOP, and I still see
>> some spurious SIGSTOPs), it kind of works ...
> 
> This is amazing. Thank you for this exploration and the patch.
> 
>> ... except for what I think is a deal breaker that I don't know how to work around:
>>
>> If you attach to a process that is running on a shell, you will see that the group-stop
>> makes it go to the background.  GDB resumes the process, but it won't go back to the
>> foreground without an explicit "fg" in the shell...  Like:
>>
>>  $ ./test_program
>>  [1]+  Stopped                 ./test_program
>>  $
> 
> Couldn't this be a shell behaviour rather than a kernel one? I don't see
> anything that I could associate with this effect mentioned in the ptrace
> man page (though I could have easily missed it). If it is, then the fix
> would be in the shell too.

I don't see how a shell could behave differently.  A shell will be waiting for its
children with waitpid WUNTRACED, which makes it see the job stop and report it
to the user.  A subsequent PTRACE_CONTINUE just leaves the child running in
the background.  Ideally we would want something like group-stop that doesn't actually
cause a job stop in the first place.

Pedro Alves


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2024-04-26 15:36 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-21 23:11 [RFC PATCH 0/3] Fix attaching to process when it has zombie threads Thiago Jung Bauermann
2024-03-21 23:11 ` [RFC PATCH 1/3] gdb/nat: Use procfs(5) indexes in linux_common_core_of_thread Thiago Jung Bauermann
2024-03-22 17:33   ` Luis Machado
2024-04-17 15:55     ` Pedro Alves
2024-04-20  5:15       ` Thiago Jung Bauermann
2024-03-21 23:11 ` [RFC PATCH 2/3] gdb/nat: Factor linux_find_proc_stat_field out of linux_common_core_of_thread Thiago Jung Bauermann
2024-03-22 16:12   ` Luis Machado
2024-04-17 16:06   ` Pedro Alves
2024-04-20  5:16     ` Thiago Jung Bauermann
2024-03-21 23:11 ` [RFC PATCH 3/3] gdb/nat/linux: Fix attaching to process when it has zombie threads Thiago Jung Bauermann
2024-03-22 16:19   ` Luis Machado
2024-03-22 16:52   ` Pedro Alves
2024-04-16  4:48     ` Thiago Jung Bauermann
2024-04-17 15:32       ` Pedro Alves
2024-04-20  5:00         ` Thiago Jung Bauermann
2024-04-26 15:35           ` Pedro Alves
2024-04-17 16:28   ` Pedro Alves
2024-04-20  5:28     ` Thiago Jung Bauermann
2024-03-22 10:17 ` [RFC PATCH 0/3] " Christophe Lyon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox