[RFA] Fix internal error in wait_lwp (interrupted system call)

Mirror of the gdb-patches mailing list
 help / color / mirror / Atom feed

* [RFA] Fix internal error in wait_lwp (interrupted system call)
@ 2005-05-12 19:18 Ulrich Weigand
  2005-05-12 20:32 ` Daniel Jacobowitz
  0 siblings, 1 reply; 8+ messages in thread
From: Ulrich Weigand @ 2005-05-12 19:18 UTC (permalink / raw)
  To: gdb-patches

Hello,

we've had reports from our JVM/JIT development group that for them,
gdb 6.3 frequently fails with internal errors like:
linux-nat.c:1152: internal-error: wait_lwp: Assertion `pid == GET_LWP (lp->ptid)' failed.

It turned out that this happens when a SIGCHLD arrives during
execution of the waitpid call.  This causes the signal handler
to be executed, and subsequently the system call returns with
errno equal to EINTR.

Now, looking through the linux-nat.c file, it would appear that this
type of problem has been addressed at various places in different
ways.  In linux_handle_extended_wait, the waitpid call is wrapped
into an explicit do { } while (ret == -1 && errno == EINTR) loop.
In linux_test_for_tracefork, this very loop is abstracted into a
my_waitpid routine.  In child_wait and linux_nat_wait, there are
larger loops that will handle this situation as well.  Finally,
in lin_lwp_attach_lwp, SIGCHLD is actually blocked during the
execution of the waitpid call.

However, there remain some places where waitpid is called without
any such precaution, and wait_lwp is one of these.  When debugging
a process making very heavy use of threads, as the JVM, this can
lead to the error shown above.

Now, as far as I can see, there is really *no* place where GDB
actually *wants* a system call to be interrupted by the SIGCHLD
signal handler.  Thus, I'd propose to fix the problem at its 
root by simply installing the handler with the SA_RESTART flag,
causing any interrupted system call to be automatically restarted.

The patch below does this, and fixes all problems for the JVM team.
It also passes regression testing on s390-ibm-linux and s390x-ibm-linux.

OK to commit?

Bye,
Ulrich

ChangeLog:

	* linux-nat.c (_initialize_linux_nat): Install SIGCHLD handler
	using the SA_RESTART flag.

Index: gdb/linux-nat.c
===================================================================
RCS file: /cvs/src/src/gdb/linux-nat.c,v
retrieving revision 1.27
diff -c -p -r1.27 linux-nat.c
*** gdb/linux-nat.c	6 Mar 2005 16:42:20 -0000	1.27
--- gdb/linux-nat.c	12 May 2005 18:50:42 -0000
*************** Specify any of the following keywords fo
*** 3095,3101 ****

    action.sa_handler = sigchld_handler;
    sigemptyset (&action.sa_mask);
!   action.sa_flags = 0;
    sigaction (SIGCHLD, &action, NULL);

    /* Make sure we don't block SIGCHLD during a sigsuspend.  */
--- 3095,3101 ----

    action.sa_handler = sigchld_handler;
    sigemptyset (&action.sa_mask);
!   action.sa_flags = SA_RESTART;
    sigaction (SIGCHLD, &action, NULL);

    /* Make sure we don't block SIGCHLD during a sigsuspend.  */
-- 
  Dr. Ulrich Weigand
  Linux on zSeries Development
  Ulrich.Weigand@de.ibm.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFA] Fix internal error in wait_lwp (interrupted system call)
  2005-05-12 19:18 [RFA] Fix internal error in wait_lwp (interrupted system call) Ulrich Weigand
@ 2005-05-12 20:32 ` Daniel Jacobowitz
  2005-05-12 21:14   ` Ulrich Weigand
  0 siblings, 1 reply; 8+ messages in thread
From: Daniel Jacobowitz @ 2005-05-12 20:32 UTC (permalink / raw)
  To: Ulrich Weigand; +Cc: gdb-patches

On Thu, May 12, 2005 at 09:06:38PM +0200, Ulrich Weigand wrote:
> Hello,
> 
> we've had reports from our JVM/JIT development group that for them,
> gdb 6.3 frequently fails with internal errors like:
> linux-nat.c:1152: internal-error: wait_lwp: Assertion `pid == GET_LWP (lp->ptid)' failed.
> 
> It turned out that this happens when a SIGCHLD arrives during
> execution of the waitpid call.  This causes the signal handler
> to be executed, and subsequently the system call returns with
> errno equal to EINTR.
> 
> Now, looking through the linux-nat.c file, it would appear that this
> type of problem has been addressed at various places in different
> ways.  In linux_handle_extended_wait, the waitpid call is wrapped
> into an explicit do { } while (ret == -1 && errno == EINTR) loop.
> In linux_test_for_tracefork, this very loop is abstracted into a
> my_waitpid routine.  In child_wait and linux_nat_wait, there are
> larger loops that will handle this situation as well.  Finally,
> in lin_lwp_attach_lwp, SIGCHLD is actually blocked during the
> execution of the waitpid call.
> 
> However, there remain some places where waitpid is called without
> any such precaution, and wait_lwp is one of these.  When debugging
> a process making very heavy use of threads, as the JVM, this can
> lead to the error shown above.
> 
> Now, as far as I can see, there is really *no* place where GDB
> actually *wants* a system call to be interrupted by the SIGCHLD
> signal handler.  Thus, I'd propose to fix the problem at its 
> root by simply installing the handler with the SA_RESTART flag,
> causing any interrupted system call to be automatically restarted.
> 
> The patch below does this, and fixes all problems for the JVM team.
> It also passes regression testing on s390-ibm-linux and s390x-ibm-linux.
> 
> OK to commit?

On the one hand, this is very clever.  On the other hand, it's not very
robust.  This is not the only signal that could arrive.  Shouldn't
wait_lwp be looping on EINTR anyway, probably by using my_waitpid
(which is a recent addition)?


-- 
Daniel Jacobowitz
CodeSourcery, LLC


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFA] Fix internal error in wait_lwp (interrupted system call)
  2005-05-12 20:32 ` Daniel Jacobowitz
@ 2005-05-12 21:14   ` Ulrich Weigand
  2005-05-12 21:35     ` Daniel Jacobowitz
  0 siblings, 1 reply; 8+ messages in thread
From: Ulrich Weigand @ 2005-05-12 21:14 UTC (permalink / raw)
  To: Daniel Jacobowitz; +Cc: Ulrich Weigand, gdb-patches

Daniel Jacobowitz wrote:

> On the one hand, this is very clever.  On the other hand, it's not very
> robust.  This is not the only signal that could arrive.  Shouldn't
> wait_lwp be looping on EINTR anyway, probably by using my_waitpid
> (which is a recent addition)?

Well, this isn't very robust either as waitpid isn't the only system
call that could be interrupted -- it just typically has the biggest
race window as it tends to sleep ...

Using SA_RESTART has the advantage that it handles *all* (well,
nearly all) system calls automatically, without having to add
EINTR loops all over the place.

If other signals are in fact a problem, too, I'd rather install
them with SA_RESTART too.  (However, I haven't ever seen the
problem with any other signal.)

Bye,
Ulrich

-- 
  Dr. Ulrich Weigand
  Linux on zSeries Development
  Ulrich.Weigand@de.ibm.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFA] Fix internal error in wait_lwp (interrupted system call)
  2005-05-12 21:14   ` Ulrich Weigand
@ 2005-05-12 21:35     ` Daniel Jacobowitz
  2005-05-12 23:14       ` Andreas Schwab
  2005-05-13 13:43       ` Ulrich Weigand
  0 siblings, 2 replies; 8+ messages in thread
From: Daniel Jacobowitz @ 2005-05-12 21:35 UTC (permalink / raw)
  To: Ulrich Weigand; +Cc: gdb-patches

On Thu, May 12, 2005 at 11:02:55PM +0200, Ulrich Weigand wrote:
> Daniel Jacobowitz wrote:
> 
> > On the one hand, this is very clever.  On the other hand, it's not very
> > robust.  This is not the only signal that could arrive.  Shouldn't
> > wait_lwp be looping on EINTR anyway, probably by using my_waitpid
> > (which is a recent addition)?
> 
> Well, this isn't very robust either as waitpid isn't the only system
> call that could be interrupted -- it just typically has the biggest
> race window as it tends to sleep ...
> 
> Using SA_RESTART has the advantage that it handles *all* (well,
> nearly all) system calls automatically, without having to add
> EINTR loops all over the place.
> 
> If other signals are in fact a problem, too, I'd rather install
> them with SA_RESTART too.  (However, I haven't ever seen the
> problem with any other signal.)

I think that's because GDB mostly uses signal().  That automatically
has SA_RESTART behavior, right?  I'm just worried that this will
require non-local fixes; the SIGCHLD handler was in the same file, but
arbitrary other signal handlers could be at arbitrary places in GDB.
Not handling EINTR from a library call which is allowed to return EINTR
is, pedantically, always a bug.

Perhaps we should make both changes.

-- 
Daniel Jacobowitz
CodeSourcery, LLC


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFA] Fix internal error in wait_lwp (interrupted system call)
  2005-05-12 21:35     ` Daniel Jacobowitz
@ 2005-05-12 23:14       ` Andreas Schwab
  2005-05-13 13:43       ` Ulrich Weigand
  1 sibling, 0 replies; 8+ messages in thread
From: Andreas Schwab @ 2005-05-12 23:14 UTC (permalink / raw)
  To: Ulrich Weigand; +Cc: gdb-patches

Daniel Jacobowitz <drow@false.org> writes:

> I think that's because GDB mostly uses signal().  That automatically
> has SA_RESTART behavior, right?

Depends on whether signal() has BSD or SYSV semantics.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, MaxfeldstraÃŸe 5, 90409 NÃ¼rnberg, Germany
Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFA] Fix internal error in wait_lwp (interrupted system call)
  2005-05-12 21:35     ` Daniel Jacobowitz
  2005-05-12 23:14       ` Andreas Schwab
@ 2005-05-13 13:43       ` Ulrich Weigand
  2005-05-15 17:12         ` Daniel Jacobowitz
  1 sibling, 1 reply; 8+ messages in thread
From: Ulrich Weigand @ 2005-05-13 13:43 UTC (permalink / raw)
  To: Daniel Jacobowitz; +Cc: Ulrich Weigand, gdb-patches

Daniel Jacobowitz wrote:
> I think that's because GDB mostly uses signal().  That automatically
> has SA_RESTART behavior, right?

Andreas Schwab wrote:
> Depends on whether signal() has BSD or SYSV semantics.

I wasn't completely certain how to verify this at the source code level,
so I simply checked via 'strace': and indeed the signal() library calls
performed by gdb are transformed to sigaction() system calls *with*
SA_RESTART set on my Linux system.

This is IMO yet another argument for using SA_RESTART with the explicit
sigaction calls in linux-nat.c as well.

Daniel Jacobowitz wrote:
> I'm just worried that this will
> require non-local fixes; the SIGCHLD handler was in the same file, but
> arbitrary other signal handlers could be at arbitrary places in GDB.
> Not handling EINTR from a library call which is allowed to return EINTR
> is, pedantically, always a bug.
> 
> Perhaps we should make both changes.

As verifying correct EINTR handling everywhere appears to be a major
effort, and the simple SA_RESTART patch both fixes a serious problem,
and appears to be correct in any case, I'd argue for applying that
patch now, and maybe later on adding EINTR handling as required.

Bye,
Ulrich

-- 
  Dr. Ulrich Weigand
  Linux on zSeries Development
  Ulrich.Weigand@de.ibm.com

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFA] Fix internal error in wait_lwp (interrupted system call)
  2005-05-13 13:43       ` Ulrich Weigand
@ 2005-05-15 17:12         ` Daniel Jacobowitz
  2005-05-16 11:21           ` Ulrich Weigand
  0 siblings, 1 reply; 8+ messages in thread
From: Daniel Jacobowitz @ 2005-05-15 17:12 UTC (permalink / raw)
  To: Ulrich Weigand; +Cc: gdb-patches

On Fri, May 13, 2005 at 02:36:55PM +0200, Ulrich Weigand wrote:
> As verifying correct EINTR handling everywhere appears to be a major
> effort, and the simple SA_RESTART patch both fixes a serious problem,
> and appears to be correct in any case, I'd argue for applying that
> patch now, and maybe later on adding EINTR handling as required.

I'll do both.  I didn't audit every single restartable syscall, just
the other uses of waitpid in this file; that will do for now.  I know
that most of these are not robust against EINTR and should not have to
be.  There was also another call to sigaction; only relevant on
LinuxThreads with older kernels.

Committed the below after testing on i686-pc-linux-gnu.  Thank you for
discovering this problem.

-- 
Daniel Jacobowitz
CodeSourcery, LLC

2005-05-15  Daniel Jacobowitz  <dan@codesourcery.com>

	* linux-nat.c (child_follow_fork, linux_handle_extended_wait)
	(lin_lwp_attach_lwp, linux_nat_attach, wait_lwp, child_wait)
	(linux_nat_wait, kill_wait_callback): Use my_waitpid.
	(_initialize_linux_nat, lin_thread_get_thread_signals): Use
	SA_RESTART.

Index: linux-nat.c
===================================================================
RCS file: /cvs/src/src/gdb/linux-nat.c,v
retrieving revision 1.27
diff -u -p -r1.27 linux-nat.c
--- linux-nat.c	6 Mar 2005 16:42:20 -0000	1.27
+++ linux-nat.c	15 May 2005 14:11:22 -0000
@@ -377,7 +377,7 @@ child_follow_fork (int follow_child)
 	      int status;
 
 	      ptrace (PTRACE_CONT, parent_pid, 0, 0);
-	      waitpid (parent_pid, &status, __WALL);
+	      my_waitpid (parent_pid, &status, __WALL);
 	      if ((status >> 16) != PTRACE_EVENT_VFORK_DONE)
 		warning (_("Unexpected waitpid result %06x when waiting for "
 			 "vfork-done"), status);
@@ -494,10 +494,8 @@ linux_handle_extended_wait (int pid, int
 	{
 	  /* The new child has a pending SIGSTOP.  We can't affect it until it
 	     hits the SIGSTOP, but we're already attached.  */
-	  do {
-	    ret = waitpid (new_pid, &status,
-			   (event == PTRACE_EVENT_CLONE) ? __WCLONE : 0);
-	  } while (ret == -1 && errno == EINTR);
+	  ret = my_waitpid (new_pid, &status,
+			    (event == PTRACE_EVENT_CLONE) ? __WCLONE : 0);
 	  if (ret == -1)
 	    perror_with_name (_("waiting for new child"));
 	  else if (ret != new_pid)
@@ -868,11 +866,11 @@ lin_lwp_attach_lwp (ptid_t ptid, int ver
 			    "LLAL: PTRACE_ATTACH %s, 0, 0 (OK)\n",
 			    target_pid_to_str (ptid));
 
-      pid = waitpid (GET_LWP (ptid), &status, 0);
+      pid = my_waitpid (GET_LWP (ptid), &status, 0);
       if (pid == -1 && errno == ECHILD)
 	{
 	  /* Try again with __WCLONE to check cloned processes.  */
-	  pid = waitpid (GET_LWP (ptid), &status, __WCLONE);
+	  pid = my_waitpid (GET_LWP (ptid), &status, __WCLONE);
 	  lp->cloned = 1;
 	}
 
@@ -920,13 +918,13 @@ linux_nat_attach (char *args, int from_t
   /* Make sure the initial process is stopped.  The user-level threads
      layer might want to poke around in the inferior, and that won't
      work if things haven't stabilized yet.  */
-  pid = waitpid (GET_PID (inferior_ptid), &status, 0);
+  pid = my_waitpid (GET_PID (inferior_ptid), &status, 0);
   if (pid == -1 && errno == ECHILD)
     {
       warning (_("%s is a cloned process"), target_pid_to_str (inferior_ptid));
 
       /* Try again with __WCLONE to check cloned processes.  */
-      pid = waitpid (GET_PID (inferior_ptid), &status, __WCLONE);
+      pid = my_waitpid (GET_PID (inferior_ptid), &status, __WCLONE);
       lp->cloned = 1;
     }
 
@@ -1191,10 +1189,10 @@ wait_lwp (struct lwp_info *lp)
   gdb_assert (!lp->stopped);
   gdb_assert (lp->status == 0);
 
-  pid = waitpid (GET_LWP (lp->ptid), &status, 0);
+  pid = my_waitpid (GET_LWP (lp->ptid), &status, 0);
   if (pid == -1 && errno == ECHILD)
     {
-      pid = waitpid (GET_LWP (lp->ptid), &status, __WCLONE);
+      pid = my_waitpid (GET_LWP (lp->ptid), &status, __WCLONE);
       if (pid == -1 && errno == ECHILD)
 	{
 	  /* The thread has previously exited.  We need to delete it
@@ -1706,10 +1704,10 @@ child_wait (ptid_t ptid, struct target_w
 				   attached process.  */
       set_sigio_trap ();
 
-      pid = waitpid (GET_PID (ptid), &status, 0);
+      pid = my_waitpid (GET_PID (ptid), &status, 0);
       if (pid == -1 && errno == ECHILD)
 	/* Try again with __WCLONE to check cloned processes.  */
-	pid = waitpid (GET_PID (ptid), &status, __WCLONE);
+	pid = my_waitpid (GET_PID (ptid), &status, __WCLONE);
 
       if (debug_linux_nat)
 	{
@@ -1920,7 +1918,7 @@ retry:
     {
       pid_t lwpid;
 
-      lwpid = waitpid (pid, &status, options);
+      lwpid = my_waitpid (pid, &status, options);
       if (lwpid > 0)
 	{
 	  gdb_assert (pid == -1 || lwpid == pid);
@@ -2264,7 +2262,7 @@ kill_wait_callback (struct lwp_info *lp,
     {
       do
 	{
-	  pid = waitpid (GET_LWP (lp->ptid), NULL, __WCLONE);
+	  pid = my_waitpid (GET_LWP (lp->ptid), NULL, __WCLONE);
 	  if (pid != (pid_t) -1 && debug_linux_nat)
 	    {
 	      fprintf_unfiltered (gdb_stdlog,
@@ -2279,7 +2277,7 @@ kill_wait_callback (struct lwp_info *lp,
 
   do
     {
-      pid = waitpid (GET_LWP (lp->ptid), NULL, 0);
+      pid = my_waitpid (GET_LWP (lp->ptid), NULL, 0);
       if (pid != (pid_t) -1 && debug_linux_nat)
 	{
 	  fprintf_unfiltered (gdb_stdlog,
@@ -3095,7 +3093,7 @@ Specify any of the following keywords fo
 
   action.sa_handler = sigchld_handler;
   sigemptyset (&action.sa_mask);
-  action.sa_flags = 0;
+  action.sa_flags = SA_RESTART;
   sigaction (SIGCHLD, &action, NULL);
 
   /* Make sure we don't block SIGCHLD during a sigsuspend.  */
@@ -3168,7 +3166,7 @@ lin_thread_get_thread_signals (sigset_t 
 
   action.sa_handler = sigchld_handler;
   sigemptyset (&action.sa_mask);
-  action.sa_flags = 0;
+  action.sa_flags = SA_RESTART;
   sigaction (cancel, &action, NULL);
 
   /* We block the "cancel" signal throughout this code ...  */


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFA] Fix internal error in wait_lwp (interrupted system call)
  2005-05-15 17:12         ` Daniel Jacobowitz
@ 2005-05-16 11:21           ` Ulrich Weigand
  0 siblings, 0 replies; 8+ messages in thread
From: Ulrich Weigand @ 2005-05-16 11:21 UTC (permalink / raw)
  To: Daniel Jacobowitz; +Cc: gdb-patches

Daniel Jacobowitz <drow@false.org> wrote on 15.05.2005 19:02:49:

> I'll do both.  I didn't audit every single restartable syscall, just
> the other uses of waitpid in this file; that will do for now.  I know
> that most of these are not robust against EINTR and should not have to
> be.  There was also another call to sigaction; only relevant on
> LinuxThreads with older kernels.
>
> Committed the below after testing on i686-pc-linux-gnu.  Thank you for
> discovering this problem.

Many thanks for taking care of this!


Mit freundlichen Gruessen / Best Regards

Ulrich Weigand

--
  Dr. Ulrich Weigand
  Linux for S/390 Design & Development
  IBM Deutschland Entwicklung GmbH, Schoenaicher Str. 220, 71032 Boeblingen
  Phone: +49-7031/16-3727   ---   Email: Ulrich.Weigand@de.ibm.com


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2005-05-16 10:21 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2005-05-12 19:18 [RFA] Fix internal error in wait_lwp (interrupted system call) Ulrich Weigand
2005-05-12 20:32 ` Daniel Jacobowitz
2005-05-12 21:14   ` Ulrich Weigand
2005-05-12 21:35     ` Daniel Jacobowitz
2005-05-12 23:14       ` Andreas Schwab
2005-05-13 13:43       ` Ulrich Weigand
2005-05-15 17:12         ` Daniel Jacobowitz
2005-05-16 11:21           ` Ulrich Weigand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox