* [RFA] Fix internal error in wait_lwp (interrupted system call)
@ 2005-05-12 19:18 Ulrich Weigand
2005-05-12 20:32 ` Daniel Jacobowitz
0 siblings, 1 reply; 8+ messages in thread
From: Ulrich Weigand @ 2005-05-12 19:18 UTC (permalink / raw)
To: gdb-patches
Hello,
we've had reports from our JVM/JIT development group that for them,
gdb 6.3 frequently fails with internal errors like:
linux-nat.c:1152: internal-error: wait_lwp: Assertion `pid == GET_LWP (lp->ptid)' failed.
It turned out that this happens when a SIGCHLD arrives during
execution of the waitpid call. This causes the signal handler
to be executed, and subsequently the system call returns with
errno equal to EINTR.
Now, looking through the linux-nat.c file, it would appear that this
type of problem has been addressed at various places in different
ways. In linux_handle_extended_wait, the waitpid call is wrapped
into an explicit do { } while (ret == -1 && errno == EINTR) loop.
In linux_test_for_tracefork, this very loop is abstracted into a
my_waitpid routine. In child_wait and linux_nat_wait, there are
larger loops that will handle this situation as well. Finally,
in lin_lwp_attach_lwp, SIGCHLD is actually blocked during the
execution of the waitpid call.
However, there remain some places where waitpid is called without
any such precaution, and wait_lwp is one of these. When debugging
a process making very heavy use of threads, as the JVM, this can
lead to the error shown above.
Now, as far as I can see, there is really *no* place where GDB
actually *wants* a system call to be interrupted by the SIGCHLD
signal handler. Thus, I'd propose to fix the problem at its
root by simply installing the handler with the SA_RESTART flag,
causing any interrupted system call to be automatically restarted.
The patch below does this, and fixes all problems for the JVM team.
It also passes regression testing on s390-ibm-linux and s390x-ibm-linux.
OK to commit?
Bye,
Ulrich
ChangeLog:
* linux-nat.c (_initialize_linux_nat): Install SIGCHLD handler
using the SA_RESTART flag.
Index: gdb/linux-nat.c
===================================================================
RCS file: /cvs/src/src/gdb/linux-nat.c,v
retrieving revision 1.27
diff -c -p -r1.27 linux-nat.c
*** gdb/linux-nat.c 6 Mar 2005 16:42:20 -0000 1.27
--- gdb/linux-nat.c 12 May 2005 18:50:42 -0000
*************** Specify any of the following keywords fo
*** 3095,3101 ****
action.sa_handler = sigchld_handler;
sigemptyset (&action.sa_mask);
! action.sa_flags = 0;
sigaction (SIGCHLD, &action, NULL);
/* Make sure we don't block SIGCHLD during a sigsuspend. */
--- 3095,3101 ----
action.sa_handler = sigchld_handler;
sigemptyset (&action.sa_mask);
! action.sa_flags = SA_RESTART;
sigaction (SIGCHLD, &action, NULL);
/* Make sure we don't block SIGCHLD during a sigsuspend. */
--
Dr. Ulrich Weigand
Linux on zSeries Development
Ulrich.Weigand@de.ibm.com
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [RFA] Fix internal error in wait_lwp (interrupted system call) 2005-05-12 19:18 [RFA] Fix internal error in wait_lwp (interrupted system call) Ulrich Weigand @ 2005-05-12 20:32 ` Daniel Jacobowitz 2005-05-12 21:14 ` Ulrich Weigand 0 siblings, 1 reply; 8+ messages in thread From: Daniel Jacobowitz @ 2005-05-12 20:32 UTC (permalink / raw) To: Ulrich Weigand; +Cc: gdb-patches On Thu, May 12, 2005 at 09:06:38PM +0200, Ulrich Weigand wrote: > Hello, > > we've had reports from our JVM/JIT development group that for them, > gdb 6.3 frequently fails with internal errors like: > linux-nat.c:1152: internal-error: wait_lwp: Assertion `pid == GET_LWP (lp->ptid)' failed. > > It turned out that this happens when a SIGCHLD arrives during > execution of the waitpid call. This causes the signal handler > to be executed, and subsequently the system call returns with > errno equal to EINTR. > > Now, looking through the linux-nat.c file, it would appear that this > type of problem has been addressed at various places in different > ways. In linux_handle_extended_wait, the waitpid call is wrapped > into an explicit do { } while (ret == -1 && errno == EINTR) loop. > In linux_test_for_tracefork, this very loop is abstracted into a > my_waitpid routine. In child_wait and linux_nat_wait, there are > larger loops that will handle this situation as well. Finally, > in lin_lwp_attach_lwp, SIGCHLD is actually blocked during the > execution of the waitpid call. > > However, there remain some places where waitpid is called without > any such precaution, and wait_lwp is one of these. When debugging > a process making very heavy use of threads, as the JVM, this can > lead to the error shown above. > > Now, as far as I can see, there is really *no* place where GDB > actually *wants* a system call to be interrupted by the SIGCHLD > signal handler. Thus, I'd propose to fix the problem at its > root by simply installing the handler with the SA_RESTART flag, > causing any interrupted system call to be automatically restarted. > > The patch below does this, and fixes all problems for the JVM team. > It also passes regression testing on s390-ibm-linux and s390x-ibm-linux. > > OK to commit? On the one hand, this is very clever. On the other hand, it's not very robust. This is not the only signal that could arrive. Shouldn't wait_lwp be looping on EINTR anyway, probably by using my_waitpid (which is a recent addition)? -- Daniel Jacobowitz CodeSourcery, LLC ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFA] Fix internal error in wait_lwp (interrupted system call) 2005-05-12 20:32 ` Daniel Jacobowitz @ 2005-05-12 21:14 ` Ulrich Weigand 2005-05-12 21:35 ` Daniel Jacobowitz 0 siblings, 1 reply; 8+ messages in thread From: Ulrich Weigand @ 2005-05-12 21:14 UTC (permalink / raw) To: Daniel Jacobowitz; +Cc: Ulrich Weigand, gdb-patches Daniel Jacobowitz wrote: > On the one hand, this is very clever. On the other hand, it's not very > robust. This is not the only signal that could arrive. Shouldn't > wait_lwp be looping on EINTR anyway, probably by using my_waitpid > (which is a recent addition)? Well, this isn't very robust either as waitpid isn't the only system call that could be interrupted -- it just typically has the biggest race window as it tends to sleep ... Using SA_RESTART has the advantage that it handles *all* (well, nearly all) system calls automatically, without having to add EINTR loops all over the place. If other signals are in fact a problem, too, I'd rather install them with SA_RESTART too. (However, I haven't ever seen the problem with any other signal.) Bye, Ulrich -- Dr. Ulrich Weigand Linux on zSeries Development Ulrich.Weigand@de.ibm.com ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFA] Fix internal error in wait_lwp (interrupted system call) 2005-05-12 21:14 ` Ulrich Weigand @ 2005-05-12 21:35 ` Daniel Jacobowitz 2005-05-12 23:14 ` Andreas Schwab 2005-05-13 13:43 ` Ulrich Weigand 0 siblings, 2 replies; 8+ messages in thread From: Daniel Jacobowitz @ 2005-05-12 21:35 UTC (permalink / raw) To: Ulrich Weigand; +Cc: gdb-patches On Thu, May 12, 2005 at 11:02:55PM +0200, Ulrich Weigand wrote: > Daniel Jacobowitz wrote: > > > On the one hand, this is very clever. On the other hand, it's not very > > robust. This is not the only signal that could arrive. Shouldn't > > wait_lwp be looping on EINTR anyway, probably by using my_waitpid > > (which is a recent addition)? > > Well, this isn't very robust either as waitpid isn't the only system > call that could be interrupted -- it just typically has the biggest > race window as it tends to sleep ... > > Using SA_RESTART has the advantage that it handles *all* (well, > nearly all) system calls automatically, without having to add > EINTR loops all over the place. > > If other signals are in fact a problem, too, I'd rather install > them with SA_RESTART too. (However, I haven't ever seen the > problem with any other signal.) I think that's because GDB mostly uses signal(). That automatically has SA_RESTART behavior, right? I'm just worried that this will require non-local fixes; the SIGCHLD handler was in the same file, but arbitrary other signal handlers could be at arbitrary places in GDB. Not handling EINTR from a library call which is allowed to return EINTR is, pedantically, always a bug. Perhaps we should make both changes. -- Daniel Jacobowitz CodeSourcery, LLC ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFA] Fix internal error in wait_lwp (interrupted system call) 2005-05-12 21:35 ` Daniel Jacobowitz @ 2005-05-12 23:14 ` Andreas Schwab 2005-05-13 13:43 ` Ulrich Weigand 1 sibling, 0 replies; 8+ messages in thread From: Andreas Schwab @ 2005-05-12 23:14 UTC (permalink / raw) To: Ulrich Weigand; +Cc: gdb-patches Daniel Jacobowitz <drow@false.org> writes: > I think that's because GDB mostly uses signal(). That automatically > has SA_RESTART behavior, right? Depends on whether signal() has BSD or SYSV semantics. Andreas. -- Andreas Schwab, SuSE Labs, schwab@suse.de SuSE Linux Products GmbH, MaxfeldstraÃe 5, 90409 Nürnberg, Germany Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different." ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFA] Fix internal error in wait_lwp (interrupted system call) 2005-05-12 21:35 ` Daniel Jacobowitz 2005-05-12 23:14 ` Andreas Schwab @ 2005-05-13 13:43 ` Ulrich Weigand 2005-05-15 17:12 ` Daniel Jacobowitz 1 sibling, 1 reply; 8+ messages in thread From: Ulrich Weigand @ 2005-05-13 13:43 UTC (permalink / raw) To: Daniel Jacobowitz; +Cc: Ulrich Weigand, gdb-patches Daniel Jacobowitz wrote: > I think that's because GDB mostly uses signal(). That automatically > has SA_RESTART behavior, right? Andreas Schwab wrote: > Depends on whether signal() has BSD or SYSV semantics. I wasn't completely certain how to verify this at the source code level, so I simply checked via 'strace': and indeed the signal() library calls performed by gdb are transformed to sigaction() system calls *with* SA_RESTART set on my Linux system. This is IMO yet another argument for using SA_RESTART with the explicit sigaction calls in linux-nat.c as well. Daniel Jacobowitz wrote: > I'm just worried that this will > require non-local fixes; the SIGCHLD handler was in the same file, but > arbitrary other signal handlers could be at arbitrary places in GDB. > Not handling EINTR from a library call which is allowed to return EINTR > is, pedantically, always a bug. > > Perhaps we should make both changes. As verifying correct EINTR handling everywhere appears to be a major effort, and the simple SA_RESTART patch both fixes a serious problem, and appears to be correct in any case, I'd argue for applying that patch now, and maybe later on adding EINTR handling as required. Bye, Ulrich -- Dr. Ulrich Weigand Linux on zSeries Development Ulrich.Weigand@de.ibm.com ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFA] Fix internal error in wait_lwp (interrupted system call) 2005-05-13 13:43 ` Ulrich Weigand @ 2005-05-15 17:12 ` Daniel Jacobowitz 2005-05-16 11:21 ` Ulrich Weigand 0 siblings, 1 reply; 8+ messages in thread From: Daniel Jacobowitz @ 2005-05-15 17:12 UTC (permalink / raw) To: Ulrich Weigand; +Cc: gdb-patches On Fri, May 13, 2005 at 02:36:55PM +0200, Ulrich Weigand wrote: > As verifying correct EINTR handling everywhere appears to be a major > effort, and the simple SA_RESTART patch both fixes a serious problem, > and appears to be correct in any case, I'd argue for applying that > patch now, and maybe later on adding EINTR handling as required. I'll do both. I didn't audit every single restartable syscall, just the other uses of waitpid in this file; that will do for now. I know that most of these are not robust against EINTR and should not have to be. There was also another call to sigaction; only relevant on LinuxThreads with older kernels. Committed the below after testing on i686-pc-linux-gnu. Thank you for discovering this problem. -- Daniel Jacobowitz CodeSourcery, LLC 2005-05-15 Daniel Jacobowitz <dan@codesourcery.com> * linux-nat.c (child_follow_fork, linux_handle_extended_wait) (lin_lwp_attach_lwp, linux_nat_attach, wait_lwp, child_wait) (linux_nat_wait, kill_wait_callback): Use my_waitpid. (_initialize_linux_nat, lin_thread_get_thread_signals): Use SA_RESTART. Index: linux-nat.c =================================================================== RCS file: /cvs/src/src/gdb/linux-nat.c,v retrieving revision 1.27 diff -u -p -r1.27 linux-nat.c --- linux-nat.c 6 Mar 2005 16:42:20 -0000 1.27 +++ linux-nat.c 15 May 2005 14:11:22 -0000 @@ -377,7 +377,7 @@ child_follow_fork (int follow_child) int status; ptrace (PTRACE_CONT, parent_pid, 0, 0); - waitpid (parent_pid, &status, __WALL); + my_waitpid (parent_pid, &status, __WALL); if ((status >> 16) != PTRACE_EVENT_VFORK_DONE) warning (_("Unexpected waitpid result %06x when waiting for " "vfork-done"), status); @@ -494,10 +494,8 @@ linux_handle_extended_wait (int pid, int { /* The new child has a pending SIGSTOP. We can't affect it until it hits the SIGSTOP, but we're already attached. */ - do { - ret = waitpid (new_pid, &status, - (event == PTRACE_EVENT_CLONE) ? __WCLONE : 0); - } while (ret == -1 && errno == EINTR); + ret = my_waitpid (new_pid, &status, + (event == PTRACE_EVENT_CLONE) ? __WCLONE : 0); if (ret == -1) perror_with_name (_("waiting for new child")); else if (ret != new_pid) @@ -868,11 +866,11 @@ lin_lwp_attach_lwp (ptid_t ptid, int ver "LLAL: PTRACE_ATTACH %s, 0, 0 (OK)\n", target_pid_to_str (ptid)); - pid = waitpid (GET_LWP (ptid), &status, 0); + pid = my_waitpid (GET_LWP (ptid), &status, 0); if (pid == -1 && errno == ECHILD) { /* Try again with __WCLONE to check cloned processes. */ - pid = waitpid (GET_LWP (ptid), &status, __WCLONE); + pid = my_waitpid (GET_LWP (ptid), &status, __WCLONE); lp->cloned = 1; } @@ -920,13 +918,13 @@ linux_nat_attach (char *args, int from_t /* Make sure the initial process is stopped. The user-level threads layer might want to poke around in the inferior, and that won't work if things haven't stabilized yet. */ - pid = waitpid (GET_PID (inferior_ptid), &status, 0); + pid = my_waitpid (GET_PID (inferior_ptid), &status, 0); if (pid == -1 && errno == ECHILD) { warning (_("%s is a cloned process"), target_pid_to_str (inferior_ptid)); /* Try again with __WCLONE to check cloned processes. */ - pid = waitpid (GET_PID (inferior_ptid), &status, __WCLONE); + pid = my_waitpid (GET_PID (inferior_ptid), &status, __WCLONE); lp->cloned = 1; } @@ -1191,10 +1189,10 @@ wait_lwp (struct lwp_info *lp) gdb_assert (!lp->stopped); gdb_assert (lp->status == 0); - pid = waitpid (GET_LWP (lp->ptid), &status, 0); + pid = my_waitpid (GET_LWP (lp->ptid), &status, 0); if (pid == -1 && errno == ECHILD) { - pid = waitpid (GET_LWP (lp->ptid), &status, __WCLONE); + pid = my_waitpid (GET_LWP (lp->ptid), &status, __WCLONE); if (pid == -1 && errno == ECHILD) { /* The thread has previously exited. We need to delete it @@ -1706,10 +1704,10 @@ child_wait (ptid_t ptid, struct target_w attached process. */ set_sigio_trap (); - pid = waitpid (GET_PID (ptid), &status, 0); + pid = my_waitpid (GET_PID (ptid), &status, 0); if (pid == -1 && errno == ECHILD) /* Try again with __WCLONE to check cloned processes. */ - pid = waitpid (GET_PID (ptid), &status, __WCLONE); + pid = my_waitpid (GET_PID (ptid), &status, __WCLONE); if (debug_linux_nat) { @@ -1920,7 +1918,7 @@ retry: { pid_t lwpid; - lwpid = waitpid (pid, &status, options); + lwpid = my_waitpid (pid, &status, options); if (lwpid > 0) { gdb_assert (pid == -1 || lwpid == pid); @@ -2264,7 +2262,7 @@ kill_wait_callback (struct lwp_info *lp, { do { - pid = waitpid (GET_LWP (lp->ptid), NULL, __WCLONE); + pid = my_waitpid (GET_LWP (lp->ptid), NULL, __WCLONE); if (pid != (pid_t) -1 && debug_linux_nat) { fprintf_unfiltered (gdb_stdlog, @@ -2279,7 +2277,7 @@ kill_wait_callback (struct lwp_info *lp, do { - pid = waitpid (GET_LWP (lp->ptid), NULL, 0); + pid = my_waitpid (GET_LWP (lp->ptid), NULL, 0); if (pid != (pid_t) -1 && debug_linux_nat) { fprintf_unfiltered (gdb_stdlog, @@ -3095,7 +3093,7 @@ Specify any of the following keywords fo action.sa_handler = sigchld_handler; sigemptyset (&action.sa_mask); - action.sa_flags = 0; + action.sa_flags = SA_RESTART; sigaction (SIGCHLD, &action, NULL); /* Make sure we don't block SIGCHLD during a sigsuspend. */ @@ -3168,7 +3166,7 @@ lin_thread_get_thread_signals (sigset_t action.sa_handler = sigchld_handler; sigemptyset (&action.sa_mask); - action.sa_flags = 0; + action.sa_flags = SA_RESTART; sigaction (cancel, &action, NULL); /* We block the "cancel" signal throughout this code ... */ ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFA] Fix internal error in wait_lwp (interrupted system call) 2005-05-15 17:12 ` Daniel Jacobowitz @ 2005-05-16 11:21 ` Ulrich Weigand 0 siblings, 0 replies; 8+ messages in thread From: Ulrich Weigand @ 2005-05-16 11:21 UTC (permalink / raw) To: Daniel Jacobowitz; +Cc: gdb-patches Daniel Jacobowitz <drow@false.org> wrote on 15.05.2005 19:02:49: > I'll do both. I didn't audit every single restartable syscall, just > the other uses of waitpid in this file; that will do for now. I know > that most of these are not robust against EINTR and should not have to > be. There was also another call to sigaction; only relevant on > LinuxThreads with older kernels. > > Committed the below after testing on i686-pc-linux-gnu. Thank you for > discovering this problem. Many thanks for taking care of this! Mit freundlichen Gruessen / Best Regards Ulrich Weigand -- Dr. Ulrich Weigand Linux for S/390 Design & Development IBM Deutschland Entwicklung GmbH, Schoenaicher Str. 220, 71032 Boeblingen Phone: +49-7031/16-3727 --- Email: Ulrich.Weigand@de.ibm.com ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2005-05-16 10:21 UTC | newest] Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2005-05-12 19:18 [RFA] Fix internal error in wait_lwp (interrupted system call) Ulrich Weigand 2005-05-12 20:32 ` Daniel Jacobowitz 2005-05-12 21:14 ` Ulrich Weigand 2005-05-12 21:35 ` Daniel Jacobowitz 2005-05-12 23:14 ` Andreas Schwab 2005-05-13 13:43 ` Ulrich Weigand 2005-05-15 17:12 ` Daniel Jacobowitz 2005-05-16 11:21 ` Ulrich Weigand
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox