From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 19311 invoked by alias); 12 May 2005 19:18:55 -0000 Mailing-List: contact gdb-patches-help@sources.redhat.com; run by ezmlm Precedence: bulk List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gdb-patches-owner@sources.redhat.com Received: (qmail 19269 invoked from network); 12 May 2005 19:18:51 -0000 Received: from unknown (HELO nevyn.them.org) (66.93.172.17) by sourceware.org with SMTP; 12 May 2005 19:18:51 -0000 Received: from drow by nevyn.them.org with local (Exim 4.50) id 1DWJCr-0002lc-UY; Thu, 12 May 2005 15:18:50 -0400 Date: Thu, 12 May 2005 20:32:00 -0000 From: Daniel Jacobowitz To: Ulrich Weigand Cc: gdb-patches@sources.redhat.com Subject: Re: [RFA] Fix internal error in wait_lwp (interrupted system call) Message-ID: <20050512191849.GA10326@nevyn.them.org> Mail-Followup-To: Ulrich Weigand , gdb-patches@sources.redhat.com References: <200505121906.j4CJ6cSW012897@53v30g15.boeblingen.de.ibm.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <200505121906.j4CJ6cSW012897@53v30g15.boeblingen.de.ibm.com> User-Agent: Mutt/1.5.8i X-SW-Source: 2005-05/txt/msg00302.txt.bz2 On Thu, May 12, 2005 at 09:06:38PM +0200, Ulrich Weigand wrote: > Hello, > > we've had reports from our JVM/JIT development group that for them, > gdb 6.3 frequently fails with internal errors like: > linux-nat.c:1152: internal-error: wait_lwp: Assertion `pid == GET_LWP (lp->ptid)' failed. > > It turned out that this happens when a SIGCHLD arrives during > execution of the waitpid call. This causes the signal handler > to be executed, and subsequently the system call returns with > errno equal to EINTR. > > Now, looking through the linux-nat.c file, it would appear that this > type of problem has been addressed at various places in different > ways. In linux_handle_extended_wait, the waitpid call is wrapped > into an explicit do { } while (ret == -1 && errno == EINTR) loop. > In linux_test_for_tracefork, this very loop is abstracted into a > my_waitpid routine. In child_wait and linux_nat_wait, there are > larger loops that will handle this situation as well. Finally, > in lin_lwp_attach_lwp, SIGCHLD is actually blocked during the > execution of the waitpid call. > > However, there remain some places where waitpid is called without > any such precaution, and wait_lwp is one of these. When debugging > a process making very heavy use of threads, as the JVM, this can > lead to the error shown above. > > Now, as far as I can see, there is really *no* place where GDB > actually *wants* a system call to be interrupted by the SIGCHLD > signal handler. Thus, I'd propose to fix the problem at its > root by simply installing the handler with the SA_RESTART flag, > causing any interrupted system call to be automatically restarted. > > The patch below does this, and fixes all problems for the JVM team. > It also passes regression testing on s390-ibm-linux and s390x-ibm-linux. > > OK to commit? On the one hand, this is very clever. On the other hand, it's not very robust. This is not the only signal that could arrive. Shouldn't wait_lwp be looping on EINTR anyway, probably by using my_waitpid (which is a recent addition)? -- Daniel Jacobowitz CodeSourcery, LLC