From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 6954 invoked by alias); 8 Feb 2010 15:54:44 -0000 Received: (qmail 6932 invoked by uid 22791); 8 Feb 2010 15:54:41 -0000 X-SWARE-Spam-Status: No, hits=-2.5 required=5.0 tests=AWL,BAYES_00,SPF_PASS X-Spam-Check-By: sourceware.org Received: from mail.codesourcery.com (HELO mail.codesourcery.com) (38.113.113.100) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Mon, 08 Feb 2010 15:54:36 +0000 Received: (qmail 6908 invoked from network); 8 Feb 2010 15:54:33 -0000 Received: from unknown (HELO orlando) (pedro@127.0.0.2) by mail.codesourcery.com with ESMTPA; 8 Feb 2010 15:54:33 -0000 From: Pedro Alves To: gdb-patches@sourceware.org Subject: Re: Fix a couple of multiexec races Date: Mon, 08 Feb 2010 15:54:00 -0000 User-Agent: KMail/1.9.10 Cc: Vladimir Prus References: <201001131811.30403.vladimir@codesourcery.com> In-Reply-To: <201001131811.30403.vladimir@codesourcery.com> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <201002081554.31654.pedro@codesourcery.com> X-IsSubscribed: yes Mailing-List: contact gdb-patches-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gdb-patches-owner@sourceware.org X-SW-Source: 2010-02/txt/msg00210.txt.bz2 On Wednesday 13 January 2010 15:11:30, Vladimir Prus wrote: > > The attached patch fixes a couple of glitches preventing (upcoming) > '-exec-run --all' from working. > > First, consider this code inside linux_nat_wait_1: > > if (lp > && ptid_is_pid (ptid) > && ptid_get_pid (lp->ptid) != ptid_get_pid (ptid)) > { > > if (debug_linux_nat) > fprintf (stderr, "LWP %ld got an event %06x, leaving pending.\n", > ptid_get_lwp (lp->ptid), status); > > if (WIFSTOPPED (lp->status)) > { > if (WSTOPSIG (lp->status) != SIGSTOP) > { > stop_callback (lp, NULL); > > /* Resume in order to collect the sigstop. */ > ptrace (PTRACE_CONT, GET_LWP (lp->ptid), 0, 0); > > stop_wait_callback (lp, NULL); > } > > Because lp->status is naturally not NULL when calling stop_callback, and because > stop_callback has: > > gdb_assert (lp->status == 0); > > if we ever enter the inner "if", GDB will crash. Offlist, Pedro suggested the inner if be > just removed. I thought about this some more, and I think this deserves more extensive fixing. See patch (and comments in it) below, which I've just applied. I don't like relying on the moribund breakpoints heuristic if we can, because it _can_ fail in practice. In this case, I think if we -exec-run --all enough inferiors, it might happens that we have too many events handled in the new processes (due to target_wait(PID) before this pending breakpoint is finally pulled to infrun (target_wait(-1)), possibly too late. In the progress on working on this patch, I noticed that I had done something weird to the lwp->resumed flag in non-stop mode a while ago. We certainly shouldn't be always tagging _all_ threads as resumed just because we resumed one thread in linux_nat_resume! > > Second change is to address the issue I have reported earlier. I reproduce the original > email below. > > > I have read the comment inside cancel_breakpoints_callback, but I am not sure > I fully understand it. The situation being handled is when GDB expects an > event in thread A, and thread B hits a breakpoint. For all-stop mode, this > makes sense -- after all we don't support reporting more than one event, and > when user examines event in A he might indeed delete a breakpoint. And if he > does not deletes a breakpoint, 'continue' resumes all threads by default, > so we'll hit breakpoint in B again. > > But is this reasonable for non-stop. In non-stop, we get can further stop > events while user examines event in A. Further, 'continue' resumes only > the current thread. So, if we don't report event in B, user does not have > any idea it has to be resumed. Aaaaahhh, you meant _not_ reasonable for non-stop. That changes the whole perspective compared to what I thought you were saying last time I read this. :-) Yes, I agree. > > What I observe, specifically, is when I use '-exec-run --all' (a local patch), > one inferior is not resumed. Here's the relevant bit of 'debug lin-lwp' output: > > *running,thread-id="2" > &"linux_nat_wait: [process 4659]\n" > &"LLW: waitpid 4656 received Trace/breakpoint trap (stopped)\n" > LWP 4656 got an event 00057f, leaving pending. > &"LLW: waitpid 4659 received Trace/breakpoint trap (stopped)\n" > &"LLW: Candidate event Trace/breakpoint trap (stopped) in process 4659.\n" > &"CB: Push back breakpoint for process 4656\n" I've confirmed, using your pending MI patch (which I'll review next), that `-exec-run --all' in non-stop, that this problem is now gone. > > And there's no further mention of this pid in the log; it remains stopped. The > attached patch appears to improve things. Pedro, what do you think? > Note that this patch also comments out the code that tries to stop a thread if > it got != SIGSTOP. As discussed offlist, that code fires assertion inside > stop_callback. Yeah, that bit was obviously busted. -- Pedro Alves 2010-02-08 Pedro Alves gdb/ * linux-nat.c (linux_nat_resume): In non-stop, also only tag resumed LWPs as resumed. (linux_nat_wait_1): If there's no resumed LWP in the set of LWPs we're waiting for, bail out with TARGET_WAITKIND_IGNORE, instead of throwing an internal error. If an LWP of a process we're not waiting for reports a signal, don't force collecting a SIGSTOP, and if it was breakpoint hit in non-stop mode, cancel it. Don't go through all LWPs cancelling breakpoints in non-stop mode. (resume_stopped_resumed_lwps): New. (linux_nat_wait): Use it. --- gdb/linux-nat.c | 126 ++++++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 105 insertions(+), 21 deletions(-) Index: src/gdb/linux-nat.c =================================================================== --- src.orig/gdb/linux-nat.c 2010-02-08 15:13:45.000000000 +0000 +++ src/gdb/linux-nat.c 2010-02-08 15:16:42.000000000 +0000 @@ -1912,14 +1912,8 @@ linux_nat_resume (struct target_ops *ops resume_many = (ptid_equal (minus_one_ptid, ptid) || ptid_is_pid (ptid)); - if (!non_stop) - { - /* Mark the lwps we're resuming as resumed. */ - iterate_over_lwps (minus_one_ptid, resume_clear_callback, NULL); - iterate_over_lwps (ptid, resume_set_callback, NULL); - } - else - iterate_over_lwps (minus_one_ptid, resume_set_callback, NULL); + /* Mark the lwps we're resuming as resumed. */ + iterate_over_lwps (ptid, resume_set_callback, NULL); /* See if it's the current inferior that should be handled specially. */ @@ -3262,8 +3256,20 @@ retry: lp = NULL; status = 0; - /* Make sure there is at least one LWP that has been resumed. */ - gdb_assert (iterate_over_lwps (ptid, resumed_callback, NULL)); + /* Make sure that of those LWPs we want to get an event from, there + is at least one LWP that has been resumed. If there's none, just + bail out. The core may just be flushing asynchronously all + events. */ + if (iterate_over_lwps (ptid, resumed_callback, NULL) == NULL) + { + ourstatus->kind = TARGET_WAITKIND_IGNORE; + + if (debug_linux_nat_async) + fprintf_unfiltered (gdb_stdlog, "LLW: exit (no resumed LWP)\n"); + + restore_child_signals_mask (&prev_mask); + return minus_one_ptid; + } /* First check if there is a LWP with a wait status pending. */ if (pid == -1) @@ -3394,6 +3400,8 @@ retry: && ptid_is_pid (ptid) && ptid_get_pid (lp->ptid) != ptid_get_pid (ptid)) { + gdb_assert (lp->resumed); + if (debug_linux_nat) fprintf (stderr, "LWP %ld got an event %06x, leaving pending.\n", ptid_get_lwp (lp->ptid), status); @@ -3402,12 +3410,30 @@ retry: { if (WSTOPSIG (lp->status) != SIGSTOP) { - stop_callback (lp, NULL); - - /* Resume in order to collect the sigstop. */ - ptrace (PTRACE_CONT, GET_LWP (lp->ptid), 0, 0); - - stop_wait_callback (lp, NULL); + /* Cancel breakpoint hits. The breakpoint may + be removed before we fetch events from this + process to report to the core. It is best + not to assume the moribund breakpoints + heuristic always handles these cases --- it + could be too many events go through to the + core before this one is handled. All-stop + always cancels breakpoint hits in all + threads. */ + if (non_stop + && lp->waitstatus.kind == TARGET_WAITKIND_IGNORE + && WSTOPSIG (lp->status) == SIGTRAP + && cancel_breakpoint (lp)) + { + /* Throw away the SIGTRAP. */ + lp->status = 0; + + if (debug_linux_nat) + fprintf (stderr, + "LLW: LWP %ld hit a breakpoint while waiting " + "for another process; cancelled it\n", + ptid_get_lwp (lp->ptid)); + } + lp->stopped = 1; } else { @@ -3597,12 +3623,19 @@ retry: starvation. */ if (pid == -1) select_event_lwp (ptid, &lp, &status); - } - /* Now that we've selected our final event LWP, cancel any - breakpoints in other LWPs that have hit a GDB breakpoint. See - the comment in cancel_breakpoints_callback to find out why. */ - iterate_over_lwps (minus_one_ptid, cancel_breakpoints_callback, lp); + /* Now that we've selected our final event LWP, cancel any + breakpoints in other LWPs that have hit a GDB breakpoint. + See the comment in cancel_breakpoints_callback to find out + why. */ + iterate_over_lwps (minus_one_ptid, cancel_breakpoints_callback, lp); + + /* In all-stop, from the core's perspective, all LWPs are now + stopped until a new resume action is sent over. */ + iterate_over_lwps (minus_one_ptid, resume_clear_callback, NULL); + } + else + lp->resumed = 0; if (WIFSTOPPED (status) && WSTOPSIG (status) == SIGTRAP) { @@ -3628,6 +3661,47 @@ retry: return lp->ptid; } +/* Resume LWPs that are currently stopped without any pending status + to report, but are resumed from the core's perspective. */ + +static int +resume_stopped_resumed_lwps (struct lwp_info *lp, void *data) +{ + ptid_t *wait_ptid_p = data; + + if (lp->stopped + && lp->resumed + && lp->status == 0 + && lp->waitstatus.kind == TARGET_WAITKIND_IGNORE) + { + gdb_assert (is_executing (lp->ptid)); + + /* Don't bother if there's a breakpoint at PC that we'd hit + immediately, and we're not waiting for this LWP. */ + if (!ptid_match (lp->ptid, *wait_ptid_p)) + { + struct regcache *regcache = get_thread_regcache (lp->ptid); + CORE_ADDR pc = regcache_read_pc (regcache); + + if (breakpoint_inserted_here_p (get_regcache_aspace (regcache), pc)) + return 0; + } + + if (debug_linux_nat) + fprintf_unfiltered (gdb_stdlog, + "RSRL: resuming stopped-resumed LWP %s\n", + target_pid_to_str (lp->ptid)); + + linux_ops->to_resume (linux_ops, pid_to_ptid (GET_LWP (lp->ptid)), + lp->step, TARGET_SIGNAL_0); + lp->stopped = 0; + memset (&lp->siginfo, 0, sizeof (lp->siginfo)); + lp->stopped_by_watchpoint = 0; + } + + return 0; +} + static ptid_t linux_nat_wait (struct target_ops *ops, ptid_t ptid, struct target_waitstatus *ourstatus, @@ -3642,6 +3716,16 @@ linux_nat_wait (struct target_ops *ops, if (target_can_async_p ()) async_file_flush (); + /* Resume LWPs that are currently stopped without any pending status + to report, but are resumed from the core's perspective. LWPs get + in this state if we find them stopping at a time we're not + interested in reporting the event (target_wait(specific_process), + for example, see linux_nat_wait_1), and meanwhile the event + became uninteresting. Don't bother resuming LWPs we're not going + to wait for if they'd stop immediately. */ + if (non_stop) + iterate_over_lwps (minus_one_ptid, resume_stopped_resumed_lwps, &ptid); + event_ptid = linux_nat_wait_1 (ops, ptid, ourstatus, target_options); /* If we requested any event, and something came out, assume there