From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gdb-patches-return-70051-listarch-gdb-patches=sources.redhat.com@sourceware.org>
Received: (qmail 6954 invoked by alias); 8 Feb 2010 15:54:44 -0000
Received: (qmail 6932 invoked by uid 22791); 8 Feb 2010 15:54:41 -0000
X-SWARE-Spam-Status: No, hits=-2.5 required=5.0 	tests=AWL,BAYES_00,SPF_PASS
X-Spam-Check-By: sourceware.org
Received: from mail.codesourcery.com (HELO mail.codesourcery.com) (38.113.113.100)     by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Mon, 08 Feb 2010 15:54:36 +0000
Received: (qmail 6908 invoked from network); 8 Feb 2010 15:54:33 -0000
Received: from unknown (HELO orlando) (pedro@127.0.0.2)   by mail.codesourcery.com with ESMTPA; 8 Feb 2010 15:54:33 -0000
From: Pedro Alves <pedro@codesourcery.com>
To: gdb-patches@sourceware.org
Subject: Re: Fix a couple of multiexec races
Date: Mon, 08 Feb 2010 15:54:00 -0000
User-Agent: KMail/1.9.10
Cc: Vladimir Prus <vladimir@codesourcery.com>
References: <201001131811.30403.vladimir@codesourcery.com>
In-Reply-To: <201001131811.30403.vladimir@codesourcery.com>
MIME-Version: 1.0
Content-Type: text/plain;   charset="utf-8"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <201002081554.31654.pedro@codesourcery.com>
X-IsSubscribed: yes
Mailing-List: contact gdb-patches-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <gdb-patches.sourceware.org>
List-Subscribe: <mailto:gdb-patches-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/gdb-patches/>
List-Post: <mailto:gdb-patches@sourceware.org>
List-Help: <mailto:gdb-patches-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: gdb-patches-owner@sourceware.org
X-SW-Source: 2010-02/txt/msg00210.txt.bz2

On Wednesday 13 January 2010 15:11:30, Vladimir Prus wrote:
> 
> The attached patch fixes a couple of glitches preventing (upcoming)
> '-exec-run --all' from working.
> 
> First, consider this code inside linux_nat_wait_1:
> 
> 	  if (lp
> 	      && ptid_is_pid (ptid)
> 	      && ptid_get_pid (lp->ptid) != ptid_get_pid (ptid))
> 	    {
> 
> 	      if (debug_linux_nat)
> 			fprintf (stderr, "LWP %ld got an event %06x, leaving pending.\n",
> 					 ptid_get_lwp (lp->ptid), status);
> 
>               if (WIFSTOPPED (lp->status))
>                 {
>                  if (WSTOPSIG (lp->status) != SIGSTOP)
>                    {
>                      stop_callback (lp, NULL);
> 
>                      /* Resume in order to collect the sigstop.  */
>                      ptrace (PTRACE_CONT, GET_LWP (lp->ptid), 0, 0);
> 
>                      stop_wait_callback (lp, NULL);
>                    }
> 
> Because lp->status is naturally not NULL when calling stop_callback, and because
> stop_callback has:
> 
> 	 gdb_assert (lp->status == 0);
> 
> if we ever enter the inner "if", GDB will crash. Offlist, Pedro suggested the inner if be
> just removed.

I thought about this some more, and I think this deserves more
extensive fixing.  See patch (and comments in it) below, which
I've just applied.  I don't like relying on the moribund breakpoints
heuristic if we can, because it _can_ fail in practice.  In this
case, I think if we -exec-run --all enough inferiors, it might
happens that we have too many events handled in the new
processes (due to target_wait(PID) before this pending breakpoint
is finally pulled to infrun (target_wait(-1)), possibly too late.

In the progress on working on this patch, I noticed that I had
done something weird to the lwp->resumed flag in non-stop
mode a while ago.  We certainly shouldn't be always
tagging _all_ threads as resumed just because we resumed one
thread in linux_nat_resume!

> 
> Second change is to address the issue I have reported earlier. I reproduce the original 
> email below.
> 
> 
> I have read the comment inside cancel_breakpoints_callback, but I am not sure
> I fully understand it. The situation being handled is when GDB expects an
> event in thread A, and thread B hits a breakpoint. For all-stop mode, this
> makes sense -- after all we don't support reporting more than one event, and
> when user examines event in A he might indeed delete a breakpoint. And if he
> does not deletes a breakpoint, 'continue' resumes all threads by default,
> so we'll hit breakpoint in B again.
> 
> But is this reasonable for non-stop. In non-stop, we get can further stop
> events while user examines event in A. Further, 'continue' resumes only
> the current thread. So, if we don't report event in B, user does not have
> any idea it has to be resumed.

Aaaaahhh, you meant _not_ reasonable for non-stop.  That changes the
whole perspective compared to what I thought you were saying last
time I read this.  :-)  Yes, I agree.

> 
> What I observe, specifically, is when I use '-exec-run --all' (a local patch),
> one inferior is not resumed. Here's the relevant bit of 'debug lin-lwp' output:
> 
>         *running,thread-id="2"
>         &"linux_nat_wait: [process 4659]\n"
>         &"LLW: waitpid 4656 received Trace/breakpoint trap (stopped)\n"
>         LWP 4656 got an event 00057f, leaving pending.
>         &"LLW: waitpid 4659 received Trace/breakpoint trap (stopped)\n"
>         &"LLW: Candidate event Trace/breakpoint trap (stopped) in process 4659.\n"
>         &"CB: Push back breakpoint for process 4656\n"

I've confirmed, using your pending MI patch (which I'll review next),
that `-exec-run --all' in non-stop, that this problem is now gone.

> 
> And there's no further mention of this pid in the log; it remains stopped. The
> attached patch appears to improve things. Pedro, what do you think?
> Note that this patch also comments out the code that tries to stop a thread if
> it got != SIGSTOP. As discussed offlist, that code fires assertion inside
> stop_callback.

Yeah, that bit was obviously busted.

-- 
Pedro Alves

2010-02-08  Pedro Alves  <pedro@codesourcery.com>

	gdb/
	* linux-nat.c (linux_nat_resume): In non-stop, also only tag
	resumed LWPs as resumed.
	(linux_nat_wait_1): If there's no resumed LWP in the set of LWPs
	we're waiting for, bail out with TARGET_WAITKIND_IGNORE, instead
	of throwing an internal error.  If an LWP of a process we're not
	waiting for reports a signal, don't force collecting a SIGSTOP,
	and if it was breakpoint hit in non-stop mode, cancel it.  Don't
	go through all LWPs cancelling breakpoints in non-stop mode.
	(resume_stopped_resumed_lwps): New.
	(linux_nat_wait): Use it.

---
 gdb/linux-nat.c |  126 ++++++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 105 insertions(+), 21 deletions(-)

Index: src/gdb/linux-nat.c
===================================================================
--- src.orig/gdb/linux-nat.c	2010-02-08 15:13:45.000000000 +0000
+++ src/gdb/linux-nat.c	2010-02-08 15:16:42.000000000 +0000
@@ -1912,14 +1912,8 @@ linux_nat_resume (struct target_ops *ops
   resume_many = (ptid_equal (minus_one_ptid, ptid)
 		 || ptid_is_pid (ptid));
 
-  if (!non_stop)
-    {
-      /* Mark the lwps we're resuming as resumed.  */
-      iterate_over_lwps (minus_one_ptid, resume_clear_callback, NULL);
-      iterate_over_lwps (ptid, resume_set_callback, NULL);
-    }
-  else
-    iterate_over_lwps (minus_one_ptid, resume_set_callback, NULL);
+  /* Mark the lwps we're resuming as resumed.  */
+  iterate_over_lwps (ptid, resume_set_callback, NULL);
 
   /* See if it's the current inferior that should be handled
      specially.  */
@@ -3262,8 +3256,20 @@ retry:
   lp = NULL;
   status = 0;
 
-  /* Make sure there is at least one LWP that has been resumed.  */
-  gdb_assert (iterate_over_lwps (ptid, resumed_callback, NULL));
+  /* Make sure that of those LWPs we want to get an event from, there
+     is at least one LWP that has been resumed.  If there's none, just
+     bail out.  The core may just be flushing asynchronously all
+     events.  */
+  if (iterate_over_lwps (ptid, resumed_callback, NULL) == NULL)
+    {
+      ourstatus->kind = TARGET_WAITKIND_IGNORE;
+
+      if (debug_linux_nat_async)
+	fprintf_unfiltered (gdb_stdlog, "LLW: exit (no resumed LWP)\n");
+
+      restore_child_signals_mask (&prev_mask);
+      return minus_one_ptid;
+    }
 
   /* First check if there is a LWP with a wait status pending.  */
   if (pid == -1)
@@ -3394,6 +3400,8 @@ retry:
 	      && ptid_is_pid (ptid)
 	      && ptid_get_pid (lp->ptid) != ptid_get_pid (ptid))
 	    {
+	      gdb_assert (lp->resumed);
+
 	      if (debug_linux_nat)
 		fprintf (stderr, "LWP %ld got an event %06x, leaving pending.\n",
 			 ptid_get_lwp (lp->ptid), status);
@@ -3402,12 +3410,30 @@ retry:
 		{
 		  if (WSTOPSIG (lp->status) != SIGSTOP)
 		    {
-		      stop_callback (lp, NULL);
-
-		      /* Resume in order to collect the sigstop.  */
-		      ptrace (PTRACE_CONT, GET_LWP (lp->ptid), 0, 0);
-
-		      stop_wait_callback (lp, NULL);
+		      /* Cancel breakpoint hits.  The breakpoint may
+			 be removed before we fetch events from this
+			 process to report to the core.  It is best
+			 not to assume the moribund breakpoints
+			 heuristic always handles these cases --- it
+			 could be too many events go through to the
+			 core before this one is handled.  All-stop
+			 always cancels breakpoint hits in all
+			 threads.  */
+		      if (non_stop
+			  && lp->waitstatus.kind == TARGET_WAITKIND_IGNORE
+			  && WSTOPSIG (lp->status) == SIGTRAP
+			  && cancel_breakpoint (lp))
+			{
+			  /* Throw away the SIGTRAP.  */
+			  lp->status = 0;
+
+			  if (debug_linux_nat)
+			    fprintf (stderr,
+				     "LLW: LWP %ld hit a breakpoint while waiting "
+				     "for another process; cancelled it\n",
+				     ptid_get_lwp (lp->ptid));
+			}
+		      lp->stopped = 1;
 		    }
 		  else
 		    {
@@ -3597,12 +3623,19 @@ retry:
 	 starvation.  */
       if (pid == -1)
 	select_event_lwp (ptid, &lp, &status);
-    }
 
-  /* Now that we've selected our final event LWP, cancel any
-     breakpoints in other LWPs that have hit a GDB breakpoint.  See
-     the comment in cancel_breakpoints_callback to find out why.  */
-  iterate_over_lwps (minus_one_ptid, cancel_breakpoints_callback, lp);
+      /* Now that we've selected our final event LWP, cancel any
+	 breakpoints in other LWPs that have hit a GDB breakpoint.
+	 See the comment in cancel_breakpoints_callback to find out
+	 why.  */
+      iterate_over_lwps (minus_one_ptid, cancel_breakpoints_callback, lp);
+
+      /* In all-stop, from the core's perspective, all LWPs are now
+	 stopped until a new resume action is sent over.  */
+      iterate_over_lwps (minus_one_ptid, resume_clear_callback, NULL);
+    }
+  else
+    lp->resumed = 0;
 
   if (WIFSTOPPED (status) && WSTOPSIG (status) == SIGTRAP)
     {
@@ -3628,6 +3661,47 @@ retry:
   return lp->ptid;
 }
 
+/* Resume LWPs that are currently stopped without any pending status
+   to report, but are resumed from the core's perspective.  */
+
+static int
+resume_stopped_resumed_lwps (struct lwp_info *lp, void *data)
+{
+  ptid_t *wait_ptid_p = data;
+
+  if (lp->stopped
+      && lp->resumed
+      && lp->status == 0
+      && lp->waitstatus.kind == TARGET_WAITKIND_IGNORE)
+    {
+      gdb_assert (is_executing (lp->ptid));
+
+      /* Don't bother if there's a breakpoint at PC that we'd hit
+	 immediately, and we're not waiting for this LWP.  */
+      if (!ptid_match (lp->ptid, *wait_ptid_p))
+	{
+	  struct regcache *regcache = get_thread_regcache (lp->ptid);
+	  CORE_ADDR pc = regcache_read_pc (regcache);
+
+	  if (breakpoint_inserted_here_p (get_regcache_aspace (regcache), pc))
+	    return 0;
+	}
+
+      if (debug_linux_nat)
+	fprintf_unfiltered (gdb_stdlog,
+			    "RSRL: resuming stopped-resumed LWP %s\n",
+			    target_pid_to_str (lp->ptid));
+
+      linux_ops->to_resume (linux_ops, pid_to_ptid (GET_LWP (lp->ptid)),
+			    lp->step, TARGET_SIGNAL_0);
+      lp->stopped = 0;
+      memset (&lp->siginfo, 0, sizeof (lp->siginfo));
+      lp->stopped_by_watchpoint = 0;
+    }
+
+  return 0;
+}
+
 static ptid_t
 linux_nat_wait (struct target_ops *ops,
 		ptid_t ptid, struct target_waitstatus *ourstatus,
@@ -3642,6 +3716,16 @@ linux_nat_wait (struct target_ops *ops,
   if (target_can_async_p ())
     async_file_flush ();
 
+  /* Resume LWPs that are currently stopped without any pending status
+     to report, but are resumed from the core's perspective.  LWPs get
+     in this state if we find them stopping at a time we're not
+     interested in reporting the event (target_wait(specific_process),
+     for example, see linux_nat_wait_1), and meanwhile the event
+     became uninteresting.  Don't bother resuming LWPs we're not going
+     to wait for if they'd stop immediately.  */
+  if (non_stop)
+    iterate_over_lwps (minus_one_ptid, resume_stopped_resumed_lwps, &ptid);
+
   event_ptid = linux_nat_wait_1 (ops, ptid, ourstatus, target_options);
 
   /* If we requested any event, and something came out, assume there