Fix a couple of multiexec races

Mirror of the gdb-patches mailing list
 help / color / mirror / Atom feed

* Fix a couple of multiexec races
@ 2010-01-13 15:11 Vladimir Prus
  2010-01-13 15:50 ` Pedro Alves
  2010-02-08 15:54 ` Pedro Alves
  0 siblings, 2 replies; 4+ messages in thread
From: Vladimir Prus @ 2010-01-13 15:11 UTC (permalink / raw)
  To: gdb-patches

[-- Attachment #1: Type: text/plain, Size: 2782 bytes --]

The attached patch fixes a couple of glitches preventing (upcoming)
'-exec-run --all' from working.

First, consider this code inside linux_nat_wait_1:

	  if (lp
	      && ptid_is_pid (ptid)
	      && ptid_get_pid (lp->ptid) != ptid_get_pid (ptid))
	    {

	      if (debug_linux_nat)
			fprintf (stderr, "LWP %ld got an event %06x, leaving pending.\n",
					 ptid_get_lwp (lp->ptid), status);

              if (WIFSTOPPED (lp->status))
                {
                 if (WSTOPSIG (lp->status) != SIGSTOP)
                   {
                     stop_callback (lp, NULL);

                     /* Resume in order to collect the sigstop.  */
                     ptrace (PTRACE_CONT, GET_LWP (lp->ptid), 0, 0);

                     stop_wait_callback (lp, NULL);
                   }

Because lp->status is naturally not NULL when calling stop_callback, and because
stop_callback has:

	 gdb_assert (lp->status == 0);

if we ever enter the inner "if", GDB will crash. Offlist, Pedro suggested the inner if be
just removed.

Second change is to address the issue I have reported earlier. I reproduce the original 
email below.

I have read the comment inside cancel_breakpoints_callback, but I am not sure
I fully understand it. The situation being handled is when GDB expects an
event in thread A, and thread B hits a breakpoint. For all-stop mode, this
makes sense -- after all we don't support reporting more than one event, and
when user examines event in A he might indeed delete a breakpoint. And if he
does not deletes a breakpoint, 'continue' resumes all threads by default,
so we'll hit breakpoint in B again.

But is this reasonable for non-stop. In non-stop, we get can further stop
events while user examines event in A. Further, 'continue' resumes only
the current thread. So, if we don't report event in B, user does not have
any idea it has to be resumed.

What I observe, specifically, is when I use '-exec-run --all' (a local patch),
one inferior is not resumed. Here's the relevant bit of 'debug lin-lwp' output:

        *running,thread-id="2"
        &"linux_nat_wait: [process 4659]\n"
        &"LLW: waitpid 4656 received Trace/breakpoint trap (stopped)\n"
        LWP 4656 got an event 00057f, leaving pending.
        &"LLW: waitpid 4659 received Trace/breakpoint trap (stopped)\n"
        &"LLW: Candidate event Trace/breakpoint trap (stopped) in process 4659.\n"
        &"CB: Push back breakpoint for process 4656\n"

And there's no further mention of this pid in the log; it remains stopped. The
attached patch appears to improve things. Pedro, what do you think?

Note that this patch also comments out the code that tries to stop a thread if
it got != SIGSTOP. As discussed offlist, that code fires assertion inside
stop_callback.

- Volodya

[-- Attachment #2: races.diff --]
[-- Type: text/x-patch, Size: 1532 bytes --]

commit 12fb288b309af44de0b6709d80d172be852368ad
Author: Vladimir Prus <vladimir@codesourcery.com>
Date:   Fri Jan 8 17:07:44 2010 +0300

    Fix issues preventing '-exec-run --all' from work.

    	* linux-nat.c (linux_nat_wait_1): When getting event in undesired
    	inferior, do not try to resume it and get SIGSTOP. In non-stop, do
    	not try to cancel breakpoints.

diff --git a/gdb/linux-nat.c b/gdb/linux-nat.c
index bf0a5f1..682949d 100644
--- a/gdb/linux-nat.c
+++ b/gdb/linux-nat.c
@@ -3419,20 +3419,9 @@ retry:

 	      if (WIFSTOPPED (lp->status))
 		{
-		  if (WSTOPSIG (lp->status) != SIGSTOP)
-		    {
-		      stop_callback (lp, NULL);
-
-		      /* Resume in order to collect the sigstop.  */
-		      ptrace (PTRACE_CONT, GET_LWP (lp->ptid), 0, 0);
-
-		      stop_wait_callback (lp, NULL);
-		    }
-		  else
-		    {
-		      lp->stopped = 1;
-		      lp->signalled = 0;
-		    }
+		  lp->stopped = 1;
+		  if (WSTOPSIG (lp->status) == SIGSTOP)
+		    lp->signalled = 0;
 		}
 	      else if (WIFEXITED (status) || WIFSIGNALED (status))
 		{
@@ -3621,7 +3610,8 @@ retry:
   /* Now that we've selected our final event LWP, cancel any
      breakpoints in other LWPs that have hit a GDB breakpoint.  See
      the comment in cancel_breakpoints_callback to find out why.  */
-  iterate_over_lwps (minus_one_ptid, cancel_breakpoints_callback, lp);
+  if (!non_stop)
+    iterate_over_lwps (minus_one_ptid, cancel_breakpoints_callback, lp);

   if (WIFSTOPPED (status) && WSTOPSIG (status) == SIGTRAP)
     {

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Fix a couple of multiexec races
  2010-01-13 15:11 Fix a couple of multiexec races Vladimir Prus
@ 2010-01-13 15:50 ` Pedro Alves
  2010-02-08 15:54 ` Pedro Alves
  1 sibling, 0 replies; 4+ messages in thread
From: Pedro Alves @ 2010-01-13 15:50 UTC (permalink / raw)
  To: gdb-patches; +Cc: Vladimir Prus

On Wednesday 13 January 2010 15:11:30, Vladimir Prus wrote:

> Pedro, what do you think?

I need to think a bit about this.  "I'll be back".

-- 
Pedro Alves


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Fix a couple of multiexec races
  2010-01-13 15:11 Fix a couple of multiexec races Vladimir Prus
  2010-01-13 15:50 ` Pedro Alves
@ 2010-02-08 15:54 ` Pedro Alves
  2010-02-15  9:11   ` Vladimir Prus
  1 sibling, 1 reply; 4+ messages in thread
From: Pedro Alves @ 2010-02-08 15:54 UTC (permalink / raw)
  To: gdb-patches; +Cc: Vladimir Prus

On Wednesday 13 January 2010 15:11:30, Vladimir Prus wrote:
> 
> The attached patch fixes a couple of glitches preventing (upcoming)
> '-exec-run --all' from working.
> 
> First, consider this code inside linux_nat_wait_1:
> 
> 	  if (lp
> 	      && ptid_is_pid (ptid)
> 	      && ptid_get_pid (lp->ptid) != ptid_get_pid (ptid))
> 	    {
> 
> 	      if (debug_linux_nat)
> 			fprintf (stderr, "LWP %ld got an event %06x, leaving pending.\n",
> 					 ptid_get_lwp (lp->ptid), status);
> 
>               if (WIFSTOPPED (lp->status))
>                 {
>                  if (WSTOPSIG (lp->status) != SIGSTOP)
>                    {
>                      stop_callback (lp, NULL);
> 
>                      /* Resume in order to collect the sigstop.  */
>                      ptrace (PTRACE_CONT, GET_LWP (lp->ptid), 0, 0);
> 
>                      stop_wait_callback (lp, NULL);
>                    }
> 
> Because lp->status is naturally not NULL when calling stop_callback, and because
> stop_callback has:
> 
> 	 gdb_assert (lp->status == 0);
> 
> if we ever enter the inner "if", GDB will crash. Offlist, Pedro suggested the inner if be
> just removed.

I thought about this some more, and I think this deserves more
extensive fixing.  See patch (and comments in it) below, which
I've just applied.  I don't like relying on the moribund breakpoints
heuristic if we can, because it _can_ fail in practice.  In this
case, I think if we -exec-run --all enough inferiors, it might
happens that we have too many events handled in the new
processes (due to target_wait(PID) before this pending breakpoint
is finally pulled to infrun (target_wait(-1)), possibly too late.

In the progress on working on this patch, I noticed that I had
done something weird to the lwp->resumed flag in non-stop
mode a while ago.  We certainly shouldn't be always
tagging _all_ threads as resumed just because we resumed one
thread in linux_nat_resume!

> 
> Second change is to address the issue I have reported earlier. I reproduce the original 
> email below.
> 
> 
> I have read the comment inside cancel_breakpoints_callback, but I am not sure
> I fully understand it. The situation being handled is when GDB expects an
> event in thread A, and thread B hits a breakpoint. For all-stop mode, this
> makes sense -- after all we don't support reporting more than one event, and
> when user examines event in A he might indeed delete a breakpoint. And if he
> does not deletes a breakpoint, 'continue' resumes all threads by default,
> so we'll hit breakpoint in B again.
> 
> But is this reasonable for non-stop. In non-stop, we get can further stop
> events while user examines event in A. Further, 'continue' resumes only
> the current thread. So, if we don't report event in B, user does not have
> any idea it has to be resumed.

Aaaaahhh, you meant _not_ reasonable for non-stop.  That changes the
whole perspective compared to what I thought you were saying last
time I read this.  :-)  Yes, I agree.

> 
> What I observe, specifically, is when I use '-exec-run --all' (a local patch),
> one inferior is not resumed. Here's the relevant bit of 'debug lin-lwp' output:
> 
>         *running,thread-id="2"
>         &"linux_nat_wait: [process 4659]\n"
>         &"LLW: waitpid 4656 received Trace/breakpoint trap (stopped)\n"
>         LWP 4656 got an event 00057f, leaving pending.
>         &"LLW: waitpid 4659 received Trace/breakpoint trap (stopped)\n"
>         &"LLW: Candidate event Trace/breakpoint trap (stopped) in process 4659.\n"
>         &"CB: Push back breakpoint for process 4656\n"

I've confirmed, using your pending MI patch (which I'll review next),
that `-exec-run --all' in non-stop, that this problem is now gone.

> 
> And there's no further mention of this pid in the log; it remains stopped. The
> attached patch appears to improve things. Pedro, what do you think?
> Note that this patch also comments out the code that tries to stop a thread if
> it got != SIGSTOP. As discussed offlist, that code fires assertion inside
> stop_callback.

Yeah, that bit was obviously busted.

-- 
Pedro Alves

2010-02-08  Pedro Alves  <pedro@codesourcery.com>

	gdb/
	* linux-nat.c (linux_nat_resume): In non-stop, also only tag
	resumed LWPs as resumed.
	(linux_nat_wait_1): If there's no resumed LWP in the set of LWPs
	we're waiting for, bail out with TARGET_WAITKIND_IGNORE, instead
	of throwing an internal error.  If an LWP of a process we're not
	waiting for reports a signal, don't force collecting a SIGSTOP,
	and if it was breakpoint hit in non-stop mode, cancel it.  Don't
	go through all LWPs cancelling breakpoints in non-stop mode.
	(resume_stopped_resumed_lwps): New.
	(linux_nat_wait): Use it.

---
 gdb/linux-nat.c |  126 ++++++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 105 insertions(+), 21 deletions(-)

Index: src/gdb/linux-nat.c
===================================================================
--- src.orig/gdb/linux-nat.c	2010-02-08 15:13:45.000000000 +0000
+++ src/gdb/linux-nat.c	2010-02-08 15:16:42.000000000 +0000
@@ -1912,14 +1912,8 @@ linux_nat_resume (struct target_ops *ops
   resume_many = (ptid_equal (minus_one_ptid, ptid)
 		 || ptid_is_pid (ptid));
 
-  if (!non_stop)
-    {
-      /* Mark the lwps we're resuming as resumed.  */
-      iterate_over_lwps (minus_one_ptid, resume_clear_callback, NULL);
-      iterate_over_lwps (ptid, resume_set_callback, NULL);
-    }
-  else
-    iterate_over_lwps (minus_one_ptid, resume_set_callback, NULL);
+  /* Mark the lwps we're resuming as resumed.  */
+  iterate_over_lwps (ptid, resume_set_callback, NULL);
 
   /* See if it's the current inferior that should be handled
      specially.  */
@@ -3262,8 +3256,20 @@ retry:
   lp = NULL;
   status = 0;
 
-  /* Make sure there is at least one LWP that has been resumed.  */
-  gdb_assert (iterate_over_lwps (ptid, resumed_callback, NULL));
+  /* Make sure that of those LWPs we want to get an event from, there
+     is at least one LWP that has been resumed.  If there's none, just
+     bail out.  The core may just be flushing asynchronously all
+     events.  */
+  if (iterate_over_lwps (ptid, resumed_callback, NULL) == NULL)
+    {
+      ourstatus->kind = TARGET_WAITKIND_IGNORE;
+
+      if (debug_linux_nat_async)
+	fprintf_unfiltered (gdb_stdlog, "LLW: exit (no resumed LWP)\n");
+
+      restore_child_signals_mask (&prev_mask);
+      return minus_one_ptid;
+    }
 
   /* First check if there is a LWP with a wait status pending.  */
   if (pid == -1)
@@ -3394,6 +3400,8 @@ retry:
 	      && ptid_is_pid (ptid)
 	      && ptid_get_pid (lp->ptid) != ptid_get_pid (ptid))
 	    {
+	      gdb_assert (lp->resumed);
+
 	      if (debug_linux_nat)
 		fprintf (stderr, "LWP %ld got an event %06x, leaving pending.\n",
 			 ptid_get_lwp (lp->ptid), status);
@@ -3402,12 +3410,30 @@ retry:
 		{
 		  if (WSTOPSIG (lp->status) != SIGSTOP)
 		    {
-		      stop_callback (lp, NULL);
-
-		      /* Resume in order to collect the sigstop.  */
-		      ptrace (PTRACE_CONT, GET_LWP (lp->ptid), 0, 0);
-
-		      stop_wait_callback (lp, NULL);
+		      /* Cancel breakpoint hits.  The breakpoint may
+			 be removed before we fetch events from this
+			 process to report to the core.  It is best
+			 not to assume the moribund breakpoints
+			 heuristic always handles these cases --- it
+			 could be too many events go through to the
+			 core before this one is handled.  All-stop
+			 always cancels breakpoint hits in all
+			 threads.  */
+		      if (non_stop
+			  && lp->waitstatus.kind == TARGET_WAITKIND_IGNORE
+			  && WSTOPSIG (lp->status) == SIGTRAP
+			  && cancel_breakpoint (lp))
+			{
+			  /* Throw away the SIGTRAP.  */
+			  lp->status = 0;
+
+			  if (debug_linux_nat)
+			    fprintf (stderr,
+				     "LLW: LWP %ld hit a breakpoint while waiting "
+				     "for another process; cancelled it\n",
+				     ptid_get_lwp (lp->ptid));
+			}
+		      lp->stopped = 1;
 		    }
 		  else
 		    {
@@ -3597,12 +3623,19 @@ retry:
 	 starvation.  */
       if (pid == -1)
 	select_event_lwp (ptid, &lp, &status);
-    }
 
-  /* Now that we've selected our final event LWP, cancel any
-     breakpoints in other LWPs that have hit a GDB breakpoint.  See
-     the comment in cancel_breakpoints_callback to find out why.  */
-  iterate_over_lwps (minus_one_ptid, cancel_breakpoints_callback, lp);
+      /* Now that we've selected our final event LWP, cancel any
+	 breakpoints in other LWPs that have hit a GDB breakpoint.
+	 See the comment in cancel_breakpoints_callback to find out
+	 why.  */
+      iterate_over_lwps (minus_one_ptid, cancel_breakpoints_callback, lp);
+
+      /* In all-stop, from the core's perspective, all LWPs are now
+	 stopped until a new resume action is sent over.  */
+      iterate_over_lwps (minus_one_ptid, resume_clear_callback, NULL);
+    }
+  else
+    lp->resumed = 0;
 
   if (WIFSTOPPED (status) && WSTOPSIG (status) == SIGTRAP)
     {
@@ -3628,6 +3661,47 @@ retry:
   return lp->ptid;
 }
 
+/* Resume LWPs that are currently stopped without any pending status
+   to report, but are resumed from the core's perspective.  */
+
+static int
+resume_stopped_resumed_lwps (struct lwp_info *lp, void *data)
+{
+  ptid_t *wait_ptid_p = data;
+
+  if (lp->stopped
+      && lp->resumed
+      && lp->status == 0
+      && lp->waitstatus.kind == TARGET_WAITKIND_IGNORE)
+    {
+      gdb_assert (is_executing (lp->ptid));
+
+      /* Don't bother if there's a breakpoint at PC that we'd hit
+	 immediately, and we're not waiting for this LWP.  */
+      if (!ptid_match (lp->ptid, *wait_ptid_p))
+	{
+	  struct regcache *regcache = get_thread_regcache (lp->ptid);
+	  CORE_ADDR pc = regcache_read_pc (regcache);
+
+	  if (breakpoint_inserted_here_p (get_regcache_aspace (regcache), pc))
+	    return 0;
+	}
+
+      if (debug_linux_nat)
+	fprintf_unfiltered (gdb_stdlog,
+			    "RSRL: resuming stopped-resumed LWP %s\n",
+			    target_pid_to_str (lp->ptid));
+
+      linux_ops->to_resume (linux_ops, pid_to_ptid (GET_LWP (lp->ptid)),
+			    lp->step, TARGET_SIGNAL_0);
+      lp->stopped = 0;
+      memset (&lp->siginfo, 0, sizeof (lp->siginfo));
+      lp->stopped_by_watchpoint = 0;
+    }
+
+  return 0;
+}
+
 static ptid_t
 linux_nat_wait (struct target_ops *ops,
 		ptid_t ptid, struct target_waitstatus *ourstatus,
@@ -3642,6 +3716,16 @@ linux_nat_wait (struct target_ops *ops,
   if (target_can_async_p ())
     async_file_flush ();
 
+  /* Resume LWPs that are currently stopped without any pending status
+     to report, but are resumed from the core's perspective.  LWPs get
+     in this state if we find them stopping at a time we're not
+     interested in reporting the event (target_wait(specific_process),
+     for example, see linux_nat_wait_1), and meanwhile the event
+     became uninteresting.  Don't bother resuming LWPs we're not going
+     to wait for if they'd stop immediately.  */
+  if (non_stop)
+    iterate_over_lwps (minus_one_ptid, resume_stopped_resumed_lwps, &ptid);
+
   event_ptid = linux_nat_wait_1 (ops, ptid, ourstatus, target_options);
 
   /* If we requested any event, and something came out, assume there


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Fix a couple of multiexec races
  2010-02-08 15:54 ` Pedro Alves
@ 2010-02-15  9:11   ` Vladimir Prus
  0 siblings, 0 replies; 4+ messages in thread
From: Vladimir Prus @ 2010-02-15  9:11 UTC (permalink / raw)
  To: Pedro Alves, gdb-patches

On Monday 08 February 2010 18:54:31 you wrote:

> On Wednesday 13 January 2010 15:11:30, Vladimir Prus wrote:
> > 
> > The attached patch fixes a couple of glitches preventing (upcoming)
> > '-exec-run --all' from working.
> > 
> > First, consider this code inside linux_nat_wait_1:
> > 
> > 	  if (lp
> > 	      && ptid_is_pid (ptid)
> > 	      && ptid_get_pid (lp->ptid) != ptid_get_pid (ptid))
> > 	    {
> > 
> > 	      if (debug_linux_nat)
> > 			fprintf (stderr, "LWP %ld got an event %06x, leaving pending.\n",
> > 					 ptid_get_lwp (lp->ptid), status);
> > 
> >               if (WIFSTOPPED (lp->status))
> >                 {
> >                  if (WSTOPSIG (lp->status) != SIGSTOP)
> >                    {
> >                      stop_callback (lp, NULL);
> > 
> >                      /* Resume in order to collect the sigstop.  */
> >                      ptrace (PTRACE_CONT, GET_LWP (lp->ptid), 0, 0);
> > 
> >                      stop_wait_callback (lp, NULL);
> >                    }
> > 
> > Because lp->status is naturally not NULL when calling stop_callback, and because
> > stop_callback has:
> > 
> > 	 gdb_assert (lp->status == 0);
> > 
> > if we ever enter the inner "if", GDB will crash. Offlist, Pedro suggested the inner if be
> > just removed.
> 
> I thought about this some more, and I think this deserves more
> extensive fixing.  See patch (and comments in it) below, which
> I've just applied. 

Thanks for taking care of this!

-- 
Vladimir Prus
CodeSourcery
vladimir@codesourcery.com
(650) 331-3385 x722


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2010-02-15  9:11 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-01-13 15:11 Fix a couple of multiexec races Vladimir Prus
2010-01-13 15:50 ` Pedro Alves
2010-02-08 15:54 ` Pedro Alves
2010-02-15  9:11   ` Vladimir Prus

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox