[RFC]: Ugly thread step situation

Mirror of the gdb-patches mailing list
 help / color / mirror / Atom feed

* [RFC]: Ugly thread step situation
@ 2004-09-14 21:44 jjohnstn
  2004-09-14 22:42 ` Andrew Cagney
  2004-09-14 23:25 ` Daniel Jacobowitz
  0 siblings, 2 replies; 7+ messages in thread
From: jjohnstn @ 2004-09-14 21:44 UTC (permalink / raw)
  To: gdb-patches

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1713 bytes --]

I recently tracked down a problem with gdb on RHEL3 Linux regarding 
stepping threads.  What happens is that in some instances, lin-lwp.c is 
asked to step the thread of interest.  We then wait on all threads.  Due 
to some form of race condition, the wait does not get back the trap from 
the stepped thread.  If we have a number of waiting events (e.g. thread 
create events, other breakpoints), lin-lwp picks one of them.

Now it gets interesting.  Infrun.c thinks the current thread is being 
stepped and isn't ready for a breakpoint coming back.  On x86, it makes a 
miscalculation of the pc value (for a breakpoint it should back up 1, for 
a step it doesn't have to).  We end up pointing at an invalid pc (we 
didn't back up 1) and everything falls apart from there.

To fix this quickly, I added the accompanying patch to lin-lwp.c.  What it 
does is ensure that we wait on any currently stepping lwp.  In truth, this 
isn't as bad as it sounds.  The lin-lwp code later on is set up to pick 
the stepping lwp over all other events.  This just keeps the scenario 
above from occurring.

Obviously, this doesn't solve everything.  Perhaps the decrement of the pc 
needs to be done once we have established whether the thread has changed 
underneath us.  We also could use a hook to run the lwp list and find out 
if the current lwp was stepping or encountered a breakpoint.

Anyway, if the consensus is that the patch is helpful in the short-term, I 
am more than happy to check it in.

-- Jeff J.

2004-09-14  Jeff Johnston  <jjohnstn@redhat.com>

	* lin-lwp.c (find_singlestep_lwp_callback): New static function.
	(lin_lwp_wait): Change code to specifically wait on any LWP
	that is currently stepping.

[-- Attachment #2: Type: TEXT/PLAIN, Size: 1848 bytes --]

--- gdb+dejagnu-20040607/gdb/lin-lwp.c.fix	Thu Sep  2 11:20:07 2004
+++ gdb+dejagnu-20040607/gdb/lin-lwp.c	Thu Sep  2 11:27:32 2004
@@ -1004,7 +1004,18 @@ count_events_callback (struct lwp_info *
   return 0;
 }

-/* Select the LWP (if any) that is currently being single-stepped.  */
+/* Find an LWP (if any) that is currently being single-stepped.  */
+
+static int
+find_singlestep_lwp_callback (struct lwp_info *lp, void *data)
+{
+  if (lp->step)
+    return 1;
+  else
+    return 0;
+}
+
+/* Select the LWP with an event (if any) that is currently being single-stepped.  */

 static int
 select_singlestep_lwp_callback (struct lwp_info *lp, void *data)
@@ -1289,7 +1300,24 @@ retry:
      least if there are any LWPs at all.  */
   gdb_assert (num_lwps == 0 || iterate_over_lwps (resumed_callback, NULL));

-  /* First check if there is a LWP with a wait status pending.  */
+  /* Check if there is any LWP that is being single-stepped.  We need to
+     wait specifically on such an LWP because the higher-level code is
+     expecting a step operation to find an event on the stepped LWP.
+     It is possible for other events to occur before the step operation
+     gets the expected trap so we don't want to wait on any LWP.
+     This has ramifications when adjustment of the PC is required which can be 
+     different after a breakpoint vs a step (e.g. x86).  */
+  lp = iterate_over_lwps (find_singlestep_lwp_callback, NULL);
+  if (lp) {
+    if (debug_lin_lwp)
+      fprintf_unfiltered (gdb_stdlog,
+ 			  "LLW: Found step lwp %s.\n",
+			  target_pid_to_str (lp->ptid));
+    ptid = lp->ptid;
+    pid = PIDGET (ptid);
+  }
+
+  /* If any pid, check if there is a LWP with a wait status pending.  */
   if (pid == -1)
     {
       /* Any LWP that's been resumed will do.  */

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC]: Ugly thread step situation
  2004-09-14 21:44 [RFC]: Ugly thread step situation jjohnstn
@ 2004-09-14 22:42 ` Andrew Cagney
  2004-09-14 23:25 ` Daniel Jacobowitz
  1 sibling, 0 replies; 7+ messages in thread
From: Andrew Cagney @ 2004-09-14 22:42 UTC (permalink / raw)
  To: jjohnstn; +Cc: gdb-patches

> Anyway, if the consensus is that the patch is helpful in the short-term, I 
> am more than happy to check it in.

What's involved in adding some sort of testcase?

If this were committed, the testcase will act as our insturance ensuring 
that the s/short/long/ term fix, or even an unrelated change, do not 
re-inroduce this bug.  Even without this, the testcase acts as a 
red-flag identifying a breakage that any long term fix needs to handle.

Andrew

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC]: Ugly thread step situation
  2004-09-14 21:44 [RFC]: Ugly thread step situation jjohnstn
  2004-09-14 22:42 ` Andrew Cagney
@ 2004-09-14 23:25 ` Daniel Jacobowitz
  2004-09-15 17:36   ` Jeff Johnston
  1 sibling, 1 reply; 7+ messages in thread
From: Daniel Jacobowitz @ 2004-09-14 23:25 UTC (permalink / raw)
  To: jjohnstn; +Cc: gdb-patches

On Tue, Sep 14, 2004 at 05:44:47PM -0400, jjohnstn wrote:
> I recently tracked down a problem with gdb on RHEL3 Linux regarding 
> stepping threads.  What happens is that in some instances, lin-lwp.c is 
> asked to step the thread of interest.  We then wait on all threads.  Due 
> to some form of race condition, the wait does not get back the trap from 
> the stepped thread.  If we have a number of waiting events (e.g. thread 
> create events, other breakpoints), lin-lwp picks one of them.

Could you explain this bit a little more?  What comes back instead for
the thread that was stepping?  Do we stop it with a SIGSTOP?

Is there a testcase?

> Now it gets interesting.  Infrun.c thinks the current thread is being 
> stepped and isn't ready for a breakpoint coming back.  On x86, it makes a 
> miscalculation of the pc value (for a breakpoint it should back up 1, for 
> a step it doesn't have to).  We end up pointing at an invalid pc (we 
> didn't back up 1) and everything falls apart from there.
> 
> To fix this quickly, I added the accompanying patch to lin-lwp.c.  What it 
> does is ensure that we wait on any currently stepping lwp.  In truth, this 
> isn't as bad as it sounds.  The lin-lwp code later on is set up to pick 
> the stepping lwp over all other events.  This just keeps the scenario 
> above from occurring.
> 
> Obviously, this doesn't solve everything.  Perhaps the decrement of the pc 
> needs to be done once we have established whether the thread has changed 
> underneath us.  We also could use a hook to run the lwp list and find out 
> if the current lwp was stepping or encountered a breakpoint.
> 
> Anyway, if the consensus is that the patch is helpful in the short-term, I 
> am more than happy to check it in.
> 
> -- Jeff J.
> 
> 2004-09-14  Jeff Johnston  <jjohnstn@redhat.com>
> 
> 	* lin-lwp.c (find_singlestep_lwp_callback): New static function.
> 	(lin_lwp_wait): Change code to specifically wait on any LWP
> 	that is currently stepping.

This sounds sort of like a problem I debugged on MIPS and hppa, but
never managed to reproduce.  I had tabled the patch until I had more
time to look at it - always a mistake.

The same patch may help here.  Could you tell me what resume_ptid is
before the call to target_resume, in resume?  The call in which we
request the single-step, I mean.


-- 
Daniel Jacobowitz


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC]: Ugly thread step situation
  2004-09-14 23:25 ` Daniel Jacobowitz
@ 2004-09-15 17:36   ` Jeff Johnston
  2004-09-15 17:53     ` Andrew Cagney
  2004-09-15 18:21     ` Daniel Jacobowitz
  0 siblings, 2 replies; 7+ messages in thread
From: Jeff Johnston @ 2004-09-15 17:36 UTC (permalink / raw)
  To: Daniel Jacobowitz; +Cc: gdb-patches

[-- Attachment #1: Type: text/plain, Size: 5909 bytes --]

Daniel Jacobowitz wrote:
> On Tue, Sep 14, 2004 at 05:44:47PM -0400, jjohnstn wrote:
> 
>>I recently tracked down a problem with gdb on RHEL3 Linux regarding 
>>stepping threads.  What happens is that in some instances, lin-lwp.c is 
>>asked to step the thread of interest.  We then wait on all threads.  Due 
>>to some form of race condition, the wait does not get back the trap from 
>>the stepped thread.  If we have a number of waiting events (e.g. thread 
>>create events, other breakpoints), lin-lwp picks one of them.
> 
> 
> Could you explain this bit a little more?  What comes back instead for
> the thread that was stepping?  Do we stop it with a SIGSTOP?
> 
> Is there a testcase?
> 

Attached.  This was the test-case given for Red Hat Bugzilla bug 130896.

Basically, you break at a thread function and attempt to do a mix of nexts and 
continues.  It doesn't seem to occur deterministically but it will eventually occur.

The following is the excerpt of the lin-lwp trace that I put in the Bugzilla 
bug.  It shows where the error occurs.  We get an event on another thread, stop 
the stepping thread and see it stopped.  At that point we have a choice of 
breakpoints to choose from.

The program needs to be run with an argument of 4

set debug lin-lwp 1
b synchronize
run 4
next
...

Breakpoint 1, synchronize (tid=3074096048) at gdbtest.C:18
18	    pthread_mutex_lock(&mutex);
(gdb) n
LLR: PTRACE_SINGLESTEP process 10066, 0 (resume event thread)
LLW: waitpid 10066 received Trace/breakpoint trap (stopped)
LLTA: PTRACE_PEEKUSER LWP 10066, 0, 0 (OK)
LLW: Candidate event Trace/breakpoint trap (stopped) in LWP 10066.
SEL: Select single-step LWP 10066
LLW: trap_ptid is LWP 10066.
RC:  PTRACE_CONT LWP 10068, 0, 0 (resume sibling)
RC:  PTRACE_CONT LWP 10067, 0, 0 (resume sibling)
RC:  PTRACE_CONT LWP 10062, 0, 0 (resume sibling)
LLR: PTRACE_SINGLESTEP process 10066, 0 (resume event thread)
LLW: waitpid 10067 received Trace/breakpoint trap (stopped)
LLTA: PTRACE_PEEKUSER LWP 10067, 0, 0 (OK)
LLW: Candidate event Trace/breakpoint trap (stopped) in LWP 10067.
SC:  kill LWP 10068 **<SIGSTOP>**
SC:  lwp kill 0 ERRNO-OK
SC:  kill LWP 10066 **<SIGSTOP>**
SC:  lwp kill 0 ERRNO-OK
SC:  kill LWP 10062 **<SIGSTOP>**
SC:  lwp kill 0 ERRNO-OK
WL: waitpid LWP 10068 received Stopped (signal) (stopped)
WL: waitpid LWP 10066 received Trace/breakpoint trap (stopped)
PTRACE_CONT LWP 10066, 0, 0 (OK)
SWC: Candidate SIGTRAP event in LWP 10066
WL: waitpid LWP 10066 received Stopped (signal) (stopped)
WL: waitpid LWP 10062 received Trace/breakpoint trap (stopped)
PTRACE_CONT LWP 10062, 0, 0 (OK)
SWC: Candidate SIGTRAP event in LWP 10062
WL: waitpid LWP 10062 received Stopped (signal) (stopped)
FC: LP has pending status 00057f
FC: LP has pending status 00057f
SEL: Select single-step LWP 10066
CBC: Push back breakpoint for LWP 10067
CBC: Push back breakpoint for LWP 10062
LLW: trap_ptid is LWP 10066.
RC:  PTRACE_CONT LWP 10068, 0, 0 (resume sibling)
RC:  PTRACE_CONT LWP 10067, 0, 0 (resume sibling)
RC:  PTRACE_CONT LWP 10062, 0, 0 (resume sibling)
LLR: PTRACE_SINGLESTEP process 10066, 0 (resume event thread)
LLW: waitpid 10062 received Trace/breakpoint trap (stopped)
LLTA: PTRACE_PEEKUSER LWP 10062, 0, 0 (OK)
LLW: Candidate event Trace/breakpoint trap (stopped) in LWP 10062.
SC:  kill LWP 10068 **<SIGSTOP>**
SC:  lwp kill 0 ERRNO-OK
SC:  kill LWP 10067 **<SIGSTOP>**
SC:  lwp kill 0 ERRNO-OK
SC:  kill LWP 10066 **<SIGSTOP>**
SC:  lwp kill 0 ERRNO-OK
WL: waitpid LWP 10068 received Stopped (signal) (stopped)
WL: waitpid LWP 10067 received Trace/breakpoint trap (stopped)
PTRACE_CONT LWP 10067, 0, 0 (OK)
SWC: Candidate SIGTRAP event in LWP 10067
WL: waitpid LWP 10067 received Stopped (signal) (stopped)
WL: waitpid LWP 10066 received Stopped (signal) (stopped)
FC: LP has pending status 00057f
SEL: Found 2 SIGTRAP events, selecting #0  <=== should not happen
CBC: Push back breakpoint for LWP 10062
LLW: trap_ptid is LWP 10067.

Program received signal SIGTRAP, Trace/breakpoint trap.
[Switching to Thread -1231361104 (LWP 10067)]
0x080489cb in synchronize (tid=3063606192) at gdbtest.C:18
18	    pthread_mutex_lock(&mutex);

> 
>>Now it gets interesting.  Infrun.c thinks the current thread is being 
>>stepped and isn't ready for a breakpoint coming back.  On x86, it makes a 
>>miscalculation of the pc value (for a breakpoint it should back up 1, for 
>>a step it doesn't have to).  We end up pointing at an invalid pc (we 
>>didn't back up 1) and everything falls apart from there.
>>
>>To fix this quickly, I added the accompanying patch to lin-lwp.c.  What it 
>>does is ensure that we wait on any currently stepping lwp.  In truth, this 
>>isn't as bad as it sounds.  The lin-lwp code later on is set up to pick 
>>the stepping lwp over all other events.  This just keeps the scenario 
>>above from occurring.
>>
>>Obviously, this doesn't solve everything.  Perhaps the decrement of the pc 
>>needs to be done once we have established whether the thread has changed 
>>underneath us.  We also could use a hook to run the lwp list and find out 
>>if the current lwp was stepping or encountered a breakpoint.
>>
>>Anyway, if the consensus is that the patch is helpful in the short-term, I 
>>am more than happy to check it in.
>>
>>-- Jeff J.
>>
>>2004-09-14  Jeff Johnston  <jjohnstn@redhat.com>
>>
>>	* lin-lwp.c (find_singlestep_lwp_callback): New static function.
>>	(lin_lwp_wait): Change code to specifically wait on any LWP
>>	that is currently stepping.
> 
> 
> This sounds sort of like a problem I debugged on MIPS and hppa, but
> never managed to reproduce.  I had tabled the patch until I had more
> time to look at it - always a mistake.
> 
> The same patch may help here.  Could you tell me what resume_ptid is
> before the call to target_resume, in resume?  The call in which we
> request the single-step, I mean.
>

I'll have to look into it.

-- Jeff J.

> 

[-- Attachment #2: gdbtest.C --]
[-- Type: text/plain, Size: 1322 bytes --]

#ifdef __sun
#include <stream.h>
#else
#include <iostream>
using namespace std;
#endif

#include <pthread.h>
#include <stdlib.h>
#include <sys/time.h>

pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;

// getitimer, setitimer
void synchronize(pthread_t tid)
{
    pthread_mutex_lock(&mutex);
    cout << "In synchronize() for thread " << tid << endl;
    pthread_mutex_unlock(&mutex);
    return;
}

void pthreadMain()
{
    try
    {
	while ( 1 )
	{
	    pthread_t tid = pthread_self();
 	    //cout << "Thread " << tid << " about to call synchronize()..." << endl;
	    synchronize(tid);
 	    //cout << "Thread " << tid << " done..." << endl;
	}
    }
    catch ( ... )
    {
	cout << "--> EXCEPTION ???" << endl;
    }
}

int main(int argc, char **argv)
{
    int numThreads;

    if ( argc == 2 )
    {
	numThreads = atoi(argv[1]);
    }
    else
    {
	cout << "Usage: server <num threads>" << '\n';
	return(1);
    }

    pthread_t *tid = new pthread_t[numThreads];
    int idx;

    for ( idx = 1; idx < numThreads; idx++ )
    {
	if ( pthread_create(tid+idx, NULL,(void*(*)(void*))pthreadMain, NULL) != 0 )
	    cout << "--> FAILED TO CREATE THREAD!!!" << endl;
    }

    for ( idx = 1; idx < numThreads; idx++ )
	pthread_join(tid[idx], NULL);

    cout << "--> EXECUTION COMPLETED" << endl;
    return(0);
}

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC]: Ugly thread step situation
  2004-09-15 17:36   ` Jeff Johnston
@ 2004-09-15 17:53     ` Andrew Cagney
  2004-09-15 17:59       ` Jeff Johnston
  2004-09-15 18:21     ` Daniel Jacobowitz
  1 sibling, 1 reply; 7+ messages in thread
From: Andrew Cagney @ 2004-09-15 17:53 UTC (permalink / raw)
  To: Jeff Johnston; +Cc: Daniel Jacobowitz, gdb-patches

> Attached.  This was the test-case given for Red Hat Bugzilla bug 130896.

The FSF can't accept this, or integrate it into the testsuite/

Andrew



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC]: Ugly thread step situation
  2004-09-15 17:53     ` Andrew Cagney
@ 2004-09-15 17:59       ` Jeff Johnston
  0 siblings, 0 replies; 7+ messages in thread
From: Jeff Johnston @ 2004-09-15 17:59 UTC (permalink / raw)
  To: Andrew Cagney; +Cc: Daniel Jacobowitz, gdb-patches

Andrew Cagney wrote:
>> Attached.  This was the test-case given for Red Hat Bugzilla bug 130896.
> 
> 
> The FSF can't accept this, or integrate it into the testsuite/
> 
> Andrew
>

Yes, I should have stated that.  It is just to allow Daniel (and others) to 
recreate the problem.

-- Jeff J.

> 
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC]: Ugly thread step situation
  2004-09-15 17:36   ` Jeff Johnston
  2004-09-15 17:53     ` Andrew Cagney
@ 2004-09-15 18:21     ` Daniel Jacobowitz
  1 sibling, 0 replies; 7+ messages in thread
From: Daniel Jacobowitz @ 2004-09-15 18:21 UTC (permalink / raw)
  To: Jeff Johnston; +Cc: gdb-patches

On Wed, Sep 15, 2004 at 01:36:13PM -0400, Jeff Johnston wrote:
> Daniel Jacobowitz wrote:
> >On Tue, Sep 14, 2004 at 05:44:47PM -0400, jjohnstn wrote:
> >
> >>I recently tracked down a problem with gdb on RHEL3 Linux regarding 
> >>stepping threads.  What happens is that in some instances, lin-lwp.c is 
> >>asked to step the thread of interest.  We then wait on all threads.  Due 
> >>to some form of race condition, the wait does not get back the trap from 
> >>the stepped thread.  If we have a number of waiting events (e.g. thread 
> >>create events, other breakpoints), lin-lwp picks one of them.
> >
> >
> >Could you explain this bit a little more?  What comes back instead for
> >the thread that was stepping?  Do we stop it with a SIGSTOP?
> >
> >Is there a testcase?
> >
> 
> Attached.  This was the test-case given for Red Hat Bugzilla bug 130896.

Given Andrew's caveat, I will just look at the trace instead.  I see
that bug isn't publicly visible.

> LLR: PTRACE_SINGLESTEP process 10066, 0 (resume event thread)
> LLW: waitpid 10062 received Trace/breakpoint trap (stopped)
> LLTA: PTRACE_PEEKUSER LWP 10062, 0, 0 (OK)
> LLW: Candidate event Trace/breakpoint trap (stopped) in LWP 10062.
> SC:  kill LWP 10068 **<SIGSTOP>**
> SC:  lwp kill 0 ERRNO-OK
> SC:  kill LWP 10067 **<SIGSTOP>**
> SC:  lwp kill 0 ERRNO-OK
> SC:  kill LWP 10066 **<SIGSTOP>**
> SC:  lwp kill 0 ERRNO-OK
> WL: waitpid LWP 10068 received Stopped (signal) (stopped)
> WL: waitpid LWP 10067 received Trace/breakpoint trap (stopped)
> PTRACE_CONT LWP 10067, 0, 0 (OK)
> SWC: Candidate SIGTRAP event in LWP 10067
> WL: waitpid LWP 10067 received Stopped (signal) (stopped)
> WL: waitpid LWP 10066 received Stopped (signal) (stopped)
> FC: LP has pending status 00057f
> SEL: Found 2 SIGTRAP events, selecting #0  <=== should not happen
> CBC: Push back breakpoint for LWP 10062
> LLW: trap_ptid is LWP 10067.
> 
> Program received signal SIGTRAP, Trace/breakpoint trap.
> [Switching to Thread -1231361104 (LWP 10067)]
> 0x080489cb in synchronize (tid=3063606192) at gdbtest.C:18
> 18	    pthread_mutex_lock(&mutex);

In this trace, we single-step process 10066 [we resumed all the other
threads first - I snipped too much].  Then we wait.  We receive an
event from 10062.  We stop all other threads.  We receive another event
from 10067, and a SIGSTOP from 10066 and 10068.  So the thread had not
yet been scheduled when we sent it the SIGSTOP, and it hasn't done the
single step yet.

If I am interpreting your patch correctly, it will cause us to use
waitpid without __WNOHANG on LWP 10066.  If 10066 is stepping over a
blocking syscall, and another thread hits a breakpoint, we want to
display the breakpoint.  Otherwise we'll get the wrong behavior in the
non-race-condition situation where one thread goes to sleep.

Infrun needs to verify that the thread which had an event was the
thread which stepped.  If it isn't, then hopefully the single-step has
not happened - this will be true using x86 and PTRACE_SINGLESTEP, but
may require some other changes for software singlestep that I think I
have queued somewhere.

-- 
Daniel Jacobowitz

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2004-09-15 18:21 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2004-09-14 21:44 [RFC]: Ugly thread step situation jjohnstn
2004-09-14 22:42 ` Andrew Cagney
2004-09-14 23:25 ` Daniel Jacobowitz
2004-09-15 17:36   ` Jeff Johnston
2004-09-15 17:53     ` Andrew Cagney
2004-09-15 17:59       ` Jeff Johnston
2004-09-15 18:21     ` Daniel Jacobowitz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox