* [RFC]: Ugly thread step situation
@ 2004-09-14 21:44 jjohnstn
2004-09-14 22:42 ` Andrew Cagney
2004-09-14 23:25 ` Daniel Jacobowitz
0 siblings, 2 replies; 7+ messages in thread
From: jjohnstn @ 2004-09-14 21:44 UTC (permalink / raw)
To: gdb-patches
[-- Attachment #1: Type: TEXT/PLAIN, Size: 1713 bytes --]
I recently tracked down a problem with gdb on RHEL3 Linux regarding
stepping threads. What happens is that in some instances, lin-lwp.c is
asked to step the thread of interest. We then wait on all threads. Due
to some form of race condition, the wait does not get back the trap from
the stepped thread. If we have a number of waiting events (e.g. thread
create events, other breakpoints), lin-lwp picks one of them.
Now it gets interesting. Infrun.c thinks the current thread is being
stepped and isn't ready for a breakpoint coming back. On x86, it makes a
miscalculation of the pc value (for a breakpoint it should back up 1, for
a step it doesn't have to). We end up pointing at an invalid pc (we
didn't back up 1) and everything falls apart from there.
To fix this quickly, I added the accompanying patch to lin-lwp.c. What it
does is ensure that we wait on any currently stepping lwp. In truth, this
isn't as bad as it sounds. The lin-lwp code later on is set up to pick
the stepping lwp over all other events. This just keeps the scenario
above from occurring.
Obviously, this doesn't solve everything. Perhaps the decrement of the pc
needs to be done once we have established whether the thread has changed
underneath us. We also could use a hook to run the lwp list and find out
if the current lwp was stepping or encountered a breakpoint.
Anyway, if the consensus is that the patch is helpful in the short-term, I
am more than happy to check it in.
-- Jeff J.
2004-09-14 Jeff Johnston <jjohnstn@redhat.com>
* lin-lwp.c (find_singlestep_lwp_callback): New static function.
(lin_lwp_wait): Change code to specifically wait on any LWP
that is currently stepping.
[-- Attachment #2: Type: TEXT/PLAIN, Size: 1848 bytes --]
--- gdb+dejagnu-20040607/gdb/lin-lwp.c.fix Thu Sep 2 11:20:07 2004
+++ gdb+dejagnu-20040607/gdb/lin-lwp.c Thu Sep 2 11:27:32 2004
@@ -1004,7 +1004,18 @@ count_events_callback (struct lwp_info *
return 0;
}
-/* Select the LWP (if any) that is currently being single-stepped. */
+/* Find an LWP (if any) that is currently being single-stepped. */
+
+static int
+find_singlestep_lwp_callback (struct lwp_info *lp, void *data)
+{
+ if (lp->step)
+ return 1;
+ else
+ return 0;
+}
+
+/* Select the LWP with an event (if any) that is currently being single-stepped. */
static int
select_singlestep_lwp_callback (struct lwp_info *lp, void *data)
@@ -1289,7 +1300,24 @@ retry:
least if there are any LWPs at all. */
gdb_assert (num_lwps == 0 || iterate_over_lwps (resumed_callback, NULL));
- /* First check if there is a LWP with a wait status pending. */
+ /* Check if there is any LWP that is being single-stepped. We need to
+ wait specifically on such an LWP because the higher-level code is
+ expecting a step operation to find an event on the stepped LWP.
+ It is possible for other events to occur before the step operation
+ gets the expected trap so we don't want to wait on any LWP.
+ This has ramifications when adjustment of the PC is required which can be
+ different after a breakpoint vs a step (e.g. x86). */
+ lp = iterate_over_lwps (find_singlestep_lwp_callback, NULL);
+ if (lp) {
+ if (debug_lin_lwp)
+ fprintf_unfiltered (gdb_stdlog,
+ "LLW: Found step lwp %s.\n",
+ target_pid_to_str (lp->ptid));
+ ptid = lp->ptid;
+ pid = PIDGET (ptid);
+ }
+
+ /* If any pid, check if there is a LWP with a wait status pending. */
if (pid == -1)
{
/* Any LWP that's been resumed will do. */
^ permalink raw reply [flat|nested] 7+ messages in thread* Re: [RFC]: Ugly thread step situation 2004-09-14 21:44 [RFC]: Ugly thread step situation jjohnstn @ 2004-09-14 22:42 ` Andrew Cagney 2004-09-14 23:25 ` Daniel Jacobowitz 1 sibling, 0 replies; 7+ messages in thread From: Andrew Cagney @ 2004-09-14 22:42 UTC (permalink / raw) To: jjohnstn; +Cc: gdb-patches > Anyway, if the consensus is that the patch is helpful in the short-term, I > am more than happy to check it in. What's involved in adding some sort of testcase? If this were committed, the testcase will act as our insturance ensuring that the s/short/long/ term fix, or even an unrelated change, do not re-inroduce this bug. Even without this, the testcase acts as a red-flag identifying a breakage that any long term fix needs to handle. Andrew ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC]: Ugly thread step situation 2004-09-14 21:44 [RFC]: Ugly thread step situation jjohnstn 2004-09-14 22:42 ` Andrew Cagney @ 2004-09-14 23:25 ` Daniel Jacobowitz 2004-09-15 17:36 ` Jeff Johnston 1 sibling, 1 reply; 7+ messages in thread From: Daniel Jacobowitz @ 2004-09-14 23:25 UTC (permalink / raw) To: jjohnstn; +Cc: gdb-patches On Tue, Sep 14, 2004 at 05:44:47PM -0400, jjohnstn wrote: > I recently tracked down a problem with gdb on RHEL3 Linux regarding > stepping threads. What happens is that in some instances, lin-lwp.c is > asked to step the thread of interest. We then wait on all threads. Due > to some form of race condition, the wait does not get back the trap from > the stepped thread. If we have a number of waiting events (e.g. thread > create events, other breakpoints), lin-lwp picks one of them. Could you explain this bit a little more? What comes back instead for the thread that was stepping? Do we stop it with a SIGSTOP? Is there a testcase? > Now it gets interesting. Infrun.c thinks the current thread is being > stepped and isn't ready for a breakpoint coming back. On x86, it makes a > miscalculation of the pc value (for a breakpoint it should back up 1, for > a step it doesn't have to). We end up pointing at an invalid pc (we > didn't back up 1) and everything falls apart from there. > > To fix this quickly, I added the accompanying patch to lin-lwp.c. What it > does is ensure that we wait on any currently stepping lwp. In truth, this > isn't as bad as it sounds. The lin-lwp code later on is set up to pick > the stepping lwp over all other events. This just keeps the scenario > above from occurring. > > Obviously, this doesn't solve everything. Perhaps the decrement of the pc > needs to be done once we have established whether the thread has changed > underneath us. We also could use a hook to run the lwp list and find out > if the current lwp was stepping or encountered a breakpoint. > > Anyway, if the consensus is that the patch is helpful in the short-term, I > am more than happy to check it in. > > -- Jeff J. > > 2004-09-14 Jeff Johnston <jjohnstn@redhat.com> > > * lin-lwp.c (find_singlestep_lwp_callback): New static function. > (lin_lwp_wait): Change code to specifically wait on any LWP > that is currently stepping. This sounds sort of like a problem I debugged on MIPS and hppa, but never managed to reproduce. I had tabled the patch until I had more time to look at it - always a mistake. The same patch may help here. Could you tell me what resume_ptid is before the call to target_resume, in resume? The call in which we request the single-step, I mean. -- Daniel Jacobowitz ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC]: Ugly thread step situation 2004-09-14 23:25 ` Daniel Jacobowitz @ 2004-09-15 17:36 ` Jeff Johnston 2004-09-15 17:53 ` Andrew Cagney 2004-09-15 18:21 ` Daniel Jacobowitz 0 siblings, 2 replies; 7+ messages in thread From: Jeff Johnston @ 2004-09-15 17:36 UTC (permalink / raw) To: Daniel Jacobowitz; +Cc: gdb-patches [-- Attachment #1: Type: text/plain, Size: 5909 bytes --] Daniel Jacobowitz wrote: > On Tue, Sep 14, 2004 at 05:44:47PM -0400, jjohnstn wrote: > >>I recently tracked down a problem with gdb on RHEL3 Linux regarding >>stepping threads. What happens is that in some instances, lin-lwp.c is >>asked to step the thread of interest. We then wait on all threads. Due >>to some form of race condition, the wait does not get back the trap from >>the stepped thread. If we have a number of waiting events (e.g. thread >>create events, other breakpoints), lin-lwp picks one of them. > > > Could you explain this bit a little more? What comes back instead for > the thread that was stepping? Do we stop it with a SIGSTOP? > > Is there a testcase? > Attached. This was the test-case given for Red Hat Bugzilla bug 130896. Basically, you break at a thread function and attempt to do a mix of nexts and continues. It doesn't seem to occur deterministically but it will eventually occur. The following is the excerpt of the lin-lwp trace that I put in the Bugzilla bug. It shows where the error occurs. We get an event on another thread, stop the stepping thread and see it stopped. At that point we have a choice of breakpoints to choose from. The program needs to be run with an argument of 4 set debug lin-lwp 1 b synchronize run 4 next ... Breakpoint 1, synchronize (tid=3074096048) at gdbtest.C:18 18 pthread_mutex_lock(&mutex); (gdb) n LLR: PTRACE_SINGLESTEP process 10066, 0 (resume event thread) LLW: waitpid 10066 received Trace/breakpoint trap (stopped) LLTA: PTRACE_PEEKUSER LWP 10066, 0, 0 (OK) LLW: Candidate event Trace/breakpoint trap (stopped) in LWP 10066. SEL: Select single-step LWP 10066 LLW: trap_ptid is LWP 10066. RC: PTRACE_CONT LWP 10068, 0, 0 (resume sibling) RC: PTRACE_CONT LWP 10067, 0, 0 (resume sibling) RC: PTRACE_CONT LWP 10062, 0, 0 (resume sibling) LLR: PTRACE_SINGLESTEP process 10066, 0 (resume event thread) LLW: waitpid 10067 received Trace/breakpoint trap (stopped) LLTA: PTRACE_PEEKUSER LWP 10067, 0, 0 (OK) LLW: Candidate event Trace/breakpoint trap (stopped) in LWP 10067. SC: kill LWP 10068 **<SIGSTOP>** SC: lwp kill 0 ERRNO-OK SC: kill LWP 10066 **<SIGSTOP>** SC: lwp kill 0 ERRNO-OK SC: kill LWP 10062 **<SIGSTOP>** SC: lwp kill 0 ERRNO-OK WL: waitpid LWP 10068 received Stopped (signal) (stopped) WL: waitpid LWP 10066 received Trace/breakpoint trap (stopped) PTRACE_CONT LWP 10066, 0, 0 (OK) SWC: Candidate SIGTRAP event in LWP 10066 WL: waitpid LWP 10066 received Stopped (signal) (stopped) WL: waitpid LWP 10062 received Trace/breakpoint trap (stopped) PTRACE_CONT LWP 10062, 0, 0 (OK) SWC: Candidate SIGTRAP event in LWP 10062 WL: waitpid LWP 10062 received Stopped (signal) (stopped) FC: LP has pending status 00057f FC: LP has pending status 00057f SEL: Select single-step LWP 10066 CBC: Push back breakpoint for LWP 10067 CBC: Push back breakpoint for LWP 10062 LLW: trap_ptid is LWP 10066. RC: PTRACE_CONT LWP 10068, 0, 0 (resume sibling) RC: PTRACE_CONT LWP 10067, 0, 0 (resume sibling) RC: PTRACE_CONT LWP 10062, 0, 0 (resume sibling) LLR: PTRACE_SINGLESTEP process 10066, 0 (resume event thread) LLW: waitpid 10062 received Trace/breakpoint trap (stopped) LLTA: PTRACE_PEEKUSER LWP 10062, 0, 0 (OK) LLW: Candidate event Trace/breakpoint trap (stopped) in LWP 10062. SC: kill LWP 10068 **<SIGSTOP>** SC: lwp kill 0 ERRNO-OK SC: kill LWP 10067 **<SIGSTOP>** SC: lwp kill 0 ERRNO-OK SC: kill LWP 10066 **<SIGSTOP>** SC: lwp kill 0 ERRNO-OK WL: waitpid LWP 10068 received Stopped (signal) (stopped) WL: waitpid LWP 10067 received Trace/breakpoint trap (stopped) PTRACE_CONT LWP 10067, 0, 0 (OK) SWC: Candidate SIGTRAP event in LWP 10067 WL: waitpid LWP 10067 received Stopped (signal) (stopped) WL: waitpid LWP 10066 received Stopped (signal) (stopped) FC: LP has pending status 00057f SEL: Found 2 SIGTRAP events, selecting #0 <=== should not happen CBC: Push back breakpoint for LWP 10062 LLW: trap_ptid is LWP 10067. Program received signal SIGTRAP, Trace/breakpoint trap. [Switching to Thread -1231361104 (LWP 10067)] 0x080489cb in synchronize (tid=3063606192) at gdbtest.C:18 18 pthread_mutex_lock(&mutex); > >>Now it gets interesting. Infrun.c thinks the current thread is being >>stepped and isn't ready for a breakpoint coming back. On x86, it makes a >>miscalculation of the pc value (for a breakpoint it should back up 1, for >>a step it doesn't have to). We end up pointing at an invalid pc (we >>didn't back up 1) and everything falls apart from there. >> >>To fix this quickly, I added the accompanying patch to lin-lwp.c. What it >>does is ensure that we wait on any currently stepping lwp. In truth, this >>isn't as bad as it sounds. The lin-lwp code later on is set up to pick >>the stepping lwp over all other events. This just keeps the scenario >>above from occurring. >> >>Obviously, this doesn't solve everything. Perhaps the decrement of the pc >>needs to be done once we have established whether the thread has changed >>underneath us. We also could use a hook to run the lwp list and find out >>if the current lwp was stepping or encountered a breakpoint. >> >>Anyway, if the consensus is that the patch is helpful in the short-term, I >>am more than happy to check it in. >> >>-- Jeff J. >> >>2004-09-14 Jeff Johnston <jjohnstn@redhat.com> >> >> * lin-lwp.c (find_singlestep_lwp_callback): New static function. >> (lin_lwp_wait): Change code to specifically wait on any LWP >> that is currently stepping. > > > This sounds sort of like a problem I debugged on MIPS and hppa, but > never managed to reproduce. I had tabled the patch until I had more > time to look at it - always a mistake. > > The same patch may help here. Could you tell me what resume_ptid is > before the call to target_resume, in resume? The call in which we > request the single-step, I mean. > I'll have to look into it. -- Jeff J. > [-- Attachment #2: gdbtest.C --] [-- Type: text/plain, Size: 1322 bytes --] #ifdef __sun #include <stream.h> #else #include <iostream> using namespace std; #endif #include <pthread.h> #include <stdlib.h> #include <sys/time.h> pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER; // getitimer, setitimer void synchronize(pthread_t tid) { pthread_mutex_lock(&mutex); cout << "In synchronize() for thread " << tid << endl; pthread_mutex_unlock(&mutex); return; } void pthreadMain() { try { while ( 1 ) { pthread_t tid = pthread_self(); //cout << "Thread " << tid << " about to call synchronize()..." << endl; synchronize(tid); //cout << "Thread " << tid << " done..." << endl; } } catch ( ... ) { cout << "--> EXCEPTION ???" << endl; } } int main(int argc, char **argv) { int numThreads; if ( argc == 2 ) { numThreads = atoi(argv[1]); } else { cout << "Usage: server <num threads>" << '\n'; return(1); } pthread_t *tid = new pthread_t[numThreads]; int idx; for ( idx = 1; idx < numThreads; idx++ ) { if ( pthread_create(tid+idx, NULL,(void*(*)(void*))pthreadMain, NULL) != 0 ) cout << "--> FAILED TO CREATE THREAD!!!" << endl; } for ( idx = 1; idx < numThreads; idx++ ) pthread_join(tid[idx], NULL); cout << "--> EXECUTION COMPLETED" << endl; return(0); } ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC]: Ugly thread step situation 2004-09-15 17:36 ` Jeff Johnston @ 2004-09-15 17:53 ` Andrew Cagney 2004-09-15 17:59 ` Jeff Johnston 2004-09-15 18:21 ` Daniel Jacobowitz 1 sibling, 1 reply; 7+ messages in thread From: Andrew Cagney @ 2004-09-15 17:53 UTC (permalink / raw) To: Jeff Johnston; +Cc: Daniel Jacobowitz, gdb-patches > Attached. This was the test-case given for Red Hat Bugzilla bug 130896. The FSF can't accept this, or integrate it into the testsuite/ Andrew ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC]: Ugly thread step situation 2004-09-15 17:53 ` Andrew Cagney @ 2004-09-15 17:59 ` Jeff Johnston 0 siblings, 0 replies; 7+ messages in thread From: Jeff Johnston @ 2004-09-15 17:59 UTC (permalink / raw) To: Andrew Cagney; +Cc: Daniel Jacobowitz, gdb-patches Andrew Cagney wrote: >> Attached. This was the test-case given for Red Hat Bugzilla bug 130896. > > > The FSF can't accept this, or integrate it into the testsuite/ > > Andrew > Yes, I should have stated that. It is just to allow Daniel (and others) to recreate the problem. -- Jeff J. > > ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC]: Ugly thread step situation 2004-09-15 17:36 ` Jeff Johnston 2004-09-15 17:53 ` Andrew Cagney @ 2004-09-15 18:21 ` Daniel Jacobowitz 1 sibling, 0 replies; 7+ messages in thread From: Daniel Jacobowitz @ 2004-09-15 18:21 UTC (permalink / raw) To: Jeff Johnston; +Cc: gdb-patches On Wed, Sep 15, 2004 at 01:36:13PM -0400, Jeff Johnston wrote: > Daniel Jacobowitz wrote: > >On Tue, Sep 14, 2004 at 05:44:47PM -0400, jjohnstn wrote: > > > >>I recently tracked down a problem with gdb on RHEL3 Linux regarding > >>stepping threads. What happens is that in some instances, lin-lwp.c is > >>asked to step the thread of interest. We then wait on all threads. Due > >>to some form of race condition, the wait does not get back the trap from > >>the stepped thread. If we have a number of waiting events (e.g. thread > >>create events, other breakpoints), lin-lwp picks one of them. > > > > > >Could you explain this bit a little more? What comes back instead for > >the thread that was stepping? Do we stop it with a SIGSTOP? > > > >Is there a testcase? > > > > Attached. This was the test-case given for Red Hat Bugzilla bug 130896. Given Andrew's caveat, I will just look at the trace instead. I see that bug isn't publicly visible. > LLR: PTRACE_SINGLESTEP process 10066, 0 (resume event thread) > LLW: waitpid 10062 received Trace/breakpoint trap (stopped) > LLTA: PTRACE_PEEKUSER LWP 10062, 0, 0 (OK) > LLW: Candidate event Trace/breakpoint trap (stopped) in LWP 10062. > SC: kill LWP 10068 **<SIGSTOP>** > SC: lwp kill 0 ERRNO-OK > SC: kill LWP 10067 **<SIGSTOP>** > SC: lwp kill 0 ERRNO-OK > SC: kill LWP 10066 **<SIGSTOP>** > SC: lwp kill 0 ERRNO-OK > WL: waitpid LWP 10068 received Stopped (signal) (stopped) > WL: waitpid LWP 10067 received Trace/breakpoint trap (stopped) > PTRACE_CONT LWP 10067, 0, 0 (OK) > SWC: Candidate SIGTRAP event in LWP 10067 > WL: waitpid LWP 10067 received Stopped (signal) (stopped) > WL: waitpid LWP 10066 received Stopped (signal) (stopped) > FC: LP has pending status 00057f > SEL: Found 2 SIGTRAP events, selecting #0 <=== should not happen > CBC: Push back breakpoint for LWP 10062 > LLW: trap_ptid is LWP 10067. > > Program received signal SIGTRAP, Trace/breakpoint trap. > [Switching to Thread -1231361104 (LWP 10067)] > 0x080489cb in synchronize (tid=3063606192) at gdbtest.C:18 > 18 pthread_mutex_lock(&mutex); In this trace, we single-step process 10066 [we resumed all the other threads first - I snipped too much]. Then we wait. We receive an event from 10062. We stop all other threads. We receive another event from 10067, and a SIGSTOP from 10066 and 10068. So the thread had not yet been scheduled when we sent it the SIGSTOP, and it hasn't done the single step yet. If I am interpreting your patch correctly, it will cause us to use waitpid without __WNOHANG on LWP 10066. If 10066 is stepping over a blocking syscall, and another thread hits a breakpoint, we want to display the breakpoint. Otherwise we'll get the wrong behavior in the non-race-condition situation where one thread goes to sleep. Infrun needs to verify that the thread which had an event was the thread which stepped. If it isn't, then hopefully the single-step has not happened - this will be true using x86 and PTRACE_SINGLESTEP, but may require some other changes for software singlestep that I think I have queued somewhere. -- Daniel Jacobowitz ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2004-09-15 18:21 UTC | newest] Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2004-09-14 21:44 [RFC]: Ugly thread step situation jjohnstn 2004-09-14 22:42 ` Andrew Cagney 2004-09-14 23:25 ` Daniel Jacobowitz 2004-09-15 17:36 ` Jeff Johnston 2004-09-15 17:53 ` Andrew Cagney 2004-09-15 17:59 ` Jeff Johnston 2004-09-15 18:21 ` Daniel Jacobowitz
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox