From: Michael Snyder <msnyder@cygnus.com>
To: gdb-patches@sources.redhat.com
Cc: kettenis@wins.uva.nl
Subject: [PATCH RFC] linux threads: prevent starvation
Date: Tue, 12 Jun 2001 18:07:00 -0000 [thread overview]
Message-ID: <3B26BCE9.AA6731EC@cygnus.com> (raw)
OK, this deserves some explanation.
Problem: Debugging with GDB can cause some threads to starve
in a multi-threaded Linux program.
Testcase: gdb/testsuite/gdb.threads/pthreads.exp
FAIL: gdb.threads/pthreads.exp: Some threads didn't run.
Cause: In lin_lwp_wait, we call waitpid (-1, __WCLONE).
If more than one LWP is currently stopped at a breakpoint,
the highest-numbered one will be returned. All the others
will have their breakpoints pushed back. If we then continue,
and the highest-numbered LWP has time to hit a breakpoint
again, no one else will have a chance to run.
Solution: Instead of always accepting the event for the
first LWP returned by waitpid, use a monte carlo method
to select among all LWPs that have breakpoint events.
Implementation: I've made use of the lwp list and saved status
that Mark has already implemented. I've saved the SIGTRAP events
along with any other events accidentally fetched by stop_wait_callback,
and I've added the event fetched by lin_lwp_wait to the same queue.
Then I've added code that examines the list of fetched SIGTRAP events
and chooses one at random, making that LWP the event LWP and returning
its event to gdb.
This passes the testsuite (where the old method fails), and I've
done extensive hand testing as well. In fact, this has exposed
a number of bugs in core gdb's thread handling, some of which I've
already submitted patches for. Others are still to come. With
all of these patches in place, this code seems to be quite robust.
I can put a breakpoint in code that is shared by several threads,
and keep them all stepping and nexting indefinitely.
2001-06-12 Michael Snyder <msnyder@redhat.com>
* lin-lwp.c: Prevent thread starvation by using a monte carlo
method to choose which of several event threads to handle next.
(resume_callback): Disable code that attempts to handle
step_resume breakpoints. Let core gdb handle this.
(stop_wait_callback): Defer pushback of breakpoint events until
later; add SIGTRAP events to the queue of unhandled events.
Keep calling waitpid until SIGSTOP retrieved. If more than one
non-SIGSTOP event is retrieved, push them back onto the process
queue using kill.
(count_events_callback, select_singlestep_lwp_callback,
select_event_lwp_callback, cancel_breakpoints_callback,
select_event_lwp): New functions. Implement monte carlo method
for selecting which of several SIGTRAP threads to handle next.
Push back the breakpoint event for all threads other than the
selected one.
(lin_lwp_wait): Call select_event_lwp to decide which of several
sigtrapped lwps to handle next.
Index: lin-lwp.c
===================================================================
RCS file: /cvs/src/src/gdb/lin-lwp.c,v
retrieving revision 1.24
diff -c -3 -p -r1.24 lin-lwp.c
*** lin-lwp.c 2001/06/07 19:31:10 1.24
--- lin-lwp.c 2001/06/13 00:50:29
*************** resume_callback (struct lwp_info *lp, vo
*** 458,464 ****
{
struct thread_info *tp;
! #if 1
/* FIXME: kettenis/2000-08-26: This should really be handled
properly by core GDB. */
--- 458,464 ----
{
struct thread_info *tp;
! #if 0
/* FIXME: kettenis/2000-08-26: This should really be handled
properly by core GDB. */
*************** stop_wait_callback (struct lwp_info *lp,
*** 585,591 ****
pid_t pid;
int status;
- get_another_event:
gdb_assert (lp->status == 0);
pid = waitpid (GET_LWP (lp->ptid), &status,
--- 585,590 ----
*************** stop_wait_callback (struct lwp_info *lp,
*** 619,631 ****
}
gdb_assert (WIFSTOPPED (status));
- lp->stopped = 1;
if (WSTOPSIG (status) != SIGSTOP)
{
! if (WSTOPSIG (status) == SIGTRAP
! && breakpoint_inserted_here_p (read_pc_pid (pid_to_ptid (pid))
! - DECR_PC_AFTER_BREAK))
{
/* If a LWP other than the LWP that we're reporting an
event for has hit a GDB breakpoint (as opposed to
--- 618,627 ----
}
gdb_assert (WIFSTOPPED (status));
if (WSTOPSIG (status) != SIGSTOP)
{
! if (WSTOPSIG (status) == SIGTRAP)
{
/* If a LWP other than the LWP that we're reporting an
event for has hit a GDB breakpoint (as opposed to
*************** stop_wait_callback (struct lwp_info *lp,
*** 640,660 ****
user will delete or disable the breakpoint, but the
thread will have already tripped on it. */
- if (debug_lin_lwp)
- fprintf_unfiltered (gdb_stdlog,
- "Tripped breakpoint at %lx in LWP %d"
- " while waiting for SIGSTOP.\n",
- (long) read_pc_pid (lp->ptid), pid);
-
- /* Set the PC to before the trap. */
- if (DECR_PC_AFTER_BREAK)
- write_pc_pid (read_pc_pid (pid_to_ptid (pid))
- - DECR_PC_AFTER_BREAK,
- pid_to_ptid (pid));
-
/* Now resume this LWP and get the SIGSTOP event. */
ptrace (PTRACE_CONT, GET_LWP (lp->ptid), 0, 0);
! goto get_another_event;
}
else if (WSTOPSIG (status) == SIGINT &&
signal_pass_state (SIGINT) == 0)
--- 636,657 ----
user will delete or disable the breakpoint, but the
thread will have already tripped on it. */
/* Now resume this LWP and get the SIGSTOP event. */
ptrace (PTRACE_CONT, GET_LWP (lp->ptid), 0, 0);
! if (debug_lin_lwp)
! {
! fprintf_unfiltered (gdb_stderr,
! "SWC: Candidate SIGTRAP event in %ld\n",
! GET_LWP (lp->ptid));
! }
! /* Hold the SIGTRAP for handling by lin_lwp_wait. */
! stop_wait_callback (lp, data);
! /* If there's another event, throw it back into the queue. */
! if (lp->status)
! kill (GET_LWP (lp->ptid), WSTOPSIG (lp->status));
! /* Save the sigtrap event. */
! lp->status = status;
! return 0;
}
else if (WSTOPSIG (status) == SIGINT &&
signal_pass_state (SIGINT) == 0)
*************** stop_wait_callback (struct lwp_info *lp,
*** 664,690 ****
just ignore all SIGINT events from all lwp's except for
the one that was caught by lin_lwp_wait. */
! /* Now resume this LWP and get the SIGSTP event. */
ptrace (PTRACE_CONT, GET_LWP (lp->ptid), 0, 0);
! goto get_another_event;
}
else
{
if (debug_lin_lwp)
! fprintf_unfiltered (gdb_stdlog,
! "Received %s in LWP %d while waiting for SIGSTOP.\n",
! strsignal (WSTOPSIG (status)), pid);
! /* The thread was stopped with a signal other than
! SIGSTOP, and didn't accidentally trip a breakpoint.
! Record the wait status. */
! lp->status = status;
}
}
else
{
/* We caught the SIGSTOP that we intended to catch, so
there's no SIGSTOP pending. */
lp->signalled = 0;
}
}
--- 661,702 ----
just ignore all SIGINT events from all lwp's except for
the one that was caught by lin_lwp_wait. */
! /* Now resume this LWP and get the SIGSTOP event. */
ptrace (PTRACE_CONT, GET_LWP (lp->ptid), 0, 0);
! return stop_wait_callback (lp, data);
}
else
{
+ /* The thread was stopped with a signal other than
+ SIGSTOP, and didn't accidentally trip a breakpoint. */
+
if (debug_lin_lwp)
! {
! fprintf_unfiltered (gdb_stderr,
! "SWC: Pending event %d in %ld\n",
! WSTOPSIG (status), GET_LWP (lp->ptid));
! }
! /* Now resume this LWP and get the SIGSTOP event. */
! ptrace (PTRACE_CONT, GET_LWP (lp->ptid), 0, 0);
! /* Hold this event/waitstatus while we check to see if
! there are any more (we still want to get that SIGSTOP). */
! stop_wait_callback (lp, data);
! /* If the lp->status field is still empty, use it to hold
! this event. If not, then this event must be returned
! to the event queue of the LWP. */
! if (lp->status == 0)
! lp->status = status;
! else
! kill (GET_LWP (lp->ptid), WSTOPSIG (status));
! return 0;
}
}
else
{
/* We caught the SIGSTOP that we intended to catch, so
there's no SIGSTOP pending. */
+ lp->stopped = 1;
lp->signalled = 0;
}
}
*************** running_callback (struct lwp_info *lp, v
*** 710,715 ****
--- 722,862 ----
return (lp->stopped == 0);
}
+ /* Count the LWP's that have had events. */
+
+ static int
+ count_events_callback (struct lwp_info *lp, void *data)
+ {
+ int *count = data;
+
+ /* Count only threads that have a SIGTRAP pending. */
+ if (lp->status != 0 &&
+ WIFSTOPPED (lp->status) &&
+ WSTOPSIG (lp->status) == SIGTRAP &&
+ count != NULL) /* paranoia */
+ (*count)++;
+
+ return 0;
+ }
+
+ /* Select the LWP (if any) that is currently being single-stepped. */
+
+ static int
+ select_singlestep_lwp_callback (struct lwp_info *lp, void *data)
+ {
+ if (lp->step)
+ return 1;
+ else
+ return 0;
+ }
+
+ /* Select the Nth LWP that has had a SIGTRAP event. */
+
+ static int
+ select_event_lwp_callback (struct lwp_info *lp, void *data)
+ {
+ int *selector = data;
+
+ /* Select only threads that have a SIGTRAP event pending. */
+ if (lp->status != 0 &&
+ WIFSTOPPED (lp->status) &&
+ WSTOPSIG (lp->status) == SIGTRAP &&
+ selector != NULL) /* paranoia */
+ if ((*selector)-- == 0)
+ return 1;
+
+ return 0;
+ }
+
+ static int
+ cancel_breakpoints_callback (struct lwp_info *lp, void *data)
+ {
+ struct lwp_info *event_lp = data;
+
+ if (lp != event_lp &&
+ lp->status != 0 &&
+ WIFSTOPPED (lp->status) &&
+ WSTOPSIG (lp->status) == SIGTRAP &&
+ breakpoint_inserted_here_p (read_pc_pid (lp->ptid) -
+ DECR_PC_AFTER_BREAK))
+ {
+ if (debug_lin_lwp)
+ {
+ fprintf_unfiltered (gdb_stdlog,
+ "Push back BP for %ld\n",
+ GET_LWP (lp->ptid));
+ }
+ /* For each lp except the event lp, if there was a trap,
+ set the PC to before the trap. */
+ if (DECR_PC_AFTER_BREAK)
+ {
+ write_pc_pid (read_pc_pid (lp->ptid) - DECR_PC_AFTER_BREAK,
+ lp->ptid);
+ }
+ lp->status = 0;
+ }
+ return 0;
+ }
+
+ /* Select one LWP out of those that have events to be the event thread. */
+
+ static void
+ select_event_lwp (struct lwp_info **orig_lp, int *status)
+ {
+ int num_events = 0;
+ int random_selector;
+ struct lwp_info *event_lp;
+
+ /* Give preference to any LWP that is being single-stepped. */
+ event_lp = iterate_over_lwps (select_singlestep_lwp_callback, NULL);
+ if (event_lp != NULL)
+ {
+ if (debug_lin_lwp)
+ fprintf_unfiltered (gdb_stdlog,
+ "Select singlestep lwp %ld\n",
+ GET_LWP (event_lp->ptid));
+ }
+ else
+ {
+ /* No single-stepping LWP. Select one at random, out of those
+ which have had SIGTRAP events. */
+
+ /* First see how many SIGTRAP events we have. */
+ iterate_over_lwps (count_events_callback, (void *) &num_events);
+
+ /* OK, now randomly pick the Nth LWP of those that have had SIGTRAP. */
+ random_selector = (int)
+ ((num_events * (double) rand ()) / (RAND_MAX + 1.0));
+
+ if (debug_lin_lwp)
+ {
+ if (num_events > 1)
+ fprintf_unfiltered (gdb_stdlog,
+ "Found %d SIGTRAP events, selecting #%d\n",
+ num_events, random_selector);
+ else if (num_events <= 0)
+ fprintf_unfiltered (gdb_stdlog,
+ "ERROR select_event_lwp %d events!\n",
+ num_events);
+ }
+
+ event_lp = iterate_over_lwps (select_event_lwp_callback,
+ (void *) &random_selector);
+ }
+
+ if (event_lp != NULL)
+ {
+ /* "event_lp" is now the selected event thread.
+ If any other threads have had SIGTRAP events, these events
+ must now be cancelled, so that the respective thread will
+ trip the breakpoint again once it is resumed. */
+ iterate_over_lwps (cancel_breakpoints_callback, (void *) event_lp);
+ *orig_lp = event_lp;
+ *status = event_lp->status;
+ event_lp->status = 0;
+ }
+ }
+
/* Return non-zero if LP has been resumed. */
static int
*************** lin_lwp_wait (ptid_t ptid, struct target
*** 741,757 ****
/* First check if there is a LWP with a wait status pending. */
if (pid == -1)
{
! /* Any LWP will do. */
lp = iterate_over_lwps (status_callback, NULL);
if (lp)
{
! if (debug_lin_lwp)
fprintf_unfiltered (gdb_stdlog,
! "Using pending wait status for LWP %ld.\n",
GET_LWP (lp->ptid));
- status = lp->status;
- lp->status = 0;
}
/* But if we don't fine one, we'll have to wait, and check both
--- 888,907 ----
/* First check if there is a LWP with a wait status pending. */
if (pid == -1)
{
! /* Any LWP that's been resumed will do. */
lp = iterate_over_lwps (status_callback, NULL);
if (lp)
{
! status = lp->status;
! lp->status = 0;
! if (debug_lin_lwp && status)
fprintf_unfiltered (gdb_stdlog,
! "Using pending wait status %d for LWP %ld.\n",
! WIFSTOPPED (status) ? WSTOPSIG (status) :
! WIFSIGNALED (status) ? WTERMSIG (status) :
! WEXITSTATUS (status),
GET_LWP (lp->ptid));
}
/* But if we don't fine one, we'll have to wait, and check both
*************** lin_lwp_wait (ptid_t ptid, struct target
*** 772,782 ****
status = lp->status;
lp->status = 0;
! if (debug_lin_lwp)
! if (status)
! fprintf_unfiltered (gdb_stdlog,
! "Using pending wait status for LWP %ld.\n",
! GET_LWP (lp->ptid));
/* If we have to wait, take into account whether PID is a cloned
process or not. And we have to convert it to something that
--- 922,934 ----
status = lp->status;
lp->status = 0;
! if (debug_lin_lwp && status)
! fprintf_unfiltered (gdb_stdlog,
! "Using pending wait status %d for LWP %ld.\n",
! WIFSTOPPED (status) ? WSTOPSIG (status) :
! WIFSIGNALED (status) ? WTERMSIG (status) :
! WEXITSTATUS (status),
! GET_LWP (lp->ptid));
/* If we have to wait, take into account whether PID is a cloned
process or not. And we have to convert it to something that
*************** lin_lwp_wait (ptid_t ptid, struct target
*** 947,952 ****
--- 1099,1111 ----
/* This LWP is stopped now. */
lp->stopped = 1;
+ /* MVS: so save its event_status. */
+ lp->status = status;
+ if (debug_lin_lwp)
+ fprintf_unfiltered (gdb_stdlog,
+ "LLW: Candidate event %d in %ld\n",
+ WSTOPSIG (status), GET_LWP (lp->ptid));
+
/* Now stop all other LWP's ... */
iterate_over_lwps (stop_callback, NULL);
*************** lin_lwp_wait (ptid_t ptid, struct target
*** 954,964 ****
longer running. */
iterate_over_lwps (stop_wait_callback, NULL);
/* If we're not running in "threaded" mode, we'll report the bare
process id. */
if (WIFSTOPPED (status) && WSTOPSIG (status) == SIGTRAP)
! trap_ptid = (threaded ? lp->ptid : pid_to_ptid (GET_LWP (lp->ptid)));
else
trap_ptid = null_ptid;
--- 1113,1135 ----
longer running. */
iterate_over_lwps (stop_wait_callback, NULL);
+ /* MVS Now choose an event thread from among those that
+ have had events. Giving equal priority to all threads
+ that have had events helps prevent starvation. */
+
+ select_event_lwp (&lp, &status);
+
/* If we're not running in "threaded" mode, we'll report the bare
process id. */
if (WIFSTOPPED (status) && WSTOPSIG (status) == SIGTRAP)
! {
! trap_ptid = (threaded ? lp->ptid : pid_to_ptid (GET_LWP (lp->ptid)));
! if (debug_lin_lwp)
! fprintf_unfiltered (gdb_stdlog,
! "LLW: trap_ptid is %ld\n",
! GET_LWP (trap_ptid));
! }
else
trap_ptid = null_ptid;
From dan@cgsoftware.com Tue Jun 12 18:41:00 2001
From: Daniel Berlin <dan@cgsoftware.com>
To: Stan Shebs <shebs@apple.com>
Cc: Daniel Berlin <dan@cgsoftware.com>, Jim Blandy <jimb@zwingli.cygnus.com>, gdb-patches@sources.redhat.com
Subject: Re: Rewriting the type system
Date: Tue, 12 Jun 2001 18:41:00 -0000
Message-id: <87ofrtf97f.fsf@cgsoftware.com>
References: <m24rtrwkfd.fsf@kelso.bothner.com> <3B212FC8.6010801@cygnus.com> <m2pucevgf6.fsf@kelso.bothner.com> <87vgm6snjf.fsf@cgsoftware.com> <npelsq28sh.fsf_-_@zwingli.cygnus.com> <87d78aluw6.fsf@cgsoftware.com> <npofrty5wv.fsf@zwingli.cygnus.com> <87wv6hh9uc.fsf@cgsoftware.com> <3B268089.D5839A84@apple.com>
X-SW-Source: 2001-06/msg00251.html
Content-length: 5190
Stan Shebs <shebs@apple.com> writes:
> Daniel Berlin wrote:
>>
>> Jim Blandy <jimb@zwingli.cygnus.com> writes:
>>
>> > It's because, for whatever reason, you don't take the time to make
>> > your changes correct.
>>
>> Now that's simply bullshit.
>
> OK OK. For openers, let's agree not to do personal accusations and
> profanity on the list. Flame each other in private please, it will
> feel just as good and not waste everybody else's time.
>
>> > Here we have, in the space of less than a dozen lines of code:
>> >
>> > - host == target assumptions (why are you applying `*' to
>> > target-format data?)
>> >
>> > - sizeof (foo) assumptions (what is 8? what is 12?)
>>
>> Neat, but that code was written 2 years ago, when i was first starting
>> gdb development.
>> It was introduced into value_rtti_type, and copied in the gnuv3 rtti
>> type because gnuv3-abi.c was based on gnuv2-abi.c, which was based on
>> all that code.
>
> Dan, if it's mistaken, it's mistaken; excuses and history aren't really
> that interesting. We all have things we're not that proud of, myself
> probably more than anybody. Let's fix them and go on to the next
> thing.
Um, it's not exactly broken and in need of fixing, it only exists in a
patch where it was specifically disabled, because I knew it didn't work.
So Jim's just taking something from a patch I submitted to actually
get some long overdue work *started*, where in fact, that code wasn't
even enabled. It was never intended to *stay
that way* for very long, since i had a patch that does it the right
way . But I couldn't very well submit patches that depend on other
patches, without having those other patches approved first.
So, basically, Jim's bitching about something that's not even in the
gdb source, and even if it had been, would have been disabled.
>
>> > why should I bother to read your patches?
>>
>> Because as maintainer, it's your job?
>
> You're right here, Dan. Everybody here talks about maintenance as
> if it's some kind of signal honor, but no, it's just a responsibility,
> like peer-reviewing scientific papers. You don't get to read the
> title and say "Oh, Dr. Luser again, straight to the trash it goes."
> Gotta read the whole thing and provide meaningful feedback, and
> worse, you're expected to do it within a certain time period. The
> worse the patch, the more work it will be to review.
>
>> > You have an extensive history of reverted changes:
>
> Jim, everybody who's worked extensively on GCC or GDB has a history
> of reverted changes. It we waited until we were sure that each change
> was perfect, progress would be appallingly slow. Every day the GCC
> mainline gets patches that raise the hairs on the back of my neck,
> and I don't relax until I've confirm they don't break the target for
> which I'm responsible. Some of the patches do break GCC; the patch
> gets fixed or reverted, and everybody moves on.
>
>> [...on and on...]
>> It's nice of you to try to imply that most of my patches are wrong, when they
>> aren't.
>
> Both of you are in the wrong here. Jim, you know as well as anyone
> that the C++ symbol handling parts of GDB are not so neatly partitioned
> that you can review symbol patches without knowing more than a bit about
> C++, and Dan, you should be able to admit mistakes in your patches, fix
> them and resubmit, rather than flaming the reviewer of the patch.
I didn't flame him for reviewing that patch, I thanked him when he did
it. He's pulling the code
from a patch that was never committed, and code that wasn't even enabled by
default in that patch (unless by mistake. At least in the tree that
he's pulling that code from, i have a comment above gnuv3_rtti_type,
and the abi_ops initialization, to redo gnuv3_rtti_type. So if it is
in the actual patch he's got, he's got an old patch).
A work in progress, if you will. In fact, it had to be. The v3 abi was
still a moving target at the time it was originally written.
I'm happy to admit mistakes in my patch.
I'm happy to say that the code he quoted is broken as all get out.
It's an ugly, amazing, piece of crap.
That's why
1. If you look, it's not in the gdb sources.
2. Even in my sources, I noted it was a broken piece of crap, and
needed to be rewritten.
I'm also happy to admit that I needed, and asked for, help on
getting the v3 abi stuff done once the base abstraction was done.
The ABI documents are somewhat complex to follow, as most C++ ABI's are.
Hopefully, whoever takes over C++ will have an easier job, and
hopefully, I helped to make that possible. Apparently, I haven't.
Oh well, it won't be the first thing i've failed at, and it won't be
the last.
>
> There is plenty of room to criticize both maintainer responsiveness
> and patch quality without turning it into a spitting match, and I'm
> disappointed to see the development process sink to this level.
So am I, but i'm not going to let someone try to smear me in public.
Sorry.
--Dan
>
> Stan
--
"I went to the hardware store and bought some used paint. It was
in the shape of a house. I also bought some batteries, but they
weren't included. So I had to buy them again.
"-Steven Wright
next reply other threads:[~2001-06-12 18:07 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2001-06-12 18:07 Michael Snyder [this message]
2001-06-13 13:48 ` Andrew Cagney
2001-06-13 15:33 ` Christopher Faylor
2001-06-26 19:08 ` Michael Snyder
2001-07-06 12:15 ` Michael Snyder
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=3B26BCE9.AA6731EC@cygnus.com \
--to=msnyder@cygnus.com \
--cc=gdb-patches@sources.redhat.com \
--cc=kettenis@wins.uva.nl \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox