From mboxrd@z Thu Jan 1 00:00:00 1970 From: Michael Snyder To: gdb-patches@sources.redhat.com Subject: Re: [PATCH RFC] linux threads: prevent starvation Date: Fri, 06 Jul 2001 12:15:00 -0000 Message-id: <3B460DA3.F995B302@cygnus.com> References: <3B26BCE9.AA6731EC@cygnus.com> X-SW-Source: 2001-07/msg00160.html Committed. Mark Kettenis, thanks for your review. Michael Snyder wrote: > > OK, this deserves some explanation. > > Problem: Debugging with GDB can cause some threads to starve > in a multi-threaded Linux program. > > Testcase: gdb/testsuite/gdb.threads/pthreads.exp > FAIL: gdb.threads/pthreads.exp: Some threads didn't run. > > Cause: In lin_lwp_wait, we call waitpid (-1, __WCLONE). > If more than one LWP is currently stopped at a breakpoint, > the highest-numbered one will be returned. All the others > will have their breakpoints pushed back. If we then continue, > and the highest-numbered LWP has time to hit a breakpoint > again, no one else will have a chance to run. > > Solution: Instead of always accepting the event for the > first LWP returned by waitpid, use a monte carlo method > to select among all LWPs that have breakpoint events. > > Implementation: I've made use of the lwp list and saved status > that Mark has already implemented. I've saved the SIGTRAP events > along with any other events accidentally fetched by stop_wait_callback, > and I've added the event fetched by lin_lwp_wait to the same queue. > Then I've added code that examines the list of fetched SIGTRAP events > and chooses one at random, making that LWP the event LWP and returning > its event to gdb. > > This passes the testsuite (where the old method fails), and I've > done extensive hand testing as well. In fact, this has exposed > a number of bugs in core gdb's thread handling, some of which I've > already submitted patches for. Others are still to come. With > all of these patches in place, this code seems to be quite robust. > I can put a breakpoint in code that is shared by several threads, > and keep them all stepping and nexting indefinitely. > > ---------------------------------------------------------------------------------------------------- > 2001-06-12 Michael Snyder > > * lin-lwp.c: Prevent thread starvation by using a monte carlo > method to choose which of several event threads to handle next. > > (resume_callback): Disable code that attempts to handle > step_resume breakpoints. Let core gdb handle this. > (stop_wait_callback): Defer pushback of breakpoint events until > later; add SIGTRAP events to the queue of unhandled events. > Keep calling waitpid until SIGSTOP retrieved. If more than one > non-SIGSTOP event is retrieved, push them back onto the process > queue using kill. > (count_events_callback, select_singlestep_lwp_callback, > select_event_lwp_callback, cancel_breakpoints_callback, > select_event_lwp): New functions. Implement monte carlo method > for selecting which of several SIGTRAP threads to handle next. > Push back the breakpoint event for all threads other than the > selected one. > (lin_lwp_wait): Call select_event_lwp to decide which of several > sigtrapped lwps to handle next. > > Index: lin-lwp.c > =================================================================== > RCS file: /cvs/src/src/gdb/lin-lwp.c,v > retrieving revision 1.24 > diff -c -3 -p -r1.24 lin-lwp.c > *** lin-lwp.c 2001/06/07 19:31:10 1.24 > --- lin-lwp.c 2001/06/13 00:50:29 > *************** resume_callback (struct lwp_info *lp, vo > *** 458,464 **** > { > struct thread_info *tp; > > ! #if 1 > /* FIXME: kettenis/2000-08-26: This should really be handled > properly by core GDB. */ > > --- 458,464 ---- > { > struct thread_info *tp; > > ! #if 0 > /* FIXME: kettenis/2000-08-26: This should really be handled > properly by core GDB. */ > > *************** stop_wait_callback (struct lwp_info *lp, > *** 585,591 **** > pid_t pid; > int status; > > - get_another_event: > gdb_assert (lp->status == 0); > > pid = waitpid (GET_LWP (lp->ptid), &status, > --- 585,590 ---- > *************** stop_wait_callback (struct lwp_info *lp, > *** 619,631 **** > } > > gdb_assert (WIFSTOPPED (status)); > - lp->stopped = 1; > > if (WSTOPSIG (status) != SIGSTOP) > { > ! if (WSTOPSIG (status) == SIGTRAP > ! && breakpoint_inserted_here_p (read_pc_pid (pid_to_ptid (pid)) > ! - DECR_PC_AFTER_BREAK)) > { > /* If a LWP other than the LWP that we're reporting an > event for has hit a GDB breakpoint (as opposed to > --- 618,627 ---- > } > > gdb_assert (WIFSTOPPED (status)); > > if (WSTOPSIG (status) != SIGSTOP) > { > ! if (WSTOPSIG (status) == SIGTRAP) > { > /* If a LWP other than the LWP that we're reporting an > event for has hit a GDB breakpoint (as opposed to > *************** stop_wait_callback (struct lwp_info *lp, > *** 640,660 **** > user will delete or disable the breakpoint, but the > thread will have already tripped on it. */ > > - if (debug_lin_lwp) > - fprintf_unfiltered (gdb_stdlog, > - "Tripped breakpoint at %lx in LWP %d" > - " while waiting for SIGSTOP.\n", > - (long) read_pc_pid (lp->ptid), pid); > - > - /* Set the PC to before the trap. */ > - if (DECR_PC_AFTER_BREAK) > - write_pc_pid (read_pc_pid (pid_to_ptid (pid)) > - - DECR_PC_AFTER_BREAK, > - pid_to_ptid (pid)); > - > /* Now resume this LWP and get the SIGSTOP event. */ > ptrace (PTRACE_CONT, GET_LWP (lp->ptid), 0, 0); > ! goto get_another_event; > } > else if (WSTOPSIG (status) == SIGINT && > signal_pass_state (SIGINT) == 0) > --- 636,657 ---- > user will delete or disable the breakpoint, but the > thread will have already tripped on it. */ > > /* Now resume this LWP and get the SIGSTOP event. */ > ptrace (PTRACE_CONT, GET_LWP (lp->ptid), 0, 0); > ! if (debug_lin_lwp) > ! { > ! fprintf_unfiltered (gdb_stderr, > ! "SWC: Candidate SIGTRAP event in %ld\n", > ! GET_LWP (lp->ptid)); > ! } > ! /* Hold the SIGTRAP for handling by lin_lwp_wait. */ > ! stop_wait_callback (lp, data); > ! /* If there's another event, throw it back into the queue. */ > ! if (lp->status) > ! kill (GET_LWP (lp->ptid), WSTOPSIG (lp->status)); > ! /* Save the sigtrap event. */ > ! lp->status = status; > ! return 0; > } > else if (WSTOPSIG (status) == SIGINT && > signal_pass_state (SIGINT) == 0) > *************** stop_wait_callback (struct lwp_info *lp, > *** 664,690 **** > just ignore all SIGINT events from all lwp's except for > the one that was caught by lin_lwp_wait. */ > > ! /* Now resume this LWP and get the SIGSTP event. */ > ptrace (PTRACE_CONT, GET_LWP (lp->ptid), 0, 0); > ! goto get_another_event; > } > else > { > if (debug_lin_lwp) > ! fprintf_unfiltered (gdb_stdlog, > ! "Received %s in LWP %d while waiting for SIGSTOP.\n", > ! strsignal (WSTOPSIG (status)), pid); > > ! /* The thread was stopped with a signal other than > ! SIGSTOP, and didn't accidentally trip a breakpoint. > ! Record the wait status. */ > ! lp->status = status; > } > } > else > { > /* We caught the SIGSTOP that we intended to catch, so > there's no SIGSTOP pending. */ > lp->signalled = 0; > } > } > --- 661,702 ---- > just ignore all SIGINT events from all lwp's except for > the one that was caught by lin_lwp_wait. */ > > ! /* Now resume this LWP and get the SIGSTOP event. */ > ptrace (PTRACE_CONT, GET_LWP (lp->ptid), 0, 0); > ! return stop_wait_callback (lp, data); > } > else > { > + /* The thread was stopped with a signal other than > + SIGSTOP, and didn't accidentally trip a breakpoint. */ > + > if (debug_lin_lwp) > ! { > ! fprintf_unfiltered (gdb_stderr, > ! "SWC: Pending event %d in %ld\n", > ! WSTOPSIG (status), GET_LWP (lp->ptid)); > ! } > ! /* Now resume this LWP and get the SIGSTOP event. */ > ! ptrace (PTRACE_CONT, GET_LWP (lp->ptid), 0, 0); > > ! /* Hold this event/waitstatus while we check to see if > ! there are any more (we still want to get that SIGSTOP). */ > ! stop_wait_callback (lp, data); > ! /* If the lp->status field is still empty, use it to hold > ! this event. If not, then this event must be returned > ! to the event queue of the LWP. */ > ! if (lp->status == 0) > ! lp->status = status; > ! else > ! kill (GET_LWP (lp->ptid), WSTOPSIG (status)); > ! return 0; > } > } > else > { > /* We caught the SIGSTOP that we intended to catch, so > there's no SIGSTOP pending. */ > + lp->stopped = 1; > lp->signalled = 0; > } > } > *************** running_callback (struct lwp_info *lp, v > *** 710,715 **** > --- 722,862 ---- > return (lp->stopped == 0); > } > > + /* Count the LWP's that have had events. */ > + > + static int > + count_events_callback (struct lwp_info *lp, void *data) > + { > + int *count = data; > + > + /* Count only threads that have a SIGTRAP pending. */ > + if (lp->status != 0 && > + WIFSTOPPED (lp->status) && > + WSTOPSIG (lp->status) == SIGTRAP && > + count != NULL) /* paranoia */ > + (*count)++; > + > + return 0; > + } > + > + /* Select the LWP (if any) that is currently being single-stepped. */ > + > + static int > + select_singlestep_lwp_callback (struct lwp_info *lp, void *data) > + { > + if (lp->step) > + return 1; > + else > + return 0; > + } > + > + /* Select the Nth LWP that has had a SIGTRAP event. */ > + > + static int > + select_event_lwp_callback (struct lwp_info *lp, void *data) > + { > + int *selector = data; > + > + /* Select only threads that have a SIGTRAP event pending. */ > + if (lp->status != 0 && > + WIFSTOPPED (lp->status) && > + WSTOPSIG (lp->status) == SIGTRAP && > + selector != NULL) /* paranoia */ > + if ((*selector)-- == 0) > + return 1; > + > + return 0; > + } > + > + static int > + cancel_breakpoints_callback (struct lwp_info *lp, void *data) > + { > + struct lwp_info *event_lp = data; > + > + if (lp != event_lp && > + lp->status != 0 && > + WIFSTOPPED (lp->status) && > + WSTOPSIG (lp->status) == SIGTRAP && > + breakpoint_inserted_here_p (read_pc_pid (lp->ptid) - > + DECR_PC_AFTER_BREAK)) > + { > + if (debug_lin_lwp) > + { > + fprintf_unfiltered (gdb_stdlog, > + "Push back BP for %ld\n", > + GET_LWP (lp->ptid)); > + } > + /* For each lp except the event lp, if there was a trap, > + set the PC to before the trap. */ > + if (DECR_PC_AFTER_BREAK) > + { > + write_pc_pid (read_pc_pid (lp->ptid) - DECR_PC_AFTER_BREAK, > + lp->ptid); > + } > + lp->status = 0; > + } > + return 0; > + } > + > + /* Select one LWP out of those that have events to be the event thread. */ > + > + static void > + select_event_lwp (struct lwp_info **orig_lp, int *status) > + { > + int num_events = 0; > + int random_selector; > + struct lwp_info *event_lp; > + > + /* Give preference to any LWP that is being single-stepped. */ > + event_lp = iterate_over_lwps (select_singlestep_lwp_callback, NULL); > + if (event_lp != NULL) > + { > + if (debug_lin_lwp) > + fprintf_unfiltered (gdb_stdlog, > + "Select singlestep lwp %ld\n", > + GET_LWP (event_lp->ptid)); > + } > + else > + { > + /* No single-stepping LWP. Select one at random, out of those > + which have had SIGTRAP events. */ > + > + /* First see how many SIGTRAP events we have. */ > + iterate_over_lwps (count_events_callback, (void *) &num_events); > + > + /* OK, now randomly pick the Nth LWP of those that have had SIGTRAP. */ > + random_selector = (int) > + ((num_events * (double) rand ()) / (RAND_MAX + 1.0)); > + > + if (debug_lin_lwp) > + { > + if (num_events > 1) > + fprintf_unfiltered (gdb_stdlog, > + "Found %d SIGTRAP events, selecting #%d\n", > + num_events, random_selector); > + else if (num_events <= 0) > + fprintf_unfiltered (gdb_stdlog, > + "ERROR select_event_lwp %d events!\n", > + num_events); > + } > + > + event_lp = iterate_over_lwps (select_event_lwp_callback, > + (void *) &random_selector); > + } > + > + if (event_lp != NULL) > + { > + /* "event_lp" is now the selected event thread. > + If any other threads have had SIGTRAP events, these events > + must now be cancelled, so that the respective thread will > + trip the breakpoint again once it is resumed. */ > + iterate_over_lwps (cancel_breakpoints_callback, (void *) event_lp); > + *orig_lp = event_lp; > + *status = event_lp->status; > + event_lp->status = 0; > + } > + } > + > /* Return non-zero if LP has been resumed. */ > > static int > *************** lin_lwp_wait (ptid_t ptid, struct target > *** 741,757 **** > /* First check if there is a LWP with a wait status pending. */ > if (pid == -1) > { > ! /* Any LWP will do. */ > lp = iterate_over_lwps (status_callback, NULL); > if (lp) > { > ! if (debug_lin_lwp) > fprintf_unfiltered (gdb_stdlog, > ! "Using pending wait status for LWP %ld.\n", > GET_LWP (lp->ptid)); > > - status = lp->status; > - lp->status = 0; > } > > /* But if we don't fine one, we'll have to wait, and check both > --- 888,907 ---- > /* First check if there is a LWP with a wait status pending. */ > if (pid == -1) > { > ! /* Any LWP that's been resumed will do. */ > lp = iterate_over_lwps (status_callback, NULL); > if (lp) > { > ! status = lp->status; > ! lp->status = 0; > ! if (debug_lin_lwp && status) > fprintf_unfiltered (gdb_stdlog, > ! "Using pending wait status %d for LWP %ld.\n", > ! WIFSTOPPED (status) ? WSTOPSIG (status) : > ! WIFSIGNALED (status) ? WTERMSIG (status) : > ! WEXITSTATUS (status), > GET_LWP (lp->ptid)); > > } > > /* But if we don't fine one, we'll have to wait, and check both > *************** lin_lwp_wait (ptid_t ptid, struct target > *** 772,782 **** > status = lp->status; > lp->status = 0; > > ! if (debug_lin_lwp) > ! if (status) > ! fprintf_unfiltered (gdb_stdlog, > ! "Using pending wait status for LWP %ld.\n", > ! GET_LWP (lp->ptid)); > > /* If we have to wait, take into account whether PID is a cloned > process or not. And we have to convert it to something that > --- 922,934 ---- > status = lp->status; > lp->status = 0; > > ! if (debug_lin_lwp && status) > ! fprintf_unfiltered (gdb_stdlog, > ! "Using pending wait status %d for LWP %ld.\n", > ! WIFSTOPPED (status) ? WSTOPSIG (status) : > ! WIFSIGNALED (status) ? WTERMSIG (status) : > ! WEXITSTATUS (status), > ! GET_LWP (lp->ptid)); > > /* If we have to wait, take into account whether PID is a cloned > process or not. And we have to convert it to something that > *************** lin_lwp_wait (ptid_t ptid, struct target > *** 947,952 **** > --- 1099,1111 ---- > /* This LWP is stopped now. */ > lp->stopped = 1; > > + /* MVS: so save its event_status. */ > + lp->status = status; > + if (debug_lin_lwp) > + fprintf_unfiltered (gdb_stdlog, > + "LLW: Candidate event %d in %ld\n", > + WSTOPSIG (status), GET_LWP (lp->ptid)); > + > /* Now stop all other LWP's ... */ > iterate_over_lwps (stop_callback, NULL); > > *************** lin_lwp_wait (ptid_t ptid, struct target > *** 954,964 **** > longer running. */ > iterate_over_lwps (stop_wait_callback, NULL); > > /* If we're not running in "threaded" mode, we'll report the bare > process id. */ > > if (WIFSTOPPED (status) && WSTOPSIG (status) == SIGTRAP) > ! trap_ptid = (threaded ? lp->ptid : pid_to_ptid (GET_LWP (lp->ptid))); > else > trap_ptid = null_ptid; > > --- 1113,1135 ---- > longer running. */ > iterate_over_lwps (stop_wait_callback, NULL); > > + /* MVS Now choose an event thread from among those that > + have had events. Giving equal priority to all threads > + that have had events helps prevent starvation. */ > + > + select_event_lwp (&lp, &status); > + > /* If we're not running in "threaded" mode, we'll report the bare > process id. */ > > if (WIFSTOPPED (status) && WSTOPSIG (status) == SIGTRAP) > ! { > ! trap_ptid = (threaded ? lp->ptid : pid_to_ptid (GET_LWP (lp->ptid))); > ! if (debug_lin_lwp) > ! fprintf_unfiltered (gdb_stdlog, > ! "LLW: trap_ptid is %ld\n", > ! GET_LWP (trap_ptid)); > ! } > else > trap_ptid = null_ptid; >