On 09/04/2015 05:54 PM, Sandra Loosemore wrote:
> While running GDB tests on nios2-linux-gnu with gdbserver and "target 
> remote", I've been seeing random failures in 
> gdb.threads/non-stop-fair-events.exp.  E.g. in one test run I got
> 
> FAIL: gdb.threads/non-stop-fair-events.exp: signal_thread=6: thread 1 
> broke out of loop (timeout)
> FAIL: gdb.threads/non-stop-fair-events.exp: signal_thread=6: thread 2 
> broke out of loop (timeout)
> FAIL: gdb.threads/non-stop-fair-events.exp: signal_thread=6: thread 3 
> broke out of loop (timeout)
> FAIL: gdb.threads/non-stop-fair-events.exp: signal_thread=7: thread 1 
> broke out of loop (timeout)
> FAIL: gdb.threads/non-stop-fair-events.exp: signal_thread=10: thread 1 
> broke out of loop (timeout)
> FAIL: gdb.threads/non-stop-fair-events.exp: signal_thread=10: thread 2 
> broke out of loop (timeout)
> 
> and in other test runs I got a different ones.  The pattern seemed to be 
> that sometimes it took an extra long time for the first thread to break 
> out of the loop, but once that happened they would all stop correctly 
> and send the expected replies even though GDB had given up on waiting 
> for the first few already.

Yeah, I've seen this before with a local series I use for debugging
software single-step things that implements software single-stepping
on x86.  I just re-tried it now after rebasing that series to
current mainline, and I still see the time outs against gdbserver.

AFAICS, nios2 is a software single-step target that does not implement
displaced stepping either.  I had a patch for this that I had
never posted.  See attached.

> 
> I've come up with the attached patch to factor the timeout for the 
> failing tests by the number of threads still running, which seems to 
> take care of the problem.  Does this seem reasonable?

I'd rather avoid it unconditionally; it's just 10 threads, and if the
target supports displaced stepping, if starvation avoidance in gdb is
working correctly, the test should complete quickly.  I takes a couple
seconds on my getting-old x86-64 laptop.

> 
> I'm somewhat confused because, in spite of it sometimes taking at least 
> 3 times the normal timeout for the first stop message to appear, the 
> alarm in the test case (which is tied to the normal timeout) was never 
> triggering.  My best theory on that is that the slowness is not in the 
> test case, but rather in gdbserver.  IOW, all the threads are already 
> stopped by the time the alarm would expire, but gdb and gdbserver 
> haven't finished all the notifications and requests to print a stop 
> message for any of the threads yet.  Is that plausible?  Should the 
> timeout for the alarm be factored by the number of threads, too, just to 
> be safe?

Or maybe it was, and the SIGALRM never manages to be processed by gdb
and passed down to the inferior.

> 
> I'm also not entirely sure what this test case is supposed to test. 
>  From the original commit message and comments in the .exp file it seems 
> like timeouts were supposed to be a sign of a broken kernel with thread 
> starvation problems, not bugs in gdb or gdbserver.  

On the kernel side, "waitpid(-1, ...)" just walks the task list linearly
looking for the first that had an event.  Say you have two threads, A and
B which are constantly hitting events/breakpoints.  If A is quick enough,
"waitpid(-1, ...)" returns the event for thread A over and over, and thread B is
starved.  The linux backends in both gdb and gdbserver have code in place that
picks an event LWP at random out of all that have had events.  A similar problem
exists as soon as events are queued out of the target backends into gdb's core
run control -- events can end up pending for processing later in gdb's core
data structures too, and so if gdb just picked those events by walking its own
thread list looking for the first thread that has an event pending, it'd starve
some threads.  So again infrun.c has similar randomization code to avoid
starvation (random_pending_event_thread).

> But, don't we 
> normally just skip tests that the target doesn't support or can't run 
> properly, rather than report them as FAILs?  And, I don't know how to 
> distinguish timeouts that mean the kernel is broken from timeouts that 
> mean the target is just slow and you need to set a bigger value in the 
> test harness.

Pedro Alves