* Debugging return.exp on ARM
@ 2016-05-26 15:15 Simon Marchi
2016-05-26 19:11 ` Pedro Alves
2016-06-01 15:17 ` Yao Qi
0 siblings, 2 replies; 5+ messages in thread
From: Simon Marchi @ 2016-05-26 15:15 UTC (permalink / raw)
To: gdb; +Cc: Yao Qi
Hi everyone,
In an attempt to fix flaky tests on ARM, I started looking at gdb.base/return.exp.
The last test, which tests the "return" command on a function that returns a double,
fails randomly on our ODroid XU-4 board. We have another board, a Firefly RK3288,
which fails the same way (and even more frequently). I have the feeling that there's
a race somewhere in the kernel/cache/memory/something.
I isolated a minimal reproducer from the test case, that goes like this:
double
func3 ()
{
return -5.0;
}
double tmp3;
int main ()
{
tmp3 = func3 ();
return 0;
}
Built with:
$ arm-linux-gnueabihf-gcc -g3 -O0 return.c -o return
And here is the gdb script to run:
file ~/return
b func3
run
return 2.0
n
print tmp3
quit tmp3 != 2
I simply run gdb like this:
$ ./gdb -nx -batch -x run.gdb
What the test does is run to the beginning of func3, then issues the command
"return 2.0", which makes the function artificially return with the value 2.0.
It then does a "next" to complete the assignment to tmp3, and then prints the
value of tmp3. Most of the time, we see the expected value, 2.0. Once in a
while, we get 0.
When doing the return, GDB writes 2.0 in the d0 register, which is the place where
a return value of type "double" should be (and writes other registers including pc and
sp to actually pop the stack frame). I added debug traces to confirm that the
right value is written in d0 though ptrace by GDB (even in failure cases). So when we
resume the thread (when doing the "next" command), it should have the right value in
its d0 register. When doing the next, those are the exact instructions it executes (also
confirmed by infrun debug):
83e4: eeb0 7b40 vmov.f64 d7, d0
83e8: f241 0330 movw r3, #4144 ; 0x1030
83ec: f2c0 0301 movt r3, #1
83f0: ed83 7b00 vstr d7, [r3]
In other words, move d0 to d7 and then store it to tmp3's address (0x11030). I
don't see anything that can go wrong with these instructions... if d0 contains
the right value at the time the thread is resumed, the tmp3 should contain the
right value at the end. However, as I said earlier, we get the wrong value once
in a while. So it sounds like somehow the value didn't make it in time to the d0
register when the thread was resumed, or it's GDB reads the value of tmp3 before
the effect of the vstr is visible...
Given that we give the right input to the kernel, even in the cases that
fail, I assume that the problem must be something like wrong cache invalidation
or memory barrier/sequencing.
I ran this test in a loop and got these results:
ODroid XU-4:
263 fails
737 successes
Firefly RK3288:
336 fails
163 success
First, is anybody able to reproduce the problem on other boards? Then, does anybody
have an idea what could cause this?
Thanks!
Simon
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Debugging return.exp on ARM
2016-05-26 15:15 Debugging return.exp on ARM Simon Marchi
@ 2016-05-26 19:11 ` Pedro Alves
2016-05-27 13:35 ` Simon Marchi
2016-06-01 15:17 ` Yao Qi
1 sibling, 1 reply; 5+ messages in thread
From: Pedro Alves @ 2016-05-26 19:11 UTC (permalink / raw)
To: Simon Marchi, gdb; +Cc: Yao Qi
On 05/26/2016 04:15 PM, Simon Marchi wrote:
> Given that we give the right input to the kernel, even in the cases that
> fail, I assume that the problem must be something like wrong cache invalidation
> or memory barrier/sequencing.
>
> I ran this test in a loop and got these results:
>
> ODroid XU-4:
> 263 fails
> 737 successes
>
> Firefly RK3288:
> 336 fails
> 163 success
>
> First, is anybody able to reproduce the problem on other boards? Then, does anybody
> have an idea what could cause this?
- I'd suspect something odd with caches / barriers too.
Did you try sprinkling in memory barrier instructions, and
see whether it makes a difference?
- I'd also try "si" + "info regs" instead of "next" after the return,
and see if a register with a bad value pops up always at some
specific instruction.
- I'd try to see if pinning the thread to a core makes a difference.
- Might help to show the kernel version.
Thanks,
Pedro Alves
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Debugging return.exp on ARM
2016-05-26 19:11 ` Pedro Alves
@ 2016-05-27 13:35 ` Simon Marchi
0 siblings, 0 replies; 5+ messages in thread
From: Simon Marchi @ 2016-05-27 13:35 UTC (permalink / raw)
To: Pedro Alves, gdb; +Cc: Yao Qi
On 16-05-26 03:11 PM, Pedro Alves wrote:
Thanks for the suggestions.
> - I'd suspect something odd with caches / barriers too.
> Did you try sprinkling in memory barrier instructions, and
> see whether it makes a difference?
I tried to put some dmb a bit everywhere, it didn't help.
> - I'd also try "si" + "info regs" instead of "next" after the return,
> and see if a register with a bad value pops up always at some
> specific instruction.
Good point.
If I replace next with si, only the vmov.f64 d7, d0 gets executed. So if everything
goes well, I should have the "right" value in both d0 and d7. I made a more
focused reproducer, see below.
> - I'd try to see if pinning the thread to a core makes a difference.
Indeed, pinning GDB to a single CPU makes it work (as in the result is right) every time.
As far as I can tell, pinning the inferior has no effect (I am not sure i worked, but I
used "set exec-wrapper taskset 0xffffffff" to reset the affinity).
> - Might help to show the kernel version.
ODroid: Linux odroid 3.10.96+ #5 SMP PREEMPT Thu May 26 15:03:58 EDT 2016 armv7l armv7l armv7l GNU/Linux
Firefly: Linux firefly 3.10.0 #40 SMP PREEMPT Tue Jan 27 16:12:04 CST 2015 armv7l armv7l armv7l GNU/Linux
I also reproduced it on my Rasp Pi 2, which has:
Linux alarmpi 4.4.8-2-ARCH #1 SMP Tue Apr 26 19:14:58 MDT 2016 armv7l GNU/Linux
So here's another case that reproduces the problem, but without a memory read, so
it isolates the problem a bit more. It verifies whether the thread sees our register
write or not.
test.S:
.global _start
_start:
vldr.64 d0, constante
vldr.64 d1, constante
break_here:
vcmp.f64 d0, d1
vmrs APSR_nzcv, fpscr
# Exit code
moveq r0, #1
movne r0, #0
# Exit syscall
mov r7, #1
svc 0
.align 8
constante:
.word 0xc8b43958
.word 0x40594676
Built with:
$ gcc -g3 -O0 -o test test.S -nostdlib
And the gdb script test.gdb:
file test
b break_here
run
p $d0 = 4.0
c
The test is ran with
$ ./gdb -nx -x test.gdb -batch
The test loads the same constant in d0 and d1. It then does a comparison between
them and exits with 1 (failure) if they are the same, 0 (success) if they are different.
The GDB script breaks at "break_here", tries to change the value of d0 to some other
constant (4.0) and lets the program continue and exit. If our register write succeeded,
the program should exit with 0 (values are different). If our register write failed, the
program will exit with 1 (values are still the same).
The result is that I randomly see both cases, hinting that the race is really between the
register write through ptrace and the kernel restoring the thread's vfp registers. Again,
pinning GDB to a single code seems to hide/bypass the bug.
Simon
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Debugging return.exp on ARM
2016-05-26 15:15 Debugging return.exp on ARM Simon Marchi
2016-05-26 19:11 ` Pedro Alves
@ 2016-06-01 15:17 ` Yao Qi
2016-06-01 15:22 ` Simon Marchi
1 sibling, 1 reply; 5+ messages in thread
From: Yao Qi @ 2016-06-01 15:17 UTC (permalink / raw)
To: Simon Marchi; +Cc: GDB
On Thu, May 26, 2016 at 4:15 PM, Simon Marchi <simon.marchi@ericsson.com> wrote:
>
> First, is anybody able to reproduce the problem on other boards? Then, does anybody
> have an idea what could cause this?
I saw this problem on my arm boards too. I triaged this problem
before, but didn't
get anything useful.
--
Yao (齐尧)
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Debugging return.exp on ARM
2016-06-01 15:17 ` Yao Qi
@ 2016-06-01 15:22 ` Simon Marchi
0 siblings, 0 replies; 5+ messages in thread
From: Simon Marchi @ 2016-06-01 15:22 UTC (permalink / raw)
To: Yao Qi; +Cc: Simon Marchi, GDB
On 2016-06-01 11:17, Yao Qi wrote:
> On Thu, May 26, 2016 at 4:15 PM, Simon Marchi
> <simon.marchi@ericsson.com> wrote:
>>
>> First, is anybody able to reproduce the problem on other boards?
>> Then, does anybody
>> have an idea what could cause this?
>
> I saw this problem on my arm boards too. I triaged this problem
> before, but didn't
> get anything useful.
I brought this problem to the ARM kernel mailing list, it seems the
problem was really in ptrace:
http://lists.infradead.org/pipermail/linux-arm-kernel/2016-May/431962.html
Does the suggested fix work for you as well?
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2016-06-01 15:22 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-26 15:15 Debugging return.exp on ARM Simon Marchi
2016-05-26 19:11 ` Pedro Alves
2016-05-27 13:35 ` Simon Marchi
2016-06-01 15:17 ` Yao Qi
2016-06-01 15:22 ` Simon Marchi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox