From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 91692 invoked by alias); 26 May 2016 15:15:19 -0000 Mailing-List: contact gdb-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gdb-owner@sourceware.org Received: (qmail 91650 invoked by uid 89); 26 May 2016 15:15:14 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-0.9 required=5.0 tests=BAYES_00,KAM_LAZY_DOMAIN_SECURITY autolearn=no version=3.3.2 spammy=executes, 0301, resumed, our X-HELO: usplmg20.ericsson.net Received: from usplmg20.ericsson.net (HELO usplmg20.ericsson.net) (198.24.6.45) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES256-SHA encrypted) ESMTPS; Thu, 26 May 2016 15:15:11 +0000 Received: from EUSAAHC002.ericsson.se (Unknown_Domain [147.117.188.78]) by usplmg20.ericsson.net (Symantec Mail Security) with SMTP id 6A.A5.09012.B5A07475; Thu, 26 May 2016 16:38:19 +0200 (CEST) Received: from [142.133.110.144] (147.117.188.8) by smtp-am.internal.ericsson.com (147.117.188.80) with Microsoft SMTP Server id 14.3.294.0; Thu, 26 May 2016 11:15:08 -0400 From: Simon Marchi Subject: Debugging return.exp on ARM To: CC: Yao Qi Message-ID: <574712FC.5090409@ericsson.com> Date: Thu, 26 May 2016 15:15:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0 MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit X-IsSubscribed: yes X-SW-Source: 2016-05/txt/msg00020.txt.bz2 Hi everyone, In an attempt to fix flaky tests on ARM, I started looking at gdb.base/return.exp. The last test, which tests the "return" command on a function that returns a double, fails randomly on our ODroid XU-4 board. We have another board, a Firefly RK3288, which fails the same way (and even more frequently). I have the feeling that there's a race somewhere in the kernel/cache/memory/something. I isolated a minimal reproducer from the test case, that goes like this: double func3 () { return -5.0; } double tmp3; int main () { tmp3 = func3 (); return 0; } Built with: $ arm-linux-gnueabihf-gcc -g3 -O0 return.c -o return And here is the gdb script to run: file ~/return b func3 run return 2.0 n print tmp3 quit tmp3 != 2 I simply run gdb like this: $ ./gdb -nx -batch -x run.gdb What the test does is run to the beginning of func3, then issues the command "return 2.0", which makes the function artificially return with the value 2.0. It then does a "next" to complete the assignment to tmp3, and then prints the value of tmp3. Most of the time, we see the expected value, 2.0. Once in a while, we get 0. When doing the return, GDB writes 2.0 in the d0 register, which is the place where a return value of type "double" should be (and writes other registers including pc and sp to actually pop the stack frame). I added debug traces to confirm that the right value is written in d0 though ptrace by GDB (even in failure cases). So when we resume the thread (when doing the "next" command), it should have the right value in its d0 register. When doing the next, those are the exact instructions it executes (also confirmed by infrun debug): 83e4: eeb0 7b40 vmov.f64 d7, d0 83e8: f241 0330 movw r3, #4144 ; 0x1030 83ec: f2c0 0301 movt r3, #1 83f0: ed83 7b00 vstr d7, [r3] In other words, move d0 to d7 and then store it to tmp3's address (0x11030). I don't see anything that can go wrong with these instructions... if d0 contains the right value at the time the thread is resumed, the tmp3 should contain the right value at the end. However, as I said earlier, we get the wrong value once in a while. So it sounds like somehow the value didn't make it in time to the d0 register when the thread was resumed, or it's GDB reads the value of tmp3 before the effect of the vstr is visible... Given that we give the right input to the kernel, even in the cases that fail, I assume that the problem must be something like wrong cache invalidation or memory barrier/sequencing. I ran this test in a loop and got these results: ODroid XU-4: 263 fails 737 successes Firefly RK3288: 336 fails 163 success First, is anybody able to reproduce the problem on other boards? Then, does anybody have an idea what could cause this? Thanks! Simon