From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 47619 invoked by alias); 4 Aug 2015 18:07:53 -0000 Mailing-List: contact gdb-patches-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gdb-patches-owner@sourceware.org Received: (qmail 47605 invoked by uid 89); 4 Aug 2015 18:07:52 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-1.6 required=5.0 tests=AWL,BAYES_00,KAM_LAZY_DOMAIN_SECURITY,RCVD_IN_DNSWL_LOW autolearn=no version=3.3.2 X-HELO: rock.gnat.com Received: from rock.gnat.com (HELO rock.gnat.com) (205.232.38.15) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with (AES256-SHA encrypted) ESMTPS; Tue, 04 Aug 2015 18:07:49 +0000 Received: from localhost (localhost.localdomain [127.0.0.1]) by filtered-rock.gnat.com (Postfix) with ESMTP id A8AE62863B for ; Tue, 4 Aug 2015 14:07:47 -0400 (EDT) Received: from rock.gnat.com ([127.0.0.1]) by localhost (rock.gnat.com [127.0.0.1]) (amavisd-new, port 10024) with LMTP id bGWP-255FOh0 for ; Tue, 4 Aug 2015 14:07:47 -0400 (EDT) Received: from joel.gnat.com (localhost.localdomain [127.0.0.1]) by rock.gnat.com (Postfix) with ESMTP id 716A1285C3 for ; Tue, 4 Aug 2015 14:07:47 -0400 (EDT) Received: by joel.gnat.com (Postfix, from userid 1000) id D2B58472E5; Tue, 4 Aug 2015 11:07:45 -0700 (PDT) Date: Tue, 04 Aug 2015 18:07:00 -0000 From: Joel Brobecker To: gdb-patches@sourceware.org Subject: sig != GDB_SIGNAL_0 failed assertion stepping program on GNU/Linux Message-ID: <20150804180745.GA13984@adacore.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.23 (2014-03-12) X-SW-Source: 2015-08/txt/msg00097.txt.bz2 Hello, I have an issue debugging a program which uses threads, and which only reproduces on SuSE 10; I'm not sure whether this is thanks to pure luck, or whether something is specific to that particular distro. The test is also sufficiently racy that it does not happen all the time. Unfortunately, I can't share the code, and attempts at creating another reproducer I can share have failed miserably. I think I understand what's happening, but I am not sure what GDB ought to be doing in that scenario. In our scenario, we are trying to next/step our program after having stopped at the start of its main subprogram: % gdb -q .obj/gprof/main (gdb) start (gdb) n (gdb) step [...]/infrun.c:2391: internal-error: resume: Assertion `sig != GDB_SIGNAL_0' failed. The summary: The issue is related to the use of the _Unwind_DebugHook breakpoint. More precisely, the "next" happens to stop the inferior at an address which is the return address from the _Unwind_DebugHook, and then the "step" operation triggers the _Unwind_DebugHook breakpoint for another thread, before the initial thread had a chance to advance at all. The handling of the _Unwind_DebugHook breakpoint causes GDB to insert a "bp_step_resume" breakpoint at the function's return address again, which happens to be the same as in the previous _Unwind_DebugHook call. So, when GDB tries to step the initial thread again, it finds that it is stepping from a location that has a breakpoint with SIG being GDB_SIGNAL_0. The detailed analysis so far: At the end of the "start" command, we are stopped at the start of function Main in main.adb. (gdb) start Temporary breakpoint 1 at 0x805451e: file /[...]/main.adb, line 57. Starting program: /[...]/main [New Thread 0xb7e5eba0 (LWP 28377)] [New Thread 0xb7c5aba0 (LWP 28378)] [New Thread 0xb7a56ba0 (LWP 28379)] Temporary breakpoint 1, main () at /[...]/main.adb:57 [...] There are 4 threads in total, and we are in the main thread (which is thread 1): (gdb) info thread Id Target Id Frame 4 Thread 0xb7a56ba0 (LWP 28379) 0xffffe410 in __kernel_vsyscall () 3 Thread 0xb7c5aba0 (LWP 28378) 0xffffe410 in __kernel_vsyscall () 2 Thread 0xb7e5eba0 (LWP 28377) 0xffffe410 in __kernel_vsyscall () * 1 Thread 0xb7ea18c0 (LWP 28370) main () at /glinos.a/users/brobecke/ex/O701-042/Pre_Post_Sub/main.adb:57 All the logs below reference Thread ID/LWP, but I think it'll be easier to talk about the the thread by thread number. For instance, thread 1 is LWP 28370 while thread 3 is LWP 28378. So, I will translate in my explanations the LWPs into thread numbers. Back to our program. At this point, we attempt a "next" (from thread 1), and here is what happens: (gdb) n infrun: clear_proceed_status_thread (Thread 0xb7a56ba0 (LWP 28379)) infrun: clear_proceed_status_thread (Thread 0xb7c5aba0 (LWP 28378)) infrun: clear_proceed_status_thread (Thread 0xb7e5eba0 (LWP 28377)) infrun: clear_proceed_status_thread (Thread 0xb7ea18c0 (LWP 28370)) infrun: proceed (addr=0xffffffff, signal=GDB_SIGNAL_DEFAULT) infrun: resume (step=1, signal=GDB_SIGNAL_0), trap_expected=0, current thread [Thread 0xb7ea18c0 (LWP 28370)] at 0x805451e infrun: target_wait (-1.0.0, status) = infrun: 28370.28370.0 [Thread 0xb7ea18c0 (LWP 28370)], infrun: status->kind = stopped, signal = GDB_SIGNAL_TRAP infrun: TARGET_WAITKIND_STOPPED infrun: stop_pc = 0x8054523 We've resumed thread 1 (LWP 28370), and received in return a signal that the same thread stopped slightly further. It's still in the range of instructions for the line of source we started the "next" from, as evidenced by the following trace... infrun: stepping inside range [0x805451e-0x8054531] ... and thus, we decide to continue stepping the same thread: infrun: resume (step=1, signal=GDB_SIGNAL_0), trap_expected=0, current thread [Thread 0xb7ea18c0 (LWP 28370)] at 0x8054523 infrun: prepare_to_wait That's when we get an event from a different thread (thread 3): infrun: target_wait (-1.0.0, status) = infrun: 28370.28378.0 [Thread 0xb7c5aba0 (LWP 28378)], infrun: status->kind = stopped, signal = GDB_SIGNAL_TRAP infrun: TARGET_WAITKIND_STOPPED infrun: stop_pc = 0x80782d0 infrun: context switch infrun: Switching context from Thread 0xb7ea18c0 (LWP 28370) to Thread 0xb7c5aba0 (LWP 28378) ... which we find to be at the address where we set a breakpoint on "the unwinder debug hook" (namely "_Unwind_DebugHook"). That's why GDB reports for this event that this is... infrun: BPSTAT_WHAT_SET_LONGJMP_RESUME ... and anlyzing the arguments passed to this function, it finds that the return address is 0x80542a2. That's what the next trace means, when it says: infrun: exception resume at 80542a2 The trace above also indicates that GDB created an internal "bp_step_resume" breakpoint at that address. The address seems innocent right now, but it'll become important later on, during the step. And now that the event has been analyzed and we've determmined that we stopped at an exception breakpoint, GDB needs to step thread 3 past the breakpoint it just hit. Thus, it temporarily disables the exception breakpoint, and requests a step of that thread: infrun: skipping breakpoint: stepping past insn at: 0x80782d0 infrun: skipping breakpoint: stepping past insn at: 0x80782d0 infrun: skipping breakpoint: stepping past insn at: 0x80782d0 infrun: resume (step=1, signal=GDB_SIGNAL_0), trap_expected=1, current thread [Thread 0xb7c5aba0 (LWP 28378)] at 0x80782d0 infrun: prepare_to_wait We then get a notification, still from thread 3, that it's now past that breakpoint... infrun: prepare_to_wait infrun: target_wait (-1.0.0, status) = infrun: 28370.28378.0 [Thread 0xb7c5aba0 (LWP 28378)], infrun: status->kind = stopped, signal = GDB_SIGNAL_TRAP infrun: TARGET_WAITKIND_STOPPED infrun: stop_pc = 0x8078424 ... so we can resume what we were doing before, which is single-stepping thread 1 until we get to a new line of code: infrun: switching back to stepped thread infrun: Switching context from Thread 0xb7c5aba0 (LWP 28378) to Thread 0xb7ea18c0 (LWP 28370) infrun: expected thread still hasn't advanced infrun: resume (step=1, signal=GDB_SIGNAL_0), trap_expected=0, current thread [Thread 0xb7ea18c0 (LWP 28370)] at 0x8054523 The "resume" log above shows that we're resuming thread 1 from where we left off (0x8054523). We get one more stop at 0x8054529, which is still inside our stepping range so we go again. That's when we get the following event, from thread 3: infrun: prepare_to_wait infrun: target_wait (-1.0.0, status) = infrun: 28370.28378.0 [Thread 0xb7c5aba0 (LWP 28378)], infrun: status->kind = stopped, signal = GDB_SIGNAL_TRAP infrun: TARGET_WAITKIND_STOPPED infrun: stop_pc = 0x80542a2 Now the stop_pc adddres is interesting, because it's the address of "exception resume" breakpoint. When GDB sees this, it knows infrun: context switch infrun: Switching context from Thread 0xb7ea18c0 (LWP 28370) to Thread 0xb7c5aba0 (LWP 28378) infrun: BPSTAT_WHAT_CLEAR_LONGJMP_RESUME ... and since the location is at a different line of code, this is where it decide the "next" operation should stop: infrun: stop_waiting [Switching to Thread 0xb7c5aba0 (LWP 28378)] 0x080542a2 in inte_tache_rt.ttache_rt ( <_task>=0x80968ec ) at /[...]/inte_tache_rt.adb:54 54 end loop; So far, things seem to be all normal. But the important element to consider, here, is the fact that we stopped at an address that also happens to be the address of the "exception resume" breakpoint. Now that the plot has been set, we can try to "step", and see what happens when things do not go well. First, GDB resumes the current thread, which is now thread 3: (gdb) step infrun: clear_proceed_status_thread (Thread 0xb7a56ba0 (LWP 28379)) infrun: clear_proceed_status_thread (Thread 0xb7c5aba0 (LWP 28378)) infrun: clear_proceed_status_thread (Thread 0xb7e5eba0 (LWP 28377)) infrun: clear_proceed_status_thread (Thread 0xb7ea18c0 (LWP 28370)) infrun: proceed (addr=0xffffffff, signal=GDB_SIGNAL_DEFAULT) infrun: resume (step=1, signal=GDB_SIGNAL_0), trap_expected=0, current thread [Thread 0xb7c5aba0 (LWP 28378)] at 0x80542a2 Enter thread 4, whom we haven't heard of yet, who now triggers our exception breakpoint at _Unwind_DebugHook... infrun: target_wait (-1.0.0, status) = infrun: 28370.28379.0 [Thread 0xb7a56ba0 (LWP 28379)], infrun: status->kind = stopped, signal = GDB_SIGNAL_TRAP infrun: TARGET_WAITKIND_STOPPED infrun: stop_pc = 0x80782d0 infrun: context switch infrun: Switching context from Thread 0xb7c5aba0 (LWP 28378) to Thread 0xb7a56ba0 (LWP 28379) infrun: BPSTAT_WHAT_SET_LONGJMP_RESUME ... and once more, GDB finds that the "exception resume" breakpoint should be set at 0x80542a2: infrun: exception resume at 80542a2 Same dance as before, we need to single-step that thread out of the breakpoint, so: infrun: skipping breakpoint: stepping past insn at: 0x80782d0 infrun: skipping breakpoint: stepping past insn at: 0x80782d0 infrun: skipping breakpoint: stepping past insn at: 0x80782d0 infrun: resume (step=1, signal=GDB_SIGNAL_0), trap_expected=1, current thread [Thread 0xb7a56ba0 (LWP 28379)] at 0x80782d0 infrun: prepare_to_wait infrun: target_wait (-1.0.0, status) = infrun: 28370.28379.0 [Thread 0xb7a56ba0 (LWP 28379)], infrun: status->kind = stopped, signal = GDB_SIGNAL_TRAP infrun: TARGET_WAITKIND_STOPPED infrun: stop_pc = 0x8078424 ... at which point we get back to where we left off again, which is stepping thread 3 because it hasn't "moved" yet: infrun: switching back to stepped thread infrun: Switching context from Thread 0xb7a56ba0 (LWP 28379) to Thread 0xb7c5aba0 (LWP 28378) infrun: expected thread still hasn't advanced infrun: resume (step=1, signal=GDB_SIGNAL_0), trap_expected=0, current thread [Thread 0xb7c5aba0 (LWP 28378)] at 0x80542a2 The error happens during the resume, because we're trying to step from a location which has a breakpoint... if (execution_direction != EXEC_REVERSE && step && breakpoint_inserted_here_p (aspace, pc)) ... and the code has a very long explanation of why this can only happen if we have a signal to deliver (hence the assertion). Note that the "exception resume" breakpoint has an additional condition that limits its effectiveness to thread 4 only, which is the thread that hit the _Unwind_DebugHook breakpoint. But, at the target level, breakpoints apply to all thread, and thread-specific handling is done by GDB, by automatically restarting a program when a non-matching thread hits the thread-specific breakpoint. I am not really sure what to think of that, whether the assertion is just too strict, or whether we need special handling of that situation. In particular, I'm wondering whether the code in the "if" requires the assertion to be true in order to be correct. It does: tp->stepped_breakpoint = 1; /* Most targets can step a breakpoint instruction, thus executing it normally. But if this one cannot, just continue and we will hit it anyway. */ if (gdbarch_cannot_step_breakpoint (gdbarch)) step = 0; Just for kicks, I tried commenting out the assertion, and after a couple of tries, I got at the end of the step... (gdb) step /[...]/infrun.c:4865: internal-error: process_event_stop_test: Assertion `ecs->event_thread->control.exception_resume_breakpoint != NULL' failed. Any suggestion? Maybe catch this specific case where we're trying to single-step a thread from the exception_resume_breakpoint by adding some code that would single-stepp the one thread alone out of that address, before resuming a more normal single-step operation? Thanks! -- Joel