* "finish" command leads to SIGTRAP @ 2019-02-21 11:21 David Griffiths 2019-02-21 11:24 ` Pedro Alves 0 siblings, 1 reply; 15+ messages in thread From: David Griffiths @ 2019-02-21 11:21 UTC (permalink / raw) To: gdb I have a strange situation where issuing the "finish" command always leads to a SIGTRAP (this is gdb 8.1 on Ubuntu 16.04). Once this SIGTRAP occurs every continue also produces SIGTRAP. Completely reproducible. In the run up to the finish I'm doing single steps from a previous breakpoint: ===== (gdb) display/i $pc 1: x/i $pc => 0x7fffe1923b84: movabs $0x7ffff6d33b00,%r10 (gdb) si 0x00007fffe1923b8e in ?? () 1: x/i $pc => 0x7fffe1923b8e: callq *%r10 (gdb) 0x00007ffff6d33b00 in os::javaTimeMillis() () from /mnt/hgfs/david/jdk8u/build/linux-x86_64-normal-server-release/jdk/lib/amd64/server/libjvm.so 1: x/i $pc => 0x7ffff6d33b00 <_ZN2os14javaTimeMillisEv>: push %rbp (gdb) finish Run till exit from #0 0x00007ffff6d33b00 in os::javaTimeMillis() () from /mnt/hgfs/david/jdk8u/build/linux-x86_64-normal-server-release/jdk/lib/amd64/server/libjvm.so Thread 2 "java" received signal SIGTRAP, Trace/breakpoint trap. 0x00007ffff6d33b01 in os::javaTimeMillis() () from /mnt/hgfs/david/jdk8u/build/linux-x86_64-normal-server-release/jdk/lib/amd64/server/libjvm.so 1: x/i $pc => 0x7ffff6d33b01 <_ZN2os14javaTimeMillisEv+1>: xor %esi,%esi (gdb) c Continuing. Thread 2 "java" received signal SIGTRAP, Trace/breakpoint trap. 0x00007ffff6d33b03 in os::javaTimeMillis() () from /mnt/hgfs/david/jdk8u/build/linux-x86_64-normal-server-release/jdk/lib/amd64/server/libjvm.so 1: x/i $pc => 0x7ffff6d33b03 <_ZN2os14javaTimeMillisEv+3>: mov %rsp,%rbp ===== Even more strangely I can execute finish on that function in general, e.g. if I set a breakpoint on it: ===== (gdb) br os::javaTimeMillis Breakpoint 1 at 0x7ffff6d33b00 (gdb) c Continuing. [Switching to Thread 0x7ffff7fd8700 (LWP 12502)] Thread 2 "java" hit Breakpoint 1, 0x00007ffff6d33b00 in os::javaTimeMillis() () from /mnt/hgfs/david/jdk8u/build/linux-x86_64-normal-server-release/jdk/lib/amd64/server/libjvm.so (gdb) finish Run till exit from #0 0x00007ffff6d33b00 in os::javaTimeMillis() () from /mnt/hgfs/david/jdk8u/build/linux-x86_64-normal-server-release/jdk/lib/amd64/server/libjvm.so 0x00007fffe1b4f75c in ?? () (gdb) ===== So there must be something about the environment when it occurs but I don't know what. And by the way the code runs fine without the finish/single steps/etc. I need it to work because I'm trying to automate something via gdb/MI. Any suggestions as to how to debug this would be very welcome. Thanks, David -- David Griffiths, Senior Software Engineer Undo <https://undo.io> | Resolve even the most challenging software defects with software flight recorder technology Software reliability report: optimizing the software supplier and customer relationship <https://info.undo.io/software-reliability-report-optimizing-supplier-and-customer-relationship> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: "finish" command leads to SIGTRAP 2019-02-21 11:21 "finish" command leads to SIGTRAP David Griffiths @ 2019-02-21 11:24 ` Pedro Alves 2019-02-21 12:13 ` David Griffiths 0 siblings, 1 reply; 15+ messages in thread From: Pedro Alves @ 2019-02-21 11:24 UTC (permalink / raw) To: David Griffiths, gdb On 02/21/2019 11:21 AM, David Griffiths wrote: > > I need it to work because I'm trying to automate something via gdb/MI. Any > suggestions as to how to debug this would be very welcome. Start with "set debug infrun 1". And then "set debug lin-lwp 1" if debugging natively, or "set debug remote 1" if using the remote serial protocol. Thanks, Pedro Alves ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: "finish" command leads to SIGTRAP 2019-02-21 11:24 ` Pedro Alves @ 2019-02-21 12:13 ` David Griffiths 2019-02-21 12:17 ` David Griffiths 0 siblings, 1 reply; 15+ messages in thread From: David Griffiths @ 2019-02-21 12:13 UTC (permalink / raw) To: Pedro Alves; +Cc: gdb Ok thanks, did that. If I compare the output for the bad case with the good case, this seems to be the main difference: < infrun: proceed: resuming Thread 0x7ffff7fd8700 (LWP 12901) < infrun: resume (step=0, signal=GDB_SIGNAL_0), trap_expected=0, current thread [Thread 0x7ffff7fd8700 (LWP 12901)] at 0x7ffff6d33b00 < LLR: Preparing to resume Thread 0x7ffff7fd8700 (LWP 12901), 0, inferior_ptid Thread 0x7ffff7fd8700 (LWP 12901) < LLR: PTRACE_CONT Thread 0x7ffff7fd8700 (LWP 12901), 0 (resume event thread) --- > infrun: step-over queue now empty > infrun: resuming [Thread 0x7ffff7fd8700 (LWP 12901)] for step-over > infrun: resume (step=1, signal=GDB_SIGNAL_0), trap_expected=1, current thread [Thread 0x7ffff7fd8700 (LWP 12901)] at 0x7ffff6d33b00 > LLR: Preparing to step Thread 0x7ffff7fd8700 (LWP 12901), 0, inferior_ptid Thread 0x7ffff7fd8700 (LWP 12901) > LLR: PTRACE_SINGLESTEP Thread 0x7ffff7fd8700 (LWP 12901), 0 (resume event thread) 10a11 > infrun: proceed: [Thread 0x7ffff7fd8700 (LWP 12901)] resumed 27c28,60 < infrun: random signal (GDB_SIGNAL_TRAP) --- > infrun: no stepping, continue > infrun: resume (step=0, signal=GDB_SIGNAL_0), trap_expected=0, current thread [Thread 0x7ffff7fd8700 (LWP 12901)] at 0x7ffff6d33b01 Cheers, David On Thu, 21 Feb 2019 at 11:24, Pedro Alves <palves@redhat.com> wrote: > On 02/21/2019 11:21 AM, David Griffiths wrote: > > > > I need it to work because I'm trying to automate something via gdb/MI. > Any > > suggestions as to how to debug this would be very welcome. > > Start with "set debug infrun 1". > > And then "set debug lin-lwp 1" if debugging natively, or > "set debug remote 1" if using the remote serial protocol. > > Thanks, > Pedro Alves > -- David Griffiths, Senior Software Engineer Undo <https://undo.io> | Resolve even the most challenging software defects with software flight recorder technology Software reliability report: optimizing the software supplier and customer relationship <https://info.undo.io/software-reliability-report-optimizing-supplier-and-customer-relationship> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: "finish" command leads to SIGTRAP 2019-02-21 12:13 ` David Griffiths @ 2019-02-21 12:17 ` David Griffiths 2019-02-21 13:12 ` Pedro Alves 0 siblings, 1 reply; 15+ messages in thread From: David Griffiths @ 2019-02-21 12:17 UTC (permalink / raw) To: Pedro Alves; +Cc: gdb Oh, I should add a bit extra to the end because in the good case it is also doing the PTRACE_CONT: > LLR: Preparing to resume Thread 0x7ffff7fd8700 (LWP 12901), 0, inferior_ptid Thread 0x7ffff7fd8700 (LWP 12901) > LLR: PTRACE_CONT Thread 0x7ffff7fd8700 (LWP 12901), 0 (resume event thread) > infrun: prepare_to_wait > linux_nat_wait: [process -1], [TARGET_WNOHANG] > RSRL: NOT resuming LWP Thread 0x7ffff7fd8700 (LWP 12901), not stopped > LLW: enter > LNW: waitpid(-1, ...) returned 0, ERRNO-OK > RSRL: NOT resuming LWP Thread 0x7ffff7fd8700 (LWP 12901), not stopped > LLW: exit (ignore) etc On Thu, 21 Feb 2019 at 12:13, David Griffiths <dgriffiths@undo.io> wrote: > Ok thanks, did that. If I compare the output for the bad case with the > good case, this seems to be the main difference: > > < infrun: proceed: resuming Thread 0x7ffff7fd8700 (LWP 12901) > < infrun: resume (step=0, signal=GDB_SIGNAL_0), trap_expected=0, current > thread [Thread 0x7ffff7fd8700 (LWP 12901)] at 0x7ffff6d33b00 > < LLR: Preparing to resume Thread 0x7ffff7fd8700 (LWP 12901), 0, > inferior_ptid Thread 0x7ffff7fd8700 (LWP 12901) > < LLR: PTRACE_CONT Thread 0x7ffff7fd8700 (LWP 12901), 0 (resume event > thread) > --- > > infrun: step-over queue now empty > > infrun: resuming [Thread 0x7ffff7fd8700 (LWP 12901)] for step-over > > infrun: resume (step=1, signal=GDB_SIGNAL_0), trap_expected=1, current > thread [Thread 0x7ffff7fd8700 (LWP 12901)] at 0x7ffff6d33b00 > > LLR: Preparing to step Thread 0x7ffff7fd8700 (LWP 12901), 0, > inferior_ptid Thread 0x7ffff7fd8700 (LWP 12901) > > LLR: PTRACE_SINGLESTEP Thread 0x7ffff7fd8700 (LWP 12901), 0 (resume > event thread) > 10a11 > > infrun: proceed: [Thread 0x7ffff7fd8700 (LWP 12901)] resumed > 27c28,60 > < infrun: random signal (GDB_SIGNAL_TRAP) > --- > > infrun: no stepping, continue > > infrun: resume (step=0, signal=GDB_SIGNAL_0), trap_expected=0, current > thread [Thread 0x7ffff7fd8700 (LWP 12901)] at 0x7ffff6d33b01 > > Cheers, > > David > > On Thu, 21 Feb 2019 at 11:24, Pedro Alves <palves@redhat.com> wrote: > >> On 02/21/2019 11:21 AM, David Griffiths wrote: >> > >> > I need it to work because I'm trying to automate something via gdb/MI. >> Any >> > suggestions as to how to debug this would be very welcome. >> >> Start with "set debug infrun 1". >> >> And then "set debug lin-lwp 1" if debugging natively, or >> "set debug remote 1" if using the remote serial protocol. >> >> Thanks, >> Pedro Alves >> > > > -- > > David Griffiths, Senior Software Engineer > > Undo <https://undo.io> | Resolve even the most challenging software > defects with software flight recorder technology > > Software reliability report: optimizing the software supplier and customer > relationship > <https://info.undo.io/software-reliability-report-optimizing-supplier-and-customer-relationship> > -- David Griffiths, Senior Software Engineer Undo <https://undo.io> | Resolve even the most challenging software defects with software flight recorder technology Software reliability report: optimizing the software supplier and customer relationship <https://info.undo.io/software-reliability-report-optimizing-supplier-and-customer-relationship> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: "finish" command leads to SIGTRAP 2019-02-21 12:17 ` David Griffiths @ 2019-02-21 13:12 ` Pedro Alves 2019-02-21 15:55 ` David Griffiths 0 siblings, 1 reply; 15+ messages in thread From: Pedro Alves @ 2019-02-21 13:12 UTC (permalink / raw) To: David Griffiths; +Cc: gdb Might be unrelated, but ISTR that there used to be a kernel bug that would lead to the cpu's trace flag getting stuck set when you step in a signal handler. That would result in SIGTRAP happening at every step from that point on. Could that be the case here? I'd look at "set debug displaced on" too. Otherwise, it's a matter at staring at the logs, and trying to understand what is happening. Basically, "finish" sets a breakpoint at the caller and runs to it. But all sorts of other things can happen behind the scenes. Thanks, Pedro Alves ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: "finish" command leads to SIGTRAP 2019-02-21 13:12 ` Pedro Alves @ 2019-02-21 15:55 ` David Griffiths 2019-02-21 17:50 ` Pedro Alves 0 siblings, 1 reply; 15+ messages in thread From: David Griffiths @ 2019-02-21 15:55 UTC (permalink / raw) To: Pedro Alves; +Cc: gdb It's something to do with the nature of single stepping through a "popfq" instruction. Given the following instructions: 0x7fffe104638f: add $0x8,%rsp 0x7fffe1046393: popfq 0x7fffe1046394: pop %rbp 0x7fffe1046395: jmpq *%rax If I set a breakpoint at the first of that set and single step through, I end up with: eflags 0x346 [ PF ZF TF IF ] but if I set a breakpoint on the last instruction and avoid single stepping I get: eflags 0x246 [ PF ZF IF ] and I think it's that TF that is causing the SIGTRAP? On Thu, 21 Feb 2019 at 13:12, Pedro Alves <palves@redhat.com> wrote: > Might be unrelated, but ISTR that there used to be a kernel bug > that would lead to the cpu's trace flag getting stuck set > when you step in a signal handler. That would result in > SIGTRAP happening at every step from that point on. Could > that be the case here? > > I'd look at "set debug displaced on" too. Otherwise, it's a matter > at staring at the logs, and trying to understand what is happening. > Basically, "finish" sets a breakpoint at the caller and runs to it. > But all sorts of other things can happen behind the scenes. > > Thanks, > Pedro Alves > -- David Griffiths, Senior Software Engineer Undo <https://undo.io> | Resolve even the most challenging software defects with software flight recorder technology Software reliability report: optimizing the software supplier and customer relationship <https://info.undo.io/software-reliability-report-optimizing-supplier-and-customer-relationship> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: "finish" command leads to SIGTRAP 2019-02-21 15:55 ` David Griffiths @ 2019-02-21 17:50 ` Pedro Alves 2019-02-21 18:03 ` Pedro Alves ` (2 more replies) 0 siblings, 3 replies; 15+ messages in thread From: Pedro Alves @ 2019-02-21 17:50 UTC (permalink / raw) To: David Griffiths; +Cc: gdb On 02/21/2019 03:54 PM, David Griffiths wrote: > It's something to do with the nature of single stepping through a "popfq" > instruction. Given the following instructions: I assume you have a pushf somewhere earlier? > > 0x7fffe104638f: add $0x8,%rsp > 0x7fffe1046393: popfq > 0x7fffe1046394: pop %rbp > 0x7fffe1046395: jmpq *%rax > > If I set a breakpoint at the first of that set and single step through, I > end up with: > > eflags 0x346 [ PF ZF TF IF ] > > but if I set a breakpoint on the last instruction and avoid single stepping > I get: > > eflags 0x246 [ PF ZF IF ] > > and I think it's that TF that is causing the SIGTRAP? Same as <https://sourceware.org/bugzilla/show_bug.cgi?id=13508> ? I can reproduce that here, on Fedora 27 / Linux 4.17.17-100.fc27.x86_64. Sounds like PTRACE_SINGLESTEP enables TF, which then causes pushf to push the state with TF set. And then popf pops restores that TF-enabled state. I'd think this is a kernel bug, in the same vein as the signal issue I mentioned below (in which TF would get stuck when you stepped into a signal handler, or something like that). The kernel could have special handling for pushf, emulating it instead of actually single-stepping it? Maybe newer Linux kernels do something else. Haven't tried. I wonder what other kernels, like e.g., FreeBSD do here? Guess if GDB is to workaround this, it'll have to either add special treatment for this instruction (emulate, step over with a software breakpoints, something like that), or clear TF manually after single-stepping. :-/ Thanks, Pedro Alves (please avoid top posting) > > > On Thu, 21 Feb 2019 at 13:12, Pedro Alves <palves@redhat.com> wrote: > >> Might be unrelated, but ISTR that there used to be a kernel bug >> that would lead to the cpu's trace flag getting stuck set >> when you step in a signal handler. That would result in >> SIGTRAP happening at every step from that point on. Could >> that be the case here? >> >> I'd look at "set debug displaced on" too. Otherwise, it's a matter >> at staring at the logs, and trying to understand what is happening. >> Basically, "finish" sets a breakpoint at the caller and runs to it. >> But all sorts of other things can happen behind the scenes. >> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: "finish" command leads to SIGTRAP 2019-02-21 17:50 ` Pedro Alves @ 2019-02-21 18:03 ` Pedro Alves 2019-02-21 18:22 ` David Griffiths 2019-02-21 18:50 ` John Baldwin 2 siblings, 0 replies; 15+ messages in thread From: Pedro Alves @ 2019-02-21 18:03 UTC (permalink / raw) To: David Griffiths; +Cc: gdb On 02/21/2019 05:50 PM, Pedro Alves wrote: > (...) the signal issue > I mentioned below (in which TF would get stuck when you stepped into > a signal handler, or something like that). The kernel could have special > handling for pushf, emulating it instead of actually single-stepping it? FYI, the signals + TF kernel bug: https://bugzilla.kernel.org/show_bug.cgi?id=16061 Thanks, Pedro Alves ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: "finish" command leads to SIGTRAP 2019-02-21 17:50 ` Pedro Alves 2019-02-21 18:03 ` Pedro Alves @ 2019-02-21 18:22 ` David Griffiths 2019-02-21 18:50 ` John Baldwin 2 siblings, 0 replies; 15+ messages in thread From: David Griffiths @ 2019-02-21 18:22 UTC (permalink / raw) To: Pedro Alves; +Cc: gdb On Thu, 21 Feb 2019 at 17:50, Pedro Alves <palves@redhat.com> wrote: > > Same as <https://sourceware.org/bugzilla/show_bug.cgi?id=13508> ? > > Yes, that's exactly it! I'd just written a simple test also to reproduce: #include <stdio.h> int main() { asm("pushf\n\t" "popf\n\t"); printf("after popfq\n"); } (please avoid top posting) > > Sorry, gmail default! Cheers, David -- David Griffiths, Senior Software Engineer Undo <https://undo.io> | Resolve even the most challenging software defects with software flight recorder technology Software reliability report: optimizing the software supplier and customer relationship <https://info.undo.io/software-reliability-report-optimizing-supplier-and-customer-relationship> ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: "finish" command leads to SIGTRAP 2019-02-21 17:50 ` Pedro Alves 2019-02-21 18:03 ` Pedro Alves 2019-02-21 18:22 ` David Griffiths @ 2019-02-21 18:50 ` John Baldwin 2019-02-21 19:34 ` Pedro Alves 2 siblings, 1 reply; 15+ messages in thread From: John Baldwin @ 2019-02-21 18:50 UTC (permalink / raw) To: Pedro Alves, David Griffiths; +Cc: gdb On 2/21/19 9:50 AM, Pedro Alves wrote: > On 02/21/2019 03:54 PM, David Griffiths wrote: >> It's something to do with the nature of single stepping through a "popfq" >> instruction. Given the following instructions: > > I assume you have a pushf somewhere earlier? > >> >> 0x7fffe104638f: add $0x8,%rsp >> 0x7fffe1046393: popfq >> 0x7fffe1046394: pop %rbp >> 0x7fffe1046395: jmpq *%rax >> >> If I set a breakpoint at the first of that set and single step through, I >> end up with: >> >> eflags 0x346 [ PF ZF TF IF ] >> >> but if I set a breakpoint on the last instruction and avoid single stepping >> I get: >> >> eflags 0x246 [ PF ZF IF ] >> >> and I think it's that TF that is causing the SIGTRAP? > > Same as <https://sourceware.org/bugzilla/show_bug.cgi?id=13508> ? > > I can reproduce that here, on Fedora 27 / Linux 4.17.17-100.fc27.x86_64. > > Sounds like PTRACE_SINGLESTEP enables TF, which then causes pushf to push > the state with TF set. And then popf pops restores that TF-enabled state. > > I'd think this is a kernel bug, in the same vein as the signal issue > I mentioned below (in which TF would get stuck when you stepped into > a signal handler, or something like that). The kernel could have special > handling for pushf, emulating it instead of actually single-stepping it? > > Maybe newer Linux kernels do something else. Haven't tried. > > I wonder what other kernels, like e.g., FreeBSD do here? FreeBSD also fails (and in the last year we had a set of changes to rework TF handling in the kernel to boot). This doesn't look trivial to solve. To get the exception you have to have TF set in %rflags/%eflags, but that means it is set when the pushf writes to the stack. I think what would have to happen (ugh) is that the kernel needs to recognize that the DB# fault is due to a pushf instruction and that if the TF was a "shadow" TF due to ptrace it needs to clear TF from the value written on the stack as part of the fault handler. > Guess if GDB is to workaround this, it'll have to either add > special treatment for this instruction (emulate, step over with a software > breakpoints, something like that), or clear TF manually after > single-stepping. :-/ I suspect it will be common for kernels to have this bug because the CPU will always write a value onto the stack with TF set as part of executing the instruction. A workaround in GDB would be much like what I described above with the advantage that GDB actually knows it is stepping a pushf before it steps it, so it can know to rewrite the value on the stack after it gets the SIGTRAP for the single step over the pushf. This may actually be hard for a kernel to get right as at the time of the fault we don't get anything that says how long the faulting instruction was, etc. Thus, just looking at the byte before the current eip/rip in a DB# fault handler for the pushf opcode (I believe it's a single byte) can get false positives because you might have stepped over a mov instruction with an immediate whose last byte happens to be the opcode, etc. -- John Baldwin                                                                             ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: "finish" command leads to SIGTRAP 2019-02-21 18:50 ` John Baldwin @ 2019-02-21 19:34 ` Pedro Alves 2019-02-21 20:50 ` John Baldwin 0 siblings, 1 reply; 15+ messages in thread From: Pedro Alves @ 2019-02-21 19:34 UTC (permalink / raw) To: John Baldwin, David Griffiths; +Cc: gdb Hi John, Thanks for stepping in. On 02/21/2019 06:49 PM, John Baldwin wrote: > On 2/21/19 9:50 AM, Pedro Alves wrote: >> I wonder what other kernels, like e.g., FreeBSD do here? > > FreeBSD also fails (and in the last year we had a set of changes to rework > TF handling in the kernel to boot). This doesn't look trivial to solve. > To get the exception you have to have TF set in %rflags/%eflags, but that > means it is set when the pushf writes to the stack. I think what would > have to happen (ugh) is that the kernel needs to recognize that the DB# > fault is due to a pushf instruction and that if the TF was a "shadow" TF > due to ptrace it needs to clear TF from the value written on the stack as > part of the fault handler. > >> Guess if GDB is to workaround this, it'll have to either add >> special treatment for this instruction (emulate, step over with a software >> breakpoints, something like that), or clear TF manually after >> single-stepping. :-/ > > I suspect it will be common for kernels to have this bug because the CPU > will always write a value onto the stack with TF set as part of > executing the instruction. A workaround in GDB would be much like what I > described above with the advantage that GDB actually knows it is stepping a > pushf before it steps it, so it can know to rewrite the value on the > stack after it gets the SIGTRAP for the single step over the pushf. > > This may actually be hard for a kernel to get right as at the time of the > fault we don't get anything that says how long the faulting instruction was, > etc. Thus, just looking at the byte before the current eip/rip in a DB# > fault handler for the pushf opcode (I believe it's a single byte) can get > false positives because you might have stepped over a mov instruction with > an immediate whose last byte happens to be the opcode, etc. I can think of other workarounds potentially possible: #1 - emulate the instruction: i.e., if you know you're stepping a pushf instruction, you could instead push the flags state on the stack yourself manually, advance the PC, and then raise a fake trap. Could be done by the kernel, or gdb. Fixing it on the kernel side should be more efficient, and fixes it for all debuggers. While fixing it on the debugger side fixes it for all kernels... #2 - if you know you're stepping a pushf instruction, set a breakpoint after it, and PTRACE_CONTINUE instead of stepping. (that's the software single-step workaround mentioned earlier). #3 - have gdb always clear TF after a single-step. This is the easiest, even if the "less technically cool" solution. This would mean that it'd be impossible to debug a program that sets the trace flag manually. I've actually once co-wrote an in-process x86 debug stub, and in that use case preserving TF mattered, made it possible to debug that stub... Quite a niche use case, though, and it'd have been trivial for me for hack gdb for that special use case, of course. In order for GDB to know whether it is stepping a pushf instruction, it needs to read the memory at PC, which has a cost, but maybe it's negligible if we already end up reading memory anyway (because of the code cache), but I'm not sure we already do. This can have a more noticeable effect with remote debugging (which should weigh on whether to do the workaround at the infrun.c level, or in the target backend (thus in gdbserver when remote). Solution #3 would require extra ptrace commands anyway (read-modify-write the flags), so it may end up being less performant, if #1 and #2 already hit the code cache. There are some extra complications around #1 and #2 for gdbserver, because we need to consider the cases when gdbserver handles single-stepping without roundtripping to gdb: - range-stepping - stepping over breakpoints/tracepoints Thanks, Pedro Alves ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: "finish" command leads to SIGTRAP 2019-02-21 19:34 ` Pedro Alves @ 2019-02-21 20:50 ` John Baldwin 2019-02-22 15:09 ` Pedro Alves 0 siblings, 1 reply; 15+ messages in thread From: John Baldwin @ 2019-02-21 20:50 UTC (permalink / raw) To: Pedro Alves, David Griffiths; +Cc: gdb On 2/21/19 11:34 AM, Pedro Alves wrote: > Hi John, > > Thanks for stepping in. > > On 02/21/2019 06:49 PM, John Baldwin wrote: >> On 2/21/19 9:50 AM, Pedro Alves wrote: > >>> I wonder what other kernels, like e.g., FreeBSD do here? >> >> FreeBSD also fails (and in the last year we had a set of changes to rework >> TF handling in the kernel to boot). This doesn't look trivial to solve. >> To get the exception you have to have TF set in %rflags/%eflags, but that >> means it is set when the pushf writes to the stack. I think what would >> have to happen (ugh) is that the kernel needs to recognize that the DB# >> fault is due to a pushf instruction and that if the TF was a "shadow" TF >> due to ptrace it needs to clear TF from the value written on the stack as >> part of the fault handler. >> >>> Guess if GDB is to workaround this, it'll have to either add >>> special treatment for this instruction (emulate, step over with a software >>> breakpoints, something like that), or clear TF manually after >>> single-stepping. :-/ >> >> I suspect it will be common for kernels to have this bug because the CPU >> will always write a value onto the stack with TF set as part of >> executing the instruction. A workaround in GDB would be much like what I >> described above with the advantage that GDB actually knows it is stepping a >> pushf before it steps it, so it can know to rewrite the value on the >> stack after it gets the SIGTRAP for the single step over the pushf. >> >> This may actually be hard for a kernel to get right as at the time of the >> fault we don't get anything that says how long the faulting instruction was, >> etc. Thus, just looking at the byte before the current eip/rip in a DB# >> fault handler for the pushf opcode (I believe it's a single byte) can get >> false positives because you might have stepped over a mov instruction with >> an immediate whose last byte happens to be the opcode, etc. > I can think of other workarounds potentially possible: > > #1 - emulate the instruction: i.e., if you know you're stepping a > pushf instruction, you could instead push the flags state on the > stack yourself manually, advance the PC, and then raise a > fake trap. Could be done by the kernel, or gdb. Fixing it on > the kernel side should be more efficient, and fixes it for > all debuggers. While fixing it on the debugger side fixes > it for all kernels... Actually, yes, the PTRACE_STEP/PT_STEP can notice the pushf before it executes it in the kernel. That is not too bad then I guess. > #2 - if you know you're stepping a pushf instruction, set a breakpoint > after it, and PTRACE_CONTINUE instead of stepping. (that's the software > single-step workaround mentioned earlier). I prefer that to my suggestion above, and if we chose to do it in GDB my guess is that #2 is simpler / smaller patch to implement than #1? > #3 - have gdb always clear TF after a single-step. This is the > easiest, even if the "less technically cool" solution. This > would mean that it'd be impossible to debug a program that > sets the trace flag manually. I've actually once co-wrote > an in-process x86 debug stub, and in that use case > preserving TF mattered, made it possible to debug that > stub... Quite a niche use case, though, and it'd have been > trivial for me for hack gdb for that special use case, of course. > > In order for GDB to know whether it is stepping a pushf instruction, > it needs to read the memory at PC, which has a cost, but maybe it's > negligible if we already end up reading memory anyway (because of the > code cache), but I'm not sure we already do. This can have a more > noticeable effect with remote debugging (which should weigh on whether > to do the workaround at the infrun.c level, or in the target backend (thus > in gdbserver when remote). > > Solution #3 would require extra ptrace commands anyway (read-modify-write > the flags), so it may end up being less performant, if #1 and #2 already > hit the code cache. > > There are some extra complications around #1 and #2 for gdbserver, > because we need to consider the cases when gdbserver handles > single-stepping without roundtripping to gdb: > > - range-stepping > - stepping over breakpoints/tracepoints Hmmm, I will probably try to fix (or get someone else to fix) FreeBSD's kernel regardless probably using the approach in #1. For GDB itself, I probably have a slight preference for #2 over #1, but I haven't yet worked with gdbserver, so I'd defer to you on if #3 is the best solution when taking gdbserver into account. If the edge case of #3 matters, (which might matter for some other things like some language runtimes that set TF and use SIGTRAP handlers that motivated FreeBSD's kernel changes last year), we could perhaps provide a way for targets to override #3 if they know they don't need it (e.g. a native target under a kernel known to work). Not sure how that would work over remote (e.g. if you would want gdbserver to internalize this behavior so that only it deals with it and hides it from the remote debugger). -- John Baldwin                                                                             ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: "finish" command leads to SIGTRAP 2019-02-21 20:50 ` John Baldwin @ 2019-02-22 15:09 ` Pedro Alves 2019-02-22 16:42 ` John Baldwin 0 siblings, 1 reply; 15+ messages in thread From: Pedro Alves @ 2019-02-22 15:09 UTC (permalink / raw) To: John Baldwin, David Griffiths; +Cc: gdb On 02/21/2019 08:49 PM, John Baldwin wrote: > On 2/21/19 11:34 AM, Pedro Alves wrote: >> Hi John, >> >> Thanks for stepping in. >> >> On 02/21/2019 06:49 PM, John Baldwin wrote: >>> On 2/21/19 9:50 AM, Pedro Alves wrote: >> >>>> I wonder what other kernels, like e.g., FreeBSD do here? >>> >>> FreeBSD also fails (and in the last year we had a set of changes to rework >>> TF handling in the kernel to boot). This doesn't look trivial to solve. >>> To get the exception you have to have TF set in %rflags/%eflags, but that >>> means it is set when the pushf writes to the stack. I think what would >>> have to happen (ugh) is that the kernel needs to recognize that the DB# >>> fault is due to a pushf instruction and that if the TF was a "shadow" TF >>> due to ptrace it needs to clear TF from the value written on the stack as >>> part of the fault handler. >>> >>>> Guess if GDB is to workaround this, it'll have to either add >>>> special treatment for this instruction (emulate, step over with a software >>>> breakpoints, something like that), or clear TF manually after >>>> single-stepping. :-/ >>> >>> I suspect it will be common for kernels to have this bug because the CPU >>> will always write a value onto the stack with TF set as part of >>> executing the instruction. A workaround in GDB would be much like what I >>> described above with the advantage that GDB actually knows it is stepping a >>> pushf before it steps it, so it can know to rewrite the value on the >>> stack after it gets the SIGTRAP for the single step over the pushf. >>> >>> This may actually be hard for a kernel to get right as at the time of the >>> fault we don't get anything that says how long the faulting instruction was, >>> etc. Thus, just looking at the byte before the current eip/rip in a DB# >>> fault handler for the pushf opcode (I believe it's a single byte) can get >>> false positives because you might have stepped over a mov instruction with >>> an immediate whose last byte happens to be the opcode, etc. >> I can think of other workarounds potentially possible: >> >> #1 - emulate the instruction: i.e., if you know you're stepping a >> pushf instruction, you could instead push the flags state on the >> stack yourself manually, advance the PC, and then raise a >> fake trap. Could be done by the kernel, or gdb. Fixing it on >> the kernel side should be more efficient, and fixes it for >> all debuggers. While fixing it on the debugger side fixes >> it for all kernels... > > Actually, yes, the PTRACE_STEP/PT_STEP can notice the pushf before it > executes it in the kernel. That is not too bad then I guess. > >> #2 - if you know you're stepping a pushf instruction, set a breakpoint >> after it, and PTRACE_CONTINUE instead of stepping. (that's the software >> single-step workaround mentioned earlier). > > I prefer that to my suggestion above, and if we chose to do it in GDB my > guess is that #2 is simpler / smaller patch to implement than #1? Not 100% sure, #1 feels simpler in some aspects; #2 feels simpler in others. A detail that I'm thinking of right now, is that when we have a signal to deliver, we better deliver the signal first before emulating the instruction, because we don't know whether the signal will take up to a signal handler (which may siglongjmp and thus skip the pushf). IIRC, there's code in infrun.c to do something like that for other cases, so it shouldn't be too hard. #2 avoids this, because PTRACE_CONTINUE would just take us to the signal handler as usual, but, both in-line and out-of-line stepping must be considered. To me it feels like the kind of thing that would require experimentation / prototyping to get a better feel and notice the corner cases as one digs through the state machine code in infrun.c. > >> #3 - have gdb always clear TF after a single-step. This is the >> easiest, even if the "less technically cool" solution. This >> would mean that it'd be impossible to debug a program that >> sets the trace flag manually. I've actually once co-wrote >> an in-process x86 debug stub, and in that use case >> preserving TF mattered, made it possible to debug that >> stub... Quite a niche use case, though, and it'd have been >> trivial for me for hack gdb for that special use case, of course. >> >> In order for GDB to know whether it is stepping a pushf instruction, >> it needs to read the memory at PC, which has a cost, but maybe it's >> negligible if we already end up reading memory anyway (because of the >> code cache), but I'm not sure we already do. This can have a more >> noticeable effect with remote debugging (which should weigh on whether >> to do the workaround at the infrun.c level, or in the target backend (thus >> in gdbserver when remote). >> >> Solution #3 would require extra ptrace commands anyway (read-modify-write >> the flags), so it may end up being less performant, if #1 and #2 already >> hit the code cache. >> >> There are some extra complications around #1 and #2 for gdbserver, >> because we need to consider the cases when gdbserver handles >> single-stepping without roundtripping to gdb: >> >> - range-stepping >> - stepping over breakpoints/tracepoints > > Hmmm, I will probably try to fix (or get someone else to fix) FreeBSD's > kernel regardless probably using the approach in #1. For GDB itself, I > probably have a slight preference for #2 over #1, but I haven't yet worked > with gdbserver, so I'd defer to you on if #3 is the best solution when > taking gdbserver into account. If the edge case of #3 matters, (which might > matter for some other things like some language runtimes that set TF and use > SIGTRAP handlers that motivated FreeBSD's kernel changes last year), we > could perhaps provide a way for targets to override #3 if they know they > don't need it (e.g. a native target under a kernel known to work). Not > sure how that would work over remote (e.g. if you would want gdbserver to > internalize this behavior so that only it deals with it and hides it from > the remote debugger). I'd prefer #1 or #2 over #3. As for gdbserver, the thing is that whatever solution we implement in gdb isn't going to fix gdbserver, gdbserver needs fixing as well. gdbserver has its own run control loop that does single-stepping behind gdb's back. The most common case nowadays is range-stepping. When you do "next", or "step", as an optimization, gdb tells gdbserver to single-step as long the PC is within an address range (the continuous address range that corresponds to the current line that includes PC). gdbserver then continually single-steps, and only reports back a stop to GDB once the PC leaves the range. This avoids many roundtrips between gdb and gdbserver. This means that gdbserver must have some workaround too. For this case alone, we could just make gdbserver punt and report a stop to gdb if the next instruction is a pushf (gdb continues stepping itself, which would trigger the workaround). BUT, that wouldn't address the less frequent case -- tracepoints: gdbserver needs to step over them without gdb involvement, and needs to implement while-stepping actions. So here we can't punt to gdb, there may not even be one connected! So we need to a full workaround in gdbserver. Thanks, Pedro Alves ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: "finish" command leads to SIGTRAP 2019-02-22 15:09 ` Pedro Alves @ 2019-02-22 16:42 ` John Baldwin 2019-02-22 17:38 ` David Griffiths 0 siblings, 1 reply; 15+ messages in thread From: John Baldwin @ 2019-02-22 16:42 UTC (permalink / raw) To: Pedro Alves, David Griffiths; +Cc: gdb On 2/22/19 7:09 AM, Pedro Alves wrote: > On 02/21/2019 08:49 PM, John Baldwin wrote: >> On 2/21/19 11:34 AM, Pedro Alves wrote: >>> #3 - have gdb always clear TF after a single-step. This is the >>> easiest, even if the "less technically cool" solution. This >>> would mean that it'd be impossible to debug a program that >>> sets the trace flag manually. I've actually once co-wrote >>> an in-process x86 debug stub, and in that use case >>> preserving TF mattered, made it possible to debug that >>> stub... Quite a niche use case, though, and it'd have been >>> trivial for me for hack gdb for that special use case, of course. >>> >>> In order for GDB to know whether it is stepping a pushf instruction, >>> it needs to read the memory at PC, which has a cost, but maybe it's >>> negligible if we already end up reading memory anyway (because of the >>> code cache), but I'm not sure we already do. This can have a more >>> noticeable effect with remote debugging (which should weigh on whether >>> to do the workaround at the infrun.c level, or in the target backend (thus >>> in gdbserver when remote). >>> >>> Solution #3 would require extra ptrace commands anyway (read-modify-write >>> the flags), so it may end up being less performant, if #1 and #2 already >>> hit the code cache. >>> >>> There are some extra complications around #1 and #2 for gdbserver, >>> because we need to consider the cases when gdbserver handles >>> single-stepping without roundtripping to gdb: >>> >>> - range-stepping >>> - stepping over breakpoints/tracepoints >> >> Hmmm, I will probably try to fix (or get someone else to fix) FreeBSD's >> kernel regardless probably using the approach in #1. For GDB itself, I >> probably have a slight preference for #2 over #1, but I haven't yet worked >> with gdbserver, so I'd defer to you on if #3 is the best solution when >> taking gdbserver into account. If the edge case of #3 matters, (which might >> matter for some other things like some language runtimes that set TF and use >> SIGTRAP handlers that motivated FreeBSD's kernel changes last year), we >> could perhaps provide a way for targets to override #3 if they know they >> don't need it (e.g. a native target under a kernel known to work). Not >> sure how that would work over remote (e.g. if you would want gdbserver to >> internalize this behavior so that only it deals with it and hides it from >> the remote debugger). > > I'd prefer #1 or #2 over #3. As for gdbserver, the thing is that whatever > solution we implement in gdb isn't going to fix gdbserver, gdbserver > needs fixing as well. gdbserver has its own run control loop that does > single-stepping behind gdb's back. The most common case nowadays is > range-stepping. When you do "next", or "step", as an optimization, gdb > tells gdbserver to single-step as long the PC is within an address range > (the continuous address range that corresponds to the current line > that includes PC). gdbserver then continually single-steps, and only > reports back a stop to GDB once the PC leaves the range. This avoids > many roundtrips between gdb and gdbserver. This means that gdbserver > must have some workaround too. For this case alone, we could just > make gdbserver punt and report a stop to gdb if the next instruction is > a pushf (gdb continues stepping itself, which would trigger the workaround). > BUT, that wouldn't address the less frequent case -- tracepoints: > gdbserver needs to step over them without gdb involvement, and needs to > implement while-stepping actions. So here we can't punt to gdb, there > may not even be one connected! So we need to a full workaround > in gdbserver. I thought of one more issue with #3 which is that it's not necessarily that you need to clear TF after each step. The way I reproduced this when I ran the test program was to si over the pushf, then do a continue. This meant that we weren't stepping when the popf was executed, and the instruction after popf then raised a spurious SIGTRAP. At that point, the thread's current state isn't stepping. One way perhaps to handle this was if you could specifically determine that a SIGTRAP was a step and if the you get an unexpected step trap, resume the thread anyway (possibly clearing TF as part of the resume). This wouldn't be hard to do in individual native targets where you have the siginfo for the SIGTRAP. It's harder to do at a higher layer I think. One thing I've wondered about when adding the siginfo parsing for the FreeBSD native target is that it feels like it would be nicer if a target could return more fine-grained waitkinds, something like TARGET_WAITKIND_STEPPED, TARGET_WAITKIND_SW_BREAKPOINT, etc. instead of requiring the various methods like 'supports_stopped_by_sw_breakpoint' and 'stopped_by_sw_breakpoint' and assuming that SIGTRAP is a step if the current thread is stepping and none of the other 'stopped_by_foo' methods return true. You could maybe still have a fallback for TARGET_WAITKIND_STOPPED that would use the same heuristics for targets that don't parse siginfo to infer the more detailed stop type perhaps? Having that detail at a higher level would make it easier to recognize spurious step traps in the core I think. That's probably too big a change just to workaround this issue, but still a thought I've had for a while. -- John Baldwin                                                                             ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: "finish" command leads to SIGTRAP 2019-02-22 16:42 ` John Baldwin @ 2019-02-22 17:38 ` David Griffiths 0 siblings, 0 replies; 15+ messages in thread From: David Griffiths @ 2019-02-22 17:38 UTC (permalink / raw) To: John Baldwin; +Cc: Pedro Alves, gdb By the way, just testing my workaround for this (setting a breakpoint and continuing rather than single step) and it appears to effect both pushfq and popfq. Even after I fixed the pushfq case the problem still occurred because it set the TF on the popfq despite the fact the stack value didn't contain TF. Cheers, David -- David Griffiths, Senior Software Engineer Undo <https://undo.io> | Resolve even the most challenging software defects with software flight recorder technology Software reliability report: optimizing the software supplier and customer relationship <https://info.undo.io/software-reliability-report-optimizing-supplier-and-customer-relationship> ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2019-02-22 17:38 UTC | newest] Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2019-02-21 11:21 "finish" command leads to SIGTRAP David Griffiths 2019-02-21 11:24 ` Pedro Alves 2019-02-21 12:13 ` David Griffiths 2019-02-21 12:17 ` David Griffiths 2019-02-21 13:12 ` Pedro Alves 2019-02-21 15:55 ` David Griffiths 2019-02-21 17:50 ` Pedro Alves 2019-02-21 18:03 ` Pedro Alves 2019-02-21 18:22 ` David Griffiths 2019-02-21 18:50 ` John Baldwin 2019-02-21 19:34 ` Pedro Alves 2019-02-21 20:50 ` John Baldwin 2019-02-22 15:09 ` Pedro Alves 2019-02-22 16:42 ` John Baldwin 2019-02-22 17:38 ` David Griffiths
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox