From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 12824 invoked by alias); 1 Sep 2009 20:48:31 -0000 Received: (qmail 12816 invoked by uid 22791); 1 Sep 2009 20:48:30 -0000 X-SWARE-Spam-Status: No, hits=-2.4 required=5.0 tests=AWL,BAYES_00 X-Spam-Check-By: sourceware.org Received: from rock.gnat.com (HELO rock.gnat.com) (205.232.38.15) by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Tue, 01 Sep 2009 20:48:25 +0000 Received: from localhost (localhost.localdomain [127.0.0.1]) by filtered-rock.gnat.com (Postfix) with ESMTP id B21682BAB19 for ; Tue, 1 Sep 2009 16:48:22 -0400 (EDT) Received: from rock.gnat.com ([127.0.0.1]) by localhost (rock.gnat.com [127.0.0.1]) (amavisd-new, port 10024) with LMTP id GpFPnGOmuKBn for ; Tue, 1 Sep 2009 16:48:22 -0400 (EDT) Received: from joel.gnat.com (localhost.localdomain [127.0.0.1]) by rock.gnat.com (Postfix) with ESMTP id 671BA2BAB03 for ; Tue, 1 Sep 2009 16:48:22 -0400 (EDT) Received: by joel.gnat.com (Postfix, from userid 1000) id 1D714F5893; Tue, 1 Sep 2009 13:48:15 -0700 (PDT) Date: Tue, 01 Sep 2009 20:48:00 -0000 From: Joel Brobecker To: gdb@sourceware.org Subject: More info on PR/9711 (quadratic slowdown for deep stack traces) Message-ID: <20090901204815.GK4379@adacore.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.5.18 (2008-05-17) Mailing-List: contact gdb-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gdb-owner@sourceware.org X-SW-Source: 2009-09/txt/msg00009.txt.bz2 Hello, I think I understand a little better what's causing the debugger to be really slow in the testcase provided with the PR. I logged the following entry in the PR yesterday: > The slowdown comes from printing the frame corresponding to "main", > because argv is stored in %rsi. The following transcript shows a more > isolated way of reproducing the issue: > > (gdb) set print frame-arguments none > (gdb) run 5000 > Starting program: /[...]/deep 5000 > > Program received signal SIGABRT, Aborted. > 0x00007ffff7a9efb5 in raise () from /lib/libc.so.6 > (gdb) frame 5004 > #5004 0x00000000004005a2 in main (argc=..., argv=...) at deep.c:34 > 34 return visit (&n); > (gdb) p /x $rsi > $1 = 0x748b > > The "p /x $rsi" take a long time, while the rest is pretty much > instantaneous. That's as much as I can do for today, but look at it > some more tomorrow. I reconstructed the part that takes time: value_contents_all (...): => value_fetch_lazy: | while (VALUE_LVAL (new_val) == lval_register | { && value_lazy (new_val)) | { frame = frame_find_by_id (VALUE_FRAME_ID (new_val)) | { [ for (frame = get_current_frame (); ; frame = prev_frame) | { [ | if (frame_id_eq (id, this)) | { [ | return frame | { [ | prev_frame = get_prev_frame (frame) | { [ | { if ([...] && inside_main_func (this_frame)) | { [ | { return NULL | { [ | { if ([...] && inside_entry_func (this_frame)) | { [ | { return NULL | { [ | { if ([...] && get_frame_pc (this_frame) == 0) | { [ | { return NULL | { [ | { return get_prev_frame_1 (this_frame); | { [ | { | if (this_frame->prev_p) | { [ | { | return this_frame->prev; | { [ | if (!prev_frame) | { [ | return NULL | { [ | if (get_frame_type (frame) == NORMAL_FRAME | { [ | && !frame_id_inner () | { [ | && !frame_id_inner ()) | { [ | return NULL | { new_val = get_frame_register_value (frame, regnum); | { [ return frame_unwind_register_value (frame->next, regnum) | { [ | value = frame->unwind->prev_register (frame, [...], regnum); | { [ | { case DWARF2_FRAME_REG_UNSPECIFIED: | { [ | { return frame_unwind_got_register (this_frame, regnum, regnum); | { [ | { [ return return value_of_register_lazy (frame, new_regnum); It might be a little cryptic, but it gives a good overall idea of what gets called and why. The first interesting piece of information is that the time is spent trying to get the the contents of our lval_register value (value_contents_all), and that it corresponds to a while loop. This loop does two things: 1. Find the value's frame, using the frame ID. This turns out to be a non-trivial operation! (a) we start at frame #0 (b) for every frame: + if frame->id does not match: -> get the previous frame, which as you can see above involves a lot of checking (main func, entry point, null pc, etc) 2. Call get_frame_register_value (frame, regnum) This, as far as I can tell from code inspection, should return pretty promptly a *lazy* register lval for the same register number, but with the next frame. The while loop combined with the frame_find_by_id is O(n^2), because they both walk the entire frame chain. But what makes it really slow, I think, is the fact that each call to get_prev_frame is really expensive because of the various checks. One improvement I'm looking at is the fact that get_prev_frame is doing redundant checks when the previous frame has already been computed. So I'm currently testing the idea of adding: if (this_frame->prev_p) return this_frame->prev; (before doing the various checks and then calling get_prev_frame_1). Does anyone see why this would be a problem? It does not change the fact that the current algorithm is still O(n^2), but it does make the frame_find_by_id loop a lot lighter. With the example above and 10_000 iterations, my debug traces showed a reduction from 40+ secs down to about 13secs. There might be some other micro optimizations that we could do, I haven't looked further. But to break the quadratic behavior, I suspect we need to avoid the call to frame_find_by_id, which means that we would net to have prev_register routine return the frame in addition to the value - I think that we would have to be very careful with what we do with the frame. Another option that might be more workable is to have the prev_register routine always return a non-lazy value, going down the call stack as far as necessary until finally getting the actual register value. Any other suggestions, perhaps? Taking a step back, it does not feel like this is actually a super common issue. For this to happen, we need a program such that we get a really deep stack (with 1000 iterations, it takes 1.5 secs before my change, about 0.5 sec after IIRC), and a register that does not get saved throughout the entire stack. Since we still end up printing the correct value at the end, is it worth enhancing the design to handle this case? -- Joel