From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gdb-return-35391-listarch-gdb=sources.redhat.com@sourceware.org>
Received: (qmail 12824 invoked by alias); 1 Sep 2009 20:48:31 -0000
Received: (qmail 12816 invoked by uid 22791); 1 Sep 2009 20:48:30 -0000
X-SWARE-Spam-Status: No, hits=-2.4 required=5.0 	tests=AWL,BAYES_00
X-Spam-Check-By: sourceware.org
Received: from rock.gnat.com (HELO rock.gnat.com) (205.232.38.15)     by sourceware.org (qpsmtpd/0.43rc1) with ESMTP; Tue, 01 Sep 2009 20:48:25 +0000
Received: from localhost (localhost.localdomain [127.0.0.1]) 	by filtered-rock.gnat.com (Postfix) with ESMTP id B21682BAB19 	for <gdb@sourceware.org>; Tue,  1 Sep 2009 16:48:22 -0400 (EDT)
Received: from rock.gnat.com ([127.0.0.1]) 	by localhost (rock.gnat.com [127.0.0.1]) (amavisd-new, port 10024) 	with LMTP id GpFPnGOmuKBn for <gdb@sourceware.org>; 	Tue,  1 Sep 2009 16:48:22 -0400 (EDT)
Received: from joel.gnat.com (localhost.localdomain [127.0.0.1]) 	by rock.gnat.com (Postfix) with ESMTP id 671BA2BAB03 	for <gdb@sourceware.org>; Tue,  1 Sep 2009 16:48:22 -0400 (EDT)
Received: by joel.gnat.com (Postfix, from userid 1000) 	id 1D714F5893; Tue,  1 Sep 2009 13:48:15 -0700 (PDT)
Date: Tue, 01 Sep 2009 20:48:00 -0000
From: Joel Brobecker <brobecker@adacore.com>
To: gdb@sourceware.org
Subject: More info on PR/9711 (quadratic slowdown for deep stack traces)
Message-ID: <20090901204815.GK4379@adacore.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.5.18 (2008-05-17)
Mailing-List: contact gdb-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <gdb.sourceware.org>
List-Subscribe: <mailto:gdb-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/gdb/>
List-Post: <mailto:gdb@sourceware.org>
List-Help: <mailto:gdb-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: gdb-owner@sourceware.org
X-SW-Source: 2009-09/txt/msg00009.txt.bz2

Hello,

I think I understand a little better what's causing the debugger
to be really slow in the testcase provided with the PR. I logged
the following entry in the PR yesterday:

> The slowdown comes from printing the frame corresponding to "main",
> because argv is stored in %rsi. The following transcript shows a more
> isolated way of reproducing the issue:
> 
> (gdb) set print frame-arguments none
> (gdb) run 5000
> Starting program: /[...]/deep 5000
> 
> Program received signal SIGABRT, Aborted.
> 0x00007ffff7a9efb5 in raise () from /lib/libc.so.6
> (gdb) frame 5004
> #5004 0x00000000004005a2 in main (argc=..., argv=...) at deep.c:34
> 34       return visit (&n);
> (gdb) p /x $rsi
> $1 = 0x748b
> 
> The "p /x $rsi" take a long time, while the rest is pretty much
> instantaneous.  That's as much as I can do for today, but look at it
> some more tomorrow.

I reconstructed the part that takes time:

value_contents_all (...):
  => value_fetch_lazy:
     | while (VALUE_LVAL (new_val) == lval_register
     | {      && value_lazy (new_val))
     | { frame = frame_find_by_id (VALUE_FRAME_ID (new_val))
     | {   [ for (frame = get_current_frame (); ; frame = prev_frame)
     | {   [ | if (frame_id_eq (id, this))
     | {   [ |   return frame
     | {   [ | prev_frame = get_prev_frame (frame)
     | {   [ |   { if ([...] && inside_main_func (this_frame))
     | {   [ |   {   return NULL
     | {   [ |   { if ([...] && inside_entry_func (this_frame))
     | {   [ |   {   return NULL
     | {   [ |   { if ([...] && get_frame_pc (this_frame) == 0)
     | {   [ |   {   return NULL
     | {   [ |   { return get_prev_frame_1 (this_frame);
     | {   [ |   {   | if (this_frame->prev_p)
     | {   [ |   {   |   return this_frame->prev;
     | {   [ | if (!prev_frame)
     | {   [ |   return NULL
     | {   [ | if (get_frame_type (frame) == NORMAL_FRAME
     | {   [ |     && !frame_id_inner ()
     | {   [ |     && !frame_id_inner ())
     | {   [ |   return NULL
     | { new_val = get_frame_register_value (frame, regnum);
     | {   [ return frame_unwind_register_value (frame->next, regnum)
     | {   [   |  value = frame->unwind->prev_register (frame, [...], regnum);
     | {   [   |    { case DWARF2_FRAME_REG_UNSPECIFIED:
     | {   [   |    {   return frame_unwind_got_register (this_frame, regnum, regnum);
     | {   [   |    {    [  return return value_of_register_lazy (frame, new_regnum);

It might be a little cryptic, but it gives a good overall idea of what
gets called and why.

The first interesting piece of information is that the time is spent
trying to get the the contents of our lval_register value
(value_contents_all), and that it corresponds to a while loop.
This loop does two things:

  1. Find the value's frame, using the frame ID.
     This turns out to be a non-trivial operation!
        (a) we start at frame #0
        (b) for every frame:
            + if frame->id does not match:
                -> get the previous frame, which as you can see above
                   involves a lot of checking (main func, entry point,
                   null pc, etc)

  2. Call get_frame_register_value (frame, regnum)
     This, as far as I can tell from code inspection, should return
     pretty promptly a *lazy* register lval for the same register
     number, but with the next frame.

The while loop combined with the frame_find_by_id is O(n^2), because
they both walk the entire frame chain. But what makes it really slow,
I think, is the fact that each call to get_prev_frame is really
expensive because of the various checks.

One improvement I'm looking at is the fact that get_prev_frame is
doing redundant checks when the previous frame has already been
computed. So I'm currently testing the idea of adding:

  if (this_frame->prev_p)
    return this_frame->prev;

(before doing the various checks and then calling get_prev_frame_1).
Does anyone see why this would be a problem? It does not change the
fact that the current algorithm is still O(n^2), but it does make
the frame_find_by_id loop a lot lighter. With the example above and
10_000 iterations, my debug traces showed a reduction from 40+ secs
down to about 13secs.

There might be some other micro optimizations that we could do,
I haven't looked further. But to break the quadratic behavior,
I suspect we need to avoid the call to frame_find_by_id, which means
that we would net to have prev_register routine return the frame in
addition to the value - I think that we would have to be very careful
with what we do with the frame.  Another option that might be more workable
is to have the prev_register routine always return a non-lazy value,
going down the call stack as far as necessary until finally getting
the actual register value.

Any other suggestions, perhaps?

Taking a step back, it does not feel like this is actually a super
common issue. For this to happen, we need a program such that we get
a really deep stack (with 1000 iterations, it takes 1.5 secs before
my change, about 0.5 sec after IIRC), and a register that does not
get saved throughout the entire stack. Since we still end up printing
the correct value at the end, is it worth enhancing the design to
handle this case?

-- 
Joel