suggestion for dictionary representation

Mirror of the gdb mailing list
 help / color / mirror / Atom feed

* suggestion for dictionary representation
@ 2002-09-22 19:59 Jim Blandy
  2002-09-22 20:11 ` Daniel Jacobowitz
  0 siblings, 1 reply; 28+ messages in thread
From: Jim Blandy @ 2002-09-22 19:59 UTC (permalink / raw)
  To: david carlton; +Cc: gdb

It seems to me that the `skip list' data structure simultaneously
meets a lot of the criteria that we're currently meeting by having
multiple representations.  Skip lists:
- provide (probabalistic) log n access time
- are low overhead ("They can easily be configured to require an
  average of 1 1/3 pointers per element (or even less).")
- are easy to build incrementally, and stay "balanced" automatically
- are obstack-friendly, since they don't involve realloc (as hash
  tables do)
- are an ordered structure, which would support completion nicely (and,
  by the way, make the `_Z' test for the C++ V3 ABI faster too)
- have a trivial iterator (walking the finest level of links)
- are pretty easy to understand

http://www.cs.umd.edu/~pugh points to a paper describing and analyzing
them.

Using skip lists, there'd be no need to distinguish `expandable' from
non-expandable blocks.  This one structure would scale to handle both
local blocks and the global environment (depending on how we handle
lazy symbol reading --- I'd like a more generic and descriptive term
than "partial symbol tables").

The only remaining special case would be function blocks, in which
parameter symbols must to appear in the order they appear in the
source.  I think it's pretty ugly to abuse the name array this way; it
introduces special cases in dumb places.  This kludge could be removed
by changing the `function' member of `struct block' to a pointer to
something like this:

   struct function_info {
       struct symbol *sym;
       int num_params;
       struct symbol **params;
   };

This would require extra space only for function blocks; non-function
blocks would remain the same size.  And this info would only be
consulted when we actually wanted to iterate over the parameters.
This would clean up a bunch of loops in GDB that currently have to
iterate over all the symbols in a function's block and do a switch on
each symbol's address class to find the arguments.  (And would this
also allow us to remove the arg/other distinction in enum
address_class?  Dunno.)

But if we were to remove function blocks as a special case, there
would only need to be a single structure for representing levels of
the namespace.

(Daniel Berlin pointed me at skip lists a long time ago --- thanks!)

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: suggestion for dictionary representation
  2002-09-22 19:59 suggestion for dictionary representation Jim Blandy
@ 2002-09-22 20:11 ` Daniel Jacobowitz
  2002-09-23 10:38   ` David Carlton
  0 siblings, 1 reply; 28+ messages in thread
From: Daniel Jacobowitz @ 2002-09-22 20:11 UTC (permalink / raw)
  To: Jim Blandy; +Cc: david carlton, gdb

On Sun, Sep 22, 2002 at 09:44:40PM -0500, Jim Blandy wrote:
> 
> It seems to me that the `skip list' data structure simultaneously
> meets a lot of the criteria that we're currently meeting by having
> multiple representations.  Skip lists:
> - provide (probabalistic) log n access time
> - are low overhead ("They can easily be configured to require an
>   average of 1 1/3 pointers per element (or even less).")
> - are easy to build incrementally, and stay "balanced" automatically
> - are obstack-friendly, since they don't involve realloc (as hash
>   tables do)
> - are an ordered structure, which would support completion nicely (and,
>   by the way, make the `_Z' test for the C++ V3 ABI faster too)
> - have a trivial iterator (walking the finest level of links)
> - are pretty easy to understand
> 
> http://www.cs.umd.edu/~pugh points to a paper describing and analyzing
> them.
> 
> Using skip lists, there'd be no need to distinguish `expandable' from
> non-expandable blocks.  This one structure would scale to handle both
> local blocks and the global environment (depending on how we handle
> lazy symbol reading --- I'd like a more generic and descriptive term
> than "partial symbol tables").

Hmm.  Lots of simplicity/cleanliness benefits, but the real question as
far as I'm concerned is whether the benefit to completion (the _Z thing
is done as we read in symbols right now, so it's a complete non-issue)
outweights going from O(1) to O(probabalistic log n) for symbol lookup.

I suspect it would; having faster completion [I can't really see how to
beat O(n) with the current hash tables, can anyone else?  But I think
it's slower than O(n) right now; I recall it being quadratic...] would
be nice.  O(~ log N) ought to be plenty fast, right?

> The only remaining special case would be function blocks, in which
> parameter symbols must to appear in the order they appear in the
> source.  I think it's pretty ugly to abuse the name array this way; it
> introduces special cases in dumb places.  This kludge could be removed
> by changing the `function' member of `struct block' to a pointer to
> something like this:
> 
>    struct function_info {
>        struct symbol *sym;
>        int num_params;
>        struct symbol **params;
>    };
> 
> This would require extra space only for function blocks; non-function
> blocks would remain the same size.  And this info would only be
> consulted when we actually wanted to iterate over the parameters.
> This would clean up a bunch of loops in GDB that currently have to
> iterate over all the symbols in a function's block and do a switch on
> each symbol's address class to find the arguments.  (And would this
> also allow us to remove the arg/other distinction in enum
> address_class?  Dunno.)
> 
> But if we were to remove function blocks as a special case, there
> would only need to be a single structure for representing levels of
> the namespace.

I'm tempted to whack the block special case for function arguments.  It
may make name lookup a little more complicated but I think it will make
everything clearer.  We could, of course, try this on the branch and
see if we like the results :)

David, what do you think?

-- 
Daniel Jacobowitz
MontaVista Software                         Debian GNU/Linux Developer


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: suggestion for dictionary representation
  2002-09-22 20:11 ` Daniel Jacobowitz
@ 2002-09-23 10:38   ` David Carlton
  2002-09-23 17:34     ` Daniel Berlin
                       ` (3 more replies)
  0 siblings, 4 replies; 28+ messages in thread
From: David Carlton @ 2002-09-23 10:38 UTC (permalink / raw)
  To: Daniel Jacobowitz; +Cc: Jim Blandy, gdb

On Sun, 22 Sep 2002 23:10:56 -0400, Daniel Jacobowitz <drow@mvista.com> said:
> On Sun, Sep 22, 2002 at 09:44:40PM -0500, Jim Blandy wrote:

>> It seems to me that the `skip list' data structure simultaneously
>> meets a lot of the criteria that we're currently meeting by having
>> multiple representations.

Interesting.

>> Skip lists:
>> - provide (probabalistic) log n access time
>> - are low overhead ("They can easily be configured to require an
>> average of 1 1/3 pointers per element (or even less).")
>> - are easy to build incrementally, and stay "balanced" automatically
>> - are obstack-friendly, since they don't involve realloc (as hash
>> tables do)

Well, the current use of hash tables knows the size of the hash table
when building it.  But of course that wouldn't be the case if we used
hash tables for global-like environments.

>> - are an ordered structure, which would support completion nicely (and,
>> by the way, make the `_Z' test for the C++ V3 ABI faster too)
>> - have a trivial iterator (walking the finest level of links)
>> - are pretty easy to understand
>> 
>> http://www.cs.umd.edu/~pugh points to a paper describing and analyzing
>> them.
>> 
>> Using skip lists, there'd be no need to distinguish `expandable' from
>> non-expandable blocks.

Personally, I like this: I'm not that big a fan of that distinction.
I can imagine situations where this could really help us when doing
namespace support, or support for its analogues in other programming
languages.

>> This one structure would scale to handle both local blocks and the
>> global environment (depending on how we handle lazy symbol reading
>> --- I'd like a more generic and descriptive term than "partial
>> symbol tables").

I'm not all that worried about having multiple implementations
depending on the needs of the situation.  But there's certainly no
need to go out of your way to have multiple implementations.

Also, when thinking about issues related to lazy symbol reading, I
think it's important to keep minimal symbols in mind, too.  Which, of
course, isn't really an example of lazy symbol reading, but it is
another sort of object that we currently are currently looking at in
lookup_symbol.

> Hmm.  Lots of simplicity/cleanliness benefits, but the real question
> as far as I'm concerned is whether the benefit to completion (the _Z
> thing is done as we read in symbols right now, so it's a complete
> non-issue) outweights going from O(1) to O(probabalistic log n) for
> symbol lookup.

> I suspect it would; having faster completion [I can't really see how
> to beat O(n) with the current hash tables, can anyone else?  But I
> think it's slower than O(n) right now; I recall it being
> quadratic...] would be nice.  O(~ log N) ought to be plenty fast,
> right?

Yeah, really.

I'm also curious about how it would affect the speed of reading in
symbols.  Right now, that should be O(n), where n is the number of
global symbols, right?  If we used expandable hash tables, then I
think it would be amortized O(n) and with the constant factor larger.
(But, I think, not larger in a way that would make a difference.)  I'm
curious about how often the "amortized" bit would lead to strange
hiccups, but I don't think that's a big deal.

But for skip lists, wouldn't it be something like O(n log n)?  If so,
that's an issue we have to consider.

With luck, I should be able to get the global symbol code into good
enough shape that can we can easily write the code both ways and then
just replace one call to dict_create_hashed_exandable with a call to
dict_create_skiplist, to test it.  We'll see.

>> The only remaining special case would be function blocks, in which
>> parameter symbols must to appear in the order they appear in the
>> source.  I think it's pretty ugly to abuse the name array this way; it
>> introduces special cases in dumb places.  This kludge could be removed
>> by changing the `function' member of `struct block' to a pointer to
>> something like this:
>> 
>> struct function_info {
>> struct symbol *sym;
>> int num_params;
>> struct symbol **params;
>> };
>> 
>> This would require extra space only for function blocks; non-function
>> blocks would remain the same size.  And this info would only be
>> consulted when we actually wanted to iterate over the parameters.
>> This would clean up a bunch of loops in GDB that currently have to
>> iterate over all the symbols in a function's block and do a switch on
>> each symbol's address class to find the arguments.  (And would this
>> also allow us to remove the arg/other distinction in enum
>> address_class?  Dunno.)
>> 
>> But if we were to remove function blocks as a special case, there
>> would only need to be a single structure for representing levels of
>> the namespace.

> I'm tempted to whack the block special case for function arguments.  It
> may make name lookup a little more complicated but I think it will make
> everything clearer.  We could, of course, try this on the branch and
> see if we like the results :)

Would it be reasonable to break up function blocks into two separate
blocks: a linear block that only defines the parameters for the
function and a non-linear block that contains the actual local
variables?  Not that I think Jim's scheme is a bad one - I agree that
it's better than the current scheme - but given the possibility of
local variables shadowing function parameters, it seems to me to be
conceptually cleaner to have two separate blocks appear anyways, and
it also solves this problem.

Also, for what it's worth, I'm still not ready to completely give up
on representing members of classes via a dictionary; that would
provide another place where a linear dictionary environment could be
useful.

David Carlton
carlton@math.stanford.edu

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: suggestion for dictionary representation
  2002-09-23 10:38   ` David Carlton
@ 2002-09-23 17:34     ` Daniel Berlin
  2002-09-23 18:39       ` Daniel Jacobowitz
  2002-09-24  9:33       ` David Carlton
  2002-09-23 18:28     ` Daniel Jacobowitz
                       ` (2 subsequent siblings)
  3 siblings, 2 replies; 28+ messages in thread
From: Daniel Berlin @ 2002-09-23 17:34 UTC (permalink / raw)
  To: David Carlton; +Cc: Daniel Jacobowitz, Jim Blandy, gdb

> I'm also curious about how it would affect the speed of reading in
> symbols.  Right now, that should be O(n), where n is the number of
> global symbols, right?

>  If we used expandable hash tables, then I
> think it would be amortized O(n) and with the constant factor larger.

Nope.
Our string hash function is O(N) right now (as are most). Hash tables 
are only O(N) when the hash function is O(1).

Now, if you have it only hash the first x characters, you can make your 
hash table O(N) again, with the x as the constant. Of course, if only 
hashing the first x characters causes tons of hash conflicts, it's not 
going to make your hash table very fast.


> (But, I think, not larger in a way that would make a difference.)  I'm
> curious about how often the "amortized" bit would lead to strange
> hiccups, but I don't think that's a big deal.
>
> But for skip lists, wouldn't it be something like O(n log n)?  If so,
> that's an issue we have to consider.

Put it in perspective.
for 1 billion symbols, n is 29.89.
for 1 million symbols, n is 19.93.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: suggestion for dictionary representation
  2002-09-23 10:38   ` David Carlton
  2002-09-23 17:34     ` Daniel Berlin
@ 2002-09-23 18:28     ` Daniel Jacobowitz
  2002-09-24  3:51     ` Paul N. Hilfinger
  2002-09-24 19:52     ` Jim Blandy
  3 siblings, 0 replies; 28+ messages in thread
From: Daniel Jacobowitz @ 2002-09-23 18:28 UTC (permalink / raw)
  To: David Carlton; +Cc: Jim Blandy, gdb

On Mon, Sep 23, 2002 at 10:38:54AM -0700, David Carlton wrote:
> On Sun, 22 Sep 2002 23:10:56 -0400, Daniel Jacobowitz <drow@mvista.com> said:
> > I'm tempted to whack the block special case for function arguments.  It
> > may make name lookup a little more complicated but I think it will make
> > everything clearer.  We could, of course, try this on the branch and
> > see if we like the results :)
> 
> Would it be reasonable to break up function blocks into two separate
> blocks: a linear block that only defines the parameters for the
> function and a non-linear block that contains the actual local
> variables?  Not that I think Jim's scheme is a bad one - I agree that
> it's better than the current scheme - but given the possibility of
> local variables shadowing function parameters, it seems to me to be
> conceptually cleaner to have two separate blocks appear anyways, and
> it also solves this problem.

Absolutely, I think it would be reasonable.  In fact I think it's a
really good idea.

> Also, for what it's worth, I'm still not ready to completely give up
> on representing members of classes via a dictionary; that would
> provide another place where a linear dictionary environment could be
> useful.

I wouldn't mind having the function arguments be a dictionary,
either...

-- 
Daniel Jacobowitz
MontaVista Software                         Debian GNU/Linux Developer


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: suggestion for dictionary representation
  2002-09-23 17:34     ` Daniel Berlin
@ 2002-09-23 18:39       ` Daniel Jacobowitz
  2002-09-23 21:28         ` Daniel Berlin
  2002-09-24  9:33       ` David Carlton
  1 sibling, 1 reply; 28+ messages in thread
From: Daniel Jacobowitz @ 2002-09-23 18:39 UTC (permalink / raw)
  To: Daniel Berlin; +Cc: David Carlton, Jim Blandy, gdb

On Mon, Sep 23, 2002 at 08:34:50PM -0400, Daniel Berlin wrote:
> >I'm also curious about how it would affect the speed of reading in
> >symbols.  Right now, that should be O(n), where n is the number of
> >global symbols, right?
> 
> > If we used expandable hash tables, then I
> >think it would be amortized O(n) and with the constant factor larger.
> 
> Nope.
> Our string hash function is O(N) right now (as are most). Hash tables 
> are only O(N) when the hash function is O(1).
> 
> Now, if you have it only hash the first x characters, you can make your 
> hash table O(N) again, with the x as the constant. Of course, if only 
> hashing the first x characters causes tons of hash conflicts, it's not 
> going to make your hash table very fast.

Wait a second... aren't you switching N's on us?  N is the number of
global symbols.  We're talking about the total time of adding all
symbols to the table.  Hashing a string is "effectively" constant time,
because all string lengths are "small".

> >(But, I think, not larger in a way that would make a difference.)  I'm
> >curious about how often the "amortized" bit would lead to strange
> >hiccups, but I don't think that's a big deal.
> >
> >But for skip lists, wouldn't it be something like O(n log n)?  If so,
> >that's an issue we have to consider.
> 
> Put it in perspective.
> for 1 billion symbols, n is 29.89.
> for 1 million symbols, n is 19.93.

You mean "log n", of course.

-- 
Daniel Jacobowitz
MontaVista Software                         Debian GNU/Linux Developer


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: suggestion for dictionary representation
  2002-09-23 18:39       ` Daniel Jacobowitz
@ 2002-09-23 21:28         ` Daniel Berlin
  2002-09-23 21:38           ` Eli Zaretskii
  0 siblings, 1 reply; 28+ messages in thread
From: Daniel Berlin @ 2002-09-23 21:28 UTC (permalink / raw)
  To: Daniel Jacobowitz; +Cc: David Carlton, Jim Blandy, gdb



On Mon, 23 Sep 2002, Daniel Jacobowitz wrote:

> On Mon, Sep 23, 2002 at 08:34:50PM -0400, Daniel Berlin wrote:
> > >I'm also curious about how it would affect the speed of reading in
> > >symbols.  Right now, that should be O(n), where n is the number of
> > >global symbols, right?
> > 
> > > If we used expandable hash tables, then I
> > >think it would be amortized O(n) and with the constant factor larger.
> > 
> > Nope.
> > Our string hash function is O(N) right now (as are most). Hash tables 
> > are only O(N) when the hash function is O(1).
> > 
> > Now, if you have it only hash the first x characters, you can make your 
> > hash table O(N) again, with the x as the constant. Of course, if only 
> > hashing the first x characters causes tons of hash conflicts, it's not 
> > going to make your hash table very fast.
> 
> Wait a second... aren't you switching N's on us?  
Yes, but this just makes it O(MN), where M is your string hash time, N is 
your number of global symbols.

Although M is not likely to approach N, it could happen on *very* long 
mangled names (i've seen some > 500 characters, in applications with 
around that many *global* symbols), or easily if you index on demangled 
names.

>  is the number of
> global symbols.  We're talking about the total time of adding all
> symbols to the table.
I'm quite aware.

>  Hashing a string is "effectively" constant time,
It's not.
> because all string lengths are "small".
Bullshit.

libstdc++, fer instance, has an average symbol length of 42 (mangled).
Longest is 131 characters.

Demangled, the average is 86. 
Longest is 1116.

> 
> > >(But, I think, not larger in a way that would make a difference.)  I'm
> > >curious about how often the "amortized" bit would lead to strange
> > >hiccups, but I don't think that's a big deal.
> > >
> > >But for skip lists, wouldn't it be something like O(n log n)?  If so,
> > >that's an issue we have to consider.
> > 
> > Put it in perspective.
> > for 1 billion symbols, n is 29.89.
> > for 1 million symbols, n is 19.93.
> 
> You mean "log n", of course.
yup.

> 
> 


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: suggestion for dictionary representation
  2002-09-23 21:28         ` Daniel Berlin
@ 2002-09-23 21:38           ` Eli Zaretskii
  2002-09-23 21:44             ` Daniel Berlin
  0 siblings, 1 reply; 28+ messages in thread
From: Eli Zaretskii @ 2002-09-23 21:38 UTC (permalink / raw)
  To: Daniel Berlin; +Cc: Daniel Jacobowitz, David Carlton, Jim Blandy, gdb


On Tue, 24 Sep 2002, Daniel Berlin wrote:

> > because all string lengths are "small".
> Bullshit.

Can you _please_ try to be polite here?  TIA


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: suggestion for dictionary representation
  2002-09-23 21:38           ` Eli Zaretskii
@ 2002-09-23 21:44             ` Daniel Berlin
  2002-09-23 21:47               ` Eli Zaretskii
  0 siblings, 1 reply; 28+ messages in thread
From: Daniel Berlin @ 2002-09-23 21:44 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Daniel Jacobowitz, David Carlton, Jim Blandy, gdb

On Tue, 24 Sep 2002, Eli Zaretskii wrote:

> 
> On Tue, 24 Sep 2002, Daniel Berlin wrote:
> 
> > > because all string lengths are "small".
> > Bullshit.
> 
> Can you _please_ try to be polite here?  TIA
Um, it was clearly directed at his assertion, not him.
Thus, there is nothing impolite about it (unless you are going by a rather 
odd definition i've only heard used once, but is still technically 
correct).

In the future, if you are only going to scold me for using crude language, 
i kindly request you do it in private. Otherwise, you are just 
contributing noise (and when one does searches, this message will pop up 
when it has nothing to do with dictionarys, string lengths, etc, making 
it harder to find useful information). 
TIA,
Dan

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: suggestion for dictionary representation
  2002-09-23 21:44             ` Daniel Berlin
@ 2002-09-23 21:47               ` Eli Zaretskii
  2002-09-23 21:54                 ` Daniel Berlin
  0 siblings, 1 reply; 28+ messages in thread
From: Eli Zaretskii @ 2002-09-23 21:47 UTC (permalink / raw)
  To: Daniel Berlin; +Cc: Daniel Jacobowitz, David Carlton, Jim Blandy, gdb


On Tue, 24 Sep 2002, Daniel Berlin wrote:

> In the future, if you are only going to scold me for using crude language, 
> i kindly request you do it in private.

Sorry, no.  In my view, rudeness in public forums should be replied in 
public.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: suggestion for dictionary representation
  2002-09-23 21:47               ` Eli Zaretskii
@ 2002-09-23 21:54                 ` Daniel Berlin
  0 siblings, 0 replies; 28+ messages in thread
From: Daniel Berlin @ 2002-09-23 21:54 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Daniel Jacobowitz, David Carlton, Jim Blandy, gdb



On Tue, 24 Sep 2002, Eli Zaretskii wrote:

> 
> On Tue, 24 Sep 2002, Daniel Berlin wrote:
> 
> > In the future, if you are only going to scold me for using crude language, 
> > i kindly request you do it in private.
> 
> Sorry, no.  In my view, rudeness in public forums should be replied in 
> public.
Then at least change the subject, please, so it doesn't occur in the 
archives under dictionary representations.
It is also not rudeness. Rudeness generally suggests intentional 
discourtesy towards someone.
There was none.

--Dan


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: suggestion for dictionary representation
  2002-09-23 10:38   ` David Carlton
  2002-09-23 17:34     ` Daniel Berlin
  2002-09-23 18:28     ` Daniel Jacobowitz
@ 2002-09-24  3:51     ` Paul N. Hilfinger
  2002-09-24 19:52     ` Jim Blandy
  3 siblings, 0 replies; 28+ messages in thread
From: Paul N. Hilfinger @ 2002-09-24  3:51 UTC (permalink / raw)
  To: carlton; +Cc: drow, jimb, gdb

A random piece of empirical data, for what it's worth: 

Out of curiosity, I slapped together a little skip-list implementation
and ran all the stabs symbols in a large (67MByte) executable through
it---adding all of them.  There were a total of about 670000 adds, of
which about 150000 were distinct.  This required about 3 seconds
(including the time to allocate new list nodes, allocate copies of the
keys, and insert the new items) running on an 800MHz Athlon processor.
For comparison, the time to first prompt for a version of GDB (version
5.1.1 or so) on this same file was 4 seconds, or 47 seconds if started
with -r.  

Interpretation, I admit, is difficult, but I think this establishes
some sort of rough bound on possible performance hits from this
representation (the skip-list implementation, being hasty, was not
particularly optimized).

Paul Hilfinger

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: suggestion for dictionary representation
  2002-09-23 17:34     ` Daniel Berlin
  2002-09-23 18:39       ` Daniel Jacobowitz
@ 2002-09-24  9:33       ` David Carlton
  2002-09-24 10:42         ` Daniel Berlin
  2002-09-24 20:01         ` Jim Blandy
  1 sibling, 2 replies; 28+ messages in thread
From: David Carlton @ 2002-09-24  9:33 UTC (permalink / raw)
  To: Daniel Berlin; +Cc: Daniel Jacobowitz, Jim Blandy, gdb

On Mon, 23 Sep 2002 20:34:50 -0400, Daniel Berlin <dberlin@dberlin.org> said:

>> I'm also curious about how it would affect the speed of reading in
>> symbols.  Right now, that should be O(n), where n is the number of
>> global symbols, right?

>> If we used expandable hash tables, then I think it would be
>> amortized O(n) and with the constant factor larger.

> Our string hash function is O(N) right now (as are most). Hash
> tables are only O(N) when the hash function is O(1).

[ Here, of course, my 'n' is the number of global symbols, and
Daniel's 'N' is the maximum symbol length. ]

This is true, but I'm not sure that it's relevant to this sort of
theoretical analysis.  After all, skip lists depend on N, as well:
they put symbols in order, and the amount of time to do that depends
on the length of the symbols.

And it's entirely reasonable to think of 'N' as a constant.  Or
perhaps two constants: one for C programs with short names, one for
C++ programs with long names.  (And I'm not really sure that the C++
names will ultimately turn out to be that much longer: once the proper
namespace support has been added, then looking up a C++ name will
probably be a multistep process (looking up pieces of the demangled
name in turn), and for each those steps, we'll be looking at a name
that will be of the appropriate length for a C program.)

But even if we consider N to be a constant, your broader point stands:
the constant factors that different algorithms differ by are
important, and in practice large constants can have more of an affect
than logarithmic terms.  Fortunately, one of the advantages of the
refactoring that I'm doing right now is that it'll be easy to drop in
different dictionary implementations for testing purposes: it should
boil down to writing the code that we'd have to do to get skip list
support anyways, changing one or two function calls, and recompiling.

David Carlton
carlton@math.stanford.edu

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: suggestion for dictionary representation
  2002-09-24  9:33       ` David Carlton
@ 2002-09-24 10:42         ` Daniel Berlin
  2002-09-24 10:53           ` David Carlton
  2002-09-24 20:01         ` Jim Blandy
  1 sibling, 1 reply; 28+ messages in thread
From: Daniel Berlin @ 2002-09-24 10:42 UTC (permalink / raw)
  To: David Carlton; +Cc: Daniel Jacobowitz, Jim Blandy, gdb

On Tuesday, September 24, 2002, at 12:33  PM, David Carlton wrote:

> On Mon, 23 Sep 2002 20:34:50 -0400, Daniel Berlin 
> <dberlin@dberlin.org> said:
>
>>> I'm also curious about how it would affect the speed of reading in
>>> symbols.  Right now, that should be O(n), where n is the number of
>>> global symbols, right?
>
>>> If we used expandable hash tables, then I think it would be
>>> amortized O(n) and with the constant factor larger.
>
>> Our string hash function is O(N) right now (as are most). Hash
>> tables are only O(N) when the hash function is O(1).
>
> [ Here, of course, my 'n' is the number of global symbols, and
> Daniel's 'N' is the maximum symbol length. ]
Yup.
In the future i'll just choose different letters.
(Who the heck thought it would be smart to have a big Oh and little Oh, 
fer instance)
>
> This is true, but I'm not sure that it's relevant to this sort of
> theoretical analysis.  After all, skip lists depend on N, as well:
> they put symbols in order, and the amount of time to do that depends
> on the length of the symbols.
I was waiting for someone to bring this up.

You *also* have this N in hash tables in another form: Key comparisons.
It's not just the hash function.
Regardless of your method of hashing (chaining or probing), you have to 
compare the key.
The better the hash function, the less key comparisons you do (because 
the chain length is shorter or the number of probes is less).

If your average chain lengths go above log n (the number of keys we'd 
have to strcmp in the skiplist), which in fact, for large files, they 
were, then the skiplist will likely be better, because it'll perform 
less comparisons, *and* doesn't have do to the hash function 
calculation.

>
> And it's entirely reasonable to think of 'N' as a constant.
But it's really not.
You can't predict the lengths of symbol names over the next ten years.
I'd bet if one looked, you'd seem it's a generally increasing curve for 
C++ apps, as they become more complex, have more namespaces, etc.
For C apps, it's probably constant.

>  Or
> perhaps two constants: one for C programs with short names, one for
> C++ programs with long names.  (And I'm not really sure that the C++
> names will ultimately turn out to be that much longer: once the proper
> namespace support has been added, then looking up a C++ name will
> probably be a multistep process (looking up pieces of the demangled
> name in turn), and for each those steps, we'll be looking at a name
> that will be of the appropriate length for a C program.)
>
Maybe, maybe not.
Certainly, if it goes multistep, i'd wager that even in 10 years, the 
average symbol length of individual parts won't go up very much, even 
if the length of the entire symbol name goes up.
> But even if we consider N to be a constant, your broader point stands:
> the constant factors that different algorithms differ by are
> important, and in practice large constants can have more of an affect
> than logarithmic terms.  Fortunately, one of the advantages of the
> refactoring that I'm doing right now is that it'll be easy to drop in
> different dictionary implementations for testing purposes: it should
> boil down to writing the code that we'd have to do to get skip list
> support anyways, changing one or two function calls, and recompiling.
>
Works for me.

I would actually suggest neither hash tables or skiplists, but ternary 
search trees.
They compare *very* well to hashes, and beat out skip lists as well.
The algorithms are easy (and i already added them to libiberty), and 
they were engineered as structures for symbol tables.
They support the "fast completion" property of skiplists (since they 
are ordered), but the constant you pay to insert nodes is much lower.
They also support easy pattern matching searches.
The only problem is that i was too lazy to look up general n-way tree 
balancing algorithms and implement one, so the current implementation 
is unbalanced.
Thus, if you insert in sorted order, things will get bad.

See http://www.ddj.com/documents/s=921/ddj9804a/9804a.htm

Asymptotic analysis of them isn't as easy as one would think.
In fact, it's downright difficult.
But, it's been done.
If you can grasp all the math:
http://users.info.unicaen.fr/~clement/SODA/all1.html

But, as you said, this can all be hashed out later and a few function 
calls changed if it turns out to be better.

> David Carlton
> carlton@math.stanford.edu

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: suggestion for dictionary representation
  2002-09-24 10:42         ` Daniel Berlin
@ 2002-09-24 10:53           ` David Carlton
  0 siblings, 0 replies; 28+ messages in thread
From: David Carlton @ 2002-09-24 10:53 UTC (permalink / raw)
  To: Daniel Berlin; +Cc: Daniel Jacobowitz, Jim Blandy, gdb

On Tue, 24 Sep 2002 13:42:18 -0400, Daniel Berlin <dberlin@dberlin.org> said:

> (Who the heck thought it would be smart to have a big Oh and little
> Oh, fer instance)

:-)

> If your average chain lengths go above log n (the number of keys
> we'd have to strcmp in the skiplist), which in fact, for large
> files, they were, then the skiplist will likely be better, because
> it'll perform less comparisons, *and* doesn't have do to the hash
> function calculation.

I'm not entirely sure I understand your assumptions in this last
paragraph; are you assuming the number of buckets will remain
constant?  I think that, if we're going to use hash tables for global
symbols, we'll want to resize the hash tables as necessary.  My
apologies for not making that clear; that's why I said that hash
tables were amortized O(n) rather than just O(n).  (But this is, of
course, yet another way that the constant factor for hash tables could
be larger than one might like.)

> I would actually suggest neither hash tables or skiplists, but
> ternary search trees.

Thanks for the references; I'll give them a look.

David Carlton
carlton@math.stanford.edu

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: suggestion for dictionary representation
  2002-09-23 10:38   ` David Carlton
                       ` (2 preceding siblings ...)
  2002-09-24  3:51     ` Paul N. Hilfinger
@ 2002-09-24 19:52     ` Jim Blandy
  2002-09-24 20:37       ` Elena Zannoni
  2002-09-24 20:53       ` Daniel Berlin
  3 siblings, 2 replies; 28+ messages in thread
From: Jim Blandy @ 2002-09-24 19:52 UTC (permalink / raw)
  To: David Carlton; +Cc: Daniel Jacobowitz, gdb

David Carlton <carlton@math.stanford.edu> writes:
> > I'm tempted to whack the block special case for function arguments.  It
> > may make name lookup a little more complicated but I think it will make
> > everything clearer.  We could, of course, try this on the branch and
> > see if we like the results :)
> 
> Would it be reasonable to break up function blocks into two separate
> blocks: a linear block that only defines the parameters for the
> function and a non-linear block that contains the actual local
> variables?  Not that I think Jim's scheme is a bad one - I agree that
> it's better than the current scheme - but given the possibility of
> local variables shadowing function parameters, it seems to me to be
> conceptually cleaner to have two separate blocks appear anyways, and
> it also solves this problem.

The issue is a bit more tangled than you think, I think.  Splitting
the function's body and its formals into two separate blocks is a good
idea, but it isn't going to get rid of all your duplicates.  A single
formal parameter can have two symbols in a function's block that
describe it.  Try this out on a Pentium.  (The `-O2' and `-gstabs+'
are required.)

  $ cat func.c
  #include <stdio.h>

  int
  main (int argc, char **argv)
  {
    static int local = 3;
    printf ("%d\n", argc * local);
  }
  $ gcc -O2 -gstabs+ func.c -o func

Then start up GDB on GDB on `func':

  (top-gdb) run
  The program being debugged has been started already.
  Start it from the beginning? (y or n) y

  Starting program: gdb -nw func
  GNU gdb 2002-09-16-cvs
  Copyright 2002 Free Software Foundation, Inc.
  GDB is free software, covered by the GNU General Public License, and you are
  welcome to change it and/or distribute copies of it under certain conditions.
  Type "show copying" to see the conditions.
  There is absolutely no warranty for GDB.  Type "show warranty" for details.
  This GDB was configured as "i686-pc-linux-gnu"...
  (gdb)

Set a breakpoint in main, just to get the symbols read:

  (gdb) break main
  Breakpoint 1 at 0x804834c: file func.c, line 7.
  (gdb)

Drop out to the enclosing GDB:

  (gdb) info
  (top-gdb)

It just so happens that `func.c' is the first compilation unit of the
first executable file in GDB's list:

  (top-gdb) print object_files->symtabs->filename
  $177 = 0x82fdbf8 "func.c"
  (top-gdb)

If that's not so for you, you'll need to walk `symtabs' to find the
right symtab.  Anyway, let's check out this symtab's blockvector.  I'm
just using [0] as a postfix dereferencing operator here:

  (top-gdb) print object_files->symtabs->blockvector[0]
  $178 = {nblocks = 3, block = {0x82f8b74}}
  (top-gdb)

The first and second blocks are the global and static blocks, so the
third one is probably for `main':

  (top-gdb) print object_files->symtabs->blockvector->block[2]
  $179 = (struct block *) 0x82f8ab4
  (top-gdb) p *$179
  $180 = {startaddr = 134513472, endaddr = 134513513, function = 0x82f8988, 
    superblock = 0x82f8ae4, gcc_compile_flag = 2 '\002', hashtable = 0 '\0', 
    nsyms = 4, sym = {0x82f89c4}}
  (top-gdb) p *$179->function
  $181 = {ginfo = {name = 0x82f89bc "main", value = {ivalue = 137333428, 
        block = 0x82f8ab4, 
        bytes = 0x82f8ab4 "@\203\004\bi\203\004\b\210\211/\bä\212/\b\002", 
        address = 137333428, chain = 0x82f8ab4}, language_specific = {
        cplus_specific = {demangled_name = 0x0}}, language = language_c, 
      section = 11, bfd_section = 0x82d4fc0}, type = 0x82faaa8, 
    namespace = VAR_NAMESPACE, aclass = LOC_BLOCK, line = 5, aux_value = {
      basereg = 0}, aliases = 0x0, ranges = 0x0, hash_next = 0x0}
  (top-gdb)

And it was!  Let's look at those four symbols:

  (top-gdb) p *$179->sym[0]
  $182 = {ginfo = {name = 0x82f89f8 "argc", value = {ivalue = 8, block = 0x8, 
        bytes = 0x8 <Address 0x8 out of bounds>, address = 8, chain = 0x8}, 
      language_specific = {cplus_specific = {demangled_name = 0x0}}, 
      language = language_c, section = 0, bfd_section = 0x0}, type = 0x82df828,
    namespace = VAR_NAMESPACE, aclass = LOC_ARG, line = 4, aux_value = {
      basereg = 0}, aliases = 0x0, ranges = 0x0, hash_next = 0x0}
  (top-gdb) p *$179->sym[1]
  $183 = {ginfo = {name = 0x82f8a34 "argv", value = {ivalue = 12, block = 0xc, 
        bytes = 0xc <Address 0xc out of bounds>, address = 12, chain = 0xc}, 
      language_specific = {cplus_specific = {demangled_name = 0x0}}, 
      language = language_c, section = 0, bfd_section = 0x0}, type = 0x82faaf4,
    namespace = VAR_NAMESPACE, aclass = LOC_ARG, line = 4, aux_value = {
      basereg = 0}, aliases = 0x0, ranges = 0x0, hash_next = 0x0}
  (top-gdb) p *$179->sym[2]
  $184 = {ginfo = {name = 0x82f8a70 "argc", value = {ivalue = 0, block = 0x0, 
        bytes = 0x0, address = 0, chain = 0x0}, language_specific = {
        cplus_specific = {demangled_name = 0x0}}, language = language_c, 
      section = 0, bfd_section = 0x0}, type = 0x82df828, 
    namespace = VAR_NAMESPACE, aclass = LOC_REGISTER, line = 4, aux_value = {
      basereg = 0}, aliases = 0x0, ranges = 0x0, hash_next = 0x0}
  (top-gdb) p *$179->sym[3]
  $185 = {ginfo = {name = 0x82f8aac "local", value = {ivalue = 134517720, 
        block = 0x80493d8, bytes = 0x80493d8 "É\f", address = 134517720, 
        chain = 0x80493d8}, language_specific = {cplus_specific = {
          demangled_name = 0x0}}, language = language_c, section = 14, 
      bfd_section = 0x0}, type = 0x82df828, namespace = VAR_NAMESPACE, 
    aclass = LOC_STATIC, line = 6, aux_value = {basereg = 0}, aliases = 0x0, 
    ranges = 0x0, hash_next = 0x0}
  (top-gdb) 

Hey!  Why are there two entries for argc?  (This is the extra tangle I
was referring to.  If you know all about this, you can stop reading
now.)

The two `argc' symbols have different address classes: one has an
address class that indicates it's an argument, and the other doesn't.
The argument symbol describes where the variable is passed on the
stack (eight bytes after %ebp), whereas the non-argument symbol
describes where the variable lives in the block of the function:
register zero, or %eax.

As a sanity check, let's look at the IA-32 code for main:

    (top-gdb) c
    Continuing.
    (gdb) disass main
    Dump of assembler code for function main:
    0x8048340 <main>:	push   %ebp
    0x8048341 <main+1>:	mov    %esp,%ebp
    0x8048343 <main+3>:	sub    $0x8,%esp
    0x8048346 <main+6>:	mov    0x8(%ebp),%eax
    0x8048349 <main+9>:	and    $0xfffffff0,%esp
    0x804834c <main+12>:	mov    0x80493d8,%edx
    0x8048352 <main+18>:	movl   $0x80483c8,(%esp,1)
    0x8048359 <main+25>:	imul   %edx,%eax
    0x804835c <main+28>:	mov    %eax,0x4(%esp,1)
    0x8048360 <main+32>:	call   0x8048268 <printf>
    0x8048365 <main+37>:	mov    %ebp,%esp
    0x8048367 <main+39>:	pop    %ebp
    0x8048368 <main+40>:	ret    
    End of assembler dump.
    (gdb) 

So, yes, the compiler did copy `argc' from the stack into %eax.
Check.

But *why* does GDB do this?  I have no idea.  It seems to me that,
with prologue skipping et al, simply having a single LOC_REGPARM would
be the Right Thing.  I don't really know when GDB will prefer the
argument entry, and when it'll prefer the non-argument entry.

I suspect it's historical.  If you look at the stabs spec, you'll see
that it actually emits two stabs for arguments that are passed in one
place, but get moved somewhere else:

  $ objdump --stabs func
  ...
  329    FUN    0      5      08048340 12145  main:F(0,1)
  330    PSYM   0      4      00000008 12157  argc:p(0,1)
  331    PSYM   0      4      0000000c 12169  argv:p(1,1)=*(7,36)
  332    SLINE  0      5      00000000 0      
  333    SLINE  0      7      0000000c 0      
  334    SLINE  0      8      00000025 0      
  335    RSYM   0      4      00000000 12189  argc:r(0,1)
  336    STSYM  0      6      080493d8 12201  local:V(0,1)
  337    LBRAC  0      0      0000000c 0      
  338    RBRAC  0      0      00000029 0      
  339    FUN    0      0      00000029 0      
  ...
  $ 

The PSYM accounts for the argument symbol, and the RSYM accounts for
the internal symbol.  A lot of GDB's data structures very closely
match what's provided in STABS.  (The partial symbol tables are a
good example of this: they correspond exactly to the EXCL links.)

But anyway, all this could be handled much better nowadays using Dwarf
2 CFA and location lists.  I've been saying that for years, but it
hasn't happened yet.  Andrew has the CFI done now (I think?), and
Daniel B. has submitted a patch for location expressions (but not
location lists, tho they would be easy to add), but it's awaiting
revision while he works on law school.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: suggestion for dictionary representation
  2002-09-24  9:33       ` David Carlton
  2002-09-24 10:42         ` Daniel Berlin
@ 2002-09-24 20:01         ` Jim Blandy
  2002-09-24 20:50           ` Daniel Berlin
  1 sibling, 1 reply; 28+ messages in thread
From: Jim Blandy @ 2002-09-24 20:01 UTC (permalink / raw)
  To: David Carlton; +Cc: Daniel Berlin, Daniel Jacobowitz, gdb


David Carlton <carlton@math.stanford.edu> writes:
> And it's entirely reasonable to think of 'N' as a constant.  Or
> perhaps two constants: one for C programs with short names, one for
> C++ programs with long names.  (And I'm not really sure that the C++
> names will ultimately turn out to be that much longer: once the proper
> namespace support has been added, then looking up a C++ name will
> probably be a multistep process (looking up pieces of the demangled
> name in turn), and for each those steps, we'll be looking at a name
> that will be of the appropriate length for a C program.)

Actually, the insanely long symbol names I've seen have to do with
template abuse, not deeply nested namespaces.  I'm not free to post
the example I have in mind, but I'm pretty sure you'd be disgusted.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: suggestion for dictionary representation
  2002-09-24 19:52     ` Jim Blandy
@ 2002-09-24 20:37       ` Elena Zannoni
  2002-09-24 20:53       ` Daniel Berlin
  1 sibling, 0 replies; 28+ messages in thread
From: Elena Zannoni @ 2002-09-24 20:37 UTC (permalink / raw)
  To: Jim Blandy; +Cc: David Carlton, Daniel Jacobowitz, gdb


On this topic, see the following thread. DanielJ and I had the same discussion
then.

http://sources.redhat.com/ml/gdb-patches/2001-11/msg00130.html

Elena


Jim Blandy writes:
 > 
 > David Carlton <carlton@math.stanford.edu> writes:
 > > > I'm tempted to whack the block special case for function arguments.  It
 > > > may make name lookup a little more complicated but I think it will make
 > > > everything clearer.  We could, of course, try this on the branch and
 > > > see if we like the results :)
 > > 
 > > Would it be reasonable to break up function blocks into two separate
 > > blocks: a linear block that only defines the parameters for the
 > > function and a non-linear block that contains the actual local
 > > variables?  Not that I think Jim's scheme is a bad one - I agree that
 > > it's better than the current scheme - but given the possibility of
 > > local variables shadowing function parameters, it seems to me to be
 > > conceptually cleaner to have two separate blocks appear anyways, and
 > > it also solves this problem.
 > 
 > The issue is a bit more tangled than you think, I think.  Splitting
 > the function's body and its formals into two separate blocks is a good
 > idea, but it isn't going to get rid of all your duplicates.  A single
 > formal parameter can have two symbols in a function's block that
 > describe it.  Try this out on a Pentium.  (The `-O2' and `-gstabs+'
 > are required.)
 > 
 >   $ cat func.c
 >   #include <stdio.h>
 > 
 >   int
 >   main (int argc, char **argv)
 >   {
 >     static int local = 3;
 >     printf ("%d\n", argc * local);
 >   }
 >   $ gcc -O2 -gstabs+ func.c -o func
 > 
 > Then start up GDB on GDB on `func':
 > 
 >   (top-gdb) run
 >   The program being debugged has been started already.
 >   Start it from the beginning? (y or n) y
 > 
 >   Starting program: gdb -nw func
 >   GNU gdb 2002-09-16-cvs
 >   Copyright 2002 Free Software Foundation, Inc.
 >   GDB is free software, covered by the GNU General Public License, and you are
 >   welcome to change it and/or distribute copies of it under certain conditions.
 >   Type "show copying" to see the conditions.
 >   There is absolutely no warranty for GDB.  Type "show warranty" for details.
 >   This GDB was configured as "i686-pc-linux-gnu"...
 >   (gdb)
 > 
 > Set a breakpoint in main, just to get the symbols read:
 > 
 >   (gdb) break main
 >   Breakpoint 1 at 0x804834c: file func.c, line 7.
 >   (gdb)
 > 
 > Drop out to the enclosing GDB:
 > 
 >   (gdb) info
 >   (top-gdb)
 > 
 > It just so happens that `func.c' is the first compilation unit of the
 > first executable file in GDB's list:
 > 
 >   (top-gdb) print object_files->symtabs->filename
 >   $177 = 0x82fdbf8 "func.c"
 >   (top-gdb)
 > 
 > If that's not so for you, you'll need to walk `symtabs' to find the
 > right symtab.  Anyway, let's check out this symtab's blockvector.  I'm
 > just using [0] as a postfix dereferencing operator here:
 > 
 >   (top-gdb) print object_files->symtabs->blockvector[0]
 >   $178 = {nblocks = 3, block = {0x82f8b74}}
 >   (top-gdb)
 > 
 > The first and second blocks are the global and static blocks, so the
 > third one is probably for `main':
 > 
 >   (top-gdb) print object_files->symtabs->blockvector->block[2]
 >   $179 = (struct block *) 0x82f8ab4
 >   (top-gdb) p *$179
 >   $180 = {startaddr = 134513472, endaddr = 134513513, function = 0x82f8988, 
 >     superblock = 0x82f8ae4, gcc_compile_flag = 2 '\002', hashtable = 0 '\0', 
 >     nsyms = 4, sym = {0x82f89c4}}
 >   (top-gdb) p *$179->function
 >   $181 = {ginfo = {name = 0x82f89bc "main", value = {ivalue = 137333428, 
 >         block = 0x82f8ab4, 
 >         bytes = 0x82f8ab4 "@\203\004\bi\203\004\b\210\211/\bä\212/\b\002", 
 >         address = 137333428, chain = 0x82f8ab4}, language_specific = {
 >         cplus_specific = {demangled_name = 0x0}}, language = language_c, 
 >       section = 11, bfd_section = 0x82d4fc0}, type = 0x82faaa8, 
 >     namespace = VAR_NAMESPACE, aclass = LOC_BLOCK, line = 5, aux_value = {
 >       basereg = 0}, aliases = 0x0, ranges = 0x0, hash_next = 0x0}
 >   (top-gdb)
 > 
 > And it was!  Let's look at those four symbols:
 > 
 >   (top-gdb) p *$179->sym[0]
 >   $182 = {ginfo = {name = 0x82f89f8 "argc", value = {ivalue = 8, block = 0x8, 
 >         bytes = 0x8 <Address 0x8 out of bounds>, address = 8, chain = 0x8}, 
 >       language_specific = {cplus_specific = {demangled_name = 0x0}}, 
 >       language = language_c, section = 0, bfd_section = 0x0}, type = 0x82df828,
 >     namespace = VAR_NAMESPACE, aclass = LOC_ARG, line = 4, aux_value = {
 >       basereg = 0}, aliases = 0x0, ranges = 0x0, hash_next = 0x0}
 >   (top-gdb) p *$179->sym[1]
 >   $183 = {ginfo = {name = 0x82f8a34 "argv", value = {ivalue = 12, block = 0xc, 
 >         bytes = 0xc <Address 0xc out of bounds>, address = 12, chain = 0xc}, 
 >       language_specific = {cplus_specific = {demangled_name = 0x0}}, 
 >       language = language_c, section = 0, bfd_section = 0x0}, type = 0x82faaf4,
 >     namespace = VAR_NAMESPACE, aclass = LOC_ARG, line = 4, aux_value = {
 >       basereg = 0}, aliases = 0x0, ranges = 0x0, hash_next = 0x0}
 >   (top-gdb) p *$179->sym[2]
 >   $184 = {ginfo = {name = 0x82f8a70 "argc", value = {ivalue = 0, block = 0x0, 
 >         bytes = 0x0, address = 0, chain = 0x0}, language_specific = {
 >         cplus_specific = {demangled_name = 0x0}}, language = language_c, 
 >       section = 0, bfd_section = 0x0}, type = 0x82df828, 
 >     namespace = VAR_NAMESPACE, aclass = LOC_REGISTER, line = 4, aux_value = {
 >       basereg = 0}, aliases = 0x0, ranges = 0x0, hash_next = 0x0}
 >   (top-gdb) p *$179->sym[3]
 >   $185 = {ginfo = {name = 0x82f8aac "local", value = {ivalue = 134517720, 
 >         block = 0x80493d8, bytes = 0x80493d8 "É\f", address = 134517720, 
 >         chain = 0x80493d8}, language_specific = {cplus_specific = {
 >           demangled_name = 0x0}}, language = language_c, section = 14, 
 >       bfd_section = 0x0}, type = 0x82df828, namespace = VAR_NAMESPACE, 
 >     aclass = LOC_STATIC, line = 6, aux_value = {basereg = 0}, aliases = 0x0, 
 >     ranges = 0x0, hash_next = 0x0}
 >   (top-gdb) 
 > 
 > Hey!  Why are there two entries for argc?  (This is the extra tangle I
 > was referring to.  If you know all about this, you can stop reading
 > now.)
 > 
 > The two `argc' symbols have different address classes: one has an
 > address class that indicates it's an argument, and the other doesn't.
 > The argument symbol describes where the variable is passed on the
 > stack (eight bytes after %ebp), whereas the non-argument symbol
 > describes where the variable lives in the block of the function:
 > register zero, or %eax.
 > 
 > As a sanity check, let's look at the IA-32 code for main:
 > 
 >     (top-gdb) c
 >     Continuing.
 >     (gdb) disass main
 >     Dump of assembler code for function main:
 >     0x8048340 <main>:	push   %ebp
 >     0x8048341 <main+1>:	mov    %esp,%ebp
 >     0x8048343 <main+3>:	sub    $0x8,%esp
 >     0x8048346 <main+6>:	mov    0x8(%ebp),%eax
 >     0x8048349 <main+9>:	and    $0xfffffff0,%esp
 >     0x804834c <main+12>:	mov    0x80493d8,%edx
 >     0x8048352 <main+18>:	movl   $0x80483c8,(%esp,1)
 >     0x8048359 <main+25>:	imul   %edx,%eax
 >     0x804835c <main+28>:	mov    %eax,0x4(%esp,1)
 >     0x8048360 <main+32>:	call   0x8048268 <printf>
 >     0x8048365 <main+37>:	mov    %ebp,%esp
 >     0x8048367 <main+39>:	pop    %ebp
 >     0x8048368 <main+40>:	ret    
 >     End of assembler dump.
 >     (gdb) 
 > 
 > So, yes, the compiler did copy `argc' from the stack into %eax.
 > Check.
 > 
 > But *why* does GDB do this?  I have no idea.  It seems to me that,
 > with prologue skipping et al, simply having a single LOC_REGPARM would
 > be the Right Thing.  I don't really know when GDB will prefer the
 > argument entry, and when it'll prefer the non-argument entry.
 > 
 > I suspect it's historical.  If you look at the stabs spec, you'll see
 > that it actually emits two stabs for arguments that are passed in one
 > place, but get moved somewhere else:
 > 
 >   $ objdump --stabs func
 >   ...
 >   329    FUN    0      5      08048340 12145  main:F(0,1)
 >   330    PSYM   0      4      00000008 12157  argc:p(0,1)
 >   331    PSYM   0      4      0000000c 12169  argv:p(1,1)=*(7,36)
 >   332    SLINE  0      5      00000000 0      
 >   333    SLINE  0      7      0000000c 0      
 >   334    SLINE  0      8      00000025 0      
 >   335    RSYM   0      4      00000000 12189  argc:r(0,1)
 >   336    STSYM  0      6      080493d8 12201  local:V(0,1)
 >   337    LBRAC  0      0      0000000c 0      
 >   338    RBRAC  0      0      00000029 0      
 >   339    FUN    0      0      00000029 0      
 >   ...
 >   $ 
 > 
 > The PSYM accounts for the argument symbol, and the RSYM accounts for
 > the internal symbol.  A lot of GDB's data structures very closely
 > match what's provided in STABS.  (The partial symbol tables are a
 > good example of this: they correspond exactly to the EXCL links.)
 > 
 > But anyway, all this could be handled much better nowadays using Dwarf
 > 2 CFA and location lists.  I've been saying that for years, but it
 > hasn't happened yet.  Andrew has the CFI done now (I think?), and
 > Daniel B. has submitted a patch for location expressions (but not
 > location lists, tho they would be easy to add), but it's awaiting
 > revision while he works on law school.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: suggestion for dictionary representation
  2002-09-24 20:01         ` Jim Blandy
@ 2002-09-24 20:50           ` Daniel Berlin
  0 siblings, 0 replies; 28+ messages in thread
From: Daniel Berlin @ 2002-09-24 20:50 UTC (permalink / raw)
  To: Jim Blandy; +Cc: David Carlton, Daniel Jacobowitz, gdb


On Tuesday, September 24, 2002, at 10:46  PM, Jim Blandy wrote:

>>
>
> Actually, the insanely long symbol names I've seen have to do with
> template abuse, not deeply nested namespaces.  I'm not free to post
> the example I have in mind, but I'm pretty sure you'd be disgusted.
>
I forgot about template metaprogramming.
I'm a bit scared to c++filt blitz++ and friends.
--Dan


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: suggestion for dictionary representation
  2002-09-24 19:52     ` Jim Blandy
  2002-09-24 20:37       ` Elena Zannoni
@ 2002-09-24 20:53       ` Daniel Berlin
  1 sibling, 0 replies; 28+ messages in thread
From: Daniel Berlin @ 2002-09-24 20:53 UTC (permalink / raw)
  To: Jim Blandy; +Cc: David Carlton, Daniel Jacobowitz, gdb


>
>
> But anyway, all this could be handled much better nowadays using Dwarf
> 2 CFA and location lists.  I've been saying that for years, but it
> hasn't happened yet.  Andrew has the CFI done now (I think?), and
> Daniel B. has submitted a patch for location expressions (but not
> location lists, tho they would be easy to add), but it's awaiting
> revision while he works on law school.
>
I actually have the revisions necessary to do location lists, but it 
makes work of the location expression lists anyway (We just store range 
list mapping ranges to locexpr batons). It's literally nothing but a 
layer over that code that picks out the right baton(s) for the given 
pc, and calls the location functions for each baton.
--Dan


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: suggestion for dictionary representation
  2002-09-27 11:23       ` Jim Blandy
@ 2002-09-27 11:28         ` Daniel Jacobowitz
  0 siblings, 0 replies; 28+ messages in thread
From: Daniel Jacobowitz @ 2002-09-27 11:28 UTC (permalink / raw)
  To: Jim Blandy; +Cc: gdb

On Fri, Sep 27, 2002 at 01:07:57PM -0500, Jim Blandy wrote:
> Daniel Jacobowitz <drow@mvista.com> writes:
> 
> > On Tue, Sep 24, 2002 at 10:46:02PM -0500, Jim Blandy wrote:
> > > 
> > > Daniel Berlin <dberlin@dberlin.org> writes:
> > > 
> > > > On Tue, 24 Sep 2002, Jim Blandy wrote:
> > > > 
> > > > > 
> > > > > > Also, for what it's worth, I'm still not ready to completely give up
> > > > > > on representing members of classes via a dictionary; that would
> > > > > > provide another place where a linear dictionary environment could be
> > > > > > useful.
> > > > > 
> > > > > I agree, but it's worth noting that `struct symbol' is 52 bytes long
> > > > > on a Pentium, whereas `struct field' and `struct fn_field' are 16
> > > > > bytes long.  
> > > > > 
> > > > > Not that that necessarily matters.  We know GDB does have memory
> > > > > consumption problems, but I have never seen those problems really
> > > > > analyzed. 
> > > > 
> > > > Um, I have these statistics, but I need to know *exactly* what you want to 
> > > > know to be able to give them to you.
> > > 
> > > On large C++ programs, how much of a difference would it make if we
> > > used `struct symbol' objects (52 bytes long) to represent data members
> > > and member functions, instead of `struct field' and `struct fn_field'
> > > objects (both 16 bytes long)?
> > 
> > I'm not sure this is the way to go - we could have a dictionary of
> > something other than struct symbol, probably.
> 
> Well, for fields, I think you're right.  The field list is
> intrinsically an ordered thing; a dictionary is not.
> 
> But not for methods, either?  Wouldn't it simplify overload set
> computation to have methods and ordinary functions represented using
> the same structure?

Not really.  First of all, you never have to deal with both at the same
time:
void foo(int x)
{
  printf ("Int ver\n");
}
class bar {
public:
  void foo (double x) { printf ("Double member ver\n"); }
  bar() { foo (1);  foo(2.); }
};

int main()
{
  bar baz;
  return 0;
}

Both calls will get the member version.


Secondly, getting the type would probably be all of two lines of
special case; for the memory savings that sounds pretty good to me.

-- 
Daniel Jacobowitz
MontaVista Software                         Debian GNU/Linux Developer


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: suggestion for dictionary representation
  2002-09-25  5:54     ` Daniel Jacobowitz
@ 2002-09-27 11:23       ` Jim Blandy
  2002-09-27 11:28         ` Daniel Jacobowitz
  0 siblings, 1 reply; 28+ messages in thread
From: Jim Blandy @ 2002-09-27 11:23 UTC (permalink / raw)
  To: Daniel Jacobowitz; +Cc: gdb

Daniel Jacobowitz <drow@mvista.com> writes:

> On Tue, Sep 24, 2002 at 10:46:02PM -0500, Jim Blandy wrote:
> > 
> > Daniel Berlin <dberlin@dberlin.org> writes:
> > 
> > > On Tue, 24 Sep 2002, Jim Blandy wrote:
> > > 
> > > > 
> > > > > Also, for what it's worth, I'm still not ready to completely give up
> > > > > on representing members of classes via a dictionary; that would
> > > > > provide another place where a linear dictionary environment could be
> > > > > useful.
> > > > 
> > > > I agree, but it's worth noting that `struct symbol' is 52 bytes long
> > > > on a Pentium, whereas `struct field' and `struct fn_field' are 16
> > > > bytes long.  
> > > > 
> > > > Not that that necessarily matters.  We know GDB does have memory
> > > > consumption problems, but I have never seen those problems really
> > > > analyzed. 
> > > 
> > > Um, I have these statistics, but I need to know *exactly* what you want to 
> > > know to be able to give them to you.
> > 
> > On large C++ programs, how much of a difference would it make if we
> > used `struct symbol' objects (52 bytes long) to represent data members
> > and member functions, instead of `struct field' and `struct fn_field'
> > objects (both 16 bytes long)?
> 
> I'm not sure this is the way to go - we could have a dictionary of
> something other than struct symbol, probably.

Well, for fields, I think you're right.  The field list is
intrinsically an ordered thing; a dictionary is not.

But not for methods, either?  Wouldn't it simplify overload set
computation to have methods and ordinary functions represented using
the same structure?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: suggestion for dictionary representation
  2002-09-24 21:01   ` Jim Blandy
@ 2002-09-25  5:54     ` Daniel Jacobowitz
  2002-09-27 11:23       ` Jim Blandy
  0 siblings, 1 reply; 28+ messages in thread
From: Daniel Jacobowitz @ 2002-09-25  5:54 UTC (permalink / raw)
  To: Jim Blandy; +Cc: Daniel Berlin, david carlton, gdb

On Tue, Sep 24, 2002 at 10:46:02PM -0500, Jim Blandy wrote:
> 
> Daniel Berlin <dberlin@dberlin.org> writes:
> 
> > On Tue, 24 Sep 2002, Jim Blandy wrote:
> > 
> > > 
> > > > Also, for what it's worth, I'm still not ready to completely give up
> > > > on representing members of classes via a dictionary; that would
> > > > provide another place where a linear dictionary environment could be
> > > > useful.
> > > 
> > > I agree, but it's worth noting that `struct symbol' is 52 bytes long
> > > on a Pentium, whereas `struct field' and `struct fn_field' are 16
> > > bytes long.  
> > > 
> > > Not that that necessarily matters.  We know GDB does have memory
> > > consumption problems, but I have never seen those problems really
> > > analyzed. 
> > 
> > Um, I have these statistics, but I need to know *exactly* what you want to 
> > know to be able to give them to you.
> 
> On large C++ programs, how much of a difference would it make if we
> used `struct symbol' objects (52 bytes long) to represent data members
> and member functions, instead of `struct field' and `struct fn_field'
> objects (both 16 bytes long)?

I'm not sure this is the way to go - we could have a dictionary of
something other than struct symbol, probably.

-- 
Daniel Jacobowitz
MontaVista Software                         Debian GNU/Linux Developer


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: suggestion for dictionary representation
  2002-09-24  6:19 ` Daniel Berlin
  2002-09-24  7:06   ` Daniel Jacobowitz
@ 2002-09-24 21:01   ` Jim Blandy
  2002-09-25  5:54     ` Daniel Jacobowitz
  1 sibling, 1 reply; 28+ messages in thread
From: Jim Blandy @ 2002-09-24 21:01 UTC (permalink / raw)
  To: Daniel Berlin; +Cc: david carlton, gdb


Daniel Berlin <dberlin@dberlin.org> writes:

> On Tue, 24 Sep 2002, Jim Blandy wrote:
> 
> > 
> > > Also, for what it's worth, I'm still not ready to completely give up
> > > on representing members of classes via a dictionary; that would
> > > provide another place where a linear dictionary environment could be
> > > useful.
> > 
> > I agree, but it's worth noting that `struct symbol' is 52 bytes long
> > on a Pentium, whereas `struct field' and `struct fn_field' are 16
> > bytes long.  
> > 
> > Not that that necessarily matters.  We know GDB does have memory
> > consumption problems, but I have never seen those problems really
> > analyzed. 
> 
> Um, I have these statistics, but I need to know *exactly* what you want to 
> know to be able to give them to you.

On large C++ programs, how much of a difference would it make if we
used `struct symbol' objects (52 bytes long) to represent data members
and member functions, instead of `struct field' and `struct fn_field'
objects (both 16 bytes long)?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: suggestion for dictionary representation
  2002-09-23 23:50 Jim Blandy
  2002-09-24  6:19 ` Daniel Berlin
@ 2002-09-24  9:49 ` David Carlton
  1 sibling, 0 replies; 28+ messages in thread
From: David Carlton @ 2002-09-24  9:49 UTC (permalink / raw)
  To: Jim Blandy; +Cc: gdb

On Tue, 24 Sep 2002 01:35:30 -0500, Jim Blandy <jimb@redhat.com> said:

>> Also, for what it's worth, I'm still not ready to completely give up
>> on representing members of classes via a dictionary; that would
>> provide another place where a linear dictionary environment could be
>> useful.

> I agree, but it's worth noting that `struct symbol' is 52 bytes long
> on a Pentium, whereas `struct field' and `struct fn_field' are 16
> bytes long.  

> Not that that necessarily matters.  We know GDB does have memory
> consumption problems, but I have never seen those problems really
> analyzed.

Right.  It wouldn't be something to do lightly, but on the other hand
it's not something to rule out of hand without appropriate profiling.

One possible alternative would be to add a dictionary to each compound
data structure (or, as you suggest, only to certain compound data
structures) that knows how to test whether or not a name is in that
structure, but that always returns the same symbol for a match.  That
would let us get rid of the is_a_field_of_this argument to
lookup_symbol; C++ code could then handle that symbol as a possible
return value and would have enough information to do the right thing,
whereas non-C++ code could ignore the possibility (or gdb_assert that
it was never returned), and it would work automatically with a correct
C++ name lookup algorithm.

It would, of course, still increase GDB's memory usage, but it
wouldn't increase it for every single field.

David Carlton
carlton@math.stanford.edu

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: suggestion for dictionary representation
  2002-09-24  6:19 ` Daniel Berlin
@ 2002-09-24  7:06   ` Daniel Jacobowitz
  2002-09-24 21:01   ` Jim Blandy
  1 sibling, 0 replies; 28+ messages in thread
From: Daniel Jacobowitz @ 2002-09-24  7:06 UTC (permalink / raw)
  To: Daniel Berlin; +Cc: Jim Blandy, david carlton, gdb

On Tue, Sep 24, 2002 at 09:19:54AM -0400, Daniel Berlin wrote:
> 
> 
> On Tue, 24 Sep 2002, Jim Blandy wrote:
> 
> > 
> > > Also, for what it's worth, I'm still not ready to completely give up
> > > on representing members of classes via a dictionary; that would
> > > provide another place where a linear dictionary environment could be
> > > useful.
> > 
> > I agree, but it's worth noting that `struct symbol' is 52 bytes long
> > on a Pentium, whereas `struct field' and `struct fn_field' are 16
> > bytes long.  
> > 
> > Not that that necessarily matters.  We know GDB does have memory
> > consumption problems, but I have never seen those problems really
> > analyzed. 
> 
> Um, I have these statistics, but I need to know *exactly* what you want to 
> know to be able to give them to you.
> There's just too many combinations to make a nice table or something.
> Even something like STABS vs DWARF2 has too many combinations because 
> it changes with each GCC version, and even sometimes with binutils 
> versions.
> 
> > All the memory could be in physnames, for all we know. 
> In general, DWARF2 C++ apps, a lot of it is.

I'm actually making good progress on cutting down our physname use.  It
would help if someone commented on the testsuite framework I posted
last week... that's my next step.

-- 
Daniel Jacobowitz
MontaVista Software                         Debian GNU/Linux Developer


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: suggestion for dictionary representation
  2002-09-23 23:50 Jim Blandy
@ 2002-09-24  6:19 ` Daniel Berlin
  2002-09-24  7:06   ` Daniel Jacobowitz
  2002-09-24 21:01   ` Jim Blandy
  2002-09-24  9:49 ` David Carlton
  1 sibling, 2 replies; 28+ messages in thread
From: Daniel Berlin @ 2002-09-24  6:19 UTC (permalink / raw)
  To: Jim Blandy; +Cc: david carlton, gdb



On Tue, 24 Sep 2002, Jim Blandy wrote:

> 
> > Also, for what it's worth, I'm still not ready to completely give up
> > on representing members of classes via a dictionary; that would
> > provide another place where a linear dictionary environment could be
> > useful.
> 
> I agree, but it's worth noting that `struct symbol' is 52 bytes long
> on a Pentium, whereas `struct field' and `struct fn_field' are 16
> bytes long.  
> 
> Not that that necessarily matters.  We know GDB does have memory
> consumption problems, but I have never seen those problems really
> analyzed. 

Um, I have these statistics, but I need to know *exactly* what you want to 
know to be able to give them to you.
There's just too many combinations to make a nice table or something.
Even something like STABS vs DWARF2 has too many combinations because 
it changes with each GCC version, and even sometimes with binutils 
versions.

> All the memory could be in physnames, for all we know. 
In general, DWARF2 C++ apps, a lot of it is.

--Dan


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: suggestion for dictionary representation
@ 2002-09-23 23:50 Jim Blandy
  2002-09-24  6:19 ` Daniel Berlin
  2002-09-24  9:49 ` David Carlton
  0 siblings, 2 replies; 28+ messages in thread
From: Jim Blandy @ 2002-09-23 23:50 UTC (permalink / raw)
  To: david carlton; +Cc: gdb

> Also, for what it's worth, I'm still not ready to completely give up
> on representing members of classes via a dictionary; that would
> provide another place where a linear dictionary environment could be
> useful.

I agree, but it's worth noting that `struct symbol' is 52 bytes long
on a Pentium, whereas `struct field' and `struct fn_field' are 16
bytes long.  

Not that that necessarily matters.  We know GDB does have memory
consumption problems, but I have never seen those problems really
analyzed.  All the memory could be in physnames, for all we know.  But
I'd want to have some sense of the impact before I made the change.
(Perhaps a heavy C++ user could stick a `char foo[52 - 16]' at the end
of `struct field' and `struct fn_field', and tell us how that goes.)

An intermediate step would be to simply add a `struct dictionary *' to
`struct cplus_struct_type'.  We could use that right off the bat for
nested classes, typedefs, and enums.  We could then migrate other
stuff over there incrementally.

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2002-09-27 18:28 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2002-09-22 19:59 suggestion for dictionary representation Jim Blandy
2002-09-22 20:11 ` Daniel Jacobowitz
2002-09-23 10:38   ` David Carlton
2002-09-23 17:34     ` Daniel Berlin
2002-09-23 18:39       ` Daniel Jacobowitz
2002-09-23 21:28         ` Daniel Berlin
2002-09-23 21:38           ` Eli Zaretskii
2002-09-23 21:44             ` Daniel Berlin
2002-09-23 21:47               ` Eli Zaretskii
2002-09-23 21:54                 ` Daniel Berlin
2002-09-24  9:33       ` David Carlton
2002-09-24 10:42         ` Daniel Berlin
2002-09-24 10:53           ` David Carlton
2002-09-24 20:01         ` Jim Blandy
2002-09-24 20:50           ` Daniel Berlin
2002-09-23 18:28     ` Daniel Jacobowitz
2002-09-24  3:51     ` Paul N. Hilfinger
2002-09-24 19:52     ` Jim Blandy
2002-09-24 20:37       ` Elena Zannoni
2002-09-24 20:53       ` Daniel Berlin
2002-09-23 23:50 Jim Blandy
2002-09-24  6:19 ` Daniel Berlin
2002-09-24  7:06   ` Daniel Jacobowitz
2002-09-24 21:01   ` Jim Blandy
2002-09-25  5:54     ` Daniel Jacobowitz
2002-09-27 11:23       ` Jim Blandy
2002-09-27 11:28         ` Daniel Jacobowitz
2002-09-24  9:49 ` David Carlton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox