From: Daniel Berlin <dberlin@dberlin.org>
To: David Carlton <carlton@math.stanford.edu>
Cc: Daniel Jacobowitz <drow@mvista.com>, Jim Blandy <jimb@redhat.com>,
gdb@sources.redhat.com
Subject: Re: suggestion for dictionary representation
Date: Tue, 24 Sep 2002 10:42:00 -0000 [thread overview]
Message-ID: <F8B4847E-CFE4-11D6-AF7A-000393575BCC@dberlin.org> (raw)
In-Reply-To: <ro1hegfqre1.fsf@jackfruit.Stanford.EDU>
On Tuesday, September 24, 2002, at 12:33 PM, David Carlton wrote:
> On Mon, 23 Sep 2002 20:34:50 -0400, Daniel Berlin
> <dberlin@dberlin.org> said:
>
>>> I'm also curious about how it would affect the speed of reading in
>>> symbols. Right now, that should be O(n), where n is the number of
>>> global symbols, right?
>
>>> If we used expandable hash tables, then I think it would be
>>> amortized O(n) and with the constant factor larger.
>
>> Our string hash function is O(N) right now (as are most). Hash
>> tables are only O(N) when the hash function is O(1).
>
> [ Here, of course, my 'n' is the number of global symbols, and
> Daniel's 'N' is the maximum symbol length. ]
Yup.
In the future i'll just choose different letters.
(Who the heck thought it would be smart to have a big Oh and little Oh,
fer instance)
>
> This is true, but I'm not sure that it's relevant to this sort of
> theoretical analysis. After all, skip lists depend on N, as well:
> they put symbols in order, and the amount of time to do that depends
> on the length of the symbols.
I was waiting for someone to bring this up.
You *also* have this N in hash tables in another form: Key comparisons.
It's not just the hash function.
Regardless of your method of hashing (chaining or probing), you have to
compare the key.
The better the hash function, the less key comparisons you do (because
the chain length is shorter or the number of probes is less).
If your average chain lengths go above log n (the number of keys we'd
have to strcmp in the skiplist), which in fact, for large files, they
were, then the skiplist will likely be better, because it'll perform
less comparisons, *and* doesn't have do to the hash function
calculation.
>
> And it's entirely reasonable to think of 'N' as a constant.
But it's really not.
You can't predict the lengths of symbol names over the next ten years.
I'd bet if one looked, you'd seem it's a generally increasing curve for
C++ apps, as they become more complex, have more namespaces, etc.
For C apps, it's probably constant.
> Or
> perhaps two constants: one for C programs with short names, one for
> C++ programs with long names. (And I'm not really sure that the C++
> names will ultimately turn out to be that much longer: once the proper
> namespace support has been added, then looking up a C++ name will
> probably be a multistep process (looking up pieces of the demangled
> name in turn), and for each those steps, we'll be looking at a name
> that will be of the appropriate length for a C program.)
>
Maybe, maybe not.
Certainly, if it goes multistep, i'd wager that even in 10 years, the
average symbol length of individual parts won't go up very much, even
if the length of the entire symbol name goes up.
> But even if we consider N to be a constant, your broader point stands:
> the constant factors that different algorithms differ by are
> important, and in practice large constants can have more of an affect
> than logarithmic terms. Fortunately, one of the advantages of the
> refactoring that I'm doing right now is that it'll be easy to drop in
> different dictionary implementations for testing purposes: it should
> boil down to writing the code that we'd have to do to get skip list
> support anyways, changing one or two function calls, and recompiling.
>
Works for me.
I would actually suggest neither hash tables or skiplists, but ternary
search trees.
They compare *very* well to hashes, and beat out skip lists as well.
The algorithms are easy (and i already added them to libiberty), and
they were engineered as structures for symbol tables.
They support the "fast completion" property of skiplists (since they
are ordered), but the constant you pay to insert nodes is much lower.
They also support easy pattern matching searches.
The only problem is that i was too lazy to look up general n-way tree
balancing algorithms and implement one, so the current implementation
is unbalanced.
Thus, if you insert in sorted order, things will get bad.
See http://www.ddj.com/documents/s=921/ddj9804a/9804a.htm
Asymptotic analysis of them isn't as easy as one would think.
In fact, it's downright difficult.
But, it's been done.
If you can grasp all the math:
http://users.info.unicaen.fr/~clement/SODA/all1.html
But, as you said, this can all be hashed out later and a few function
calls changed if it turns out to be better.
> David Carlton
> carlton@math.stanford.edu
next prev parent reply other threads:[~2002-09-24 17:42 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2002-09-22 19:59 Jim Blandy
2002-09-22 20:11 ` Daniel Jacobowitz
2002-09-23 10:38 ` David Carlton
2002-09-23 17:34 ` Daniel Berlin
2002-09-23 18:39 ` Daniel Jacobowitz
2002-09-23 21:28 ` Daniel Berlin
2002-09-23 21:38 ` Eli Zaretskii
2002-09-23 21:44 ` Daniel Berlin
2002-09-23 21:47 ` Eli Zaretskii
2002-09-23 21:54 ` Daniel Berlin
2002-09-24 9:33 ` David Carlton
2002-09-24 10:42 ` Daniel Berlin [this message]
2002-09-24 10:53 ` David Carlton
2002-09-24 20:01 ` Jim Blandy
2002-09-24 20:50 ` Daniel Berlin
2002-09-23 18:28 ` Daniel Jacobowitz
2002-09-24 3:51 ` Paul N. Hilfinger
2002-09-24 19:52 ` Jim Blandy
2002-09-24 20:37 ` Elena Zannoni
2002-09-24 20:53 ` Daniel Berlin
2002-09-23 23:50 Jim Blandy
2002-09-24 6:19 ` Daniel Berlin
2002-09-24 7:06 ` Daniel Jacobowitz
2002-09-24 21:01 ` Jim Blandy
2002-09-25 5:54 ` Daniel Jacobowitz
2002-09-27 11:23 ` Jim Blandy
2002-09-27 11:28 ` Daniel Jacobowitz
2002-09-24 9:49 ` David Carlton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=F8B4847E-CFE4-11D6-AF7A-000393575BCC@dberlin.org \
--to=dberlin@dberlin.org \
--cc=carlton@math.stanford.edu \
--cc=drow@mvista.com \
--cc=gdb@sources.redhat.com \
--cc=jimb@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox