From: Pedro Alves <palves@redhat.com>
To: Dmitry Antipov <dantipov@nvidia.com>, gdb@sourceware.org
Subject: Re: Note on choosing string hash functions
Date: Fri, 17 Nov 2017 13:42:00 -0000 [thread overview]
Message-ID: <4fc8cd33-a362-ddf5-9a7c-e69eab385587@redhat.com> (raw)
In-Reply-To: <33c45098-17a4-4c8a-fb14-137e70c7bb3f@nvidia.com>
On 11/17/2017 09:48 AM, Dmitry Antipov wrote:
> I'm curious why 'htab_hash_string' (libiberty/hashtab.c) uses:
>
> r = r * 67 + c - 113
>
> but 'SYMBOL_HASH_NEXT' (gdb/minsyms.h) prefers an extra 'tolower' with:
>
> ((hash) * 67 + tolower ((unsigned char) (c)) - 113)
>
> Everyone assumes that 'tolower' is simple, fast, and usually implemented
> as a macro. But when >1M of mangled C++ symbols should be hashed, results
> may be somewhat surprising ('perf report'):
I've noticed tolower show up in perf too, along with isspace,
in strncmp_iw/strncmp_iw_with_mode and friends.
I think the tolower is there for source languages that
are case insensitive, like Ada and Fortran. (Or if you do
"set case-sensitive off", but I think that's a
misfeature...).
I haven't explicitly tried measuring perf with this patch:
https://sourceware.org/ml/gdb-patches/2017-06/msg00022.html
but I think it should improve performance significantly,
since safe-ctype.h is locale independent.
(
Of course, this ignores non-ASCII code points, but I'm not
sure we really handle non-ASCII correctly everywhere else,
i.e., whether it makes a difference today. Certainly the
locale when the program was compiled generally has no relation
to the current user's locale, even though they frequently
match. If it does make a difference, one way to handle
it that may be still cheap is to hash all non-ASCII (and utf8
multi-byte sequences) to the same, while still doing proper
lowercase comparisons when actually comparing symbols that end
up in the bucket. Something like:
#define SYMBOL_HASH_NEXT(hash, c) \
((hash) * 67 + ((c & 0x80) ? 'z' : TOLOWER ((unsigned char) (c))) - 113)
I'd assume that most symbol names are wholly or at least mostly
ASCII and that that wouldn't result in too many collisions.
But then again, I don't really know how compilers for case-insensitive
languages handle non-ASCII symbol names in different cases, especially
considering multibyte sequences.
)
>
> Observed on current binutils-gdb trunk, glibc 2.25, gcc 7.2.1, -g -O3
> -march=native -mtune=native,
> "remote" (both gdb and gdbserver on the same system) debugging Firefox
> debug build, reports was generated
> with 'perf record -F 10000 gdb -batch --readnever -q -ex "set sysroot /"
> -ex "set pagination off" \
> -ex "target extended-remote :5555" -ex "thread apply all bt" -ex "quit"'.
Thanks,
Pedro Alves
next prev parent reply other threads:[~2017-11-17 13:42 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-11-17 9:50 Dmitry Antipov
2017-11-17 13:42 ` Pedro Alves [this message]
2017-11-22 2:10 ` Pedro Alves
2017-11-22 2:25 ` Pedro Alves
2017-11-28 10:35 ` Dmitry Antipov
2017-11-28 12:00 ` Pedro Alves
2017-11-28 15:13 ` Dmitry Antipov
2017-11-28 15:23 ` Pedro Alves
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4fc8cd33-a362-ddf5-9a7c-e69eab385587@redhat.com \
--to=palves@redhat.com \
--cc=dantipov@nvidia.com \
--cc=gdb@sourceware.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox