Re: Note on choosing string hash functions

Mirror of the gdb mailing list
 help / color / mirror / Atom feed

From: Pedro Alves <palves@redhat.com>
To: Dmitry Antipov <dantipov@nvidia.com>, gdb@sourceware.org
Subject: Re: Note on choosing string hash functions
Date: Fri, 17 Nov 2017 13:42:00 -0000	[thread overview]
Message-ID: <4fc8cd33-a362-ddf5-9a7c-e69eab385587@redhat.com> (raw)
In-Reply-To: <33c45098-17a4-4c8a-fb14-137e70c7bb3f@nvidia.com>

On 11/17/2017 09:48 AM, Dmitry Antipov wrote:
> I'm curious why 'htab_hash_string' (libiberty/hashtab.c) uses:
> 
> r = r * 67 + c - 113
> 
> but 'SYMBOL_HASH_NEXT' (gdb/minsyms.h) prefers an extra 'tolower' with:
> 
> ((hash) * 67 + tolower ((unsigned char) (c)) - 113)
> 
> Everyone assumes that 'tolower' is simple, fast, and usually implemented
> as a macro. But when >1M of mangled C++ symbols should be hashed, results
> may be somewhat surprising ('perf report'):

I've noticed tolower show up in perf too, along with isspace,
in strncmp_iw/strncmp_iw_with_mode and friends.

I think the tolower is there for source languages that
are case insensitive, like Ada and Fortran.  (Or if you do
"set case-sensitive off", but I think that's a
misfeature...).

I haven't explicitly tried measuring perf with this patch:

 https://sourceware.org/ml/gdb-patches/2017-06/msg00022.html

but I think it should improve performance significantly,
since safe-ctype.h is locale independent.

(
Of course, this ignores non-ASCII code points, but I'm not
sure we really handle non-ASCII correctly everywhere else,
i.e., whether it makes a difference today.  Certainly the
locale when the program was compiled generally has no relation
to the current user's locale, even though they frequently
match.  If it does make a difference, one way to handle
it that may be still cheap is to hash all non-ASCII (and utf8
multi-byte sequences) to the same, while still doing proper
lowercase comparisons when actually comparing symbols that end
up in the bucket.  Something like:

#define SYMBOL_HASH_NEXT(hash, c)			\
  ((hash) * 67 + ((c & 0x80) ? 'z' : TOLOWER ((unsigned char) (c))) - 113)

I'd assume that most symbol names are wholly or at least mostly
ASCII and that that wouldn't result in too many collisions.

But then again, I don't really know how compilers for case-insensitive
languages handle non-ASCII symbol names in different cases, especially
considering multibyte sequences.
)

> 
> Observed on current binutils-gdb trunk, glibc 2.25, gcc 7.2.1, -g -O3
> -march=native -mtune=native,
> "remote" (both gdb and gdbserver on the same system) debugging Firefox
> debug build, reports was generated
> with 'perf record -F 10000 gdb -batch --readnever -q -ex "set sysroot /"
> -ex "set pagination off" \
> -ex "target extended-remote :5555" -ex "thread apply all bt" -ex "quit"'.

Thanks,
Pedro Alves

next prev parent reply	other threads:[~2017-11-17 13:42 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-11-17  9:50 Dmitry Antipov
2017-11-17 13:42 ` Pedro Alves [this message]
2017-11-22  2:10   ` Pedro Alves
2017-11-22  2:25     ` Pedro Alves
2017-11-28 10:35     ` Dmitry Antipov
2017-11-28 12:00       ` Pedro Alves
2017-11-28 15:13         ` Dmitry Antipov
2017-11-28 15:23           ` Pedro Alves

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4fc8cd33-a362-ddf5-9a7c-e69eab385587@redhat.com \
    --to=palves@redhat.com \
    --cc=dantipov@nvidia.com \
    --cc=gdb@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox