From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 1616 invoked by alias); 17 Nov 2017 13:42:34 -0000 Mailing-List: contact gdb-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gdb-owner@sourceware.org Received: (qmail 1483 invoked by uid 89); 17 Nov 2017 13:42:33 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-0.2 required=5.0 tests=BAYES_00,KAM_STOCKGEN,KB_WAM_FROM_NAME_SINGLEWORD,RP_MATCHES_RCVD,SPF_HELO_PASS autolearn=no version=3.3.2 spammy=Everyone, bucket, Hx-languages-length:2424 X-HELO: mx1.redhat.com Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Fri, 17 Nov 2017 13:42:30 +0000 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id EEB452DB709; Fri, 17 Nov 2017 13:42:26 +0000 (UTC) Received: from [127.0.0.1] (ovpn04.gateway.prod.ext.ams2.redhat.com [10.39.146.4]) by smtp.corp.redhat.com (Postfix) with ESMTP id 2A19460562; Fri, 17 Nov 2017 13:42:25 +0000 (UTC) Subject: Re: Note on choosing string hash functions To: Dmitry Antipov , gdb@sourceware.org References: <33c45098-17a4-4c8a-fb14-137e70c7bb3f@nvidia.com> From: Pedro Alves Message-ID: <4fc8cd33-a362-ddf5-9a7c-e69eab385587@redhat.com> Date: Fri, 17 Nov 2017 13:42:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: <33c45098-17a4-4c8a-fb14-137e70c7bb3f@nvidia.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-SW-Source: 2017-11/txt/msg00010.txt.bz2 On 11/17/2017 09:48 AM, Dmitry Antipov wrote: > I'm curious why 'htab_hash_string' (libiberty/hashtab.c) uses: > > r = r * 67 + c - 113 > > but 'SYMBOL_HASH_NEXT' (gdb/minsyms.h) prefers an extra 'tolower' with: > > ((hash) * 67 + tolower ((unsigned char) (c)) - 113) > > Everyone assumes that 'tolower' is simple, fast, and usually implemented > as a macro. But when >1M of mangled C++ symbols should be hashed, results > may be somewhat surprising ('perf report'): I've noticed tolower show up in perf too, along with isspace, in strncmp_iw/strncmp_iw_with_mode and friends. I think the tolower is there for source languages that are case insensitive, like Ada and Fortran. (Or if you do "set case-sensitive off", but I think that's a misfeature...). I haven't explicitly tried measuring perf with this patch: https://sourceware.org/ml/gdb-patches/2017-06/msg00022.html but I think it should improve performance significantly, since safe-ctype.h is locale independent. ( Of course, this ignores non-ASCII code points, but I'm not sure we really handle non-ASCII correctly everywhere else, i.e., whether it makes a difference today. Certainly the locale when the program was compiled generally has no relation to the current user's locale, even though they frequently match. If it does make a difference, one way to handle it that may be still cheap is to hash all non-ASCII (and utf8 multi-byte sequences) to the same, while still doing proper lowercase comparisons when actually comparing symbols that end up in the bucket. Something like: #define SYMBOL_HASH_NEXT(hash, c) \ ((hash) * 67 + ((c & 0x80) ? 'z' : TOLOWER ((unsigned char) (c))) - 113) I'd assume that most symbol names are wholly or at least mostly ASCII and that that wouldn't result in too many collisions. But then again, I don't really know how compilers for case-insensitive languages handle non-ASCII symbol names in different cases, especially considering multibyte sequences. ) > > Observed on current binutils-gdb trunk, glibc 2.25, gcc 7.2.1, -g -O3 > -march=native -mtune=native, > "remote" (both gdb and gdbserver on the same system) debugging Firefox > debug build, reports was generated > with 'perf record -F 10000 gdb -batch --readnever -q -ex "set sysroot /" > -ex "set pagination off" \ > -ex "target extended-remote :5555" -ex "thread apply all bt" -ex "quit"'. Thanks, Pedro Alves