From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 48110 invoked by alias); 22 Nov 2017 02:10:46 -0000 Mailing-List: contact gdb-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gdb-owner@sourceware.org Received: (qmail 47286 invoked by uid 89); 22 Nov 2017 02:10:39 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=0.8 required=5.0 tests=BAYES_00,KAM_LAZY_DOMAIN_SECURITY,KAM_SHORT,KAM_STOCKGEN,KB_WAM_FROM_NAME_SINGLEWORD,SPF_HELO_PASS,T_RP_MATCHES_RCVD autolearn=no version=3.3.2 spammy=Everyone, measuring, Firefox's, Firefoxs X-HELO: mx1.redhat.com Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Wed, 22 Nov 2017 02:10:37 +0000 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 4067F6E770; Wed, 22 Nov 2017 02:10:36 +0000 (UTC) Received: from [127.0.0.1] (ovpn04.gateway.prod.ext.ams2.redhat.com [10.39.146.4]) by smtp.corp.redhat.com (Postfix) with ESMTP id 6E3465C885; Wed, 22 Nov 2017 02:10:35 +0000 (UTC) Subject: Re: Note on choosing string hash functions To: Dmitry Antipov , gdb@sourceware.org References: <33c45098-17a4-4c8a-fb14-137e70c7bb3f@nvidia.com> <4fc8cd33-a362-ddf5-9a7c-e69eab385587@redhat.com> From: Pedro Alves Message-ID: Date: Wed, 22 Nov 2017 02:10:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: <4fc8cd33-a362-ddf5-9a7c-e69eab385587@redhat.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-SW-Source: 2017-11/txt/msg00021.txt.bz2 On 11/17/2017 01:42 PM, Pedro Alves wrote: > On 11/17/2017 09:48 AM, Dmitry Antipov wrote: >> I'm curious why 'htab_hash_string' (libiberty/hashtab.c) uses: >> >> r = r * 67 + c - 113 >> >> but 'SYMBOL_HASH_NEXT' (gdb/minsyms.h) prefers an extra 'tolower' with: >> >> ((hash) * 67 + tolower ((unsigned char) (c)) - 113) >> >> Everyone assumes that 'tolower' is simple, fast, and usually implemented >> as a macro. But when >1M of mangled C++ symbols should be hashed, results >> may be somewhat surprising ('perf report'): > > I've noticed tolower show up in perf too, along with isspace, > in strncmp_iw/strncmp_iw_with_mode and friends. > > I think the tolower is there for source languages that > are case insensitive, like Ada and Fortran. (Or if you do > "set case-sensitive off", but I think that's a > misfeature...). > > I haven't explicitly tried measuring perf with this patch: > > https://sourceware.org/ml/gdb-patches/2017-06/msg00022.html > > but I think it should improve performance significantly, > since safe-ctype.h is locale independent. I looked at performance now, and while there's an improvement, it isn't nearly as significant as I'd hope. > ( > Of course, this ignores non-ASCII code points, but I'm not > sure we really handle non-ASCII correctly everywhere else, > i.e., whether it makes a difference today. Certainly the > locale when the program was compiled generally has no relation > to the current user's locale, even though they frequently > match. If it does make a difference, one way to handle > it that may be still cheap is to hash all non-ASCII (and utf8 > multi-byte sequences) to the same, while still doing proper > lowercase comparisons when actually comparing symbols that end > up in the bucket. Something like: > > #define SYMBOL_HASH_NEXT(hash, c) \ > ((hash) * 67 + ((c & 0x80) ? 'z' : TOLOWER ((unsigned char) (c))) - 113) > > I'd assume that most symbol names are wholly or at least mostly > ASCII and that that wouldn't result in too many collisions. > > But then again, I don't really know how compilers for case-insensitive > languages handle non-ASCII symbol names in different cases, especially > considering multibyte sequences. I'm looking at merging the rest of that series to master, and so I've been looking at this issue today, mostly to convince myself that the tolower -> TOLOWER change can't cause a regression. It could cause a difference for case-insensitive languages, if e.g., Latin-1 encoded symbol names could appear in the minimal symbol tables, _and_ GDB is running with a Latin-1 locale too. In that case, with tolower, we'd hash "funcá" / "funcÁ" to the same, but not with TOLOWER, which is strictly ASCII. I'm pretty convinced that scenario is not real. First, nowadays UTF-8 locales are pretty much the norm in most distros (over other ASCII-superset charsets), and tolower on the individual bytes does nothing on UTF-8. Then, I played with making Ada/gnat and both Latin-1 and UTF-8 sources files (the latter with "pragma Wide_Character_Encoding (UTF8)"), and what I discovered was that Ada's encoding/mangling guarantees that only ASCII characters end up in mangled names. From gcc/ada/namet.ads: ~~~ -- Identifiers Stored with upper case letters folded to lower case. -- Upper half (16#80# bit set) and wide characters are -- stored in an encoded form (Uhh for upper half char, -- Whhhh for wide characters, WWhhhhhhhh as provided by -- the routine Append_Encoded, where hh are hex -- digits for the character code using lower case a-f). -- Normally the use of U or W in other internal names is -- avoided, but these letters may be used in internal -- names (without this special meaning), if they appear -- as the last character of the name, or they are -- followed by an upper case letter (other than the WW -- sequence), or an underscore. ~~~ Funny enough, GDB doesn't grok this Uhh/WWhhhhhhhh encoding today. (I wrote a quick patch to teach GDB about it, to help convince myself, though as is, it only works when gdb's charset/locale is UTF-8.) Then I looked at Fortran, but I couldn't make gfortran use/grok non-ASCII identifiers. Looking at https://gcc.gnu.org/onlinedocs/gfortran/SELECTED_005fCHAR_005fKIND.html and https://github.com/jacobwilliams/json-fortran/wiki/Unicode-How-To I believe that's just not supported. So I moved on. Then there's C/C++. But even here both GCC and Clang only allow UTF-8 in identifiers (GCC via UCNs, and Clang because that's the only charset that it supports), and since GCC uses UTF-8 internally, I believe that that's the encoding it always uses for mangling. That's what I see in my experiments. I tried --input-charset=ISO-8859-1 --exec-charset=ISO-8859-1 with a Latin-1 source file, and I made sure I had a non-ASCII identifier as well as a Latin-1 string literal in there, and then confirmed by debugging the result with gdb that the string literal really was Latin-1. Still, the mangled identifier name is encoded in UTF-8 (confirmed with hexdump). AFAICT, I always see UTF-8 in strings in DWARF too (and the DWARF spec recommends this). So tolower/TOLOWER can't make a difference here either. That's good, it means the ABI/mangling doesn't depend on whatever's the host charset, and neither does the DWARF. It's possible other compilers/mangling schemes/ABIs do something else, but at this point I'm not aware of any, but I hope that that's nothing we'd need to support. I think Rust is UTF-8 always. I'd assume Go too, given its authorship/ancestry... In the performance aspect, there's a little bit of improvement. With this: $ cat ff.cmd set pagination off define ff attach 19018 # Firefox's PID. thread apply all bt detach file end set $count = 10 while $count != 0 ff set $count = $count - 1 end $ time $g -q --batch --readnever -x ff.cmd I observe between 3% / 10% time drop. (I used Joel's version of --readnever from here: https://sourceware.org/ml/gdb-patches/2016-07/msg00103.html) So in sum, I'm pretty convinced the patch is safe as is. Thanks, Pedro Alves