From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gdb-return-46168-listarch-gdb=sources.redhat.com@sourceware.org>
Received: (qmail 48110 invoked by alias); 22 Nov 2017 02:10:46 -0000
Mailing-List: contact gdb-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <gdb.sourceware.org>
List-Subscribe: <mailto:gdb-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/gdb/>
List-Post: <mailto:gdb@sourceware.org>
List-Help: <mailto:gdb-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: gdb-owner@sourceware.org
Received: (qmail 47286 invoked by uid 89); 22 Nov 2017 02:10:39 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=0.8 required=5.0 tests=BAYES_00,KAM_LAZY_DOMAIN_SECURITY,KAM_SHORT,KAM_STOCKGEN,KB_WAM_FROM_NAME_SINGLEWORD,SPF_HELO_PASS,T_RP_MATCHES_RCVD autolearn=no version=3.3.2 spammy=Everyone, measuring, Firefox's, Firefoxs
X-HELO: mx1.redhat.com
Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Wed, 22 Nov 2017 02:10:37 +0000
Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16])	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))	(No client certificate requested)	by mx1.redhat.com (Postfix) with ESMTPS id 4067F6E770;	Wed, 22 Nov 2017 02:10:36 +0000 (UTC)
Received: from [127.0.0.1] (ovpn04.gateway.prod.ext.ams2.redhat.com [10.39.146.4])	by smtp.corp.redhat.com (Postfix) with ESMTP id 6E3465C885;	Wed, 22 Nov 2017 02:10:35 +0000 (UTC)
Subject: Re: Note on choosing string hash functions
To: Dmitry Antipov <dantipov@nvidia.com>, gdb@sourceware.org
References: <33c45098-17a4-4c8a-fb14-137e70c7bb3f@nvidia.com> <4fc8cd33-a362-ddf5-9a7c-e69eab385587@redhat.com>
From: Pedro Alves <palves@redhat.com>
Message-ID: <f85c22af-a467-e4b9-611f-8c50ed10a9f6@redhat.com>
Date: Wed, 22 Nov 2017 02:10:00 -0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0
MIME-Version: 1.0
In-Reply-To: <4fc8cd33-a362-ddf5-9a7c-e69eab385587@redhat.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-SW-Source: 2017-11/txt/msg00021.txt.bz2

On 11/17/2017 01:42 PM, Pedro Alves wrote:
> On 11/17/2017 09:48 AM, Dmitry Antipov wrote:
>> I'm curious why 'htab_hash_string' (libiberty/hashtab.c) uses:
>>
>> r = r * 67 + c - 113
>>
>> but 'SYMBOL_HASH_NEXT' (gdb/minsyms.h) prefers an extra 'tolower' with:
>>
>> ((hash) * 67 + tolower ((unsigned char) (c)) - 113)
>>
>> Everyone assumes that 'tolower' is simple, fast, and usually implemented
>> as a macro. But when >1M of mangled C++ symbols should be hashed, results
>> may be somewhat surprising ('perf report'):
> 
> I've noticed tolower show up in perf too, along with isspace,
> in strncmp_iw/strncmp_iw_with_mode and friends.
> 
> I think the tolower is there for source languages that
> are case insensitive, like Ada and Fortran.  (Or if you do
> "set case-sensitive off", but I think that's a
> misfeature...).
> 
> I haven't explicitly tried measuring perf with this patch:
> 
>  https://sourceware.org/ml/gdb-patches/2017-06/msg00022.html
> 
> but I think it should improve performance significantly,
> since safe-ctype.h is locale independent.

I looked at performance now, and while there's an improvement,
it isn't nearly as significant as I'd hope.

> (
> Of course, this ignores non-ASCII code points, but I'm not
> sure we really handle non-ASCII correctly everywhere else,
> i.e., whether it makes a difference today.  Certainly the
> locale when the program was compiled generally has no relation
> to the current user's locale, even though they frequently
> match.  If it does make a difference, one way to handle
> it that may be still cheap is to hash all non-ASCII (and utf8
> multi-byte sequences) to the same, while still doing proper
> lowercase comparisons when actually comparing symbols that end
> up in the bucket.  Something like:
> 
> #define SYMBOL_HASH_NEXT(hash, c)			\
>   ((hash) * 67 + ((c & 0x80) ? 'z' : TOLOWER ((unsigned char) (c))) - 113)
> 
> I'd assume that most symbol names are wholly or at least mostly
> ASCII and that that wouldn't result in too many collisions.
> 
> But then again, I don't really know how compilers for case-insensitive
> languages handle non-ASCII symbol names in different cases, especially
> considering multibyte sequences.

I'm looking at merging the rest of that series to master, and
so I've been looking at this issue today, mostly to convince myself
that the tolower -> TOLOWER change can't cause a regression.

It could cause a difference for case-insensitive languages,
if e.g., Latin-1 encoded symbol names could appear in the minimal
symbol tables, _and_ GDB is running with a Latin-1 locale too.
In that case, with tolower, we'd hash "funcÃ¡" / "funcÃ" to the same,
but not with TOLOWER, which is strictly ASCII.
I'm pretty convinced that scenario is not real.

First, nowadays UTF-8 locales are pretty much the norm in most
distros (over other ASCII-superset charsets), and tolower on the
individual bytes does nothing on UTF-8.

Then, I played with making Ada/gnat and both Latin-1 and UTF-8 sources
files (the latter with "pragma Wide_Character_Encoding (UTF8)"), and
what I discovered was that Ada's encoding/mangling guarantees that only
ASCII characters end up in mangled names.  From gcc/ada/namet.ads:

~~~
--    Identifiers        Stored with upper case letters folded to lower case.
--                       Upper half (16#80# bit set) and wide characters are
--                       stored in an encoded form (Uhh for upper half char,
--                       Whhhh for wide characters, WWhhhhhhhh as provided by
--                       the routine Append_Encoded, where hh are hex
--                       digits for the character code using lower case a-f).
--                       Normally the use of U or W in other internal names is
--                       avoided, but these letters may be used in internal
--                       names (without this special meaning), if they appear
--                       as the last character of the name, or they are
--                       followed by an upper case letter (other than the WW
--                       sequence), or an underscore.
~~~

Funny enough, GDB doesn't grok this Uhh/WWhhhhhhhh encoding today.
(I wrote a quick patch to teach GDB about it, to help convince myself,
though as is, it only works when gdb's charset/locale is UTF-8.)

Then I looked at Fortran, but I couldn't make gfortran use/grok
non-ASCII identifiers.  Looking at 
  https://gcc.gnu.org/onlinedocs/gfortran/SELECTED_005fCHAR_005fKIND.html
and
  https://github.com/jacobwilliams/json-fortran/wiki/Unicode-How-To

I believe that's just not supported.  So I moved on.

Then there's C/C++.  But even here both GCC and Clang
only allow UTF-8 in identifiers (GCC via UCNs, and Clang because
that's the only charset that it supports), and since
GCC uses UTF-8 internally, I believe that that's the encoding it
always uses for mangling.  That's what I see in my experiments.
I tried --input-charset=ISO-8859-1 --exec-charset=ISO-8859-1
with a Latin-1 source file, and I made sure I had a non-ASCII
identifier as well as a Latin-1 string literal in there, and then confirmed
by debugging the result with gdb that the string literal really was
Latin-1.  Still, the mangled identifier name is encoded in UTF-8
(confirmed with hexdump).  AFAICT, I always see UTF-8 in strings in DWARF
too (and the DWARF spec recommends this).  So tolower/TOLOWER can't make
a difference here either.    That's good, it means the ABI/mangling doesn't
depend on whatever's the host charset, and neither does the DWARF.  It's
possible other compilers/mangling schemes/ABIs do something else, but
at this point I'm not aware of any, but I hope that that's nothing we'd
need to support.

I think Rust is UTF-8 always.  I'd assume Go too, given
its authorship/ancestry...

In the performance aspect, there's a little bit of
improvement.  With this:

$ cat ff.cmd 
set pagination off

define ff
  attach 19018 # Firefox's PID.
  thread apply all bt
  detach
  file
  end

set $count = 10
while $count != 0
  ff
  set $count = $count - 1
  end

$ time $g -q --batch --readnever -x ff.cmd

I observe between 3% / 10% time drop.

(I used Joel's version of --readnever from here:
 https://sourceware.org/ml/gdb-patches/2016-07/msg00103.html)

So in sum, I'm pretty convinced the patch is safe as is.

Thanks,
Pedro Alves