From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gdb-patches-return-143521-listarch-gdb-patches=sources.redhat.com@sourceware.org>
Received: (qmail 49155 invoked by alias); 20 Nov 2017 11:56:48 -0000
Mailing-List: contact gdb-patches-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <gdb-patches.sourceware.org>
List-Subscribe: <mailto:gdb-patches-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/gdb-patches/>
List-Post: <mailto:gdb-patches@sourceware.org>
List-Help: <mailto:gdb-patches-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: gdb-patches-owner@sourceware.org
Received: (qmail 48722 invoked by uid 89); 20 Nov 2017 11:56:48 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=0.3 required=5.0 tests=BAYES_00,KB_WAM_FROM_NAME_SINGLEWORD,LIKELY_SPAM_SUBJECT,SPF_HELO_PASS,T_RP_MATCHES_RCVD autolearn=no version=3.3.2 spammy=fund, 1
X-HELO: mx1.redhat.com
Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 20 Nov 2017 11:56:46 +0000
Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16])	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))	(No client certificate requested)	by mx1.redhat.com (Postfix) with ESMTPS id 985B86A7F3;	Mon, 20 Nov 2017 11:56:45 +0000 (UTC)
Received: from [127.0.0.1] (ovpn04.gateway.prod.ext.ams2.redhat.com [10.39.146.4])	by smtp.corp.redhat.com (Postfix) with ESMTP id B1B185E1DB;	Mon, 20 Nov 2017 11:56:44 +0000 (UTC)
Subject: Re: [PATCH 1/3] 0xff chars in name components table; cp-name-parser lex UTF-8 identifiers
To: Simon Marchi <simark@simark.ca>, gdb-patches@sourceware.org
References: <5d721d13-d886-0400-db6b-76485c545142@redhat.com> <1511138515-25996-1-git-send-email-palves@redhat.com> <087a3f24-13ec-77b3-3b2b-fff1d0814ec1@simark.ca>
From: Pedro Alves <palves@redhat.com>
Message-ID: <db3aa595-1d35-51f5-bd34-1aa04707a6c8@redhat.com>
Date: Mon, 20 Nov 2017 11:56:00 -0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0
MIME-Version: 1.0
In-Reply-To: <087a3f24-13ec-77b3-3b2b-fff1d0814ec1@simark.ca>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-SW-Source: 2017-11/txt/msg00393.txt.bz2

On 11/20/2017 01:38 AM, Simon Marchi wrote:

> Everything you said makes sense to me, the patch looks good to me.  I noted
> one comment and a typo below.

Thanks, see below.

>> +/* Starting from a search name, return the string that finds the upper
>> +   bound of all strings that start with SEARCH_NAME in a sorted name
>> +   list.  Returns the empty string to indicate that the upper bound is
>> +   the end of the list.  */
>> +
>> +static std::string
>> +make_sort_after_prefix_name (const char *search_name)
>> +{
>> +  /* When looking to complete "func", we find the upper bound of all
>> +     symbols that start with "func" by looking for where we'd insert
>> +     "func"-with-last-character-incremented, i.e. "fund".  */
>> +  std::string after = search_name;
>> +
>> +  /* Mind 0xff though, which is a valid character in non-UTF-8 source
>> +     character sets (e.g. Latin1 'Ã¿'), and we can't rule out compilers
>> +     allowing it in identifiers.  If we run into it, increment the
>> +     previous character instead and shorten the string.  If the very
>> +     first character turns out to be 0xff, then the upper bound is the
>> +     end of the list.
> 
> It's a bit of a nit, but I think this explanation could be a bit more
> precise, and maybe simpler.  Maybe you could just say that you strip all
> trailing 0xff characters, and increment the last non-0xff character of
> the string.  If the string is composed only of 0xff characters, then the
> upper bound is the end of the list.

My problem with that is that it wouldn't explain _why_ we strip
the 0xffs.

> 
> The "If the very first character turns out to be 0xff" threw me off a bit,
> because if you have the string "\xffa\xff", the upper bound will be "\xffb",
> not the end of the list, despite the very first character being 0xff.

I like that example.  How about the following.  It's even longer, but
I think it's justified.

/* Starting from a search name, return the string that finds the upper
   bound of all strings that start with SEARCH_NAME in a sorted name
   list.  Returns the empty string to indicate that the upper bound is
   the end of the list.  */

static std::string
make_sort_after_prefix_name (const char *search_name)
{
  /* When looking to complete "func", we find the upper bound of all
     symbols that start with "func" by looking for where we'd insert
     the closest string that would follow "func" in lexicographical
     order.  Usually, that's "func"-with-last-character-incremented,
     i.e. "fund".  Mind non-ASCII characters, though.  Usually those
     will be UTF-8 multi-byte sequences, but we can't be certain.
     Especially mind the 0xff character, which is a valid character in
     non-UTF-8 source character sets (e.g. Latin1 'Ã¿'), and we can't
     rule out compilers allowing it in identifiers.  Note that
     conveniently, strcmp/strcasecmp are specified to compare
     characters interpreted as unsigned char.  So what we do is treat
     the whole string as a base 255 number composed of a sequence of
     base 255 "digits" and add 1 to it.  I.e., adding 1 to 0xff wraps
     to 0, and carries 1 to the following more-significant position.
     If the very first character carries/overflows, then the upper
     bound is the end of the list.  Also the string after the empty
     string is also the empty string.

     Some examples of this operation:

       SEARCH_NAME  => "+1" RESULT

       "abc"        => "abd"
       "ab\xff"     => "ac"
       "\xffa\xff"  => "\xffb"
       "\xff"       => ""
       "\xff\xff"   => ""
       ""           => ""

     Then, with these symbols for example:

      func
      func1
      fund

     completing "func" looks for symbols between "func" and
     "func"-with-last-character-incremented, i.e. "fund" (exclusive),
     which finds "func" and "func1", but not "fund".

     And with:

      funcÃ¿     (Latin1 'Ã¿' [0xff])
      funcÃ¿1
      fund

     completing "funcÃ¿" looks for symbols between "funcÃ¿" and "fund"
     (exclusive), which finds "funcÃ¿" and "funcÃ¿1", but not "fund".

     And with:

      Ã¿Ã¿        (Latin1 'Ã¿' [0xff])
      Ã¿Ã¿1

     completing "Ã¿" or "Ã¿Ã¿" looks for symbols between between "Ã¿Ã¿" and
     the end of the list.
  */
  std::string after = search_name;
  while (!after.empty () && (unsigned char) after.back () == 0xff)
    after.pop_back ();
  if (!after.empty ())
    after.back () = (unsigned char) after.back () + 1;
  return after;
}

>> +     completing "funcÃ¿", look for symbols between "funcÃ¿" and "fund"
> 
> looks

Fixed above.

Thanks,
Pedro Alves