From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 49155 invoked by alias); 20 Nov 2017 11:56:48 -0000 Mailing-List: contact gdb-patches-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gdb-patches-owner@sourceware.org Received: (qmail 48722 invoked by uid 89); 20 Nov 2017 11:56:48 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=0.3 required=5.0 tests=BAYES_00,KB_WAM_FROM_NAME_SINGLEWORD,LIKELY_SPAM_SUBJECT,SPF_HELO_PASS,T_RP_MATCHES_RCVD autolearn=no version=3.3.2 spammy=fund, 1 X-HELO: mx1.redhat.com Received: from mx1.redhat.com (HELO mx1.redhat.com) (209.132.183.28) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 20 Nov 2017 11:56:46 +0000 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 985B86A7F3; Mon, 20 Nov 2017 11:56:45 +0000 (UTC) Received: from [127.0.0.1] (ovpn04.gateway.prod.ext.ams2.redhat.com [10.39.146.4]) by smtp.corp.redhat.com (Postfix) with ESMTP id B1B185E1DB; Mon, 20 Nov 2017 11:56:44 +0000 (UTC) Subject: Re: [PATCH 1/3] 0xff chars in name components table; cp-name-parser lex UTF-8 identifiers To: Simon Marchi , gdb-patches@sourceware.org References: <5d721d13-d886-0400-db6b-76485c545142@redhat.com> <1511138515-25996-1-git-send-email-palves@redhat.com> <087a3f24-13ec-77b3-3b2b-fff1d0814ec1@simark.ca> From: Pedro Alves Message-ID: Date: Mon, 20 Nov 2017 11:56:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.4.0 MIME-Version: 1.0 In-Reply-To: <087a3f24-13ec-77b3-3b2b-fff1d0814ec1@simark.ca> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-SW-Source: 2017-11/txt/msg00393.txt.bz2 On 11/20/2017 01:38 AM, Simon Marchi wrote: > Everything you said makes sense to me, the patch looks good to me. I noted > one comment and a typo below. Thanks, see below. >> +/* Starting from a search name, return the string that finds the upper >> + bound of all strings that start with SEARCH_NAME in a sorted name >> + list. Returns the empty string to indicate that the upper bound is >> + the end of the list. */ >> + >> +static std::string >> +make_sort_after_prefix_name (const char *search_name) >> +{ >> + /* When looking to complete "func", we find the upper bound of all >> + symbols that start with "func" by looking for where we'd insert >> + "func"-with-last-character-incremented, i.e. "fund". */ >> + std::string after = search_name; >> + >> + /* Mind 0xff though, which is a valid character in non-UTF-8 source >> + character sets (e.g. Latin1 'ÿ'), and we can't rule out compilers >> + allowing it in identifiers. If we run into it, increment the >> + previous character instead and shorten the string. If the very >> + first character turns out to be 0xff, then the upper bound is the >> + end of the list. > > It's a bit of a nit, but I think this explanation could be a bit more > precise, and maybe simpler. Maybe you could just say that you strip all > trailing 0xff characters, and increment the last non-0xff character of > the string. If the string is composed only of 0xff characters, then the > upper bound is the end of the list. My problem with that is that it wouldn't explain _why_ we strip the 0xffs. > > The "If the very first character turns out to be 0xff" threw me off a bit, > because if you have the string "\xffa\xff", the upper bound will be "\xffb", > not the end of the list, despite the very first character being 0xff. I like that example. How about the following. It's even longer, but I think it's justified. /* Starting from a search name, return the string that finds the upper bound of all strings that start with SEARCH_NAME in a sorted name list. Returns the empty string to indicate that the upper bound is the end of the list. */ static std::string make_sort_after_prefix_name (const char *search_name) { /* When looking to complete "func", we find the upper bound of all symbols that start with "func" by looking for where we'd insert the closest string that would follow "func" in lexicographical order. Usually, that's "func"-with-last-character-incremented, i.e. "fund". Mind non-ASCII characters, though. Usually those will be UTF-8 multi-byte sequences, but we can't be certain. Especially mind the 0xff character, which is a valid character in non-UTF-8 source character sets (e.g. Latin1 'ÿ'), and we can't rule out compilers allowing it in identifiers. Note that conveniently, strcmp/strcasecmp are specified to compare characters interpreted as unsigned char. So what we do is treat the whole string as a base 255 number composed of a sequence of base 255 "digits" and add 1 to it. I.e., adding 1 to 0xff wraps to 0, and carries 1 to the following more-significant position. If the very first character carries/overflows, then the upper bound is the end of the list. Also the string after the empty string is also the empty string. Some examples of this operation: SEARCH_NAME => "+1" RESULT "abc" => "abd" "ab\xff" => "ac" "\xffa\xff" => "\xffb" "\xff" => "" "\xff\xff" => "" "" => "" Then, with these symbols for example: func func1 fund completing "func" looks for symbols between "func" and "func"-with-last-character-incremented, i.e. "fund" (exclusive), which finds "func" and "func1", but not "fund". And with: funcÿ (Latin1 'ÿ' [0xff]) funcÿ1 fund completing "funcÿ" looks for symbols between "funcÿ" and "fund" (exclusive), which finds "funcÿ" and "funcÿ1", but not "fund". And with: ÿÿ (Latin1 'ÿ' [0xff]) ÿÿ1 completing "ÿ" or "ÿÿ" looks for symbols between between "ÿÿ" and the end of the list. */ std::string after = search_name; while (!after.empty () && (unsigned char) after.back () == 0xff) after.pop_back (); if (!after.empty ()) after.back () = (unsigned char) after.back () + 1; return after; } >> + completing "funcÿ", look for symbols between "funcÿ" and "fund" > > looks Fixed above. Thanks, Pedro Alves