Re: [PATCH 1/3] 0xff chars in name components table; cp-name-parser lex UTF-8 identifiers

Mirror of the gdb-patches mailing list
 help / color / mirror / Atom feed

From: Simon Marchi <simark@simark.ca>
To: Pedro Alves <palves@redhat.com>, gdb-patches@sourceware.org
Subject: Re: [PATCH 1/3] 0xff chars in name components table; cp-name-parser lex UTF-8 identifiers
Date: Mon, 20 Nov 2017 01:38:00 -0000	[thread overview]
Message-ID: <087a3f24-13ec-77b3-3b2b-fff1d0814ec1@simark.ca> (raw)
In-Reply-To: <1511138515-25996-1-git-send-email-palves@redhat.com>

On 2017-11-19 07:41 PM, Pedro Alves wrote:
> The find-upper-bound-for-completion algorithm in the name components
> accelerator table in dwarf2read.c increments a char in a string, and
> asserts that it's not incrementing a 0xff char, but that's incorrect.
> 
> First, we shouldn't be calling gdb_assert on input.
> 
> Then, if "char" is signed, comparing a caracther with "0xff" will
> never yield true, which is caught by Clang with:
> 
>   error: comparison of constant 255 with expression of type '....' (aka 'char') is always true [-Werror,-Wtautological-constant-out-of-range-compare]
> 	    gdb_assert (after.back () != 0xff);
> 			~~~~~~~~~~~~~ ^  ~~~~
> 
> And then, 0xff is a valid character on non-UTF-8/ASCII character sets.
> E.g., it's 'Ã¿' in Latin1.  While GCC nor Clang support !ASCII &&
> !UTF-8 characters in identifiers (GCC supports UTF-8 characters only
> via UCNs, see https://gcc.gnu.org/onlinedocs/cpp/Character-sets.html),
> but other compilers might (Visual Studio?), so it doesn't hurt to
> handle it correctly.  Testing is covered by extending the
> dw2_expand_symtabs_matching unit tests with relevant cases.
> 
> However, without further changes, the unit tests still fail...  The
> problem is that cp-name-parser.y assumes that identifiers are ASCII
> (via ISALPHA/ISALNUM).  This commit fixes that too, so that we can
> unit test the dwarf2read.c changes.  (The regular C/C++ lexer in
> c-lang.y needs a similar treatment, but I'm leaving that for another
> patch.)
> 
> While doing this, I noticed a thinko in the computation of the upper
> bound for completion in dw2_expand_symtabs_matching_symbol.  We're
> using std::upper_bound but we should use std::lower_bound.  I extended
> the unit test with a case that I thought would expose it, this one:
> 
>  +  /* These are used to check that the increment-last-char in the
>  +     matching algorithm for completion doesn't match "t1_fund" when
>  +     completing "t1_func".  */
>  +  "t1_func",
>  +  "t1_func1",
>  +  "t1_fund",
>  +  "t1_fund1",
> 
> The algorithm actually returns "t1_fund1" as lower bound, so "t1_fund"
> matches incorrectly.  But turns out the problem is masked because
> later here:
> 
>   for (;lower != upper; ++lower)
>     {
>       const char *qualified = index.symbol_name_at (lower->idx);
> 
>       if (!lookup_name_matcher.matches (qualified)
> 
> the lookup_name_matcher.matches check above filters out "t1_fund"
> because that doesn't start with "t1_func".
> 
> I'll fix the latent bug in follow up patches, after factoring things
> out a bit in a way that allows unit testing the relevant code more
> directly.

Everything you said makes sense to me, the patch looks good to me.  I noted
one comment and a typo below.

> gdb/ChangeLog:
> yyyy-mm-dd  Pedro Alves  <palves@redhat.com>
> 
> 	* cp-name-parser.y (cp_ident_is_alpha, cp_ident_is_alnum): New.
> 	(symbol_end): Use cp_ident_is_alnum.
> 	(yylex): Use cp_ident_is_alpha and cp_ident_is_alnum.
> 	* dwarf2read.c (make_sort_after_prefix_name): New function.
> 	(dw2_expand_symtabs_matching_symbol): Use it.
> 	(test_symbols): Add more symbols.
> 	(run_test): Add tests.
> ---
>  gdb/cp-name-parser.y |  28 ++++++++++--
>  gdb/dwarf2read.c     | 119 ++++++++++++++++++++++++++++++++++++++++++++-------
>  2 files changed, 129 insertions(+), 18 deletions(-)
> 
> diff --git a/gdb/cp-name-parser.y b/gdb/cp-name-parser.y
> index 33ecf13..fdfbf15 100644
> --- a/gdb/cp-name-parser.y
> +++ b/gdb/cp-name-parser.y
> @@ -1304,6 +1304,28 @@ d_binary (const char *name, struct demangle_component *lhs, struct demangle_comp
>  		      fill_comp (DEMANGLE_COMPONENT_BINARY_ARGS, lhs, rhs));
>  }
>  
> +/* Like ISALPHA, but also returns true for the union of all UTF-8
> +   multi-byte sequence bytes and non-ASCII characters in
> +   extended-ASCII charsets (e.g., Latin1).  I.e., returns true if the
> +   high bit is set.  Note that not all UTF-8 ranges are allowed in C++
> +   identifiers, but we don't need to be pedantic so for simplicity we
> +   ignore that here.  Plus this avoids the complication of actually
> +   knowing what was the right encoding.  */
> +
> +static inline bool
> +cp_ident_is_alpha (unsigned char ch)
> +{
> +  return ISALPHA (ch) || ch >= 0x80;
> +}
> +
> +/* Similarly, but Like ISALNUM.  */
> +
> +static inline bool
> +cp_ident_is_alnum (unsigned char ch)
> +{
> +  return ISALNUM (ch) || ch >= 0x80;
> +}
> +
>  /* Find the end of a symbol name starting at LEXPTR.  */
>  
>  static const char *
> @@ -1311,7 +1333,7 @@ symbol_end (const char *lexptr)
>  {
>    const char *p = lexptr;
>  
> -  while (*p && (ISALNUM (*p) || *p == '_' || *p == '$' || *p == '.'))
> +  while (*p && (cp_ident_is_alnum (*p) || *p == '_' || *p == '$' || *p == '.'))
>      p++;
>  
>    return p;
> @@ -1791,7 +1813,7 @@ yylex (void)
>        return ERROR;
>      }
>  
> -  if (!(c == '_' || c == '$' || ISALPHA (c)))
> +  if (!(c == '_' || c == '$' || cp_ident_is_alpha (c)))
>      {
>        /* We must have come across a bad character (e.g. ';').  */
>        yyerror (_("invalid character"));
> @@ -1802,7 +1824,7 @@ yylex (void)
>    namelen = 0;
>    do
>      c = tokstart[++namelen];
> -  while (ISALNUM (c) || c == '_' || c == '$');
> +  while (cp_ident_is_alnum (c) || c == '_' || c == '$');
>  
>    lexptr += namelen;
>  
> diff --git a/gdb/dwarf2read.c b/gdb/dwarf2read.c
> index 5437d21..b08e81c 100644
> --- a/gdb/dwarf2read.c
> +++ b/gdb/dwarf2read.c
> @@ -4188,6 +4188,60 @@ gdb_index_symbol_name_matcher::matches (const char *symbol_name)
>    return false;
>  }
>  
> +/* Starting from a search name, return the string that finds the upper
> +   bound of all strings that start with SEARCH_NAME in a sorted name
> +   list.  Returns the empty string to indicate that the upper bound is
> +   the end of the list.  */
> +
> +static std::string
> +make_sort_after_prefix_name (const char *search_name)
> +{
> +  /* When looking to complete "func", we find the upper bound of all
> +     symbols that start with "func" by looking for where we'd insert
> +     "func"-with-last-character-incremented, i.e. "fund".  */
> +  std::string after = search_name;
> +
> +  /* Mind 0xff though, which is a valid character in non-UTF-8 source
> +     character sets (e.g. Latin1 'Ã¿'), and we can't rule out compilers
> +     allowing it in identifiers.  If we run into it, increment the
> +     previous character instead and shorten the string.  If the very
> +     first character turns out to be 0xff, then the upper bound is the
> +     end of the list.

It's a bit of a nit, but I think this explanation could be a bit more
precise, and maybe simpler.  Maybe you could just say that you strip all
trailing 0xff characters, and increment the last non-0xff character of
the string.  If the string is composed only of 0xff characters, then the
upper bound is the end of the list.

The "If the very first character turns out to be 0xff" threw me off a bit,
because if you have the string "\xffa\xff", the upper bound will be "\xffb",
not the end of the list, despite the very first character being 0xff.

> +
> +     E.g., with these symbols:
> +
> +      func
> +      func1
> +      fund
> +
> +     completing "func" looks for symbols between "func" and
> +     "func"-with-last-character-incremented, i.e. "fund" (exclusive),
> +     which finds "func" and "func1", but not "fund".
> +
> +     And with:
> +
> +      funcÃ¿     (Latin1 'Ã¿' [0xff])
> +      funcÃ¿1
> +      fund
> +
> +     completing "funcÃ¿", look for symbols between "funcÃ¿" and "fund"

looks

Simon

next prev parent reply	other threads:[~2017-11-20  1:38 UTC|newest]

Thread overview: 182+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-06-02 12:22 [PATCH 00/40] C++ debugging improvements: breakpoints, TAB completion, more Pedro Alves
2017-06-02 12:22 ` [PATCH 03/40] Fix gdb.base/completion.exp with --target_board=dwarf4-gdb-index Pedro Alves
2017-07-13 20:28   ` Keith Seitz
2017-07-14 16:02     ` Pedro Alves
2017-06-02 12:22 ` [PATCH 01/40] Make gdb.base/dmsym.exp independent of "set language ada" Pedro Alves
2017-07-18 19:42   ` Simon Marchi
2017-07-20 17:00     ` Pedro Alves
2017-06-02 12:22 ` [PATCH 14/40] Introduce CP_OPERATOR_STR/CP_OPERATOR_LEN and use throughout Pedro Alves
2017-07-14 18:04   ` Keith Seitz
2017-07-17 14:55     ` Pedro Alves
2017-06-02 12:22 ` [PATCH 02/40] Eliminate make_cleanup_obstack_free, introduce auto_obstack Pedro Alves
2017-06-26 13:47   ` Yao Qi
2017-06-27 10:25     ` Pedro Alves
2017-06-28 10:36   ` Yao Qi
2017-06-28 14:39     ` Pedro Alves
2017-06-28 21:33       ` Yao Qi
2017-06-02 12:22 ` [PATCH 08/40] completion_list_add_name wrapper functions Pedro Alves
2017-06-27 12:56   ` Yao Qi
2017-06-27 15:35     ` Pedro Alves
2017-06-02 12:22 ` [PATCH 06/40] Expression completer should not match explicit location options Pedro Alves
2017-06-29  8:29   ` Yao Qi
2017-06-29 10:56     ` Pedro Alves
2017-06-29 11:08       ` Pedro Alves
2017-06-29 15:23         ` Pedro Alves
2017-06-29 11:24       ` Yao Qi
2017-06-29 15:25         ` Pedro Alves
2017-06-02 12:23 ` [PATCH 19/40] Fix cp_find_first_component_aux bug Pedro Alves
2017-07-17 19:17   ` Keith Seitz
2017-07-17 19:50     ` Pedro Alves
2017-07-17 21:38       ` Keith Seitz
2017-07-20 17:03         ` Pedro Alves
2017-06-02 12:23 ` [PATCH 40/40] Document breakpoints / linespec & co improvements (manual + NEWS) Pedro Alves
2017-06-02 13:01   ` Eli Zaretskii
2017-06-02 13:33     ` Pedro Alves
2017-06-21 15:50       ` Pedro Alves
2017-06-21 19:14         ` Pedro Alves
2017-06-22 19:45           ` Eli Zaretskii
2017-06-22 19:42         ` Eli Zaretskii
2017-06-21 13:32     ` Pedro Alves
2017-06-21 18:26       ` Eli Zaretskii
2017-06-21 19:01         ` Pedro Alves
2017-06-22 19:43           ` Eli Zaretskii
2017-06-02 12:23 ` [PATCH 28/40] lookup_name_info::make_ignore_params Pedro Alves
2017-08-08 20:55   ` Keith Seitz
2017-11-08 16:18     ` Pedro Alves
2017-06-02 12:23 ` [PATCH 18/40] A smarter linespec completer Pedro Alves
2017-07-15  0:07   ` Keith Seitz
2017-07-17 18:21     ` Pedro Alves
2017-07-17 19:02       ` Keith Seitz
2017-07-17 19:33         ` Pedro Alves
2017-06-02 12:23 ` [PATCH 38/40] Use TOLOWER in SYMBOL_HASH_NEXT Pedro Alves
2017-08-09 19:25   ` Keith Seitz
2017-11-25  0:35     ` [pushed] " Pedro Alves
2017-06-02 12:23 ` [PATCH 13/40] Introduce strncmp_iw Pedro Alves
2017-06-29  8:42   ` Yao Qi
2017-07-17 19:16     ` Pedro Alves
2017-06-02 12:23 ` [PATCH 09/40] Rename make_symbol_completion_list_fn -> symbol_completer Pedro Alves
2017-06-28 21:40   ` Yao Qi
2017-07-13 20:46   ` Keith Seitz
2017-07-17 11:00     ` Pedro Alves
2017-06-02 12:23 ` [PATCH 11/40] Introduce class completion_tracker & rewrite completion<->readline interaction Pedro Alves
2017-07-14 17:23   ` Keith Seitz
2017-07-17 13:56     ` Pedro Alves
2017-07-18  8:23       ` Christophe Lyon
     [not found]         ` <845f435e-d3d5-b327-4e3a-ce9434bd6ffd@redhat.com>
2017-07-18 10:42           ` [pushed] Fix GDB builds that include the simulator (Re: [PATCH 11/40] Introduce class completion_tracker & rewrite completion<->readline interaction) Pedro Alves
2018-03-05 21:43   ` [PATCH 11/40] Introduce class completion_tracker & rewrite completion<->readline interaction Simon Marchi
2017-06-02 12:23 ` [PATCH 15/40] Rewrite/enhance explicit locations completer, parse left->right Pedro Alves
2017-07-14 20:55   ` Keith Seitz
2017-07-17 19:24     ` Pedro Alves
2017-06-02 12:23 ` [PATCH 39/40] Breakpoints in symbols with ABI tags (PR c++/19436) Pedro Alves
2017-08-09 19:34   ` Keith Seitz
2017-11-27 17:14     ` Pedro Alves
2017-06-02 12:23 ` [PATCH 35/40] Comprehensive C++ linespec/completer tests Pedro Alves
2017-08-09 17:30   ` Keith Seitz
2017-11-24 16:25     ` Pedro Alves
2017-06-02 12:23 ` [PATCH 36/40] Add comprehensive C++ operator linespec/location/completion tests Pedro Alves
2017-08-09 17:59   ` Keith Seitz
2017-11-25  0:18     ` [pushed] " Pedro Alves
2017-11-30 15:43       ` Yao Qi
2017-11-30 16:06         ` Pedro Alves
2017-11-30 16:35           ` [pushed] Fix gdb.linespec/cpls-ops.exp on 32-bit (Re: [pushed] Re: [PATCH 36/40] Add comprehensive C++ operator linespec/location/completion tests) Pedro Alves
2017-06-02 12:23 ` [PATCH 37/40] Fix completing an empty string Pedro Alves
2017-08-09 18:01   ` Keith Seitz
2017-11-25  0:28     ` Pedro Alves
2017-06-02 12:23 ` [PATCH 27/40] Make cp_remove_params return a unique_ptr Pedro Alves
2017-08-08 20:35   ` Keith Seitz
2017-10-09 15:13     ` Pedro Alves
2017-06-02 12:23 ` [PATCH 10/40] Clean up "completer_handle_brkchars" callback handling Pedro Alves
2017-07-13 21:08   ` Keith Seitz
2017-07-17 11:14     ` Pedro Alves
2017-06-02 12:23 ` [PATCH 34/40] Make strcmp_iw NOT ignore whitespace in the middle of tokens Pedro Alves
2017-08-09 15:48   ` Keith Seitz
2017-11-24 23:38     ` [pushed] " Pedro Alves
2017-06-02 12:28 ` [PATCH 24/40] Per-language symbol name hashing algorithm Pedro Alves
2017-07-18 17:33   ` Keith Seitz
2017-07-20 18:53     ` Pedro Alves
2017-11-08 16:08       ` Pedro Alves
2017-06-02 12:29 ` [PATCH 12/40] "complete" command and completion word break characters Pedro Alves
2017-07-14 17:50   ` Keith Seitz
2017-07-17 14:36     ` Pedro Alves
2017-06-02 12:29 ` [PATCH 33/40] Make the linespec/location completer ignore data symbols Pedro Alves
2017-08-09 15:42   ` Keith Seitz
2017-11-08 16:22     ` Pedro Alves
2017-06-02 12:29 ` [PATCH 17/40] Linespec lexing and C++ operators Pedro Alves
2017-07-14 21:45   ` Keith Seitz
2017-07-17 19:34     ` Pedro Alves
2017-06-02 12:29 ` [PATCH 16/40] Explicit locations -label completer Pedro Alves
2017-07-14 21:32   ` Keith Seitz
2017-06-02 12:29 ` [PATCH 07/40] objfile_per_bfd_storage non-POD Pedro Alves
2017-06-27 12:00   ` Yao Qi
2017-06-27 15:30     ` Pedro Alves
2017-06-02 12:29 ` [PATCH 05/40] command.h: Include scoped_restore_command.h Pedro Alves
2017-06-27 11:30   ` Yao Qi
2017-06-27 11:45     ` Pedro Alves
2017-06-27 11:52       ` Pedro Alves
2017-06-27 12:03         ` Pedro Alves
2017-06-27 15:46           ` [PATCH 05/40] command.h: Include common/scoped_restore.h Pedro Alves
2017-06-28  7:54             ` Yao Qi
2017-06-28 14:20               ` Pedro Alves
2017-06-02 12:29 ` [PATCH 21/40] Use SYMBOL_MATCHES_SEARCH_NAME some more Pedro Alves
2017-07-17 21:39   ` Keith Seitz
2017-07-20 17:08     ` Pedro Alves
2017-06-02 12:30 ` [PATCH 20/40] Eliminate block_iter_name_* Pedro Alves
2017-07-17 19:47   ` Keith Seitz
2017-07-20 17:05     ` Pedro Alves
2017-06-02 12:30 ` [PATCH 32/40] Make "break foo" find "A::foo", A::B::foo", etc. [C++ and wild matching] Pedro Alves
2017-08-08 23:48   ` Keith Seitz
2017-11-22 16:48     ` Pedro Alves
2017-11-24 16:48       ` Pedro Alves
2017-11-24 16:57         ` Pedro Alves
2017-11-28  0:39         ` Keith Seitz
2017-11-28  0:02       ` Keith Seitz
2017-11-28  0:21         ` Pedro Alves
2017-11-28  0:42           ` Keith Seitz
2017-06-02 12:30 ` [PATCH 30/40] Use search_domain::FUNCTIONS_DOMAIN when setting breakpoints Pedro Alves
2017-08-08 21:07   ` Keith Seitz
2017-11-08 16:20     ` Pedro Alves
2017-06-02 12:31 ` [PATCH 29/40] Simplify completion_list_add_name | remove sym_text / sym_text_len Pedro Alves
2017-08-08 20:59   ` Keith Seitz
2017-11-08 16:19     ` Pedro Alves
2017-06-02 12:31 ` [PATCH 04/40] Fix TAB-completion + .gdb_index slowness (generalize filename_seen_cache) Pedro Alves
2017-07-13 20:41   ` Keith Seitz
2017-07-14 19:40     ` Pedro Alves
2017-07-17 10:51       ` Pedro Alves
2017-06-02 12:31 ` [PATCH 22/40] get_int_var_value Pedro Alves
2017-07-17 22:11   ` Keith Seitz
2017-07-20 17:15     ` Pedro Alves
2017-06-02 12:33 ` [PATCH 31/40] Handle custom completion match prefix / LCD Pedro Alves
2017-08-08 21:28   ` Keith Seitz
2017-11-27 17:11     ` Pedro Alves
2017-06-02 12:39 ` [PATCH 25/40] Introduce lookup_name_info and generalize Ada's FULL/WILD name matching Pedro Alves
2017-07-18 20:14   ` Keith Seitz
2017-07-18 22:31     ` Pedro Alves
2017-07-20 19:00       ` Pedro Alves
2017-07-20 19:06         ` Pedro Alves
2017-08-08 20:29           ` Keith Seitz
2017-10-19 17:36             ` Pedro Alves
2017-11-01 15:38               ` Joel Brobecker
2017-11-08 16:10                 ` Pedro Alves
2017-11-08 22:15                   ` Joel Brobecker
2017-06-02 12:39 ` [PATCH 23/40] Make language_def O(1) Pedro Alves
2017-07-17 23:03   ` Keith Seitz
2017-07-20 17:40     ` Pedro Alves
2017-07-20 18:12       ` Get rid of "set language local"? (was: Re: [PATCH 23/40] Make language_def O(1)) Pedro Alves
2017-07-20 23:44         ` Matt Rice
2017-06-02 12:39 ` [PATCH 26/40] Optimize .gdb_index symbol name searching Pedro Alves
2017-08-08 20:32   ` Keith Seitz
2017-11-08 16:14     ` Pedro Alves
2017-11-08 16:16       ` [pushed] Reorder/reindent dw2_expand_symtabs_matching & friends (Re: [PATCH 26/40] Optimize .gdb_index symbol name searching) Pedro Alves
2017-11-18  5:23   ` [PATCH 26/40] Optimize .gdb_index symbol name searching Simon Marchi
2017-11-20  0:33     ` Pedro Alves
2017-11-20  0:42       ` [PATCH 1/3] 0xff chars in name components table; cp-name-parser lex UTF-8 identifiers Pedro Alves
2017-11-20  1:38         ` Simon Marchi [this message]
2017-11-20 11:56           ` Pedro Alves
2017-11-20 16:50             ` Simon Marchi
2017-11-21  0:11               ` Pedro Alves
2017-11-20  0:42       ` [PATCH 3/3] Fix mapped_index::find_name_components_bounds upper bound computation Pedro Alves
2017-11-20  3:17         ` Simon Marchi
2017-11-20  0:42       ` [PATCH 2/3] Unit test name-component bounds searching directly Pedro Alves
2017-11-20  3:16         ` Simon Marchi
2017-11-20 14:17           ` Pedro Alves
2017-06-02 15:26 ` [PATCH 00/40] C++ debugging improvements: breakpoints, TAB completion, more Pedro Alves

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=087a3f24-13ec-77b3-3b2b-fff1d0814ec1@simark.ca \
    --to=simark@simark.ca \
    --cc=gdb-patches@sourceware.org \
    --cc=palves@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox