From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 93482 invoked by alias); 22 May 2018 15:17:33 -0000 Mailing-List: contact gdb-patches-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gdb-patches-owner@sourceware.org Received: (qmail 93465 invoked by uid 89); 22 May 2018 15:17:32 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: =?ISO-8859-1?Q?No, score=-21.8 required=5.0 tests=AWL,BAYES_00,BODY_8BITS,GARBLED_BODY,GIT_PATCH_0,GIT_PATCH_1,GIT_PATCH_2,GIT_PATCH_3,KAM_SHORT,SPF_HELO_PASS autolearn=ham version=3.3.2 spammy=8:n=c3, 8:=c3=a3, 8:=a7=c3, discussions?= X-HELO: mx1.redhat.com Received: from mx3-rdu2.redhat.com (HELO mx1.redhat.com) (66.187.233.73) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 22 May 2018 15:17:29 +0000 Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.rdu2.redhat.com [10.11.54.4]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 8A694CB9C0; Tue, 22 May 2018 15:17:25 +0000 (UTC) Received: from [127.0.0.1] (ovpn04.gateway.prod.ext.ams2.redhat.com [10.39.146.4]) by smtp.corp.redhat.com (Postfix) with ESMTP id 9EBD62024CAD; Tue, 22 May 2018 15:17:24 +0000 (UTC) Subject: Re: support C/C++ identifiers named with non-ASCII characters To: =?UTF-8?B?5by15L+K6Iqd?= , gdb-patches@sourceware.org References: <9418d4f0-f22a-c587-cc34-2fa67afbd028@zjz.name> <8c8af079-dbb8-207b-5edf-86b99e9f5db8@simark.ca> <1b915196-3e97-4892-7426-be4211fe7889@zjz.name> <32da087b-da41-7414-3a56-f2e4587fe287@zjz.name> <253bd3ae-5c38-0c01-6e51-f59fc17b781d@redhat.com> From: Pedro Alves Message-ID: Date: Tue, 22 May 2018 16:42:00 -0000 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.6.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit X-SW-Source: 2018-05/txt/msg00561.txt.bz2 On 05/22/2018 03:50 PM, Pedro Alves wrote: > On 05/22/2018 03:32 PM, 張俊芝 wrote: >> >> Pedro Alves 於 2018/5/22 下午10:15 寫道: >>> >>> I actually already started writing a patch for this a few months >>> back, including a C testcase, after these discussions: >>> >>>    https://sourceware.org/ml/gdb-patches/2017-11/msg00428.html >>>    https://sourceware.org/ml/gdb/2017-11/msg00022.html >>> >>> Let me try to find it.  I don't recall exactly where I left off, >>> but I think I had something working. >> >> I just started writing a test case when I saw your letter. >> >> Could you shed light on how you delimit identifiers in your patch, Pedro? Does it check all invalid non-ASCII characters, is it dedicated to some encoding such as UTF-8, or to any encodings? > > I found the patch. Let me rebase it and send it / post it. It'll > be easier to just look at the patch. Here it is. So this is reusing the same logic added to cp-name-parser.y, in the C/C++ expression parser as well. The testcase passes cleanly, except for the test that does "b fun". That finds two functions that start with "fun", but GDB/readline displays them in an odd way, with no space in between the matches: (gdb) b função[tab] função1função2 (gdb) b função I suspect the issue is in our readline replacements in gdb/completer.c, like gdb_fnwidth. HANDLE_MULTIBYTE isn't defined for me, for example. >From ea6eafba4e32b760afdd1e00a5847772b30a2cbd Mon Sep 17 00:00:00 2001 From: Pedro Alves Date: Tue, 22 May 2018 15:35:21 +0100 Subject: [PATCH] Support UTF-8 identifiers in C/C++ Factor out cp_ident_is_alpha/cp_ident_is_alnum out of cp-name-parser.y and use it in the C/C++ expression parser too. New test included. gdb/ChangeLog: yyyy-mm-dd Pedro Alves * c-exp.y: Include "c-support.h". (parse_number, c_parse_escape, lex_one_token): Use TOLOWER instead of tolower. Use c_ident_is_alpha to scan names. * c-lang.c: Include "c-support.h". (convert_ucn, convert_octal, convert_hex, convert_escape): Use ISXDIGIT instead of isxdigit and ISDIGIT instead of isdigit. * c-support.h: New file, with bits factored out from ... * cp-name-parser.y: ... this file. Include "c-support.h". (cp_ident_is_alpha, cp_ident_is_alnum): Deleted, moved to c-support.h and renamed. (symbol_end, yylex): Adjust. gdb/testsuite/ChangeLog: yyyy-mm-dd Pedro Alves * gdb.base/utf8-identifiers.c: New file. * gdb.base/utf8-identifiers.exp: New file. --- gdb/c-exp.y | 27 +++++----- gdb/c-lang.c | 11 +++-- gdb/c-support.h | 46 +++++++++++++++++ gdb/cp-name-parser.y | 29 ++--------- gdb/testsuite/gdb.base/utf8-identifiers.c | 71 ++++++++++++++++++++++++++ gdb/testsuite/gdb.base/utf8-identifiers.exp | 77 +++++++++++++++++++++++++++++ 6 files changed, 217 insertions(+), 44 deletions(-) create mode 100644 gdb/c-support.h create mode 100644 gdb/testsuite/gdb.base/utf8-identifiers.c create mode 100644 gdb/testsuite/gdb.base/utf8-identifiers.exp diff --git a/gdb/c-exp.y b/gdb/c-exp.y index 5e10d2a3b4..ae31af52df 100644 --- a/gdb/c-exp.y +++ b/gdb/c-exp.y @@ -42,6 +42,7 @@ #include "parser-defs.h" #include "language.h" #include "c-lang.h" +#include "c-support.h" #include "bfd.h" /* Required by objfiles.h. */ #include "symfile.h" /* Required by objfiles.h. */ #include "objfiles.h" /* For have_full_symbols and have_partial_symbols */ @@ -1806,13 +1807,13 @@ parse_number (struct parser_state *par_state, len -= 2; } /* Handle suffixes: 'f' for float, 'l' for long double. */ - else if (len >= 1 && tolower (p[len - 1]) == 'f') + else if (len >= 1 && TOLOWER (p[len - 1]) == 'f') { putithere->typed_val_float.type = parse_type (par_state)->builtin_float; len -= 1; } - else if (len >= 1 && tolower (p[len - 1]) == 'l') + else if (len >= 1 && TOLOWER (p[len - 1]) == 'l') { putithere->typed_val_float.type = parse_type (par_state)->builtin_long_double; @@ -2023,9 +2024,9 @@ c_parse_escape (const char **ptr, struct obstack *output) if (output) obstack_grow_str (output, "\\x"); ++tokptr; - if (!isxdigit (*tokptr)) + if (!ISXDIGIT (*tokptr)) error (_("\\x escape without a following hex digit")); - while (isxdigit (*tokptr)) + while (ISXDIGIT (*tokptr)) { if (output) obstack_1grow (output, *tokptr); @@ -2048,7 +2049,7 @@ c_parse_escape (const char **ptr, struct obstack *output) if (output) obstack_grow_str (output, "\\"); for (i = 0; - i < 3 && isdigit (*tokptr) && *tokptr != '8' && *tokptr != '9'; + i < 3 && ISDIGIT (*tokptr) && *tokptr != '8' && *tokptr != '9'; ++i) { if (output) @@ -2073,9 +2074,9 @@ c_parse_escape (const char **ptr, struct obstack *output) obstack_1grow (output, *tokptr); } ++tokptr; - if (!isxdigit (*tokptr)) + if (!ISXDIGIT (*tokptr)) error (_("\\%c escape without a following hex digit"), c); - for (i = 0; i < len && isxdigit (*tokptr); ++i) + for (i = 0; i < len && ISXDIGIT (*tokptr); ++i) { if (output) obstack_1grow (output, *tokptr); @@ -2668,7 +2669,7 @@ lex_one_token (struct parser_state *par_state, bool *is_quoted_name) size_t len = strlen ("selector"); if (strncmp (p, "selector", len) == 0 - && (p[len] == '\0' || isspace (p[len]))) + && (p[len] == '\0' || ISSPACE (p[len]))) { lexptr = p + len; return SELECTOR; @@ -2677,9 +2678,9 @@ lex_one_token (struct parser_state *par_state, bool *is_quoted_name) goto parse_string; } - while (isspace (*p)) + while (ISSPACE (*p)) p++; - if (strncmp (p, "entry", len) == 0 && !isalnum (p[len]) + if (strncmp (p, "entry", len) == 0 && !c_ident_is_alnum (p[len]) && p[len] != '_') { lexptr = &p[len]; @@ -2741,16 +2742,14 @@ lex_one_token (struct parser_state *par_state, bool *is_quoted_name) } } - if (!(c == '_' || c == '$' - || (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z'))) + if (!(c == '_' || c == '$' || c_ident_is_alpha (c))) /* We must have come across a bad character (e.g. ';'). */ error (_("Invalid character '%c' in expression."), c); /* It's a name. See how long it is. */ namelen = 0; for (c = tokstart[namelen]; - (c == '_' || c == '$' || (c >= '0' && c <= '9') - || (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') || c == '<');) + (c == '_' || c == '$' || c_ident_is_alnum (c) || c == '<');) { /* Template parameter lists are part of the name. FIXME: This mishandles `print $a<4&&$a>3'. */ diff --git a/gdb/c-lang.c b/gdb/c-lang.c index 15e633f8c8..6bbb470957 100644 --- a/gdb/c-lang.c +++ b/gdb/c-lang.c @@ -25,6 +25,7 @@ #include "language.h" #include "varobj.h" #include "c-lang.h" +#include "c-support.h" #include "valprint.h" #include "macroscope.h" #include "charset.h" @@ -382,7 +383,7 @@ convert_ucn (char *p, char *limit, const char *dest_charset, gdb_byte data[4]; int i; - for (i = 0; i < length && p < limit && isxdigit (*p); ++i, ++p) + for (i = 0; i < length && p < limit && ISXDIGIT (*p); ++i, ++p) result = (result << 4) + host_hex_value (*p); for (i = 3; i >= 0; --i) @@ -424,7 +425,7 @@ convert_octal (struct type *type, char *p, unsigned long value = 0; for (i = 0; - i < 3 && p < limit && isdigit (*p) && *p != '8' && *p != '9'; + i < 3 && p < limit && ISDIGIT (*p) && *p != '8' && *p != '9'; ++i) { value = 8 * value + host_hex_value (*p); @@ -447,7 +448,7 @@ convert_hex (struct type *type, char *p, { unsigned long value = 0; - while (p < limit && isxdigit (*p)) + while (p < limit && ISXDIGIT (*p)) { value = 16 * value + host_hex_value (*p); ++p; @@ -488,7 +489,7 @@ convert_escape (struct type *type, const char *dest_charset, case 'x': ADVANCE; - if (!isxdigit (*p)) + if (!ISXDIGIT (*p)) error (_("\\x used with no following hex digits.")); p = convert_hex (type, p, limit, output); break; @@ -510,7 +511,7 @@ convert_escape (struct type *type, const char *dest_charset, int length = *p == 'u' ? 4 : 8; ADVANCE; - if (!isxdigit (*p)) + if (!ISXDIGIT (*p)) error (_("\\u used with no following hex digits")); p = convert_ucn (p, limit, dest_charset, output, length); } diff --git a/gdb/c-support.h b/gdb/c-support.h new file mode 100644 index 0000000000..669db60cd6 --- /dev/null +++ b/gdb/c-support.h @@ -0,0 +1,46 @@ +/* Helper routines for C support in GDB. + Copyright (C) 2017 Free Software Foundation, Inc. + + This file is part of GDB. + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see . */ + +#ifndef C_SUPPORT_H +#define C_SUPPORT_H + +#include "safe-ctype.h" + +/* Like ISALPHA, but also returns true for the union of all UTF-8 + multi-byte sequence bytes and non-ASCII characters in + extended-ASCII charsets (e.g., Latin1). I.e., returns true if the + high bit is set. Note that not all UTF-8 ranges are allowed in C++ + identifiers, but we don't need to be pedantic so for simplicity we + ignore that here. Plus this avoids the complication of actually + knowing what was the right encoding. */ + +static inline bool +c_ident_is_alpha (unsigned char ch) +{ + return ISALPHA (ch) || ch >= 0x80; +} + +/* Similarly, but Like ISALNUM. */ + +static inline bool +c_ident_is_alnum (unsigned char ch) +{ + return ISALNUM (ch) || ch >= 0x80; +} + +#endif /* C_SUPPORT_H */ diff --git a/gdb/cp-name-parser.y b/gdb/cp-name-parser.y index f522e46419..ebae56261b 100644 --- a/gdb/cp-name-parser.y +++ b/gdb/cp-name-parser.y @@ -35,6 +35,7 @@ #include "safe-ctype.h" #include "demangle.h" #include "cp-support.h" +#include "c-support.h" /* Bison does not make it easy to create a parser without global state, unfortunately. Here are all the global variables used @@ -1304,28 +1305,6 @@ d_binary (const char *name, struct demangle_component *lhs, struct demangle_comp fill_comp (DEMANGLE_COMPONENT_BINARY_ARGS, lhs, rhs)); } -/* Like ISALPHA, but also returns true for the union of all UTF-8 - multi-byte sequence bytes and non-ASCII characters in - extended-ASCII charsets (e.g., Latin1). I.e., returns true if the - high bit is set. Note that not all UTF-8 ranges are allowed in C++ - identifiers, but we don't need to be pedantic so for simplicity we - ignore that here. Plus this avoids the complication of actually - knowing what was the right encoding. */ - -static inline bool -cp_ident_is_alpha (unsigned char ch) -{ - return ISALPHA (ch) || ch >= 0x80; -} - -/* Similarly, but Like ISALNUM. */ - -static inline bool -cp_ident_is_alnum (unsigned char ch) -{ - return ISALNUM (ch) || ch >= 0x80; -} - /* Find the end of a symbol name starting at LEXPTR. */ static const char * @@ -1333,7 +1312,7 @@ symbol_end (const char *lexptr) { const char *p = lexptr; - while (*p && (cp_ident_is_alnum (*p) || *p == '_' || *p == '$' || *p == '.')) + while (*p && (c_ident_is_alnum (*p) || *p == '_' || *p == '$' || *p == '.')) p++; return p; @@ -1813,7 +1792,7 @@ yylex (void) return ERROR; } - if (!(c == '_' || c == '$' || cp_ident_is_alpha (c))) + if (!(c == '_' || c == '$' || c_ident_is_alpha (c))) { /* We must have come across a bad character (e.g. ';'). */ yyerror (_("invalid character")); @@ -1824,7 +1803,7 @@ yylex (void) namelen = 0; do c = tokstart[++namelen]; - while (cp_ident_is_alnum (c) || c == '_' || c == '$'); + while (c_ident_is_alnum (c) || c == '_' || c == '$'); lexptr += namelen; diff --git a/gdb/testsuite/gdb.base/utf8-identifiers.c b/gdb/testsuite/gdb.base/utf8-identifiers.c new file mode 100644 index 0000000000..c80b42a03d --- /dev/null +++ b/gdb/testsuite/gdb.base/utf8-identifiers.c @@ -0,0 +1,71 @@ +/* -*- coding: utf-8 -*- */ + +/* This testcase is part of GDB, the GNU debugger. + + Copyright 2017-2018 Free Software Foundation, Inc. + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 3 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see . +*/ + +/* UTF-8 "função1". */ +#define FUNCAO1 fun\u00e7\u00e3o1 + +/* UTF-8 "função2". */ +#define FUNCAO2 fun\u00e7\u00e3o2 + +/* UTF-8 "my_função". */ +#define MY_FUNCAO my_fun\u00e7\u00e3o + +/* UTF-8 "num_€". */ +#define NUM_EUROS num_\u20ac + +struct S +{ + int NUM_EUROS; +} g_s; + +void +FUNCAO1 (void) +{ + g_s.NUM_EUROS = 1000; +} + +void +FUNCAO2 (void) +{ + g_s.NUM_EUROS = 1000; +} + +void +MY_FUNCAO (void) +{ +} + +int NUM_EUROS = 2000; + +static void +done () +{ +} + +int +main () +{ + FUNCAO1 (); + done (); + FUNCAO2 (); + MY_FUNCAO (); + + return 0; +} diff --git a/gdb/testsuite/gdb.base/utf8-identifiers.exp b/gdb/testsuite/gdb.base/utf8-identifiers.exp new file mode 100644 index 0000000000..9e91cc3659 --- /dev/null +++ b/gdb/testsuite/gdb.base/utf8-identifiers.exp @@ -0,0 +1,77 @@ +# -*- coding: utf-8 -*- */ + +# This testcase is part of GDB, the GNU debugger. + +# Copyright 2017-2018 Free Software Foundation, Inc. + +# This program is free software; you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 3 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see . + +# Test GDB's support for UTF-8 C/C++ identifiers. + +load_lib completion-support.exp + +standard_testfile + +# Enable basic use of UTF-8. LC_ALL gets reset for each testfile. +setenv LC_ALL C.UTF-8 + +if { [prepare_for_testing "failed to prepare" ${testfile} [list $srcfile]] } { + return -1 +} + +if ![runto done] { + fail "couldn't run to done" + return +} + +# Test expressions. +gdb_test "print g_s.num_€" " = 1000" +gdb_test "print num_€" " = 2000" + +# Test linespecs/breakpoints. +gdb_test "break função2" "Breakpoint $decimal at .*$srcfile.*" + +set test "info breakpoints" +gdb_test_multiple $test $test { + -re "in função2 at .*$srcfile.*$gdb_prompt $" { + pass $test + } +} + +gdb_test "continue" \ + "Breakpoint $decimal, função2 \\(\\) at .*$srcfile.*" + +# Unload symbols from shared libraries to avoid random symbol and file +# names getting in the way of completion. +gdb_test_no_output "nosharedlibrary" + +# Test linespec completion. + +# A unique completion. +test_gdb_complete_unique "break my_fun" "break my_função" + +# A multiple-matches completion: + +# kfailed because gdb/readline display the completion match list like +# this, with no separating space: +# +# (gdb) break função[TAB] +# função1função2 +# +# ... which is bogus. +setup_kfail "gdb/NNNN" "*-*-*" +test_gdb_complete_multiple "break " "fun" "ção" {"função1" "função2"} + +# Test expression completion. +test_gdb_complete_unique "print g_s.num" "print g_s.num_€" -- 2.14.3