From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gdb-patches-return-147543-listarch-gdb-patches=sources.redhat.com@sourceware.org>
Received: (qmail 93482 invoked by alias); 22 May 2018 15:17:33 -0000
Mailing-List: contact gdb-patches-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <gdb-patches.sourceware.org>
List-Subscribe: <mailto:gdb-patches-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/gdb-patches/>
List-Post: <mailto:gdb-patches@sourceware.org>
List-Help: <mailto:gdb-patches-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: gdb-patches-owner@sourceware.org
Received: (qmail 93465 invoked by uid 89); 22 May 2018 15:17:32 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: =?ISO-8859-1?Q?No, score=-21.8 required=5.0 tests=AWL,BAYES_00,BODY_8BITS,GARBLED_BODY,GIT_PATCH_0,GIT_PATCH_1,GIT_PATCH_2,GIT_PATCH_3,KAM_SHORT,SPF_HELO_PASS autolearn=ham version=3.3.2 spammy=8:n=c3, 8:=c3=a3, 8:=a7=c3, discussions?=
X-HELO: mx1.redhat.com
Received: from mx3-rdu2.redhat.com (HELO mx1.redhat.com) (66.187.233.73) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Tue, 22 May 2018 15:17:29 +0000
Received: from smtp.corp.redhat.com (int-mx04.intmail.prod.int.rdu2.redhat.com [10.11.54.4])	(using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits))	(No client certificate requested)	by mx1.redhat.com (Postfix) with ESMTPS id 8A694CB9C0;	Tue, 22 May 2018 15:17:25 +0000 (UTC)
Received: from [127.0.0.1] (ovpn04.gateway.prod.ext.ams2.redhat.com [10.39.146.4])	by smtp.corp.redhat.com (Postfix) with ESMTP id 9EBD62024CAD;	Tue, 22 May 2018 15:17:24 +0000 (UTC)
Subject: Re: support C/C++ identifiers named with non-ASCII characters
To: =?UTF-8?B?5by15L+K6Iqd?= <zjz@zjz.name>, gdb-patches@sourceware.org
References: <9418d4f0-f22a-c587-cc34-2fa67afbd028@zjz.name> <8c8af079-dbb8-207b-5edf-86b99e9f5db8@simark.ca> <1b915196-3e97-4892-7426-be4211fe7889@zjz.name> <32da087b-da41-7414-3a56-f2e4587fe287@zjz.name> <253bd3ae-5c38-0c01-6e51-f59fc17b781d@redhat.com> <ba30bb2e-f146-024e-b5fa-725a5d824d5d@zjz.name> <ce0bdad9-1fed-4c9d-11dd-df31f0abb230@redhat.com>
From: Pedro Alves <palves@redhat.com>
Message-ID: <f69cbb32-8973-095c-ed93-9c80fed6db9a@redhat.com>
Date: Tue, 22 May 2018 16:42:00 -0000
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.6.0
MIME-Version: 1.0
In-Reply-To: <ce0bdad9-1fed-4c9d-11dd-df31f0abb230@redhat.com>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
X-SW-Source: 2018-05/txt/msg00561.txt.bz2

On 05/22/2018 03:50 PM, Pedro Alves wrote:
> On 05/22/2018 03:32 PM, å¼µä¿è wrote:
>>
>> Pedro Alves æ¼ 2018/5/22 ä¸å10:15 å¯«é:
>>>
>>> I actually already started writing a patch for this a few months
>>> back, including a C testcase, after these discussions:
>>>
>>> Â Â  https://sourceware.org/ml/gdb-patches/2017-11/msg00428.html
>>> Â Â  https://sourceware.org/ml/gdb/2017-11/msg00022.html
>>>
>>> Let me try to find it.Â  I don't recall exactly where I left off,
>>> but I think I had something working.
>>
>> I just started writing a test case when I saw your letter.
>>
>> Could you shed light on how you delimit identifiers in your patch, Pedro? Does it check all invalid non-ASCII characters, is it dedicated to some encoding such as UTF-8, or to any encodings?
> 
> I found the patch.  Let me rebase it and send it / post it.  It'll
> be easier to just look at the patch.

Here it is.  So this is reusing the same logic added to
cp-name-parser.y, in the C/C++ expression parser as well.

The testcase passes cleanly, except for the test that does
"b fun<tab><tab>".  That finds two functions that start with
"fun", but GDB/readline displays them in an odd way, with no
space in between the matches:

 (gdb) b funÃ§Ã£o[tab]
 funÃ§Ã£o1funÃ§Ã£o2
 (gdb) b funÃ§Ã£o

I suspect the issue is in our readline replacements in
gdb/completer.c, like gdb_fnwidth.  HANDLE_MULTIBYTE
isn't defined for me, for example.

>From ea6eafba4e32b760afdd1e00a5847772b30a2cbd Mon Sep 17 00:00:00 2001
From: Pedro Alves <palves@redhat.com>
Date: Tue, 22 May 2018 15:35:21 +0100
Subject: [PATCH] Support UTF-8 identifiers in C/C++

Factor out cp_ident_is_alpha/cp_ident_is_alnum out of cp-name-parser.y
and use it in the C/C++ expression parser too.

New test included.

gdb/ChangeLog:
yyyy-mm-dd  Pedro Alves  <palves@redhat.com>

	* c-exp.y: Include "c-support.h".
	(parse_number, c_parse_escape, lex_one_token): Use TOLOWER instead
	of tolower.  Use c_ident_is_alpha to scan names.
	* c-lang.c: Include "c-support.h".
	(convert_ucn, convert_octal, convert_hex, convert_escape): Use
	ISXDIGIT instead of isxdigit and ISDIGIT instead of isdigit.
	* c-support.h: New file, with bits factored out from ...
	* cp-name-parser.y: ... this file.
	Include "c-support.h".
	(cp_ident_is_alpha, cp_ident_is_alnum): Deleted, moved to
	c-support.h and renamed.
	(symbol_end, yylex): Adjust.

gdb/testsuite/ChangeLog:
yyyy-mm-dd  Pedro Alves  <palves@redhat.com>

	* gdb.base/utf8-identifiers.c: New file.
	* gdb.base/utf8-identifiers.exp: New file.
---
 gdb/c-exp.y                                 | 27 +++++-----
 gdb/c-lang.c                                | 11 +++--
 gdb/c-support.h                             | 46 +++++++++++++++++
 gdb/cp-name-parser.y                        | 29 ++---------
 gdb/testsuite/gdb.base/utf8-identifiers.c   | 71 ++++++++++++++++++++++++++
 gdb/testsuite/gdb.base/utf8-identifiers.exp | 77 +++++++++++++++++++++++++++++
 6 files changed, 217 insertions(+), 44 deletions(-)
 create mode 100644 gdb/c-support.h
 create mode 100644 gdb/testsuite/gdb.base/utf8-identifiers.c
 create mode 100644 gdb/testsuite/gdb.base/utf8-identifiers.exp
diff --git a/gdb/c-exp.y b/gdb/c-exp.y
index 5e10d2a3b4..ae31af52df 100644
--- a/gdb/c-exp.y
+++ b/gdb/c-exp.y
@@ -42,6 +42,7 @@
 #include "parser-defs.h"
 #include "language.h"
 #include "c-lang.h"
+#include "c-support.h"
 #include "bfd.h" /* Required by objfiles.h.  */
 #include "symfile.h" /* Required by objfiles.h.  */
 #include "objfiles.h" /* For have_full_symbols and have_partial_symbols */
@@ -1806,13 +1807,13 @@ parse_number (struct parser_state *par_state,
 	  len -= 2;
 	}
       /* Handle suffixes: 'f' for float, 'l' for long double.  */
-      else if (len >= 1 && tolower (p[len - 1]) == 'f')
+      else if (len >= 1 && TOLOWER (p[len - 1]) == 'f')
 	{
 	  putithere->typed_val_float.type
 	    = parse_type (par_state)->builtin_float;
 	  len -= 1;
 	}
-      else if (len >= 1 && tolower (p[len - 1]) == 'l')
+      else if (len >= 1 && TOLOWER (p[len - 1]) == 'l')
 	{
 	  putithere->typed_val_float.type
 	    = parse_type (par_state)->builtin_long_double;
@@ -2023,9 +2024,9 @@ c_parse_escape (const char **ptr, struct obstack *output)
       if (output)
 	obstack_grow_str (output, "\\x");
       ++tokptr;
-      if (!isxdigit (*tokptr))
+      if (!ISXDIGIT (*tokptr))
 	error (_("\\x escape without a following hex digit"));
-      while (isxdigit (*tokptr))
+      while (ISXDIGIT (*tokptr))
 	{
 	  if (output)
 	    obstack_1grow (output, *tokptr);
@@ -2048,7 +2049,7 @@ c_parse_escape (const char **ptr, struct obstack *output)
 	if (output)
 	  obstack_grow_str (output, "\\");
 	for (i = 0;
-	     i < 3 && isdigit (*tokptr) && *tokptr != '8' && *tokptr != '9';
+	     i < 3 && ISDIGIT (*tokptr) && *tokptr != '8' && *tokptr != '9';
 	     ++i)
 	  {
 	    if (output)
@@ -2073,9 +2074,9 @@ c_parse_escape (const char **ptr, struct obstack *output)
 	    obstack_1grow (output, *tokptr);
 	  }
 	++tokptr;
-	if (!isxdigit (*tokptr))
+	if (!ISXDIGIT (*tokptr))
 	  error (_("\\%c escape without a following hex digit"), c);
-	for (i = 0; i < len && isxdigit (*tokptr); ++i)
+	for (i = 0; i < len && ISXDIGIT (*tokptr); ++i)
 	  {
 	    if (output)
 	      obstack_1grow (output, *tokptr);
@@ -2668,7 +2669,7 @@ lex_one_token (struct parser_state *par_state, bool *is_quoted_name)
 	    size_t len = strlen ("selector");
 
 	    if (strncmp (p, "selector", len) == 0
-		&& (p[len] == '\0' || isspace (p[len])))
+		&& (p[len] == '\0' || ISSPACE (p[len])))
 	      {
 		lexptr = p + len;
 		return SELECTOR;
@@ -2677,9 +2678,9 @@ lex_one_token (struct parser_state *par_state, bool *is_quoted_name)
 	      goto parse_string;
 	  }
 
-	while (isspace (*p))
+	while (ISSPACE (*p))
 	  p++;
-	if (strncmp (p, "entry", len) == 0 && !isalnum (p[len])
+	if (strncmp (p, "entry", len) == 0 && !c_ident_is_alnum (p[len])
 	    && p[len] != '_')
 	  {
 	    lexptr = &p[len];
@@ -2741,16 +2742,14 @@ lex_one_token (struct parser_state *par_state, bool *is_quoted_name)
       }
     }
 
-  if (!(c == '_' || c == '$'
-	|| (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z')))
+  if (!(c == '_' || c == '$' || c_ident_is_alpha (c)))
     /* We must have come across a bad character (e.g. ';').  */
     error (_("Invalid character '%c' in expression."), c);
 
   /* It's a name.  See how long it is.  */
   namelen = 0;
   for (c = tokstart[namelen];
-       (c == '_' || c == '$' || (c >= '0' && c <= '9')
-	|| (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') || c == '<');)
+       (c == '_' || c == '$' || c_ident_is_alnum (c) || c == '<');)
     {
       /* Template parameter lists are part of the name.
 	 FIXME: This mishandles `print $a<4&&$a>3'.  */
diff --git a/gdb/c-lang.c b/gdb/c-lang.c
index 15e633f8c8..6bbb470957 100644
--- a/gdb/c-lang.c
+++ b/gdb/c-lang.c
@@ -25,6 +25,7 @@
 #include "language.h"
 #include "varobj.h"
 #include "c-lang.h"
+#include "c-support.h"
 #include "valprint.h"
 #include "macroscope.h"
 #include "charset.h"
@@ -382,7 +383,7 @@ convert_ucn (char *p, char *limit, const char *dest_charset,
   gdb_byte data[4];
   int i;
 
-  for (i = 0; i < length && p < limit && isxdigit (*p); ++i, ++p)
+  for (i = 0; i < length && p < limit && ISXDIGIT (*p); ++i, ++p)
     result = (result << 4) + host_hex_value (*p);
 
   for (i = 3; i >= 0; --i)
@@ -424,7 +425,7 @@ convert_octal (struct type *type, char *p,
   unsigned long value = 0;
 
   for (i = 0;
-       i < 3 && p < limit && isdigit (*p) && *p != '8' && *p != '9';
+       i < 3 && p < limit && ISDIGIT (*p) && *p != '8' && *p != '9';
        ++i)
     {
       value = 8 * value + host_hex_value (*p);
@@ -447,7 +448,7 @@ convert_hex (struct type *type, char *p,
 {
   unsigned long value = 0;
 
-  while (p < limit && isxdigit (*p))
+  while (p < limit && ISXDIGIT (*p))
     {
       value = 16 * value + host_hex_value (*p);
       ++p;
@@ -488,7 +489,7 @@ convert_escape (struct type *type, const char *dest_charset,
 
     case 'x':
       ADVANCE;
-      if (!isxdigit (*p))
+      if (!ISXDIGIT (*p))
 	error (_("\\x used with no following hex digits."));
       p = convert_hex (type, p, limit, output);
       break;
@@ -510,7 +511,7 @@ convert_escape (struct type *type, const char *dest_charset,
 	int length = *p == 'u' ? 4 : 8;
 
 	ADVANCE;
-	if (!isxdigit (*p))
+	if (!ISXDIGIT (*p))
 	  error (_("\\u used with no following hex digits"));
 	p = convert_ucn (p, limit, dest_charset, output, length);
       }
diff --git a/gdb/c-support.h b/gdb/c-support.h
new file mode 100644
index 0000000000..669db60cd6
--- /dev/null
+++ b/gdb/c-support.h
@@ -0,0 +1,46 @@
+/* Helper routines for C support in GDB.
+   Copyright (C) 2017 Free Software Foundation, Inc.
+
+   This file is part of GDB.
+
+   This program is free software; you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3 of the License, or
+   (at your option) any later version.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with this program.  If not, see <http://www.gnu.org/licenses/>.  */
+
+#ifndef C_SUPPORT_H
+#define C_SUPPORT_H
+
+#include "safe-ctype.h"
+
+/* Like ISALPHA, but also returns true for the union of all UTF-8
+   multi-byte sequence bytes and non-ASCII characters in
+   extended-ASCII charsets (e.g., Latin1).  I.e., returns true if the
+   high bit is set.  Note that not all UTF-8 ranges are allowed in C++
+   identifiers, but we don't need to be pedantic so for simplicity we
+   ignore that here.  Plus this avoids the complication of actually
+   knowing what was the right encoding.  */
+
+static inline bool
+c_ident_is_alpha (unsigned char ch)
+{
+  return ISALPHA (ch) || ch >= 0x80;
+}
+
+/* Similarly, but Like ISALNUM.  */
+
+static inline bool
+c_ident_is_alnum (unsigned char ch)
+{
+  return ISALNUM (ch) || ch >= 0x80;
+}
+
+#endif /* C_SUPPORT_H */
diff --git a/gdb/cp-name-parser.y b/gdb/cp-name-parser.y
index f522e46419..ebae56261b 100644
--- a/gdb/cp-name-parser.y
+++ b/gdb/cp-name-parser.y
@@ -35,6 +35,7 @@
 #include "safe-ctype.h"
 #include "demangle.h"
 #include "cp-support.h"
+#include "c-support.h"
 
 /* Bison does not make it easy to create a parser without global
    state, unfortunately.  Here are all the global variables used
@@ -1304,28 +1305,6 @@ d_binary (const char *name, struct demangle_component *lhs, struct demangle_comp
 		      fill_comp (DEMANGLE_COMPONENT_BINARY_ARGS, lhs, rhs));
 }
 
-/* Like ISALPHA, but also returns true for the union of all UTF-8
-   multi-byte sequence bytes and non-ASCII characters in
-   extended-ASCII charsets (e.g., Latin1).  I.e., returns true if the
-   high bit is set.  Note that not all UTF-8 ranges are allowed in C++
-   identifiers, but we don't need to be pedantic so for simplicity we
-   ignore that here.  Plus this avoids the complication of actually
-   knowing what was the right encoding.  */
-
-static inline bool
-cp_ident_is_alpha (unsigned char ch)
-{
-  return ISALPHA (ch) || ch >= 0x80;
-}
-
-/* Similarly, but Like ISALNUM.  */
-
-static inline bool
-cp_ident_is_alnum (unsigned char ch)
-{
-  return ISALNUM (ch) || ch >= 0x80;
-}
-
 /* Find the end of a symbol name starting at LEXPTR.  */
 
 static const char *
@@ -1333,7 +1312,7 @@ symbol_end (const char *lexptr)
 {
   const char *p = lexptr;
 
-  while (*p && (cp_ident_is_alnum (*p) || *p == '_' || *p == '$' || *p == '.'))
+  while (*p && (c_ident_is_alnum (*p) || *p == '_' || *p == '$' || *p == '.'))
     p++;
 
   return p;
@@ -1813,7 +1792,7 @@ yylex (void)
       return ERROR;
     }
 
-  if (!(c == '_' || c == '$' || cp_ident_is_alpha (c)))
+  if (!(c == '_' || c == '$' || c_ident_is_alpha (c)))
     {
       /* We must have come across a bad character (e.g. ';').  */
       yyerror (_("invalid character"));
@@ -1824,7 +1803,7 @@ yylex (void)
   namelen = 0;
   do
     c = tokstart[++namelen];
-  while (cp_ident_is_alnum (c) || c == '_' || c == '$');
+  while (c_ident_is_alnum (c) || c == '_' || c == '$');
 
   lexptr += namelen;
 
diff --git a/gdb/testsuite/gdb.base/utf8-identifiers.c b/gdb/testsuite/gdb.base/utf8-identifiers.c
new file mode 100644
index 0000000000..c80b42a03d
--- /dev/null
+++ b/gdb/testsuite/gdb.base/utf8-identifiers.c
@@ -0,0 +1,71 @@
+/* -*- coding: utf-8 -*- */
+
+/* This testcase is part of GDB, the GNU debugger.
+
+   Copyright 2017-2018 Free Software Foundation, Inc.
+
+   This program is free software; you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation; either version 3 of the License, or
+   (at your option) any later version.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with this program.  If not, see <http://www.gnu.org/licenses/>.
+*/
+
+/* UTF-8 "funÃ§Ã£o1".  */
+#define FUNCAO1 fun\u00e7\u00e3o1
+
+/* UTF-8 "funÃ§Ã£o2".  */
+#define FUNCAO2 fun\u00e7\u00e3o2
+
+/* UTF-8 "my_funÃ§Ã£o".  */
+#define MY_FUNCAO my_fun\u00e7\u00e3o
+
+/* UTF-8 "num_â¬".  */
+#define NUM_EUROS num_\u20ac
+
+struct S
+{
+  int NUM_EUROS;
+} g_s;
+
+void
+FUNCAO1 (void)
+{
+  g_s.NUM_EUROS = 1000;
+}
+
+void
+FUNCAO2 (void)
+{
+  g_s.NUM_EUROS = 1000;
+}
+
+void
+MY_FUNCAO (void)
+{
+}
+
+int NUM_EUROS = 2000;
+
+static void
+done ()
+{
+}
+
+int
+main ()
+{
+  FUNCAO1 ();
+  done ();
+  FUNCAO2 ();
+  MY_FUNCAO ();
+
+  return 0;
+}
diff --git a/gdb/testsuite/gdb.base/utf8-identifiers.exp b/gdb/testsuite/gdb.base/utf8-identifiers.exp
new file mode 100644
index 0000000000..9e91cc3659
--- /dev/null
+++ b/gdb/testsuite/gdb.base/utf8-identifiers.exp
@@ -0,0 +1,77 @@
+# -*- coding: utf-8 -*- */
+
+# This testcase is part of GDB, the GNU debugger.
+
+# Copyright 2017-2018 Free Software Foundation, Inc.
+
+# This program is free software; you can redistribute it and/or modify
+# it under the terms of the GNU General Public License as published by
+# the Free Software Foundation; either version 3 of the License, or
+# (at your option) any later version.
+#
+# This program is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+# Test GDB's support for UTF-8 C/C++ identifiers.
+
+load_lib completion-support.exp
+
+standard_testfile
+
+# Enable basic use of UTF-8.  LC_ALL gets reset for each testfile.
+setenv LC_ALL C.UTF-8
+
+if { [prepare_for_testing "failed to prepare" ${testfile} [list $srcfile]] } {
+    return -1
+}
+
+if ![runto done] {
+    fail "couldn't run to done"
+    return
+}
+
+# Test expressions.
+gdb_test "print g_s.num_â¬" " = 1000"
+gdb_test "print num_â¬" " = 2000"
+
+# Test linespecs/breakpoints.
+gdb_test "break funÃ§Ã£o2" "Breakpoint $decimal at .*$srcfile.*"
+
+set test "info breakpoints"
+gdb_test_multiple $test $test {
+    -re "in funÃ§Ã£o2 at .*$srcfile.*$gdb_prompt $" {
+	pass $test
+    }
+}
+
+gdb_test "continue" \
+    "Breakpoint $decimal, funÃ§Ã£o2 \\(\\) at .*$srcfile.*"
+
+# Unload symbols from shared libraries to avoid random symbol and file
+# names getting in the way of completion.
+gdb_test_no_output "nosharedlibrary"
+
+# Test linespec completion.
+
+# A unique completion.
+test_gdb_complete_unique "break my_fun" "break my_funÃ§Ã£o"
+
+# A multiple-matches completion:
+
+# kfailed because gdb/readline display the completion match list like
+# this, with no separating space:
+#
+#  (gdb) break funÃ§Ã£o[TAB]
+#  funÃ§Ã£o1funÃ§Ã£o2
+#
+# ... which is bogus.
+setup_kfail "gdb/NNNN" "*-*-*"
+test_gdb_complete_multiple "break " "fun" "Ã§Ã£o" {"funÃ§Ã£o1" "funÃ§Ã£o2"}
+
+# Test expression completion.
+test_gdb_complete_unique "print g_s.num" "print g_s.num_â¬"
-- 
2.14.3