From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from simark.ca by simark.ca with LMTP id Kbo/K6fW8WGsNwAAWB0awg (envelope-from ) for ; Wed, 26 Jan 2022 18:17:59 -0500 Received: by simark.ca (Postfix, from userid 112) id A0C521F3B6; Wed, 26 Jan 2022 18:17:59 -0500 (EST) X-Spam-Checker-Version: SpamAssassin 3.4.2 (2018-09-13) on simark.ca X-Spam-Level: X-Spam-Status: No, score=-2.7 required=5.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,MAILING_LIST_MULTI,URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.2 Received: from sourceware.org (server2.sourceware.org [8.43.85.97]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits) server-digest SHA256) (No client certificate requested) by simark.ca (Postfix) with ESMTPS id 4E0CD1EA69 for ; Wed, 26 Jan 2022 18:17:58 -0500 (EST) Received: from server2.sourceware.org (localhost [IPv6:::1]) by sourceware.org (Postfix) with ESMTP id 98A5B383F403 for ; Wed, 26 Jan 2022 23:17:57 +0000 (GMT) Received: from qproxy1-pub.mail.unifiedlayer.com (qproxy1-pub.mail.unifiedlayer.com [173.254.64.10]) by sourceware.org (Postfix) with ESMTPS id 04055383F422 for ; Wed, 26 Jan 2022 23:17:36 +0000 (GMT) DMARC-Filter: OpenDMARC Filter v1.4.1 sourceware.org 04055383F422 Authentication-Results: sourceware.org; dmarc=none (p=none dis=none) header.from=tromey.com Authentication-Results: sourceware.org; spf=pass smtp.mailfrom=tromey.com Received: from outbound-ss-761.bluehost.com (outbound-ss-761.bluehost.com [74.220.211.250]) by qproxy1.mail.unifiedlayer.com (Postfix) with ESMTP id 3568C802B628 for ; Wed, 26 Jan 2022 23:17:36 +0000 (UTC) Received: from cmgw15.mail.unifiedlayer.com (unknown [10.0.90.130]) by progateway8.mail.pro1.eigbox.com (Postfix) with ESMTP id E8F501004284A for ; Wed, 26 Jan 2022 23:15:05 +0000 (UTC) Received: from box5379.bluehost.com ([162.241.216.53]) by cmsmtp with ESMTP id CrVJn9IwsikTnCrVJnm1t4; Wed, 26 Jan 2022 23:15:05 +0000 X-Authority-Reason: nr=8 X-Authority-Analysis: v=2.4 cv=CeHNWJnl c=1 sm=1 tr=0 ts=61f1d5f9 a=ApxJNpeYhEAb1aAlGBBbmA==:117 a=ApxJNpeYhEAb1aAlGBBbmA==:17 a=dLZJa+xiwSxG16/P+YVxDGlgEgI=:19 a=IkcTkHD0fZMA:10:nop_charset_1 a=DghFqjY3_ZEA:10:nop_rcvd_month_year a=Qbun_eYptAEA:10:endurance_base64_authed_username_1 a=CCpqsmhAAAAA:8 a=mDV3o1hIAAAA:8 a=-wo-gco-WHJVUm5tFssA:9 a=QEXdDO2ut3YA:10:nop_charset_2 a=ul9cdbp4aOFLsgKbc677:22 a=_FVE-zBwftR9WsbkzFJk:22 DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=tromey.com; s=default; h=Content-Transfer-Encoding:Content-Type:MIME-Version:Message-Id: Date:Subject:Cc:To:From:Sender:Reply-To:Content-ID:Content-Description: Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: In-Reply-To:References:List-Id:List-Help:List-Unsubscribe:List-Subscribe: List-Post:List-Owner:List-Archive; bh=CU6xiy0bvjAF9QsejlING4oe7ll+XwnI+XcnQPfd30o=; b=sluaptSV7xyITgy/jYJ+5fBgV8 s1TdmHlBzXDLimYSpcG6Ixh/m9BifLPgm9BTU67oIZ+YvUl7dF+FQ7BZj3IbtLQbBeS+DvOgFb2Uh iEvRTB2mBDG4bsDTCtUPUKBgD; Received: from 75-166-128-165.hlrn.qwest.net ([75.166.128.165]:40682 helo=prentzel.Home) by box5379.bluehost.com with esmtpsa (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384 (Exim 4.94.2) (envelope-from ) id 1nCrVJ-002RDK-1Z; Wed, 26 Jan 2022 16:15:05 -0700 From: Tom Tromey To: gdb-patches@sourceware.org Subject: [PATCH] Allow non-ASCII characters in Rust identifiers Date: Wed, 26 Jan 2022 16:15:01 -0700 Message-Id: <20220126231501.1031201-1-tom@tromey.com> X-Mailer: git-send-email 2.31.1 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-AntiAbuse: This header was added to track abuse, please include it with any abuse report X-AntiAbuse: Primary Hostname - box5379.bluehost.com X-AntiAbuse: Original Domain - sourceware.org X-AntiAbuse: Originator/Caller UID/GID - [47 12] / [47 12] X-AntiAbuse: Sender Address Domain - tromey.com X-BWhitelist: no X-Source-IP: 75.166.128.165 X-Source-L: No X-Exim-ID: 1nCrVJ-002RDK-1Z X-Source: X-Source-Args: X-Source-Dir: X-Source-Sender: 75-166-128-165.hlrn.qwest.net (prentzel.Home) [75.166.128.165]:40682 X-Source-Auth: tom+tromey.com X-Email-Count: 1 X-Source-Cap: ZWx5bnJvYmk7ZWx5bnJvYmk7Ym94NTM3OS5ibHVlaG9zdC5jb20= X-Local-Domain: yes X-BeenThere: gdb-patches@sourceware.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Gdb-patches mailing list List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Tom Tromey Errors-To: gdb-patches-bounces+public-inbox=simark.ca@sourceware.org Sender: "Gdb-patches" Rust 1.53 (quite a while ago now) ungated the support for non-ASCII identifiers. This didn't work in gdb. This is PR rust/20166. This patch fixes the problem by allowing non-ASCII characters to be considered as identifier components. It seemed simplest to just pass them through -- doing any extra checking didn't seem worthwhile. The new test also verifies that such characters are allowed in strings and character literals as well. The latter also required a bit of work in the lexer. Bug: https://sourceware.org/bugzilla/show_bug.cgi?id=20166 --- gdb/rust-parse.c | 70 ++++++++++++++++++++++-------- gdb/testsuite/gdb.rust/unicode.exp | 51 ++++++++++++++++++++++ gdb/testsuite/gdb.rust/unicode.rs | 26 +++++++++++ 3 files changed, 129 insertions(+), 18 deletions(-) create mode 100644 gdb/testsuite/gdb.rust/unicode.exp create mode 100644 gdb/testsuite/gdb.rust/unicode.rs diff --git a/gdb/rust-parse.c b/gdb/rust-parse.c index 31a1ee3b38f..aa215f9cf2a 100644 --- a/gdb/rust-parse.c +++ b/gdb/rust-parse.c @@ -33,6 +33,12 @@ using namespace expr; +#if WORDS_BIGENDIAN +#define UTF32 "UTF-32BE" +#else +#define UTF32 "UTF-32LE" +#endif + /* A regular expression for matching Rust numbers. This is split up since it is very long and this gives us a way to comment the sections. */ @@ -577,6 +583,35 @@ rust_parser::lex_escape (int is_byte) return result; } +/* A helper for lex_character. Search forward for the closing single + quote, then convert the bytes from the host charset to UTF-32. */ + +static uint32_t +lex_multibyte_char (const char *text, int *len) +{ + /* Only look a maximum of 5 bytes for the closing quote. This is + the maximum for UTF-8. */ + int quote; + gdb_assert (text[0] != '\''); + for (quote = 1; text[quote] != '\0' && text[quote] != '\''; ++quote) + ; + *len = quote; + /* The caller will issue an error. */ + if (text[quote] == '\0') + return 0; + + auto_obstack result; + convert_between_encodings (host_charset (), UTF32, (const gdb_byte *) text, + quote, 1, &result, translit_none); + + int size = obstack_object_size (&result); + if (size > 4) + error (_("overlong character literal")); + uint32_t value; + memcpy (&value, obstack_finish (&result), size); + return value; +} + /* Lex a character constant. */ int @@ -592,13 +627,15 @@ rust_parser::lex_character () } gdb_assert (pstate->lexptr[0] == '\''); ++pstate->lexptr; - /* This should handle UTF-8 here. */ - if (pstate->lexptr[0] == '\\') + if (pstate->lexptr[0] == '\'') + error (_("empty character literal")); + else if (pstate->lexptr[0] == '\\') value = lex_escape (is_byte); else { - value = pstate->lexptr[0] & 0xff; - ++pstate->lexptr; + int len; + value = lex_multibyte_char (&pstate->lexptr[0], &len); + pstate->lexptr += len; } if (pstate->lexptr[0] != '\'') @@ -695,16 +732,9 @@ rust_parser::lex_string () if (is_byte) obstack_1grow (&obstack, value); else - { -#if WORDS_BIGENDIAN -#define UTF32 "UTF-32BE" -#else -#define UTF32 "UTF-32LE" -#endif - convert_between_encodings (UTF32, "UTF-8", (gdb_byte *) &value, - sizeof (value), sizeof (value), - &obstack, translit_none); - } + convert_between_encodings (UTF32, "UTF-8", (gdb_byte *) &value, + sizeof (value), sizeof (value), + &obstack, translit_none); } else if (pstate->lexptr[0] == '\0') error (_("Unexpected EOF in string")); @@ -746,7 +776,10 @@ rust_identifier_start_p (char c) return ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z') || c == '_' - || c == '$'); + || c == '$' + /* Allow any non-ASCII character as an identifier. There + doesn't seem to be a need to be picky about this. */ + || (c & 0x80) != 0); } /* Lex an identifier. */ @@ -772,13 +805,14 @@ rust_parser::lex_identifier () ++pstate->lexptr; - /* For the time being this doesn't handle Unicode rules. Non-ASCII - identifiers are gated anyway. */ + /* Allow any non-ASCII character here. This "handles" UTF-8 by + passing it through. */ while ((pstate->lexptr[0] >= 'a' && pstate->lexptr[0] <= 'z') || (pstate->lexptr[0] >= 'A' && pstate->lexptr[0] <= 'Z') || pstate->lexptr[0] == '_' || (is_gdb_var && pstate->lexptr[0] == '$') - || (pstate->lexptr[0] >= '0' && pstate->lexptr[0] <= '9')) + || (pstate->lexptr[0] >= '0' && pstate->lexptr[0] <= '9') + || (pstate->lexptr[0] & 0x80) != 0) ++pstate->lexptr; diff --git a/gdb/testsuite/gdb.rust/unicode.exp b/gdb/testsuite/gdb.rust/unicode.exp new file mode 100644 index 00000000000..9de0a0e724f --- /dev/null +++ b/gdb/testsuite/gdb.rust/unicode.exp @@ -0,0 +1,51 @@ +# Copyright (C) 2022 Free Software Foundation, Inc. + +# This program is free software; you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation; either version 3 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see . + +# Test raw identifiers. + +load_lib rust-support.exp +if {[skip_rust_tests]} { + continue +} + +# Non-ASCII identifiers were allowed starting in 1.53. +set v [split [rust_compiler_version] .] +if {[lindex $v 0] == 1 && [lindex $v 1] < 53} { + untested "this test requires rust 1.53 or greater" + return -1 +} + +# Enable basic use of UTF-8. LC_ALL gets reset for each testfile. +setenv LC_ALL C.UTF-8 + +standard_testfile .rs +if {[prepare_for_testing "failed to prepare" $testfile $srcfile {debug rust}]} { + return -1 +} + +set line [gdb_get_line_number "set breakpoint here"] +if {![runto ${srcfile}:$line]} { + untested "could not run to breakpoint" + return -1 +} + +gdb_test "print 𝕯" " = 98" "print D" +gdb_test "print \"𝕯\"" " = \"𝕯\"" "print D in string" +# This output is maybe not ideal, but it also isn't incorrect. +gdb_test "print '𝕯'" " = 120175 '\\\\u\\\{01d56f\\\}'" \ + "print D as char" +gdb_test "print cç" " = 97" "print cc" + +gdb_test "print 'çc'" "overlong character literal" "print cc as char" diff --git a/gdb/testsuite/gdb.rust/unicode.rs b/gdb/testsuite/gdb.rust/unicode.rs new file mode 100644 index 00000000000..c6ca90e6450 --- /dev/null +++ b/gdb/testsuite/gdb.rust/unicode.rs @@ -0,0 +1,26 @@ +// Copyright (C) 2022 Free Software Foundation, Inc. + +// This program is free software; you can redistribute it and/or modify +// it under the terms of the GNU General Public License as published by +// the Free Software Foundation; either version 3 of the License, or +// (at your option) any later version. +// +// This program is distributed in the hope that it will be useful, +// but WITHOUT ANY WARRANTY; without even the implied warranty of +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +// GNU General Public License for more details. +// +// You should have received a copy of the GNU General Public License +// along with this program. If not, see . + +#![allow(dead_code)] +#![allow(unused_variables)] +#![allow(unused_assignments)] +#![allow(uncommon_codepoints)] +#![allow(non_snake_case)] + +fn main() { + let 𝕯 = 98; + let cç = 97; + println!("{}, {}", 𝕯, cç); // set breakpoint here +} -- 2.31.1