From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 8237 invoked by alias); 21 May 2018 16:12:41 -0000 Mailing-List: contact gdb-patches-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gdb-patches-owner@sourceware.org Received: (qmail 7951 invoked by uid 89); 21 May 2018 16:12:40 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-2.4 required=5.0 tests=AWL,BAYES_00,SPF_PASS autolearn=ham version=3.3.2 spammy=H*r:4.82 X-HELO: eggs.gnu.org Received: from eggs.gnu.org (HELO eggs.gnu.org) (208.118.235.92) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 21 May 2018 16:12:39 +0000 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1fKnQI-0004AT-73 for gdb-patches@sourceware.org; Mon, 21 May 2018 12:12:38 -0400 Received: from fencepost.gnu.org ([2001:4830:134:3::e]:55011) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1fKnQI-0004AH-3I; Mon, 21 May 2018 12:12:34 -0400 Received: from [176.228.60.248] (port=2967 helo=home-c4e4a596f7) by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256) (Exim 4.82) (envelope-from ) id 1fKnQH-00063l-HF; Mon, 21 May 2018 12:12:33 -0400 Date: Mon, 21 May 2018 16:16:00 -0000 Message-Id: <834lj1f0ne.fsf@gnu.org> From: Eli Zaretskii To: CC: simark@simark.ca, zjz@zjz.name, gdb-patches@sourceware.org In-reply-to: (Paul.Koning@dell.com) Subject: Re: support C/C++ identifiers named with non-ASCII characters Reply-to: Eli Zaretskii References: <9418d4f0-f22a-c587-cc34-2fa67afbd028@zjz.name> <8c8af079-dbb8-207b-5edf-86b99e9f5db8@simark.ca> X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic] X-Received-From: 2001:4830:134:3::e X-IsSubscribed: yes X-SW-Source: 2018-05/txt/msg00489.txt.bz2 > From: > CC: , > Date: Mon, 21 May 2018 14:12:12 +0000 > > > Given unlimited time, would the right solution be to use a lib to parse the > > string as utf-8, and reject strings that are not valid utf-8? > > This sounds like a scenario where "stringprep" is helpful (or necessary). It validates strings to be valid utf-8, can check that they obey certain rules (such as "word elements only" which rejects punctuation and the like), and can convert them to a canonical form so equal strings match whether they are encoded the same or not. Is it a fact that non-ASCII identifiers must be encoded in UTF-8, and can not include invalid UTF-8 sequences?