From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gdb-patches-return-147471-listarch-gdb-patches=sources.redhat.com@sourceware.org>
Received: (qmail 8237 invoked by alias); 21 May 2018 16:12:41 -0000
Mailing-List: contact gdb-patches-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <gdb-patches.sourceware.org>
List-Subscribe: <mailto:gdb-patches-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/gdb-patches/>
List-Post: <mailto:gdb-patches@sourceware.org>
List-Help: <mailto:gdb-patches-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: gdb-patches-owner@sourceware.org
Received: (qmail 7951 invoked by uid 89); 21 May 2018 16:12:40 -0000
Authentication-Results: sourceware.org; auth=none
X-Virus-Found: No
X-Spam-SWARE-Status: No, score=-2.4 required=5.0 tests=AWL,BAYES_00,SPF_PASS autolearn=ham version=3.3.2 spammy=H*r:4.82
X-HELO: eggs.gnu.org
Received: from eggs.gnu.org (HELO eggs.gnu.org) (208.118.235.92) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 21 May 2018 16:12:39 +0000
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)	(envelope-from <eliz@gnu.org>)	id 1fKnQI-0004AT-73	for gdb-patches@sourceware.org; Mon, 21 May 2018 12:12:38 -0400
Received: from fencepost.gnu.org ([2001:4830:134:3::e]:55011)	by eggs.gnu.org with esmtp (Exim 4.71)	(envelope-from <eliz@gnu.org>)	id 1fKnQI-0004AH-3I; Mon, 21 May 2018 12:12:34 -0400
Received: from [176.228.60.248] (port=2967 helo=home-c4e4a596f7)	by fencepost.gnu.org with esmtpsa (TLS1.2:RSA_AES_256_CBC_SHA1:256)	(Exim 4.82)	(envelope-from <eliz@gnu.org>)	id 1fKnQH-00063l-HF; Mon, 21 May 2018 12:12:33 -0400
Date: Mon, 21 May 2018 16:16:00 -0000
Message-Id: <834lj1f0ne.fsf@gnu.org>
From: Eli Zaretskii <eliz@gnu.org>
To: <Paul.Koning@dell.com>
CC: simark@simark.ca, zjz@zjz.name, gdb-patches@sourceware.org
In-reply-to: <CF83AA8F-D3F8-446C-A078-252ADFB6D4C8@dell.com>	(Paul.Koning@dell.com)
Subject: Re: support C/C++ identifiers named with non-ASCII characters
Reply-to: Eli Zaretskii <eliz@gnu.org>
References: <9418d4f0-f22a-c587-cc34-2fa67afbd028@zjz.name> <8c8af079-dbb8-207b-5edf-86b99e9f5db8@simark.ca> <CF83AA8F-D3F8-446C-A078-252ADFB6D4C8@dell.com>
X-detected-operating-system: by eggs.gnu.org: GNU/Linux 2.2.x-3.x [generic]
X-Received-From: 2001:4830:134:3::e
X-IsSubscribed: yes
X-SW-Source: 2018-05/txt/msg00489.txt.bz2

> From: <Paul.Koning@dell.com>
> CC: <zjz@zjz.name>, <gdb-patches@sourceware.org>
> Date: Mon, 21 May 2018 14:12:12 +0000
> 
> > Given unlimited time, would the right solution be to use a lib to parse the
> > string as utf-8, and reject strings that are not valid utf-8?
> 
> This sounds like a scenario where "stringprep" is helpful (or necessary).  It validates strings to be valid utf-8, can check that they obey certain rules (such as "word elements only" which rejects punctuation and the like), and can convert them to a canonical form so equal strings match whether they are encoded the same or not.

Is it a fact that non-ASCII identifiers must be encoded in UTF-8, and
can not include invalid UTF-8 sequences?