From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 107915 invoked by alias); 21 May 2018 14:12:18 -0000 Mailing-List: contact gdb-patches-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gdb-patches-owner@sourceware.org Received: (qmail 107194 invoked by uid 89); 21 May 2018 14:12:17 -0000 Authentication-Results: sourceware.org; auth=none X-Virus-Found: No X-Spam-SWARE-Status: No, score=-2.6 required=5.0 tests=AWL,BAYES_00,RCVD_IN_DNSWL_LOW,SPF_PASS autolearn=ham version=3.3.2 spammy=recognition, punctuation, scenario, Hx-languages-length:985 X-HELO: esa3.dell-outbound.iphmx.com Received: from esa3.dell-outbound.iphmx.com (HELO esa3.dell-outbound.iphmx.com) (68.232.153.94) by sourceware.org (qpsmtpd/0.93/v0.84-503-g423c35a) with ESMTP; Mon, 21 May 2018 14:12:16 +0000 X-IronPort-Anti-Spam-Filtered: true X-IronPort-Anti-Spam-Result: =?us-ascii?q?A2EcAQBh0gJbmMmZ6ERcGgEBAQEBAgEBA?= =?us-ascii?q?QEIAQEBAYMYfoEzCoNrlHeBeYEPX5RPC4RsAoIZITgUAQIBAQEBAQECAQECEAE?= =?us-ascii?q?BAQEBCAsLBigvgjUiglMBAQEDASEBVwULAgEIGAEDJQMCVwIEDgUbgweBeQiLX?= =?us-ascii?q?ps+AYIgiD+CDwkBfIcvghOBMoJphHNMgjEzgiQCh1qFaIsKBwKOV4x9kHeBJTO?= =?us-ascii?q?BdHB6AYIYgh0RCXoBCY0Tb41wK4EBgRgBAQ?= X-IPAS-Result: =?us-ascii?q?A2EcAQBh0gJbmMmZ6ERcGgEBAQEBAgEBAQEIAQEBAYMYfoE?= =?us-ascii?q?zCoNrlHeBeYEPX5RPC4RsAoIZITgUAQIBAQEBAQECAQECEAEBAQEBCAsLBigvg?= =?us-ascii?q?jUiglMBAQEDASEBVwULAgEIGAEDJQMCVwIEDgUbgweBeQiLXps+AYIgiD+CDwk?= =?us-ascii?q?BfIcvghOBMoJphHNMgjEzgiQCh1qFaIsKBwKOV4x9kHeBJTOBdHB6AYIYgh0RC?= =?us-ascii?q?XoBCY0Tb41wK4EBgRgBAQ?= Received: from esa1.dell-outbound2.iphmx.com ([68.232.153.201]) by esa3.dell-outbound.iphmx.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 21 May 2018 09:11:28 -0500 From: Received: from ausxippc101.us.dell.com ([143.166.85.207]) by esa1.dell-outbound2.iphmx.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 21 May 2018 20:11:09 +0600 X-LoopCount0: from 10.166.136.216 X-DLP: DLP_GlobalPCIDSS To: CC: , Subject: Re: support C/C++ identifiers named with non-ASCII characters Date: Mon, 21 May 2018 15:27:00 -0000 Message-ID: References: <9418d4f0-f22a-c587-cc34-2fa67afbd028@zjz.name> <8c8af079-dbb8-207b-5edf-86b99e9f5db8@simark.ca> In-Reply-To: <8c8af079-dbb8-207b-5edf-86b99e9f5db8@simark.ca> Content-Type: text/plain; charset="iso-2022-jp" Content-ID: <874C993BF079804F812E0EFEB0C76190@dell.com> Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-SW-Source: 2018-05/txt/msg00481.txt.bz2 > On May 21, 2018, at 10:03 AM, Simon Marchi wrote: >=20 > ... > I am not a specialist in lexing and parsing C, so can you explain quickly= why > you think this is a good solution? Quickly, I understand that you change= the > identifier recognition algorithm to a blacklist of characters rather than > a whitelist, so bytes that are not recognized (such as those that compose > the utf-8 encoded characters) are not rejected. >=20 > Given unlimited time, would the right solution be to use a lib to parse t= he > string as utf-8, and reject strings that are not valid utf-8? This sounds like a scenario where "stringprep" is helpful (or necessary). = It validates strings to be valid utf-8, can check that they obey certain ru= les (such as "word elements only" which rejects punctuation and the like), = and can convert them to a canonical form so equal strings match whether the= y are encoded the same or not. paul