Re: support C/C++ identifiers named with non-ASCII characters

Mirror of the gdb-patches mailing list
 help / color / mirror / Atom feed

From: <Paul.Koning@dell.com>
To: <eliz@gnu.org>
Cc: <simark@simark.ca>, <zjz@zjz.name>, <gdb-patches@sourceware.org>
Subject: Re: support C/C++ identifiers named with non-ASCII characters
Date: Mon, 21 May 2018 18:34:00 -0000	[thread overview]
Message-ID: <FCEC48CC-5F04-438F-9B6C-2D8933E64A97@dell.com> (raw)
In-Reply-To: <834lj1f0ne.fsf@gnu.org>

> On May 21, 2018, at 12:12 PM, Eli Zaretskii <eliz@gnu.org> wrote:
> 
>> From: <Paul.Koning@dell.com>
>> CC: <zjz@zjz.name>, <gdb-patches@sourceware.org>
>> Date: Mon, 21 May 2018 14:12:12 +0000
>> 
>>> Given unlimited time, would the right solution be to use a lib to parse the
>>> string as utf-8, and reject strings that are not valid utf-8?
>> 
>> This sounds like a scenario where "stringprep" is helpful (or necessary).  It validates strings to be valid utf-8, can check that they obey certain rules (such as "word elements only" which rejects punctuation and the like), and can convert them to a canonical form so equal strings match whether they are encoded the same or not.
> 
> Is it a fact that non-ASCII identifiers must be encoded in UTF-8, and
> can not include invalid UTF-8 sequences?

Encoding is a I/O question.  "UTF-8" and "Unicode" are often mixed up, but they are distinct.  Unicode is a character set, in which each character has a numeric identification.  For example, 張 is Unicode character number 24373 (0x5f35).

UTF-8 is one of several ways to encode Unicode characters as a byte stream.  The UTF-8 encoding of 張 is e5 bc b5.

I don't know what the C/C++ standards say about non-ASCII identifiers.  I assume they are stated to be Unicode, and presumably specific Unicode character classes.  So there are some sequences of Unicode characters that are valid identifiers, while others are not -- exactly as "abc" is a valid ASCII identifier while "a@bc" is not.

A separate question is the encoding of files.  The encoding rule could be that UTF-8 is required -- or that the encoding is selectable.  There also has to be an encoding in output files (debug data for example).  And when strings are entered at the GDB user interface, they arrive in some encoding.  For all these, UTF-8 is a logical answer.

Not all byte strings are valid UTF-8 strings.  When a byte string is delivered from the outside, it makes sense to validate if it's a valid encoding before it is used.  Or you can assume that inputs are valid and rely on "symbol not found" as the general way to handle anything that doesn't match.  For gdb, that may be good enough.

Yet another issue: for many characters, there are multiple ways to represent them in Unicode.  For example, ü (latin small letter u with dieresis) can be coded as the single Unicode character 0xfc, or as the pair 0x0308 0x75 (combining dieresis, latin small letter u).  These are supposed to be synonymous; when doing string matches, you'd want them to be taken as equivalent.  The stringprep library helps with this by offering a conversion to a standard form, at which point memcmp will give the correct answer.

	paul

next prev parent reply	other threads:[~2018-05-21 18:03 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-05-21  9:54 張俊芝
2018-05-21 14:21 ` Simon Marchi
2018-05-21 15:27   ` Paul.Koning
2018-05-21 16:16     ` Eli Zaretskii
2018-05-21 18:34       ` Paul.Koning [this message]
     [not found]         ` <83tvr0ev0p.fsf@gnu.org>
2018-05-21 19:25           ` Paul.Koning
2018-05-21 20:43         ` Joseph Myers
2018-05-22 10:31           ` 張俊芝
2018-05-22  8:34         ` 張俊芝
     [not found]   ` <1b915196-3e97-4892-7426-be4211fe7889@zjz.name>
2018-05-21 18:00     ` 張俊芝
2018-05-21 18:03     ` 張俊芝
2018-05-21 18:14       ` Matt Rice
2018-05-22  7:06         ` 張俊芝
2018-05-22 14:39       ` Pedro Alves
2018-05-22 14:39         ` 張俊芝
2018-05-22 15:17           ` Pedro Alves
2018-05-22 16:42             ` Pedro Alves
2018-05-22 17:31               ` 張俊芝
2018-05-22 17:38                 ` Pedro Alves

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=FCEC48CC-5F04-438F-9B6C-2D8933E64A97@dell.com \
    --to=paul.koning@dell.com \
    --cc=eliz@gnu.org \
    --cc=gdb-patches@sourceware.org \
    --cc=simark@simark.ca \
    --cc=zjz@zjz.name \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox