From: Eli Zaretskii <eliz@gnu.org>
To: "Jim Blandy" <jimb@red-bean.com>
Cc: ghost@cs.msu.su, gdb@sources.redhat.com
Subject: Re: printing wchar_t*
Date: Fri, 14 Apr 2006 19:16:00 -0000 [thread overview]
Message-ID: <ubqv4108c.fsf@gnu.org> (raw)
In-Reply-To: <8f2776cb0604141053v73e512e3o2d1c9086312316bd@mail.gmail.com> (jimb@red-bean.com)
> Date: Fri, 14 Apr 2006 10:53:44 -0700
> From: "Jim Blandy" <jimb@red-bean.com>
> Cc: "Vladimir Prus" <ghost@cs.msu.su>, gdb@sources.redhat.com
>
> I think folks are seeing difficult problems where there aren't any.
What difficulties? there _are_ no difficulties ;-)
> Suppose we have a wide string where wchar_t values are Unicode code
> points. Suppose our host character set is plain ASCII. Suppose the
> user's program has a string containing the digits '123', followed by
> some funky Tibetan characters U+0F04 U+0FCC, followed by the letters
> 'xyz'. When asked to print that string, GDB should print the
> following twenty-one ASCII characters:
>
> L"123\x0f04\x0fccxyz"
This will work, if we accept your assumptions (which are by no means
universally correct, e.g. parts of our discussion were around whether
the string contains U+XXXX Unicode codepoints or their UTF-16
encodings). But all you did is invent an encoding (and a
variable-size encoding at that). Something in the GUI FE still has to
interpret that encoding, i.e. convert it back to binary representation
of the characters, because your encoding cannot be displayed by any
known GUI API.
Compare this with the facility that we already have today:
(gdb) print *warray@8
{0x0031, 0x0032, 0x0033, 0x0F04, 0x0FCC, 0x0078, 0x0079, 0x007A}
Except for using up 60-odd characters where you used 21, this is IMHO
better, since it doesn't require any code on the FE side: just convert
the strings to integers, and you've got Unicode, ready to be used for
whatever purposes.
> Since this is a valid way to write that string in a source program, a
> user at the GDB command line should understand it. Since consumers of
> MI information must contain parsers for C values already, they can
> reliably find the contents of the string.
I only partly agree with the first sentence, and not at all with the
second.
For the interactive user, understanding non-ASCII strings in the
suggested ASCII encoding might not be easy at all. For example, for
all my knowledge of Hebrew, if someone shows me \x05D2, I will have
hard time recognizing the letter Gimel.
As for the second sentence, ``reliably find the contents of the
string'' there obviously doesn't consider the complexities of handling
wide characters. In my experience, for any non-trivial string
processing, working with variable-size encoding is much harder than
with fixed-size wchar_t arrays, because you need to interpret the
bytes as you go, even if all you need is to find the n-th character.
Even the simple task of computing the number of characters in the
string becomes complicated.
> Note that this gets a GUI the contents of the string in the *target*
> character set. The GUI itself should be responsible for converting
> target characters to whatever character set it wants to use to present
> data to its user. Here, GDB's 'host' character set is just the
> character set used to carry information from GDB to the GUI; it should
> probably be set to ASCII, just to avoid needless variation. But
> either way, it's just acting as a medium for values in C source code
> syntax, and has no bearing on either the character set the target
> program is using, or the character set the GUI will use to present
> data to its user.
What you are suggesting is simple for GDB, but IMHo leaves too much
complexity to the FE. I think GDB could do better. In particular, if
I'm sitting at a UTF-8 enabled xterm, I'd be grateful if GDB would
show me Unicode characters in their normal glyphs, which would require
GDB to output the characters in their UTF-8 encoding (which the
terminal will then display in human-readable form). Your suggestion
doesn't allow such a feature, AFAICS, at least not for CLI users.
That said, if someone volunteers to do the job of adding your
suggestions to GDB, I won't object to accepting the patches, because
whoever does the job gets to choose the tools.
> Unicode technical report #17 lays out the terminology the Unicode
> folks use for all this stuff, with good explanations:
> http://www.unicode.org/reports/tr17/
Yes, that's a good background reading for related stuff.
> According to the ISO C standard, the coding character set used by
> wchar_t must be a superset of that used by char for members of the
> basic character set. See ISO/IEC 9899:1999 (E) section 7.17,
> paragraph 2. So I think it's sufficient for the user to specify the
> coding character set used by wide characters; that fixes the ccs used
> for char values.
If wchar_t uses fixed-size characters, not their variable-size
encodings, then specifying the CCS will do. Encodings are another
matter; as I wrote earlier, there could be many different encodings of
the same CCS, and I suppose some weirdo software somewhere could stuff
such encoding into a wchar_t.
next prev parent reply other threads:[~2006-04-14 18:27 UTC|newest]
Thread overview: 52+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-04-13 17:07 Vladimir Prus
2006-04-13 17:25 ` Eli Zaretskii
2006-04-14 7:29 ` Vladimir Prus
2006-04-14 8:47 ` Eli Zaretskii
2006-04-14 12:47 ` Vladimir Prus
2006-04-14 13:05 ` Eli Zaretskii
2006-04-14 13:06 ` Vladimir Prus
2006-04-14 13:15 ` Robert Dewar
2006-04-14 13:17 ` Daniel Jacobowitz
2006-04-14 13:59 ` Robert Dewar
2006-04-14 14:37 ` Eli Zaretskii
2006-04-14 14:08 ` Paul Koning
2006-04-14 14:47 ` Eli Zaretskii
2006-04-14 15:00 ` Vladimir Prus
2006-04-14 17:53 ` Eli Zaretskii
2006-04-17 7:05 ` Vladimir Prus
2006-04-17 8:35 ` Eli Zaretskii
2006-04-13 18:06 ` Jim Blandy
2006-04-13 21:18 ` Eli Zaretskii
2006-04-14 6:02 ` Jim Blandy
2006-04-14 8:43 ` Eli Zaretskii
2006-04-14 7:58 ` Vladimir Prus
2006-04-14 8:07 ` Jim Blandy
2006-04-14 8:30 ` Vladimir Prus
2006-04-14 8:57 ` Eli Zaretskii
2006-04-14 12:52 ` Vladimir Prus
2006-04-14 13:07 ` Daniel Jacobowitz
2006-04-14 14:23 ` Eli Zaretskii
2006-04-14 14:29 ` Daniel Jacobowitz
2006-04-14 14:53 ` Eli Zaretskii
2006-04-14 17:10 ` Daniel Jacobowitz
2006-04-14 17:55 ` Jim Blandy
2006-04-14 18:27 ` Eli Zaretskii
2006-04-14 18:30 ` Jim Blandy
2006-04-14 19:19 ` Eli Zaretskii
2006-04-14 14:16 ` Eli Zaretskii
2006-04-14 14:50 ` Vladimir Prus
2006-04-14 17:18 ` Eli Zaretskii
2006-04-14 18:03 ` Jim Blandy
2006-04-14 19:16 ` Eli Zaretskii [this message]
2006-04-14 19:22 ` Jim Blandy
2006-04-14 22:18 ` Daniel Jacobowitz
2006-04-16 11:39 ` Jim Blandy
2006-04-16 15:07 ` Eli Zaretskii
2006-04-15 7:14 ` Eli Zaretskii
2006-04-17 7:16 ` Vladimir Prus
2006-04-17 8:58 ` Eli Zaretskii
2006-04-17 10:35 ` Vladimir Prus
2006-04-17 12:26 ` Eli Zaretskii
2006-04-17 13:56 ` Vladimir Prus
2006-04-18 5:31 ` Eli Zaretskii
2006-04-14 19:53 ` Mark Kettenis
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ubqv4108c.fsf@gnu.org \
--to=eliz@gnu.org \
--cc=gdb@sources.redhat.com \
--cc=ghost@cs.msu.su \
--cc=jimb@red-bean.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox