From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 23073 invoked by alias); 14 Apr 2006 21:37:50 -0000 Received: (qmail 23065 invoked by uid 22791); 14 Apr 2006 21:37:49 -0000 X-Spam-Check-By: sourceware.org Received: from nitzan.inter.net.il (HELO nitzan.inter.net.il) (192.114.186.20) by sourceware.org (qpsmtpd/0.31) with ESMTP; Fri, 14 Apr 2006 21:37:47 +0000 Received: from HOME-C4E4A596F7 (IGLD-83-130-249-89.inter.net.il [83.130.249.89]) by nitzan.inter.net.il (MOS 3.7.3-GA) with ESMTP id DDH18391 (AUTH halo1); Sat, 15 Apr 2006 00:37:40 +0300 (IDT) Date: Sat, 15 Apr 2006 07:14:00 -0000 Message-Id: From: Eli Zaretskii To: "Jim Blandy" CC: ghost@cs.msu.su, gdb@sources.redhat.com In-reply-to: <8f2776cb0604141216m216ba87ch529180cd079ce971@mail.gmail.com> (jimb@red-bean.com) Subject: Re: printing wchar_t* Reply-to: Eli Zaretskii References: <200604141257.41690.ghost@cs.msu.su> <200604141837.26618.ghost@cs.msu.su> <8f2776cb0604141053v73e512e3o2d1c9086312316bd@mail.gmail.com> <8f2776cb0604141216m216ba87ch529180cd079ce971@mail.gmail.com> X-IsSubscribed: yes Mailing-List: contact gdb-help@sourceware.org; run by ezmlm Precedence: bulk List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gdb-owner@sourceware.org X-SW-Source: 2006-04/txt/msg00213.txt.bz2 > Date: Fri, 14 Apr 2006 12:16:36 -0700 > From: "Jim Blandy" > Cc: ghost@cs.msu.su, gdb@sources.redhat.com > > > (gdb) print *warray@8 > > {0x0031, 0x0032, 0x0033, 0x0F04, 0x0FCC, 0x0078, 0x0079, 0x007A} > > > > Except for using up 60-odd characters where you used 21, this is IMHO > > better, since it doesn't require any code on the FE side: just convert > > the strings to integers, and you've got Unicode, ready to be used for > > whatever purposes. > > If you're printing an expression that evaluates to a string, sure. > But what if you're printing a value of type struct { wchar *key; > wchar_t *value }? What if you're using -stack-list-arguments to show > values in a stack frame? Sorry, I don't see the difference. Perhaps I'm too dense. Are you talking about the amount of ASCII characters, or something else? > My point is, MI consumers are already parsing ISO C strings. They > just need to parse more of them. This ``more parsing'' is not magic. It's a lot of work, in general. > > For the interactive user, understanding non-ASCII strings in the > > suggested ASCII encoding might not be easy at all. For example, for > > all my knowledge of Hebrew, if someone shows me \x05D2, I will have > > hard time recognizing the letter Gimel. > > If the host character set includes Gimel, then GDB won't print it with > a hex escape. The host character set has nothing to do, in general, with what characters can be displayed. The same host character set can be displayed on an appropriately localized xterm, but not on a bare-bones character terminal. Not every system that runs in the Hebrew locale has Hebrew-enabled xterm. Some characters may be missing from a particular font, especially a Unicode-based font (because there so many Unicode characters). Etc., etc. Even if I do have a Hebrew enabled xterm, chances are that it cannot display characters sent in 16-bit Unicode codepoints, it will want some single-byte encoding, like UTF-8 or maybe ISO 8859-8. GDB will generally know nothing about these complications, unless we teach it. For example, to display Hebrew letters on a UTF-8 enabled xterm, we (i.e. the user, through appropriate GDB commands) will have to tell GDB that wchar_t strings should be encoded in UTF-8 by the CLI output routines. Sometimes these settings can be gleaned from the environment variables, but Emacs's experience shows how very unreliable and error-prone this is. > > As for the second sentence, ``reliably find the contents of the > > string'' there obviously doesn't consider the complexities of handling > > wide characters. In my experience, for any non-trivial string > > processing, working with variable-size encoding is much harder than > > with fixed-size wchar_t arrays, because you need to interpret the > > bytes as you go, even if all you need is to find the n-th character. > > Even the simple task of computing the number of characters in the > > string becomes complicated. > > I don't understand what you mean. The rules for parsing ISO C string > literals into arrays of chars and wide string literals into arrays of > wide characters are straightforward. You seem to assume here that the target and the front-end's character sets and their notion of wchar_t are identical. Otherwise, what was a valid array of wide characters on the target side will be gibberish on the host side, and will certainly not display as anything legible. Unlike GDB core, which just wants to pass the bytes from here to there, the UI needs to be able to display the string, and for that it needs to understand how it is encoded, how many glyphs will it produce on the screen, where it can be broken into several lines if it is too long, etc. This is all trivial with 7-bit ASCII (every byte produces a single glyph, except a few non-printables, whitespace characters signal possible locations to break the line, etc.), but can get very complex with other character sets. GDB cannot be asked to know about all of those complications, but I think it should at least provide a few simple translation services so that a front end will not have to work too hard to handle and display strings as mostly readable text. Passing the characters as fixed-size codepoints expressed as ASCII hex strings leaves the front-end with only very simple job. What's more, it uses an existing feature: array printing. > > What you are suggesting is simple for GDB, but IMHo leaves too much > > complexity to the FE. I think GDB could do better. In particular, if > > I'm sitting at a UTF-8 enabled xterm, I'd be grateful if GDB would > > show me Unicode characters in their normal glyphs, which would require > > GDB to output the characters in their UTF-8 encoding (which the > > terminal will then display in human-readable form). Your suggestion > > doesn't allow such a feature, AFAICS, at least not for CLI users. > > When the host character set contains a character, there's no need for > GDB to use an escape to show it. Whose host character set? GDB's? But GDB is not displaying the strings, the front end is. And as I wrote above, there's no guarantees that the host character set can be transparently displayed on the screen. This only works for ASCII and some simple single-byte encodings, mostly Latin ones. But it doesn't work in general. And why are you talking about host character set? The L"123\x0f04\x0fccxyz" string came from the target, GDB simply converted it to 7-bit ASCII. These are characters from the target character set. And the target doesn't necessarily talk in the host locale's character set and language, you could be debugging a program which talks Farsi with GDB that runs in a German locale. > > If wchar_t uses fixed-size characters, not their variable-size > > encodings, then specifying the CCS will do. > > There is no provision in ISO C for variable-size wchar_t encodings. > The portion of the standard I referred to says that wchar_t "...is an > integer type whose range of values can represent distinct codes for > all members of the largest extended character set specified among the > supported locales". I agree, but Windows and who knows what else violates that. Of course, for the BMP, UTF-16 is indistinguishable from Unicode codepoints, so in practice this might not matter too much.