From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 16363 invoked by alias); 17 Apr 2006 08:35:46 -0000 Received: (qmail 16113 invoked by uid 22791); 17 Apr 2006 08:35:45 -0000 X-Spam-Check-By: sourceware.org Received: from nitzan.inter.net.il (HELO nitzan.inter.net.il) (192.114.186.20) by sourceware.org (qpsmtpd/0.31) with ESMTP; Mon, 17 Apr 2006 08:35:43 +0000 Received: from HOME-C4E4A596F7 (IGLD-80-230-11-227.inter.net.il [80.230.11.227]) by nitzan.inter.net.il (MOS 3.7.3-GA) with ESMTP id DDQ50873 (AUTH halo1); Mon, 17 Apr 2006 11:35:07 +0300 (IDT) Date: Mon, 17 Apr 2006 08:58:00 -0000 Message-Id: From: Eli Zaretskii To: Vladimir Prus CC: jimb@red-bean.com, gdb@sources.redhat.com In-reply-to: <200604171036.48833.ghost@cs.msu.su> (message from Vladimir Prus on Mon, 17 Apr 2006 10:36:47 +0400) Subject: Re: printing wchar_t* Reply-to: Eli Zaretskii References: <8f2776cb0604141216m216ba87ch529180cd079ce971@mail.gmail.com> <200604171036.48833.ghost@cs.msu.su> X-IsSubscribed: yes Mailing-List: contact gdb-help@sourceware.org; run by ezmlm Precedence: bulk List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gdb-owner@sourceware.org X-SW-Source: 2006-04/txt/msg00231.txt.bz2 > From: Vladimir Prus > Date: Mon, 17 Apr 2006 10:36:47 +0400 > Cc: "Jim Blandy" , > gdb@sources.redhat.com > > On Saturday 15 April 2006 01:37, Eli Zaretskii wrote: > > > > My point is, MI consumers are already parsing ISO C strings. They > > > just need to parse more of them. > > > > This ``more parsing'' is not magic. It's a lot of work, in general. > > I don't quite get it. Say that frontend and gdb somehow agree on the 8-bit > encoding using by gdb to print the strings. Then frontend can look at the > string and: > > - If it sees \x, look at the following hex digits and convert it to either > code point or code unit > - If it sees anything else, convert it from local 8-bit to Unicode That's what Jim was saying. He thought (or so it seemed to me) that, once the ASCII-encoded string was read by the front end and converted back to the integer values, the job is done. That is, in Jim's example with L"123\x0f04\x0fccxyz", the character `1' is converted to its code 49 decimal, \x0f04 is converted to the 16-bit code 3844 decimal, `x' is converted to 120 decimal, etc. What I was saying that indeed this conversion is easy, but it's not even close to doing what the front end generally would like to do with the string. You want to _process_ the string, which means you want to know its length in characters (not bytes), you want to know what character set they encode, you want to be able to find the n-th character in the string, etc. The encoding suggested by Jim makes these tasks very hard, much harder than if we send the string as an array of fixed-length wide characters. > Note that due to charset function interface using 'int', you can't use UTF-8 > for encoding passed to frontend, but using ASCII + \x is still feasible. I don't understand why UTF-8 cannot be used (an int can hold an 8-bit byte just fine), nor can I see why this is an issue. We are not discussing addition of UTF-8 encoding to GDB, we are discussing how to pass to a front end wide-character strings held within the debuggee. Or at least that's what I thought you were trying to solve. > There's one nice thing about this approach. If there's new 'print array until > XX" syntax, I indeed need to special-case processing of values in several > contexts -- most notably arguments in stack trace. With "\x" escapes I'd need > to write a code to handle them once. In fact, I can add this code right to MI > parser (which operates using Unicode-enabled QString class already). That > will be more convenient than invoking 'print array' for any wchar_t* I ever > see. I don't think we should optimize GDB for one specific toolkit, even if that toolkit is Qt. > I don't quite get. First you say you want \x05D2 to display using Unicode font > on console, now you say it's very hard. No, I said that a GUI front end will be able to display the _binary_ _code_ 0x05D2 with a suitable Unicode font. Jim suggested that seeing the _string_ "\x05D2" in GDB's output will allow me to read the text, to which I replied that it will not be easy at all, since humans generally don't remember Unicode codepoints by heart, even for their native languages. > Now, if you want Unicode display for > \x05D2, there should be some method to tell gdb that your console can display > Unicode, and if user told that Unicode is supported, what are the problems? Please read my other messages: the program being debugged might talk Hebrew in Unicode codepoints, but the locale where we are running GDB might not support Hebrew on the console. So, as long as we are talking about console output (which is different from a GUI front end), just sending Unicode to the display is not enough. I suggest not to mix issues relevant for GUI front ends and text-mode front ends, including the CLI ``front end'' built into GDB itself. These are different issues, each one with its own set of complexities. Jim's L"123\x0f04\x0fccxyz" proposal was (I think) more oriented to text terminals and the CLI, so the discussion wandered off in that direction. I don't think your original problem is related to that. > > how many glyphs will it produce > > on the screen, where it can be broken into several lines if it is too > > long, etc. This is all trivial with 7-bit ASCII (every byte produces > > a single glyph, except a few non-printables, whitespace characters > > signal possible locations to break the line, etc.), but can get very > > complex with other character sets. > > Isn't this completely outside of GDB? No, not completely: the ui_output routines do this for the console output. Again, this part was about text-mode output, and the CLI in particular. > > GDB cannot be asked to know about all of those complications, but I > > think it should at least provide a few simple translation services so > > that a front end will not have to work too hard to handle and display > > strings as mostly readable text. Passing the characters as fixed-size > > codepoints expressed as ASCII hex strings leaves the front-end with > > only very simple job. What's more, it uses an existing feature: array > > printing. > > Using \x escapes, provided they encode *code units*, leaves frontend with the > same simple job. Yes, but GDB will need to generate the code units first, e.g. convert fixed-size Unicode wide characters into UTF-8. That's extra job for GDB. (Again, we were originally talking about wchar_t, not multibyte strings.) > Really, using strings with \x escapes differs from array > printing in just one point: some characters are printed not as hex values, > but as characters in local 8-bit encoding. Why do you think this is a > problem? Because knowing what is the ``local 8-bit encoding'' is in itself a huge problem. Emacs is trying to solve it since 1996, and it still haven't got all the details right in some marginal cases, although we have people on the Emacs development team who understand more about i18n than I ever will. In short, there's no reliable method of finding out what is the correct 8-bit encoding in which to talk to any given text-mode display. And you certainly do NOT want any local 8-bit encodings when you are going to display the string on a GUI, because that would require that the front end does some extra job of converting the encoded text back to what it needs to communicate with the text widgets. > > And why are you talking about host character set? The > > L"123\x0f04\x0fccxyz" string came from the target, GDB simply > > converted it to 7-bit ASCII. These are characters from the target > > character set. And the target doesn't necessarily talk in the host > > locale's character set and language, you could be debugging a program > > which talks Farsi with GDB that runs in a German locale. > > So, characters that happen to exist in German locale are printed as literal > chars. Other characters are printed using \x. FE reads the string, and when > it sees literal char, it converts it from German locale to Unicode used > internally. Where's the problem? If this conversion is lossless, it's redundant. It is easier to just send everything as hex escapes, since no human will see them, only the FE. This saves the needless conversion (and potential problems with incorrect notion of the current locale and encoding). But some conversions to ``literal characters'' (i.e. to 8-bit binary codes) are lossy, because the underlying converter needs state information to correctly interpret the byte stream. This state information is thrown away once the conversion is done, and so the opposite conversion fails to reconstruct the original codepoints. This is usually the case with ISO-2022 encodings. So I think on balance it's better to send the original wide characters as hex, the only downside being that it uses more bytes per character. (Again, this is about GUI front ends, not about GDB's own CLI output routines.)