From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 27188 invoked by alias); 17 Apr 2006 11:21:38 -0000 Received: (qmail 27178 invoked by uid 22791); 17 Apr 2006 11:21:37 -0000 X-Spam-Check-By: sourceware.org Received: from nitzan.inter.net.il (HELO nitzan.inter.net.il) (192.114.186.20) by sourceware.org (qpsmtpd/0.31) with ESMTP; Mon, 17 Apr 2006 11:21:36 +0000 Received: from HOME-C4E4A596F7 (IGLD-80-230-11-227.inter.net.il [80.230.11.227]) by nitzan.inter.net.il (MOS 3.7.3-GA) with ESMTP id DDR05643 (AUTH halo1); Mon, 17 Apr 2006 14:21:29 +0300 (IDT) Date: Mon, 17 Apr 2006 12:26:00 -0000 Message-Id: From: Eli Zaretskii To: Vladimir Prus CC: jimb@red-bean.com, gdb@sources.redhat.com In-reply-to: <200604171301.59881.ghost@cs.msu.su> (message from Vladimir Prus on Mon, 17 Apr 2006 13:01:58 +0400) Subject: Re: printing wchar_t* Reply-to: Eli Zaretskii References: <200604171036.48833.ghost@cs.msu.su> <200604171301.59881.ghost@cs.msu.su> X-IsSubscribed: yes Mailing-List: contact gdb-help@sourceware.org; run by ezmlm Precedence: bulk List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gdb-owner@sourceware.org X-SW-Source: 2006-04/txt/msg00238.txt.bz2 > From: Vladimir Prus > Date: Mon, 17 Apr 2006 13:01:58 +0400 > Cc: jimb@red-bean.com, > gdb@sources.redhat.com > > > What I was saying that indeed this conversion is easy, but it's not > > even close to doing what the front end generally would like to do with > > the string. You want to _process_ the string, which means you want to > > know its length in characters (not bytes), you want to know what > > character set they encode, you want to be able to find the n-th > > character in the string, etc. The encoding suggested by Jim makes > > these tasks very hard, much harder than if we send the string as an > > array of fixed-length wide characters. > > That's a *completely* different topic. Yes, it is. But we must keep it in mind because the front ends want strings to do something with them. > Second, frontend needs to display the data, however it will operate > using its own data structures, and it does not matter if \x escapes > were used or not. No frontend will ever work on a string containing > embedded "\x" escapes. I was saying that the ASCII encoding suggested by Jim makes it harder to convert the text into wide characters, that's all. > > > Using \x escapes, provided they encode *code units*, leaves frontend with > > > the same simple job. > > > > Yes, but GDB will need to generate the code units first, e.g. convert > > fixed-size Unicode wide characters into UTF-8. > > Sorry, where does that UTF-8 comes from? UTF-8 was an example, the general point being that code units are present only in encodings, not in fixed-length wide characters. > > That's extra job for > > GDB. (Again, we were originally talking about wchar_t, not multibyte > > strings.) > > I don't understand what's this extra job. This is as simple as: > > for c in wchar_t* literal: > if c is representable in host encoding: > output_literal > else > output_hex_escape That might sound simple for you, but it isn't, in general. The ``representable in host encoding'' part is very non-trivial; for example, how do you tell whether the Unicode codepoints 0x05C3 and 0x05C4 can be represented in the Windows codepage 1255 (the former can, the latter cannot)? This is generally impossible without using very complicated algorithms and/or large data bases. The other complex part is ``output_literal'': again, there's no simple algorithm to map Unicode's 0x05C3 into cp1255's 0xD3. You need tables again, and you need separate tables for each possible encoding (Hebrew has at least 3 widely used ones, Russian has at least 5, etc.). > > > Really, using strings with \x escapes differs from array > > > printing in just one point: some characters are printed not as hex > > > values, but as characters in local 8-bit encoding. Why do you think this > > > is a problem? > > > > Because knowing what is the ``local 8-bit encoding'' is in itself a > > huge problem. > [...] > I trust you on that, but nothing prevents user/frontend to explicitly specify > the encoding. What makes you think the user and/or front end will know what to specify? Experience shows they generally don't. > > And you certainly do NOT want any local 8-bit encodings when you are > > going to display the string on a GUI, because that would require that > > the front end does some extra job of converting the encoded text back > > to what it needs to communicate with the text widgets. > > I would expect that any GUI toolkit that pretend to support Unicode *has* to > support conversion from local 8 bit encodings. Otherwise, such toolkit is of > no use in real world. Then most of them are ``of no use''. You can rely on most of the modern GUI toolkits to support conversion from UTF-8 to Unicode, but that's about it. For anything more complex, your best bet is to link against libiconv or similar. > By the way, unless your target encoding is ASCII, frontend has to be aware of > local 8 bit encoding anyway. If I wrote program using KOI8-R and frontend > shows the char* (not wchar_t*) strings as ASCII, the frontend is broken > already. This only works as long as you use the encoding that matches your default fonts. Once it doesn't match, or the encoded characters come from a program written for different locale conventions, you are out of luck. It is important to realize that programs don't know anything about characters, all they see is integer code values. To display those codes in human-readable form, a program needs to know what display API to call and which font to request. This kind of information is absent from simple text files that hold encoded non-ASCII text, so programs generally need additional info to DTRT. The same holds for arbitrary strings GDB spills on you from some address in the debuggee. > 1. Gbd should be modified to print wchar_t* literals. ``Print'' is ambiguous in this context. I believe you mean ``send to the front end'', since this was your original problem. If the front end is charged with displaying the wchar_t strings, GDB does not need to print anything by itself. Am I right? > It should use the same > logic as for char* to decide if value is representable in the host charset, I hope I explained above why this part is highly non-trivial. That is why I think GDB should use hex notation for all characters, and leave it for the FE to deal with their display.