From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 27705 invoked by alias); 14 Apr 2006 13:59:24 -0000 Received: (qmail 27696 invoked by uid 22791); 14 Apr 2006 13:59:23 -0000 X-Spam-Check-By: sourceware.org Received: from romy.inter.net.il (HELO romy.inter.net.il) (192.114.186.66) by sourceware.org (qpsmtpd/0.31) with ESMTP; Fri, 14 Apr 2006 13:59:22 +0000 Received: from HOME-C4E4A596F7 (IGLD-83-130-214-179.inter.net.il [83.130.214.179]) by romy.inter.net.il (MOS 3.7.3-GA) with ESMTP id DZD20753 (AUTH halo1); Fri, 14 Apr 2006 16:59:16 +0300 (IDT) Date: Fri, 14 Apr 2006 14:16:00 -0000 Message-Id: From: Eli Zaretskii To: Vladimir Prus CC: gdb@sources.redhat.com In-reply-to: <200604141257.41690.ghost@cs.msu.su> (message from Vladimir Prus on Fri, 14 Apr 2006 12:57:41 +0400) Subject: Re: printing wchar_t* Reply-to: Eli Zaretskii References: <200604141257.41690.ghost@cs.msu.su> X-IsSubscribed: yes Mailing-List: contact gdb-help@sourceware.org; run by ezmlm Precedence: bulk List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gdb-owner@sourceware.org X-SW-Source: 2006-04/txt/msg00188.txt.bz2 > From: Vladimir Prus > Date: Fri, 14 Apr 2006 12:57:41 +0400 > Cc: gdb@sources.redhat.com > > > In particular, if the original wchar_t uses Unicode codepoints, then > > presumably there should be some GUI API call, specific to your > > windowing system, that would accept such a wchar_t string and display > > it using a Unicode font. > > Sure, I know how to display Unicode string. The question is how to get at pass > raw Unicode data from gdb to frontend in the form suitable for me and most > reasonable to other users of gdb. I suggested to use array features for that. > In an original post, I've asked if gdb can print wchar_t just as a raw > sequence of values, like this: > > 0x56, 0x1456 The answer is YES. Use array notation, and add a feature to report the length of a wchar_t array. > "foo" and L"foo" are other alternatives which might be more handy for general > users of gdb. L"foo" will not help you here, because the characters in question are not printable. If GDB outputs L"foo" where every character is not printable, you will have the same problem as you have now. > > > But then there's a problem: > > > > > > - Do we assume that wchar_t is always UTF-16 or UTF-32? > > > > You don't need to assume, you can ask the application. Wouldn't > > "sizeof(wchar_t)" do the trick? > > Deciding if it's UTF-16 or UTF-32 is not the problem. Well, you did ask about the distinction. > In fact, exactly the same code will handle both encodings just fine. Again, please don't use encoding when you mean character's codepoint. It's confusing, and runs a risk to obfuscate the problem. See below. > The question if we allow encodings which are not UTF-16 or UTF-32. I > don't know about any such encodings, but I'm not an i18n expert. There are a myriad of encodings, but the only ones that could ever qualify as wchar_t are single-byte (8-bit) encodings that are generally used for Latin languages (and for several others, like Cyrillic and Hebrew). What you need is a way to tell GDB how are the strings represented in the debuggee's wchar_t, and then GDB should convert that representation into something your FE can display. Assuming your FE will be able to display Unicode characters, GDB should convert to Unicode, if the debugge's wchar_t is not Unicode already. There's no universal way for GDB to know what is held in wchar_t by the debuggee, so I think the only reasonable way is for the user to tell that. A reasonable default would be 16-bit Unicode codepoints from the BMP, or 32-bit Unicode codepoints from the entire range of Unicode characters. (I think glibc uses the latter.) > > > - how user-specified encoding will be handled > > > > wchar_t is not an encoding, it's the characters' codes themselves. > > I don't understand what you say here, sorry. Do you mean that each wchar_t is > in general code point, not a complete abstract character. Yes, true, and > what? If wchar_t* literals can use encoding other then UTF-16 and UTF-32, you > need the code to handle that encoding, and the question arises where you'll > get that code, will it be iconv or something else. > > > Encoded characters are (in general) multibyte character strings, not > > wchar_t. See, for example, the description of library functions > > mbsinit, mbrlen, mbrtowc, etc., for more about this distinction. > > I know about this distinction. If you know about this distinction, then you should have no trouble understanding what I said about wchar_t NOT being an encoding. UTF-8 and UTF-16 are multibyte variable-length _encodings_ of Unicode character's _codepoints_. For example, the Cyrillic letter ``small a'' has Unicode codepoint 0x0430, but its UTF-8 encoding is a two-byte sequence 0xD0 0xB0. The codepoint is something you will find in a wchar_t array, while the UTF-8 encoding is something you will find in a multibyte string. Now, the same letter ``small a'' can be encoded in several other ways: for example, its ISO-2022-7bit encoding is 0x1B 0x24 0x2C 0x31 0x28 0x50, its KOI8-r encoding is 0xC1, its ISO-8859-5 encoding is 0xD0, etc. It should be obvious that, of all the encodings, only the fixed-length ones can be used in a wchar_t array (because wchar_t arrays are stateless, while multibyte encodings produce stateful strings, where the beginning of each encoded character cannot be decided without processing all the characters before it). It should also be obvious that using wchar_t for single-byte encodings is not useful (you waste storage). Thus, the only practical use of wchar_t is for character sets that do not fit into a single byte, and for those, all the encodings I know of are variable-length multibyte encodings, which are not suitable for wchar_t, as mentioned above. This is why I said that wchar_t is not used for an encoding (such as ISO-8859-5 or UTF-8 or UTF-16), but for characters' codepoints. It is nowadays almost universally accepted that wchar_t is a Unicode codepoint, the only difference between applications being whether only the first 64K characters (the so-called BMP) are supported by 16-bit wchar_t, or the entire 23-bit range is supported by a 32-bit wchar_t.