From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gdb-return-24900-listarch-gdb=sources.redhat.com@sourceware.org>
Received: (qmail 27705 invoked by alias); 14 Apr 2006 13:59:24 -0000
Received: (qmail 27696 invoked by uid 22791); 14 Apr 2006 13:59:23 -0000
X-Spam-Check-By: sourceware.org
Received: from romy.inter.net.il (HELO romy.inter.net.il) (192.114.186.66)     by sourceware.org (qpsmtpd/0.31) with ESMTP; Fri, 14 Apr 2006 13:59:22 +0000
Received: from HOME-C4E4A596F7 (IGLD-83-130-214-179.inter.net.il [83.130.214.179]) 	by romy.inter.net.il (MOS 3.7.3-GA) 	with ESMTP id DZD20753 (AUTH halo1); 	Fri, 14 Apr 2006 16:59:16 +0300 (IDT)
Date: Fri, 14 Apr 2006 14:16:00 -0000
Message-Id: <uu08w1cnf.fsf@gnu.org>
From: Eli Zaretskii <eliz@gnu.org>
To: Vladimir Prus <ghost@cs.msu.su>
CC: gdb@sources.redhat.com
In-reply-to: <200604141257.41690.ghost@cs.msu.su> (message from Vladimir Prus 	on Fri, 14 Apr 2006 12:57:41 +0400)
Subject: Re: printing wchar_t*
Reply-to: Eli Zaretskii <eliz@gnu.org>
References: <e1lsqg$aml$1@sea.gmane.org> <e1necb$gen$1@sea.gmane.org> <u3bgg35u9.fsf@gnu.org> <200604141257.41690.ghost@cs.msu.su>
X-IsSubscribed: yes
Mailing-List: contact gdb-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Subscribe: <mailto:gdb-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/gdb/>
List-Post: <mailto:gdb@sourceware.org>
List-Help: <mailto:gdb-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: gdb-owner@sourceware.org
X-SW-Source: 2006-04/txt/msg00188.txt.bz2

> From: Vladimir Prus <ghost@cs.msu.su>
> Date: Fri, 14 Apr 2006 12:57:41 +0400
> Cc: gdb@sources.redhat.com
> 
> > In particular, if the original wchar_t uses Unicode codepoints, then
> > presumably there should be some GUI API call, specific to your
> > windowing system, that would accept such a wchar_t string and display
> > it using a Unicode font.
> 
> Sure, I know how to display Unicode string. The question is how to get at pass 
> raw Unicode data from gdb to frontend in the form suitable for me and most 
> reasonable to other users of gdb.

I suggested to use array features for that.

> In an original post, I've asked if gdb can print wchar_t just as a raw 
> sequence of values, like this:
> 
>     0x56, 0x1456

The answer is YES.  Use array notation, and add a feature to report
the length of a wchar_t array.

> "foo" and L"foo" are other alternatives which might be more handy for general 
> users of gdb.

L"foo" will not help you here, because the characters in question are
not printable.  If GDB outputs L"foo" where every character is not
printable, you will have the same problem as you have now.

> > > But then there's a problem:
> > >
> > >    - Do we assume that wchar_t is always UTF-16 or UTF-32?
> >
> > You don't need to assume, you can ask the application.  Wouldn't
> > "sizeof(wchar_t)" do the trick?
> 
> Deciding if it's UTF-16 or UTF-32 is not the problem.

Well, you did ask about the distinction.

> In fact, exactly the same code will handle both encodings just fine.

Again, please don't use encoding when you mean character's codepoint.
It's confusing, and runs a risk to obfuscate the problem.  See below.

> The question if we allow encodings which are not UTF-16 or UTF-32. I
> don't know about any such encodings, but I'm not an i18n expert.

There are a myriad of encodings, but the only ones that could ever
qualify as wchar_t are single-byte (8-bit) encodings that are
generally used for Latin languages (and for several others, like
Cyrillic and Hebrew).

What you need is a way to tell GDB how are the strings represented in
the debuggee's wchar_t, and then GDB should convert that
representation into something your FE can display.  Assuming your FE
will be able to display Unicode characters, GDB should convert to
Unicode, if the debugge's wchar_t is not Unicode already.

There's no universal way for GDB to know what is held in wchar_t by
the debuggee, so I think the only reasonable way is for the user to
tell that.  A reasonable default would be 16-bit Unicode codepoints
from the BMP, or 32-bit Unicode codepoints from the entire range of
Unicode characters.  (I think glibc uses the latter.)

> > >      - how user-specified encoding will be handled
> >
> > wchar_t is not an encoding, it's the characters' codes themselves.
> 
> I don't understand what you say here, sorry. Do you mean that each wchar_t is 
> in general code point, not a complete abstract character. Yes, true, and 
> what? If wchar_t* literals can use encoding other then UTF-16 and UTF-32, you 
> need the code to handle that encoding, and the question arises where you'll 
> get that code, will it be iconv or something else.
> 
> > Encoded characters are (in general) multibyte character strings, not
> > wchar_t.  See, for example, the description of library functions
> > mbsinit, mbrlen, mbrtowc, etc., for more about this distinction.
> 
> I know about this distinction.

If you know about this distinction, then you should have no trouble
understanding what I said about wchar_t NOT being an encoding.  UTF-8
and UTF-16 are multibyte variable-length _encodings_ of Unicode
character's _codepoints_.  For example, the Cyrillic letter ``small
a'' has Unicode codepoint 0x0430, but its UTF-8 encoding is a two-byte
sequence 0xD0 0xB0.  The codepoint is something you will find in a
wchar_t array, while the UTF-8 encoding is something you will find in
a multibyte string.

Now, the same letter ``small a'' can be encoded in several other ways:
for example, its ISO-2022-7bit encoding is 0x1B 0x24 0x2C 0x31 0x28
0x50, its KOI8-r encoding is 0xC1, its ISO-8859-5 encoding is 0xD0,
etc.  It should be obvious that, of all the encodings, only the
fixed-length ones can be used in a wchar_t array (because wchar_t
arrays are stateless, while multibyte encodings produce stateful
strings, where the beginning of each encoded character cannot be
decided without processing all the characters before it).  It should
also be obvious that using wchar_t for single-byte encodings is not
useful (you waste storage).  Thus, the only practical use of wchar_t
is for character sets that do not fit into a single byte, and for
those, all the encodings I know of are variable-length multibyte
encodings, which are not suitable for wchar_t, as mentioned above.

This is why I said that wchar_t is not used for an encoding (such as
ISO-8859-5 or UTF-8 or UTF-16), but for characters' codepoints.  It is
nowadays almost universally accepted that wchar_t is a Unicode
codepoint, the only difference between applications being whether only
the first 64K characters (the so-called BMP) are supported by 16-bit
wchar_t, or the entire 23-bit range is supported by a 32-bit wchar_t.