From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gdb-return-24950-listarch-gdb=sources.redhat.com@sourceware.org>
Received: (qmail 27188 invoked by alias); 17 Apr 2006 11:21:38 -0000
Received: (qmail 27178 invoked by uid 22791); 17 Apr 2006 11:21:37 -0000
X-Spam-Check-By: sourceware.org
Received: from nitzan.inter.net.il (HELO nitzan.inter.net.il) (192.114.186.20)     by sourceware.org (qpsmtpd/0.31) with ESMTP; Mon, 17 Apr 2006 11:21:36 +0000
Received: from HOME-C4E4A596F7 (IGLD-80-230-11-227.inter.net.il [80.230.11.227]) 	by nitzan.inter.net.il (MOS 3.7.3-GA) 	with ESMTP id DDR05643 (AUTH halo1); 	Mon, 17 Apr 2006 14:21:29 +0300 (IDT)
Date: Mon, 17 Apr 2006 12:26:00 -0000
Message-Id: <uzmikxxab.fsf@gnu.org>
From: Eli Zaretskii <eliz@gnu.org>
To: Vladimir Prus <ghost@cs.msu.su>
CC: jimb@red-bean.com, gdb@sources.redhat.com
In-reply-to: <200604171301.59881.ghost@cs.msu.su> (message from Vladimir Prus 	on Mon, 17 Apr 2006 13:01:58 +0400)
Subject: Re: printing wchar_t*
Reply-to: Eli Zaretskii <eliz@gnu.org>
References: <e1lsqg$aml$1@sea.gmane.org> <200604171036.48833.ghost@cs.msu.su> <u7j5ozjk1.fsf@gnu.org> <200604171301.59881.ghost@cs.msu.su>
X-IsSubscribed: yes
Mailing-List: contact gdb-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Subscribe: <mailto:gdb-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/gdb/>
List-Post: <mailto:gdb@sourceware.org>
List-Help: <mailto:gdb-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: gdb-owner@sourceware.org
X-SW-Source: 2006-04/txt/msg00238.txt.bz2

> From: Vladimir Prus <ghost@cs.msu.su>
> Date: Mon, 17 Apr 2006 13:01:58 +0400
> Cc: jimb@red-bean.com,
>  gdb@sources.redhat.com
> 
> > What I was saying that indeed this conversion is easy, but it's not
> > even close to doing what the front end generally would like to do with
> > the string.  You want to _process_ the string, which means you want to
> > know its length in characters (not bytes), you want to know what
> > character set they encode, you want to be able to find the n-th
> > character in the string, etc.  The encoding suggested by Jim makes
> > these tasks very hard, much harder than if we send the string as an
> > array of fixed-length wide characters.
> 
> That's a *completely* different topic.

Yes, it is.  But we must keep it in mind because the front ends want
strings to do something with them.

> Second, frontend needs to display the data, however it will operate
> using its own data structures, and it does not matter if \x escapes
> were used or not. No frontend will ever work on a string containing
> embedded "\x" escapes.

I was saying that the ASCII encoding suggested by Jim makes it harder
to convert the text into wide characters, that's all.

> > > Using \x escapes, provided they encode *code units*, leaves frontend with
> > > the same simple job.
> >
> > Yes, but GDB will need to generate the code units first, e.g. convert
> > fixed-size Unicode wide characters into UTF-8.  
> 
> Sorry, where does that UTF-8 comes from?

UTF-8 was an example, the general point being that code units are
present only in encodings, not in fixed-length wide characters.

> > That's extra job for 
> > GDB.  (Again, we were originally talking about wchar_t, not multibyte
> > strings.)
> 
> I don't understand what's this extra job. This is as simple as:
> 
>    for c in wchar_t* literal:
>        if c is representable in host encoding:
>             output_literal
>        else
>             output_hex_escape

That might sound simple for you, but it isn't, in general.  The
``representable in host encoding'' part is very non-trivial; for
example, how do you tell whether the Unicode codepoints 0x05C3 and
0x05C4 can be represented in the Windows codepage 1255 (the former
can, the latter cannot)?  This is generally impossible without using
very complicated algorithms and/or large data bases.

The other complex part is ``output_literal'': again, there's no simple
algorithm to map Unicode's 0x05C3 into cp1255's 0xD3.  You need tables
again, and you need separate tables for each possible encoding (Hebrew
has at least 3 widely used ones, Russian has at least 5, etc.).

> > > Really, using strings with \x escapes differs from array
> > > printing in just one point: some characters are printed not as hex
> > > values, but as characters in local 8-bit encoding. Why do you think this
> > > is a problem?
> >
> > Because knowing what is the ``local 8-bit encoding'' is in itself a
> > huge problem.
> [...]
> I trust you on that, but nothing prevents user/frontend to explicitly specify 
> the encoding.

What makes you think the user and/or front end will know what to
specify?  Experience shows they generally don't.

> > And you certainly do NOT want any local 8-bit encodings when you are
> > going to display the string on a GUI, because that would require that
> > the front end does some extra job of converting the encoded text back
> > to what it needs to communicate with the text widgets.
> 
> I would expect that any GUI toolkit that pretend to support Unicode *has* to 
> support conversion from local 8 bit encodings. Otherwise, such toolkit is of 
> no use in real world.

Then most of them are ``of no use''.  You can rely on most of the
modern GUI toolkits to support conversion from UTF-8 to Unicode, but
that's about it.  For anything more complex, your best bet is to link
against libiconv or similar.

> By the way, unless your target encoding is ASCII, frontend has to be aware of 
> local 8 bit encoding anyway. If I wrote program using KOI8-R and frontend 
> shows the char* (not wchar_t*) strings as ASCII, the frontend is broken 
> already.

This only works as long as you use the encoding that matches your
default fonts.  Once it doesn't match, or the encoded characters come
from a program written for different locale conventions, you are out
of luck.

It is important to realize that programs don't know anything about
characters, all they see is integer code values.  To display those
codes in human-readable form, a program needs to know what display API
to call and which font to request.  This kind of information is absent
from simple text files that hold encoded non-ASCII text, so programs
generally need additional info to DTRT.  The same holds for arbitrary
strings GDB spills on you from some address in the debuggee.

> 1. Gbd should be modified to print wchar_t* literals.

``Print'' is ambiguous in this context.  I believe you mean ``send to
the front end'', since this was your original problem.  If the front
end is charged with displaying the wchar_t strings, GDB does not need
to print anything by itself.  Am I right?

> It should use the same 
> logic as for char* to decide if value is representable in the host charset, 

I hope I explained above why this part is highly non-trivial.  That is
why I think GDB should use hex notation for all characters, and leave
it for the FE to deal with their display.