From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gdb-return-24881-listarch-gdb=sources.redhat.com@sourceware.org>
Received: (qmail 12356 invoked by alias); 14 Apr 2006 07:29:42 -0000
Received: (qmail 12347 invoked by uid 22791); 14 Apr 2006 07:29:41 -0000
X-Spam-Check-By: sourceware.org
Received: from xproxy.gmail.com (HELO xproxy.gmail.com) (66.249.82.200)     by sourceware.org (qpsmtpd/0.31) with ESMTP; Fri, 14 Apr 2006 07:29:39 +0000
Received: by xproxy.gmail.com with SMTP id h29so9623wxd         for <gdb@sources.redhat.com>; Fri, 14 Apr 2006 00:29:37 -0700 (PDT)
Received: by 10.70.92.11 with SMTP id p11mr1749000wxb;         Fri, 14 Apr 2006 00:29:37 -0700 (PDT)
Received: by 10.70.125.5 with HTTP; Fri, 14 Apr 2006 00:29:37 -0700 (PDT)
Message-ID: <8f2776cb0604140029r44decd6atfa728aad53cb596d@mail.gmail.com>
Date: Fri, 14 Apr 2006 08:07:00 -0000
From: "Jim Blandy" <jimb@red-bean.com>
To: "Vladimir Prus" <ghost@cs.msu.su>
Subject: Re: printing wchar_t*
Cc: gdb@sources.redhat.com
In-Reply-To: <e1necb$gen$1@sea.gmane.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
References: <e1lsqg$aml$1@sea.gmane.org> 	 <8f2776cb0604131031g370d6fa9p9361421bd21d178@mail.gmail.com> 	 <e1necb$gen$1@sea.gmane.org>
X-IsSubscribed: yes
Mailing-List: contact gdb-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Subscribe: <mailto:gdb-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/gdb/>
List-Post: <mailto:gdb@sourceware.org>
List-Help: <mailto:gdb-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: gdb-owner@sourceware.org
X-SW-Source: 2006-04/txt/msg00169.txt.bz2

On 4/13/06, Vladimir Prus <ghost@cs.msu.su> wrote:
> Jim Blandy wrote:
>
> > On 4/13/06, Vladimir Prus <ghost@cs.msu.su> wrote:
> >> I have a user-defined command that can produce the output I want, but =
is
> >> defining a custom command the right approach?
> >
> > Well, you'd like wide strings to be printed properly when they appear
> > in structures, as arguments to functions, and so on, right?  So a
> > user-defined command isn't ideal.
>
> I think I'll still need to do some processing for wchar_t* on frontend si=
de.
> The problem is that I don't see any way how gdb can print wchar_t in a way
> that does not require post-processing. It can print it as UTF8, but then
> for printing char* gdb should use local 8 bit encoding, which is likely to
> be *not* UTF8. Gdb can probably use some extra markers for values: like:
>
>    "foo"  for string in local 8-bit encoding
>    L"foo" for string in UTF8 encoding.
>
> It's also possible to use "\u" escapes.
>
> But then there's a problem:
>
>    - Do we assume that wchar_t is always UTF-16 or UTF-32?
>    - If not:
>      - how user can select this?
>      - how user-specified encoding will be handled

You can't hard-code assumptions about the character set into GDB.  Nor
can you hard-code the assumption that the host and target character
sets are the same.  GDB needs to do explicit conversions between the
two as needed, and handle mismatches in some reasonable way.

GDB already has the commands 'set host-charset' and 'set
target-charset', so you can assume that you have accurate information
about the character sets at hand.  They fall back to ASCII.

> > The best approach would be to extend charset.[ch] to handle wide
> > character sets as well, and then add code to the language-specific
> > printing routines to use the charset functions.  (This is fortunately
> > much simpler than adding support for multibyte characters.)
>
> For, for each wchar_t element language-specific code will call
> 'target_wchar_t_to_host', that will output specific representation of that
> wchar_t. Hmm, the interface there seem to assume theres 1<->1 mapping
> between target and host characters.  This makes L"UTF8" format and ascii
> string with \u escapes format impossible, It seems.

Not at all.  The current character and string printing code uses those
routines, and it handles unprintable and invalid characters just fine.
 See, for example, host_print_char_literally, and
c_target_char_has_backslash_escape.

GDB tries to print characters and strings as they would appear in
source code.  C doesn't assume that the source and execution character
sets are the same; by using numeric escapes, you can write programs
for any execution character set in any source character set.  You just
need enough information to manage the overlap.

As far as 1-to-1 mappings are concerned, the only necessary property
is that host_char_to_target and target_char_to_host be inverses, and
return zero for characters that can't make a round trip.  The existing
string-printing code will automatically use numeric escapes for
characters that target_char_to_host won't translate.