From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gdb-return-24912-listarch-gdb=sources.redhat.com@sourceware.org>
Received: (qmail 18338 invoked by alias); 14 Apr 2006 17:53:50 -0000
Received: (qmail 18325 invoked by uid 22791); 14 Apr 2006 17:53:48 -0000
X-Spam-Check-By: sourceware.org
Received: from xproxy.gmail.com (HELO xproxy.gmail.com) (66.249.82.193)     by sourceware.org (qpsmtpd/0.31) with ESMTP; Fri, 14 Apr 2006 17:53:46 +0000
Received: by xproxy.gmail.com with SMTP id h29so67225wxd         for <gdb@sources.redhat.com>; Fri, 14 Apr 2006 10:53:44 -0700 (PDT)
Received: by 10.70.59.18 with SMTP id h18mr2354164wxa;         Fri, 14 Apr 2006 10:53:44 -0700 (PDT)
Received: by 10.70.125.5 with HTTP; Fri, 14 Apr 2006 10:53:44 -0700 (PDT)
Message-ID: <8f2776cb0604141053v73e512e3o2d1c9086312316bd@mail.gmail.com>
Date: Fri, 14 Apr 2006 18:03:00 -0000
From: "Jim Blandy" <jimb@red-bean.com>
To: "Eli Zaretskii" <eliz@gnu.org>
Subject: Re: printing wchar_t*
Cc: "Vladimir Prus" <ghost@cs.msu.su>, gdb@sources.redhat.com
In-Reply-To: <uirpc19u8.fsf@gnu.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline
References: <e1lsqg$aml$1@sea.gmane.org> <200604141257.41690.ghost@cs.msu.su> 	 <uu08w1cnf.fsf@gnu.org> <200604141837.26618.ghost@cs.msu.su> 	 <uirpc19u8.fsf@gnu.org>
X-IsSubscribed: yes
Mailing-List: contact gdb-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Subscribe: <mailto:gdb-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/gdb/>
List-Post: <mailto:gdb@sourceware.org>
List-Help: <mailto:gdb-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: gdb-owner@sourceware.org
X-SW-Source: 2006-04/txt/msg00200.txt.bz2

I think folks are seeing difficult problems where there aren't any.=20
Even if the host character set (that is, the character set GDB is
using to communicate with its user, or in its MI communications) is
plain, old ASCII, GDB can, without any loss of information, convey the
contents of a wide string using an arbitrary target character set via
MI to a GUI, using code the GUI must already have.

Suppose we have a wide string where wchar_t values are Unicode code
points.  Suppose our host character set is plain ASCII.  Suppose the
user's program has a string containing the digits '123', followed by
some funky Tibetan characters U+0F04 U+0FCC, followed by the letters
'xyz'.  When asked to print that string, GDB should print the
following twenty-one ASCII characters:

L"123\x0f04\x0fccxyz"

Since this is a valid way to write that string in a source program, a
user at the GDB command line should understand it.  Since consumers of
MI information must contain parsers for C values already, they can
reliably find the contents of the string.

Note that this gets a GUI the contents of the string in the *target*
character set.  The GUI itself should be responsible for converting
target characters to whatever character set it wants to use to present
data to its user.  Here, GDB's 'host' character set is just the
character set used to carry information from GDB to the GUI; it should
probably be set to ASCII, just to avoid needless variation.  But
either way, it's just acting as a medium for values in C source code
syntax, and has no bearing on either the character set the target
program is using, or the character set the GUI will use to present
data to its user.

Unicode technical report #17 lays out the terminology the Unicode
folks use for all this stuff, with good explanations:
http://www.unicode.org/reports/tr17/

According to the ISO C standard, the coding character set used by
wchar_t must be a superset of that used by char for members of the
basic character set.  See ISO/IEC 9899:1999 (E) section 7.17,
paragraph 2.  So I think it's sufficient for the user to specify the
coding character set used by wide characters; that fixes the ccs used
for char values.