From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 18338 invoked by alias); 14 Apr 2006 17:53:50 -0000 Received: (qmail 18325 invoked by uid 22791); 14 Apr 2006 17:53:48 -0000 X-Spam-Check-By: sourceware.org Received: from xproxy.gmail.com (HELO xproxy.gmail.com) (66.249.82.193) by sourceware.org (qpsmtpd/0.31) with ESMTP; Fri, 14 Apr 2006 17:53:46 +0000 Received: by xproxy.gmail.com with SMTP id h29so67225wxd for ; Fri, 14 Apr 2006 10:53:44 -0700 (PDT) Received: by 10.70.59.18 with SMTP id h18mr2354164wxa; Fri, 14 Apr 2006 10:53:44 -0700 (PDT) Received: by 10.70.125.5 with HTTP; Fri, 14 Apr 2006 10:53:44 -0700 (PDT) Message-ID: <8f2776cb0604141053v73e512e3o2d1c9086312316bd@mail.gmail.com> Date: Fri, 14 Apr 2006 18:03:00 -0000 From: "Jim Blandy" To: "Eli Zaretskii" Subject: Re: printing wchar_t* Cc: "Vladimir Prus" , gdb@sources.redhat.com In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline References: <200604141257.41690.ghost@cs.msu.su> <200604141837.26618.ghost@cs.msu.su> X-IsSubscribed: yes Mailing-List: contact gdb-help@sourceware.org; run by ezmlm Precedence: bulk List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gdb-owner@sourceware.org X-SW-Source: 2006-04/txt/msg00200.txt.bz2 I think folks are seeing difficult problems where there aren't any.=20 Even if the host character set (that is, the character set GDB is using to communicate with its user, or in its MI communications) is plain, old ASCII, GDB can, without any loss of information, convey the contents of a wide string using an arbitrary target character set via MI to a GUI, using code the GUI must already have. Suppose we have a wide string where wchar_t values are Unicode code points. Suppose our host character set is plain ASCII. Suppose the user's program has a string containing the digits '123', followed by some funky Tibetan characters U+0F04 U+0FCC, followed by the letters 'xyz'. When asked to print that string, GDB should print the following twenty-one ASCII characters: L"123\x0f04\x0fccxyz" Since this is a valid way to write that string in a source program, a user at the GDB command line should understand it. Since consumers of MI information must contain parsers for C values already, they can reliably find the contents of the string. Note that this gets a GUI the contents of the string in the *target* character set. The GUI itself should be responsible for converting target characters to whatever character set it wants to use to present data to its user. Here, GDB's 'host' character set is just the character set used to carry information from GDB to the GUI; it should probably be set to ASCII, just to avoid needless variation. But either way, it's just acting as a medium for values in C source code syntax, and has no bearing on either the character set the target program is using, or the character set the GUI will use to present data to its user. Unicode technical report #17 lays out the terminology the Unicode folks use for all this stuff, with good explanations: http://www.unicode.org/reports/tr17/ According to the ISO C standard, the coding character set used by wchar_t must be a superset of that used by char for members of the basic character set. See ISO/IEC 9899:1999 (E) section 7.17, paragraph 2. So I think it's sufficient for the user to specify the coding character set used by wide characters; that fixes the ccs used for char values.