From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 12356 invoked by alias); 14 Apr 2006 07:29:42 -0000 Received: (qmail 12347 invoked by uid 22791); 14 Apr 2006 07:29:41 -0000 X-Spam-Check-By: sourceware.org Received: from xproxy.gmail.com (HELO xproxy.gmail.com) (66.249.82.200) by sourceware.org (qpsmtpd/0.31) with ESMTP; Fri, 14 Apr 2006 07:29:39 +0000 Received: by xproxy.gmail.com with SMTP id h29so9623wxd for ; Fri, 14 Apr 2006 00:29:37 -0700 (PDT) Received: by 10.70.92.11 with SMTP id p11mr1749000wxb; Fri, 14 Apr 2006 00:29:37 -0700 (PDT) Received: by 10.70.125.5 with HTTP; Fri, 14 Apr 2006 00:29:37 -0700 (PDT) Message-ID: <8f2776cb0604140029r44decd6atfa728aad53cb596d@mail.gmail.com> Date: Fri, 14 Apr 2006 08:07:00 -0000 From: "Jim Blandy" To: "Vladimir Prus" Subject: Re: printing wchar_t* Cc: gdb@sources.redhat.com In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Content-Disposition: inline References: <8f2776cb0604131031g370d6fa9p9361421bd21d178@mail.gmail.com> X-IsSubscribed: yes Mailing-List: contact gdb-help@sourceware.org; run by ezmlm Precedence: bulk List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gdb-owner@sourceware.org X-SW-Source: 2006-04/txt/msg00169.txt.bz2 On 4/13/06, Vladimir Prus wrote: > Jim Blandy wrote: > > > On 4/13/06, Vladimir Prus wrote: > >> I have a user-defined command that can produce the output I want, but = is > >> defining a custom command the right approach? > > > > Well, you'd like wide strings to be printed properly when they appear > > in structures, as arguments to functions, and so on, right? So a > > user-defined command isn't ideal. > > I think I'll still need to do some processing for wchar_t* on frontend si= de. > The problem is that I don't see any way how gdb can print wchar_t in a way > that does not require post-processing. It can print it as UTF8, but then > for printing char* gdb should use local 8 bit encoding, which is likely to > be *not* UTF8. Gdb can probably use some extra markers for values: like: > > "foo" for string in local 8-bit encoding > L"foo" for string in UTF8 encoding. > > It's also possible to use "\u" escapes. > > But then there's a problem: > > - Do we assume that wchar_t is always UTF-16 or UTF-32? > - If not: > - how user can select this? > - how user-specified encoding will be handled You can't hard-code assumptions about the character set into GDB. Nor can you hard-code the assumption that the host and target character sets are the same. GDB needs to do explicit conversions between the two as needed, and handle mismatches in some reasonable way. GDB already has the commands 'set host-charset' and 'set target-charset', so you can assume that you have accurate information about the character sets at hand. They fall back to ASCII. > > The best approach would be to extend charset.[ch] to handle wide > > character sets as well, and then add code to the language-specific > > printing routines to use the charset functions. (This is fortunately > > much simpler than adding support for multibyte characters.) > > For, for each wchar_t element language-specific code will call > 'target_wchar_t_to_host', that will output specific representation of that > wchar_t. Hmm, the interface there seem to assume theres 1<->1 mapping > between target and host characters. This makes L"UTF8" format and ascii > string with \u escapes format impossible, It seems. Not at all. The current character and string printing code uses those routines, and it handles unprintable and invalid characters just fine. See, for example, host_print_char_literally, and c_target_char_has_backslash_escape. GDB tries to print characters and strings as they would appear in source code. C doesn't assume that the source and execution character sets are the same; by using numeric escapes, you can write programs for any execution character set in any source character set. You just need enough information to manage the overlap. As far as 1-to-1 mappings are concerned, the only necessary property is that host_char_to_target and target_char_to_host be inverses, and return zero for characters that can't make a round trip. The existing string-printing code will automatically use numeric escapes for characters that target_char_to_host won't translate.