From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 22198 invoked by alias); 14 Apr 2006 07:58:00 -0000 Received: (qmail 22190 invoked by uid 22791); 14 Apr 2006 07:57:59 -0000 X-Spam-Check-By: sourceware.org Received: from zigzag.lvk.cs.msu.su (HELO zigzag.lvk.cs.msu.su) (158.250.17.23) by sourceware.org (qpsmtpd/0.31) with ESMTP; Fri, 14 Apr 2006 07:57:56 +0000 Received: from Debian-exim by zigzag.lvk.cs.msu.su with spam-scanned (Exim 4.50) id 1FUJBb-0002Xb-QD for gdb@sources.redhat.com; Fri, 14 Apr 2006 11:57:53 +0400 Received: from zigzag.lvk.cs.msu.su ([158.250.17.23]) by zigzag.lvk.cs.msu.su with esmtp (Exim 4.50) id 1FUJB6-0002Ro-Bk; Fri, 14 Apr 2006 11:57:16 +0400 From: Vladimir Prus To: "Jim Blandy" Subject: Re: printing wchar_t* Date: Fri, 14 Apr 2006 08:30:00 -0000 User-Agent: KMail/1.7.2 Cc: gdb@sources.redhat.com References: <8f2776cb0604140029r44decd6atfa728aad53cb596d@mail.gmail.com> In-Reply-To: <8f2776cb0604140029r44decd6atfa728aad53cb596d@mail.gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200604141157.15185.ghost@cs.msu.su> Mailing-List: contact gdb-help@sourceware.org; run by ezmlm Precedence: bulk List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gdb-owner@sourceware.org X-SW-Source: 2006-04/txt/msg00170.txt.bz2 On Friday 14 April 2006 11:29, Jim Blandy wrote: > On 4/13/06, Vladimir Prus wrote: > > Jim Blandy wrote: > > > On 4/13/06, Vladimir Prus wrote: > > >> I have a user-defined command that can produce the output I want, but > > >> is defining a custom command the right approach? > > > > > > Well, you'd like wide strings to be printed properly when they appear > > > in structures, as arguments to functions, and so on, right? So a > > > user-defined command isn't ideal. > > > > I think I'll still need to do some processing for wchar_t* on frontend > > side. The problem is that I don't see any way how gdb can print wchar_t > > in a way that does not require post-processing. It can print it as UTF8, > > but then for printing char* gdb should use local 8 bit encoding, which is > > likely to be *not* UTF8. Gdb can probably use some extra markers for > > values: like: > > > > "foo" for string in local 8-bit encoding > > L"foo" for string in UTF8 encoding. > > > > It's also possible to use "\u" escapes. > > > > But then there's a problem: > > > > - Do we assume that wchar_t is always UTF-16 or UTF-32? > > - If not: > > - how user can select this? > > - how user-specified encoding will be handled > > You can't hard-code assumptions about the character set into GDB. Nor > can you hard-code the assumption that the host and target character > sets are the same. GDB needs to do explicit conversions between the > two as needed, and handle mismatches in some reasonable way. > > GDB already has the commands 'set host-charset' and 'set > target-charset', so you can assume that you have accurate information > about the character sets at hand. They fall back to ASCII. Good, but you need to separately set host-charset for char* and for wchar_t*. The first can be KOI8-R and the second can be UTF-32 in the same program at the same time. > > > The best approach would be to extend charset.[ch] to handle wide > > > character sets as well, and then add code to the language-specific > > > printing routines to use the charset functions. (This is fortunately > > > much simpler than adding support for multibyte characters.) > > > > For, for each wchar_t element language-specific code will call > > 'target_wchar_t_to_host', that will output specific representation of > > that wchar_t. Hmm, the interface there seem to assume theres 1<->1 > > mapping between target and host characters. This makes L"UTF8" format > > and ascii string with \u escapes format impossible, It seems. > > Not at all. The current character and string printing code uses those > routines, and it handles unprintable and invalid characters just fine. > See, for example, host_print_char_literally, and > c_target_char_has_backslash_escape. Can this code output using UTF8-encoding? Consider this code from c-lang.c: static void c_emit_char (int c, struct ui_file *stream, int quoter) { const char *escape; int host_char; c &= 0xFF; /* Avoid sign bit follies */ escape = c_target_char_has_backslash_escape (c); if (escape) { if (quoter == '"' && strcmp (escape, "0") == 0) /* Print nulls embedded in double quoted strings as \000 to prevent ambiguity. */ fprintf_filtered (stream, "\\000"); else fprintf_filtered (stream, "\\%s", escape); } else if (target_char_to_host (c, &host_char) && host_char_print_literally (host_char)) { if (host_char == '\\' || host_char == quoter) fputs_filtered ("\\", stream); fprintf_filtered (stream, "%c", host_char); } else fprintf_filtered (stream, "\\%.3o", (unsigned int) c); } With UTF8 host encoding, we'd want up to 6 host bytes to be output for a single wchar_t, without escaping. The size of 'host_char' here is 4 bytes, so there's no way for 'target_char_to_host' to produce 6 characters. > As far as 1-to-1 mappings are concerned, the only necessary property > is that host_char_to_target and target_char_to_host be inverses, and > return zero for characters that can't make a round trip. The existing > string-printing code will automatically use numeric escapes for > characters that target_char_to_host won't translate. So, assuming numeric escapes are fine with me, I'd need to: 1. Add a way to specify encoding of wchar_t* values. 2. Write a version of c_printstr that will handle wchar_t*. The current version just accesses i-th element of the string, so won't work with UTF-16. 3. Make new c_printstr call LA_PRINT_CHAR just like c_printstr, and that will handle escapes automatically. 4. Make sure new version of c_printstr is invoked for wchar_t* values. Is that about right? - Volodya