From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gdb-return-24882-listarch-gdb=sources.redhat.com@sourceware.org>
Received: (qmail 22198 invoked by alias); 14 Apr 2006 07:58:00 -0000
Received: (qmail 22190 invoked by uid 22791); 14 Apr 2006 07:57:59 -0000
X-Spam-Check-By: sourceware.org
Received: from zigzag.lvk.cs.msu.su (HELO zigzag.lvk.cs.msu.su) (158.250.17.23)     by sourceware.org (qpsmtpd/0.31) with ESMTP; Fri, 14 Apr 2006 07:57:56 +0000
Received: from Debian-exim by zigzag.lvk.cs.msu.su with spam-scanned (Exim 4.50) 	id 1FUJBb-0002Xb-QD 	for gdb@sources.redhat.com; Fri, 14 Apr 2006 11:57:53 +0400
Received: from zigzag.lvk.cs.msu.su ([158.250.17.23]) 	by zigzag.lvk.cs.msu.su with esmtp (Exim 4.50) 	id 1FUJB6-0002Ro-Bk; Fri, 14 Apr 2006 11:57:16 +0400
From: Vladimir Prus <ghost@cs.msu.su>
To: "Jim Blandy" <jimb@red-bean.com>
Subject: Re: printing wchar_t*
Date: Fri, 14 Apr 2006 08:30:00 -0000
User-Agent: KMail/1.7.2
Cc: gdb@sources.redhat.com
References: <e1lsqg$aml$1@sea.gmane.org> <e1necb$gen$1@sea.gmane.org> <8f2776cb0604140029r44decd6atfa728aad53cb596d@mail.gmail.com>
In-Reply-To: <8f2776cb0604140029r44decd6atfa728aad53cb596d@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain;   charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200604141157.15185.ghost@cs.msu.su>
Mailing-List: contact gdb-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Subscribe: <mailto:gdb-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/gdb/>
List-Post: <mailto:gdb@sourceware.org>
List-Help: <mailto:gdb-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: gdb-owner@sourceware.org
X-SW-Source: 2006-04/txt/msg00170.txt.bz2

On Friday 14 April 2006 11:29, Jim Blandy wrote:
> On 4/13/06, Vladimir Prus <ghost@cs.msu.su> wrote:
> > Jim Blandy wrote:
> > > On 4/13/06, Vladimir Prus <ghost@cs.msu.su> wrote:
> > >> I have a user-defined command that can produce the output I want, but
> > >> is defining a custom command the right approach?
> > >
> > > Well, you'd like wide strings to be printed properly when they appear
> > > in structures, as arguments to functions, and so on, right?  So a
> > > user-defined command isn't ideal.
> >
> > I think I'll still need to do some processing for wchar_t* on frontend
> > side. The problem is that I don't see any way how gdb can print wchar_t
> > in a way that does not require post-processing. It can print it as UTF8,
> > but then for printing char* gdb should use local 8 bit encoding, which is
> > likely to be *not* UTF8. Gdb can probably use some extra markers for
> > values: like:
> >
> >    "foo"  for string in local 8-bit encoding
> >    L"foo" for string in UTF8 encoding.
> >
> > It's also possible to use "\u" escapes.
> >
> > But then there's a problem:
> >
> >    - Do we assume that wchar_t is always UTF-16 or UTF-32?
> >    - If not:
> >      - how user can select this?
> >      - how user-specified encoding will be handled
>
> You can't hard-code assumptions about the character set into GDB.  Nor
> can you hard-code the assumption that the host and target character
> sets are the same.  GDB needs to do explicit conversions between the
> two as needed, and handle mismatches in some reasonable way.
>
> GDB already has the commands 'set host-charset' and 'set
> target-charset', so you can assume that you have accurate information
> about the character sets at hand.  They fall back to ASCII.

Good, but you need to separately set host-charset for char* and for wchar_t*.
The first can be KOI8-R and the second can be UTF-32 in the same program at 
the same time.

> > > The best approach would be to extend charset.[ch] to handle wide
> > > character sets as well, and then add code to the language-specific
> > > printing routines to use the charset functions.  (This is fortunately
> > > much simpler than adding support for multibyte characters.)
> >
> > For, for each wchar_t element language-specific code will call
> > 'target_wchar_t_to_host', that will output specific representation of
> > that wchar_t. Hmm, the interface there seem to assume theres 1<->1
> > mapping between target and host characters.  This makes L"UTF8" format
> > and ascii string with \u escapes format impossible, It seems.
>
> Not at all.  The current character and string printing code uses those
> routines, and it handles unprintable and invalid characters just fine.
>  See, for example, host_print_char_literally, and
> c_target_char_has_backslash_escape.

Can this code output using UTF8-encoding? Consider this code from c-lang.c:

  static void
  c_emit_char (int c, struct ui_file *stream, int quoter)
 {
  const char *escape;
  int host_char;

  c &= 0xFF;			/* Avoid sign bit follies */

  escape = c_target_char_has_backslash_escape (c);
  if (escape)
    {
      if (quoter == '"' && strcmp (escape, "0") == 0)
	/* Print nulls embedded in double quoted strings as \000 to
	   prevent ambiguity.  */
	fprintf_filtered (stream, "\\000");
      else
	fprintf_filtered (stream, "\\%s", escape);
    }
  else if (target_char_to_host (c, &host_char)
           && host_char_print_literally (host_char))
    {
      if (host_char == '\\' || host_char == quoter)
        fputs_filtered ("\\", stream);
      fprintf_filtered (stream, "%c", host_char);
    }
  else
    fprintf_filtered (stream, "\\%.3o", (unsigned int) c);
 }

With UTF8 host encoding, we'd want up to 6 host bytes to be output for a 
single wchar_t, without escaping. The size of 'host_char' here is 4 bytes, so 
there's no way for 'target_char_to_host' to produce 6 characters. 

> As far as 1-to-1 mappings are concerned, the only necessary property
> is that host_char_to_target and target_char_to_host be inverses, and
> return zero for characters that can't make a round trip.  The existing
> string-printing code will automatically use numeric escapes for
> characters that target_char_to_host won't translate.

So, assuming numeric escapes are fine with me, I'd need to:

  1. Add a way to specify encoding of wchar_t* values.
  2. Write a version of c_printstr that will handle wchar_t*. The current
     version just accesses i-th element of the string, so won't work with
     UTF-16.
  3. Make new c_printstr call LA_PRINT_CHAR just like c_printstr, and that
     will handle escapes automatically.
  4. Make sure new version of c_printstr is invoked for wchar_t* values.

Is that about right?

- Volodya