Re: printing wchar_t* - Vladimir Prus

Mirror of the gdb mailing list
 help / color / mirror / Atom feed

From: Vladimir Prus <ghost@cs.msu.su>
To: "Jim Blandy" <jimb@red-bean.com>
Cc: gdb@sources.redhat.com
Subject: Re: printing wchar_t*
Date: Fri, 14 Apr 2006 08:30:00 -0000	[thread overview]
Message-ID: <200604141157.15185.ghost@cs.msu.su> (raw)
In-Reply-To: <8f2776cb0604140029r44decd6atfa728aad53cb596d@mail.gmail.com>

On Friday 14 April 2006 11:29, Jim Blandy wrote:
> On 4/13/06, Vladimir Prus <ghost@cs.msu.su> wrote:
> > Jim Blandy wrote:
> > > On 4/13/06, Vladimir Prus <ghost@cs.msu.su> wrote:
> > >> I have a user-defined command that can produce the output I want, but
> > >> is defining a custom command the right approach?
> > >
> > > Well, you'd like wide strings to be printed properly when they appear
> > > in structures, as arguments to functions, and so on, right?  So a
> > > user-defined command isn't ideal.
> >
> > I think I'll still need to do some processing for wchar_t* on frontend
> > side. The problem is that I don't see any way how gdb can print wchar_t
> > in a way that does not require post-processing. It can print it as UTF8,
> > but then for printing char* gdb should use local 8 bit encoding, which is
> > likely to be *not* UTF8. Gdb can probably use some extra markers for
> > values: like:
> >
> >    "foo"  for string in local 8-bit encoding
> >    L"foo" for string in UTF8 encoding.
> >
> > It's also possible to use "\u" escapes.
> >
> > But then there's a problem:
> >
> >    - Do we assume that wchar_t is always UTF-16 or UTF-32?
> >    - If not:
> >      - how user can select this?
> >      - how user-specified encoding will be handled
>
> You can't hard-code assumptions about the character set into GDB.  Nor
> can you hard-code the assumption that the host and target character
> sets are the same.  GDB needs to do explicit conversions between the
> two as needed, and handle mismatches in some reasonable way.
>
> GDB already has the commands 'set host-charset' and 'set
> target-charset', so you can assume that you have accurate information
> about the character sets at hand.  They fall back to ASCII.

Good, but you need to separately set host-charset for char* and for wchar_t*.
The first can be KOI8-R and the second can be UTF-32 in the same program at 
the same time.

> > > The best approach would be to extend charset.[ch] to handle wide
> > > character sets as well, and then add code to the language-specific
> > > printing routines to use the charset functions.  (This is fortunately
> > > much simpler than adding support for multibyte characters.)
> >
> > For, for each wchar_t element language-specific code will call
> > 'target_wchar_t_to_host', that will output specific representation of
> > that wchar_t. Hmm, the interface there seem to assume theres 1<->1
> > mapping between target and host characters.  This makes L"UTF8" format
> > and ascii string with \u escapes format impossible, It seems.
>
> Not at all.  The current character and string printing code uses those
> routines, and it handles unprintable and invalid characters just fine.
>  See, for example, host_print_char_literally, and
> c_target_char_has_backslash_escape.

Can this code output using UTF8-encoding? Consider this code from c-lang.c:

  static void
  c_emit_char (int c, struct ui_file *stream, int quoter)
 {
  const char *escape;
  int host_char;

  c &= 0xFF;			/* Avoid sign bit follies */

  escape = c_target_char_has_backslash_escape (c);
  if (escape)
    {
      if (quoter == '"' && strcmp (escape, "0") == 0)
	/* Print nulls embedded in double quoted strings as \000 to
	   prevent ambiguity.  */
	fprintf_filtered (stream, "\\000");
      else
	fprintf_filtered (stream, "\\%s", escape);
    }
  else if (target_char_to_host (c, &host_char)
           && host_char_print_literally (host_char))
    {
      if (host_char == '\\' || host_char == quoter)
        fputs_filtered ("\\", stream);
      fprintf_filtered (stream, "%c", host_char);
    }
  else
    fprintf_filtered (stream, "\\%.3o", (unsigned int) c);
 }

With UTF8 host encoding, we'd want up to 6 host bytes to be output for a 
single wchar_t, without escaping. The size of 'host_char' here is 4 bytes, so 
there's no way for 'target_char_to_host' to produce 6 characters. 

> As far as 1-to-1 mappings are concerned, the only necessary property
> is that host_char_to_target and target_char_to_host be inverses, and
> return zero for characters that can't make a round trip.  The existing
> string-printing code will automatically use numeric escapes for
> characters that target_char_to_host won't translate.

So, assuming numeric escapes are fine with me, I'd need to:

  1. Add a way to specify encoding of wchar_t* values.
  2. Write a version of c_printstr that will handle wchar_t*. The current
     version just accesses i-th element of the string, so won't work with
     UTF-16.
  3. Make new c_printstr call LA_PRINT_CHAR just like c_printstr, and that
     will handle escapes automatically.
  4. Make sure new version of c_printstr is invoked for wchar_t* values.

Is that about right?

- Volodya

next prev parent reply	other threads:[~2006-04-14  7:58 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-04-13 17:07 Vladimir Prus
2006-04-13 17:25 ` Eli Zaretskii
2006-04-14  7:29   ` Vladimir Prus
2006-04-14  8:47     ` Eli Zaretskii
2006-04-14 12:47       ` Vladimir Prus
2006-04-14 13:05         ` Eli Zaretskii
2006-04-14 13:06           ` Vladimir Prus
2006-04-14 13:15             ` Robert Dewar
2006-04-14 13:17           ` Daniel Jacobowitz
2006-04-14 13:59             ` Robert Dewar
2006-04-14 14:37             ` Eli Zaretskii
2006-04-14 14:08       ` Paul Koning
2006-04-14 14:47         ` Eli Zaretskii
2006-04-14 15:00           ` Vladimir Prus
2006-04-14 17:53             ` Eli Zaretskii
2006-04-17  7:05               ` Vladimir Prus
2006-04-17  8:35                 ` Eli Zaretskii
2006-04-13 18:06 ` Jim Blandy
2006-04-13 21:18   ` Eli Zaretskii
2006-04-14  6:02     ` Jim Blandy
2006-04-14  8:43       ` Eli Zaretskii
2006-04-14  7:58   ` Vladimir Prus
2006-04-14  8:07     ` Jim Blandy
2006-04-14  8:30       ` Vladimir Prus [this message]
2006-04-14  8:57     ` Eli Zaretskii
2006-04-14 12:52       ` Vladimir Prus
2006-04-14 13:07         ` Daniel Jacobowitz
2006-04-14 14:23           ` Eli Zaretskii
2006-04-14 14:29             ` Daniel Jacobowitz
2006-04-14 14:53               ` Eli Zaretskii
2006-04-14 17:10                 ` Daniel Jacobowitz
2006-04-14 17:55               ` Jim Blandy
2006-04-14 18:27                 ` Eli Zaretskii
2006-04-14 18:30                   ` Jim Blandy
2006-04-14 19:19                     ` Eli Zaretskii
2006-04-14 14:16         ` Eli Zaretskii
2006-04-14 14:50           ` Vladimir Prus
2006-04-14 17:18             ` Eli Zaretskii
2006-04-14 18:03               ` Jim Blandy
2006-04-14 19:16                 ` Eli Zaretskii
2006-04-14 19:22                   ` Jim Blandy
2006-04-14 22:18                     ` Daniel Jacobowitz
2006-04-16 11:39                       ` Jim Blandy
2006-04-16 15:07                         ` Eli Zaretskii
2006-04-15  7:14                     ` Eli Zaretskii
2006-04-17  7:16                       ` Vladimir Prus
2006-04-17  8:58                         ` Eli Zaretskii
2006-04-17 10:35                           ` Vladimir Prus
2006-04-17 12:26                             ` Eli Zaretskii
2006-04-17 13:56                               ` Vladimir Prus
2006-04-18  5:31                                 ` Eli Zaretskii
2006-04-14 19:53                 ` Mark Kettenis

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200604141157.15185.ghost@cs.msu.su \
    --to=ghost@cs.msu.su \
    --cc=gdb@sources.redhat.com \
    --cc=jimb@red-bean.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox