From: Vladimir Prus <ghost@cs.msu.su>
To: "Jim Blandy" <jimb@red-bean.com>
Cc: gdb@sources.redhat.com
Subject: Re: printing wchar_t*
Date: Fri, 14 Apr 2006 08:30:00 -0000 [thread overview]
Message-ID: <200604141157.15185.ghost@cs.msu.su> (raw)
In-Reply-To: <8f2776cb0604140029r44decd6atfa728aad53cb596d@mail.gmail.com>
On Friday 14 April 2006 11:29, Jim Blandy wrote:
> On 4/13/06, Vladimir Prus <ghost@cs.msu.su> wrote:
> > Jim Blandy wrote:
> > > On 4/13/06, Vladimir Prus <ghost@cs.msu.su> wrote:
> > >> I have a user-defined command that can produce the output I want, but
> > >> is defining a custom command the right approach?
> > >
> > > Well, you'd like wide strings to be printed properly when they appear
> > > in structures, as arguments to functions, and so on, right? So a
> > > user-defined command isn't ideal.
> >
> > I think I'll still need to do some processing for wchar_t* on frontend
> > side. The problem is that I don't see any way how gdb can print wchar_t
> > in a way that does not require post-processing. It can print it as UTF8,
> > but then for printing char* gdb should use local 8 bit encoding, which is
> > likely to be *not* UTF8. Gdb can probably use some extra markers for
> > values: like:
> >
> > "foo" for string in local 8-bit encoding
> > L"foo" for string in UTF8 encoding.
> >
> > It's also possible to use "\u" escapes.
> >
> > But then there's a problem:
> >
> > - Do we assume that wchar_t is always UTF-16 or UTF-32?
> > - If not:
> > - how user can select this?
> > - how user-specified encoding will be handled
>
> You can't hard-code assumptions about the character set into GDB. Nor
> can you hard-code the assumption that the host and target character
> sets are the same. GDB needs to do explicit conversions between the
> two as needed, and handle mismatches in some reasonable way.
>
> GDB already has the commands 'set host-charset' and 'set
> target-charset', so you can assume that you have accurate information
> about the character sets at hand. They fall back to ASCII.
Good, but you need to separately set host-charset for char* and for wchar_t*.
The first can be KOI8-R and the second can be UTF-32 in the same program at
the same time.
> > > The best approach would be to extend charset.[ch] to handle wide
> > > character sets as well, and then add code to the language-specific
> > > printing routines to use the charset functions. (This is fortunately
> > > much simpler than adding support for multibyte characters.)
> >
> > For, for each wchar_t element language-specific code will call
> > 'target_wchar_t_to_host', that will output specific representation of
> > that wchar_t. Hmm, the interface there seem to assume theres 1<->1
> > mapping between target and host characters. This makes L"UTF8" format
> > and ascii string with \u escapes format impossible, It seems.
>
> Not at all. The current character and string printing code uses those
> routines, and it handles unprintable and invalid characters just fine.
> See, for example, host_print_char_literally, and
> c_target_char_has_backslash_escape.
Can this code output using UTF8-encoding? Consider this code from c-lang.c:
static void
c_emit_char (int c, struct ui_file *stream, int quoter)
{
const char *escape;
int host_char;
c &= 0xFF; /* Avoid sign bit follies */
escape = c_target_char_has_backslash_escape (c);
if (escape)
{
if (quoter == '"' && strcmp (escape, "0") == 0)
/* Print nulls embedded in double quoted strings as \000 to
prevent ambiguity. */
fprintf_filtered (stream, "\\000");
else
fprintf_filtered (stream, "\\%s", escape);
}
else if (target_char_to_host (c, &host_char)
&& host_char_print_literally (host_char))
{
if (host_char == '\\' || host_char == quoter)
fputs_filtered ("\\", stream);
fprintf_filtered (stream, "%c", host_char);
}
else
fprintf_filtered (stream, "\\%.3o", (unsigned int) c);
}
With UTF8 host encoding, we'd want up to 6 host bytes to be output for a
single wchar_t, without escaping. The size of 'host_char' here is 4 bytes, so
there's no way for 'target_char_to_host' to produce 6 characters.
> As far as 1-to-1 mappings are concerned, the only necessary property
> is that host_char_to_target and target_char_to_host be inverses, and
> return zero for characters that can't make a round trip. The existing
> string-printing code will automatically use numeric escapes for
> characters that target_char_to_host won't translate.
So, assuming numeric escapes are fine with me, I'd need to:
1. Add a way to specify encoding of wchar_t* values.
2. Write a version of c_printstr that will handle wchar_t*. The current
version just accesses i-th element of the string, so won't work with
UTF-16.
3. Make new c_printstr call LA_PRINT_CHAR just like c_printstr, and that
will handle escapes automatically.
4. Make sure new version of c_printstr is invoked for wchar_t* values.
Is that about right?
- Volodya
next prev parent reply other threads:[~2006-04-14 7:58 UTC|newest]
Thread overview: 52+ messages / expand[flat|nested] mbox.gz Atom feed top
2006-04-13 17:07 Vladimir Prus
2006-04-13 17:25 ` Eli Zaretskii
2006-04-14 7:29 ` Vladimir Prus
2006-04-14 8:47 ` Eli Zaretskii
2006-04-14 12:47 ` Vladimir Prus
2006-04-14 13:05 ` Eli Zaretskii
2006-04-14 13:06 ` Vladimir Prus
2006-04-14 13:15 ` Robert Dewar
2006-04-14 13:17 ` Daniel Jacobowitz
2006-04-14 13:59 ` Robert Dewar
2006-04-14 14:37 ` Eli Zaretskii
2006-04-14 14:08 ` Paul Koning
2006-04-14 14:47 ` Eli Zaretskii
2006-04-14 15:00 ` Vladimir Prus
2006-04-14 17:53 ` Eli Zaretskii
2006-04-17 7:05 ` Vladimir Prus
2006-04-17 8:35 ` Eli Zaretskii
2006-04-13 18:06 ` Jim Blandy
2006-04-13 21:18 ` Eli Zaretskii
2006-04-14 6:02 ` Jim Blandy
2006-04-14 8:43 ` Eli Zaretskii
2006-04-14 7:58 ` Vladimir Prus
2006-04-14 8:07 ` Jim Blandy
2006-04-14 8:30 ` Vladimir Prus [this message]
2006-04-14 8:57 ` Eli Zaretskii
2006-04-14 12:52 ` Vladimir Prus
2006-04-14 13:07 ` Daniel Jacobowitz
2006-04-14 14:23 ` Eli Zaretskii
2006-04-14 14:29 ` Daniel Jacobowitz
2006-04-14 14:53 ` Eli Zaretskii
2006-04-14 17:10 ` Daniel Jacobowitz
2006-04-14 17:55 ` Jim Blandy
2006-04-14 18:27 ` Eli Zaretskii
2006-04-14 18:30 ` Jim Blandy
2006-04-14 19:19 ` Eli Zaretskii
2006-04-14 14:16 ` Eli Zaretskii
2006-04-14 14:50 ` Vladimir Prus
2006-04-14 17:18 ` Eli Zaretskii
2006-04-14 18:03 ` Jim Blandy
2006-04-14 19:16 ` Eli Zaretskii
2006-04-14 19:22 ` Jim Blandy
2006-04-14 22:18 ` Daniel Jacobowitz
2006-04-16 11:39 ` Jim Blandy
2006-04-16 15:07 ` Eli Zaretskii
2006-04-15 7:14 ` Eli Zaretskii
2006-04-17 7:16 ` Vladimir Prus
2006-04-17 8:58 ` Eli Zaretskii
2006-04-17 10:35 ` Vladimir Prus
2006-04-17 12:26 ` Eli Zaretskii
2006-04-17 13:56 ` Vladimir Prus
2006-04-18 5:31 ` Eli Zaretskii
2006-04-14 19:53 ` Mark Kettenis
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=200604141157.15185.ghost@cs.msu.su \
--to=ghost@cs.msu.su \
--cc=gdb@sources.redhat.com \
--cc=jimb@red-bean.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox