Hi,

This patch contains (at least the start of) support for printing
wchar_t strings from a debugged program within GDB. This is the subject
for GDB bugs 9103 (and its duplicates 9369, 9268) and maybe 7821.

Notes on the implementation:

1. I've added a new configuration variable, similar to "host-charset"
and "target-charset". The latter can't be used for printing wide
characters, because regular C strings and wide strings aren't
necessarily (or in fact ever) encoded using the same encoding. The new
variable is set like:

(gdb) set target-wide-charset UTF-32

I considered adding "set target-wide-charset auto" to attempt to
auto-detect the charset used for wchar_t strings automatically (i.e.
probably 4 bytes -> UCS-4, 2 bytes -> UTF-16), but that's not done
presently.

2. The host terminal may be able to print Unicode characters, by
feeding it UTF-8 encoded characters. There are some limitations: I
don't think Unix terminals support combining character sequences --
I've ignored that for now. GDB currently defaults "host-charset" to
ISO-8859-1, although a given terminal may not print
top-bit-set characters correctly.

I've added a new way of setting the host character set from the
host terminal (using nl_langinfo (CODESET)), like so:

(gdb) set host-charset auto

If the terminal supports UTF-8 (e.g. LC_ALL is set to en_US.UTF-8), we
will then see:

(gdb) show host-charset
The host character set is "UTF-8" (auto).

If the terminal only supports ASCII (e.g. LC_ALL is set to C), we will
instead see:

(gdb) show host-charset
The host character set is "ANSI_X3.4-1968" (auto).

3. Types which are literally called "wchar_t" are assumed to be wide
characters. So we can do:

wchar_t *msg = L"Hello world";

and then:

(gdb) p msg
$1 = (wchar_t *) 0x85c4 "Hello world"

If the message contains funny characters, and the user has typed "set
host-charset auto" on a UTF-8 capable terminal, they will be printed
nicely:

(gdb) p msg
$2 = (wchar_t *) 0x85c4 "Schöne Grüße"

With the caveat that there's no way for GDB to know if you have a font
with the right glyphs in it: if not, you can fall back to ASCII:

(gdb) set host-charset ASCII
(gdb) p msg
$3 = (wchar_t *) 0x85c4 "Sch\x00f6ne Gr\x00fc\x00dfe"

4. If you want to print an integer array type which isn't literally
called "wchar_t" but nevertheless contains a wchar_t string, you can
override using "/s", just like with regular strings, e.g.:

(gdb) p/s intmsg
$2 = (int *) 0x85c4 "Schöne Grüße"

5. The existing string-printing code is careful about not printing out
lots of repeating characters. For wchar_t strings (taking into account
the differences between what they represent on various platforms
mentioned above), there is generally an X-Y correspondence between the
number of input bytes and the number of output bytes for each
character: to detect repeats, we convert an arbitrary number of X's to
UCS-4, detect repeated UCS-4 values, then translate each to Y output
characters.

Current shortcomings:

1. There's no support for non-C-like languages.

2. I've probably broken building with iconv disabled (actually I
couldn't figure out how to build without iconv() support -- even for
e.g. a mingw32 host which shouldn't support it).

3. Currently wrong-endian wide characters from the target will confuse
things (but you can explicitly set target-wide-charset to UCS-4LE or
UCS-4BE for example).

4. I've not written documentation or altered test cases yet
(charset.exp shows some regressions).

Tom Tromey is working on a patch related to this. Some of his comments
are incorporated in this patch relative to an earlier version sent to
him privately (thanks!).

Regression tested on x86-64 Linux, and spot-checked with an ARM Linux
cross debugger (from x86 build/host). As mentioned above, there are
some regressions so far.

OK to apply, or any comments?

Cheers,

Julian

ChangeLog

    gdb/
    * c-valprint.c (textual_element_type): Alter TYPE to be the type of
    the element before looking through typedefs, and update comment. Add
    wide-character support.
    (c_val_print): Pass type before typedef resolution to
    textual_element_type calls.
    * charset.c (langinfo.h): Include, if HAVE_LANGINFO_CODESET.
    (GDB_DEFAULT_TARGET_WIDE_CHARSET, GDB_INTERNAL_CODESET): New macros.
    (host_charset_auto): New.
    (show_host_charset_name): Indicate automatically-selected charset.
    (target_wide_charset_name, show_target_wide_charset_name): New.
    (host_charset_enum): Add "auto".
    (target_wide_charset_enum): New. Support a limited number of
    wchar_t character sets.
    (iconv_char_print_literally): New.
    (iconv_to_control): New.
    (lookup_and_register_iconv_charset): New.
    (default_c_internal_char_has_backslash_escape): New.
    (current_target_wide_charset, internal_charset): New.
    (set_host_charset): Add support for "auto" host charset.
    (show_charset): Show target wide charset.
    (set_target_wide_charset, set_target_wide_charset_sfunc)
    (target_wide_charset, cached_iconv_target_to_internal)
    (cached_iconv_internal_to_host, target_to_internal_iconv_t)
    (internal_to_host_iconv_t, reset_host_char_state)
    (target_char_to_internal, internal_char_host_emit): New.
    (_initialize_charset): Add wide-character support.
    * charset.h (target_wide_charset, reset_host_char_state)
    (target_char_to_internal) (internal_char_host_emit): Add prototypes.
    * c-lang.c (c_internal_char_host_emit, c_printwidestr): New.
    (c_printstr): Call c_printwidestr when appropriate.
    * printcmd.c (print_formatted): Add wide-character support.
    * configure.ac (AM_LANGINFO_CODESET): Add.
    * acinclude.m4 (../config/codeset.m4): Include.
    * config.in: Regenerate.
    * configure: Regenerate.