Default target wide character set

Mirror of the gdb-patches mailing list
 help / color / mirror / Atom feed

* Default target wide character set
@ 2009-09-15 14:12 Alexey Feldgendler
  2009-09-16 18:56 ` Tom Tromey
  0 siblings, 1 reply; 11+ messages in thread
From: Alexey Feldgendler @ 2009-09-15 14:12 UTC (permalink / raw)
  To: gdb-patches

Hello, all.

I'm Alexey Feldgendler, a developer in a software company that uses gdb to  
debug on *nix systems. I got assigned part-time to contribute to gdb,  
mostly by fixing bugs that affect us, but also to implement new features.  
Because I'm new to gdb internals, I'm trying to be very careful about  
making anything but trivial fixes like <> for now, so I'd like to discuss  
with you something I'm trying to achieve. Thank you in advance for your  
time.

Currently, UCS-4 is the default target wide character set. However, what  
this setting is really used for is handling of wide character strings,  
i.e. sequences of wide characters, which in C[++] are represented by the  
type wchar_t. By default, gcc indeed considers wchar_t 4 bytes wide, but  
it has an option to make wchar_t 2 bytes wide (-fshort-wchar). When this  
option is used, the default setting for the target wide character set  
becomes wrong.

Looking at charset_for_string_type(), it seems to handle C_STRING_16 and  
C_STRING_32 sort of correctly based on the character width. However, for  
C_WIDE_STRING it simply uses target_wide_charset(), no matter whether it's  
reasonable or not.

I have two alternative ideas for how to tackle this problem.

A. Have the default target wide character set depend on the size of the  
type named wchar_t. If I understand it correctly, in this case the default  
needs to be updated then when the symbol table gets loaded. Of course, any  
user-specified value should override the computed default. There should  
also be some way to reset the option to its dynamic default.

Side question: how does gdb figure out sizeof(wchar_t)? Does it come from  
the symbol table or from elsewhere?

B. Have charset_for_string_type() check after calling  
target_wide_charset() whether the width of the returned character set  
matches the width of the actual string type, and use fallback similar to  
what's done for C_STRING_16 and C_STRING_32 if it doesn't. By width of the  
character set I mean the smallest possible width of a character in it,  
that would be e.g. 1 for UTF-8 and 2 for UCS-2. In this case, what â€œshow  
charsetâ€ shows sometimes won't match what's actually used to print a  
wchar_t[] string.

What do you think of options A and B? Or is there maybe another possiblity  
that I'm overlooking?

-- 
Alexey Feldgendler <alexeyf@opera.com>
Software Developer, Desktop Platform/Delivery Team, Opera Software ASA
[ICQ: 115226275] http://my.opera.com/feldgendler/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Default target wide character set
  2009-09-15 14:12 Default target wide character set Alexey Feldgendler
@ 2009-09-16 18:56 ` Tom Tromey
  2009-09-16 19:11   ` Pedro Alves
                     ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Tom Tromey @ 2009-09-16 18:56 UTC (permalink / raw)
  To: Alexey Feldgendler; +Cc: gdb-patches

>>>>> "Alexey" == Alexey Feldgendler <alexeyf@opera.com> writes:

Alexey> I got assigned part-time to contribute to gdb, mostly by fixing
Alexey> bugs that affect us, but also to implement new features.

Welcome to GDB.

I don't know your copyright assignment situation, but if you are
planning to submit patches, it doesn't hurt to get started on that
early.  Send me email off-list if you want to do this.

Alexey> A. Have the default target wide character set depend on the size of
Alexey> the type named wchar_t.

Alexey> Side question: how does gdb figure out sizeof(wchar_t)? Does it come
Alexey> from the symbol table or from elsewhere?

Yeah, look in c-lang.c for a call to lookup_typename with an argument of
"wchar_t".  The resulting type can be queried for its attributes.

Alexey> B. Have charset_for_string_type() check after calling
Alexey> target_wide_charset() whether the width of the returned character set
Alexey> matches the width of the actual string type, and use fallback similar
Alexey> to  what's done for C_STRING_16 and C_STRING_32 if it doesn't. 

Alexey> What do you think of options A and B? Or is there maybe another
Alexey> possiblity that I'm overlooking?

Yeah, I think there is another solution.  It is pretty similar to these,
though.

The general problem is that the relevant standards put very few
constraints on wchar_t.  There is no guarantee that they use any form of
UCS -- and there are systems which in fact do not.

Therefore, if the user picks some random target wide charset, I think we
ought to honor his request.

Another wrinkle is that there are no good ways to determine any
characteristics of character sets.  This simply isn't part of any
standardized API (we could of course implement our own database for
this... but I was not motivated to do so).  What this means is that we
can also do very little error checking in practice -- if the target uses
UCS-4, but the user says "set target-wide-charset SJIS", well, he will
get nonsense in response, with no warning from GDB.

What I would propose doing is adding a new charset named "UCS".  If this
is selected as the target wide charset, then we would automatically pick
UCS-2 or UCS-4 depending on sizeof(target wchar_t).  This would probably
mean having a few special cases in the code (like we do for the -BE and
-LE variants).  We would then make this the default target wide charset.

What do you think of that?

Tom

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Default target wide character set
  2009-09-16 18:56 ` Tom Tromey
@ 2009-09-16 19:11   ` Pedro Alves
  2009-09-16 20:41     ` Eli Zaretskii
  2009-09-16 20:40   ` Eli Zaretskii
  2009-09-17  8:50   ` Alexey Feldgendler
  2 siblings, 1 reply; 11+ messages in thread
From: Pedro Alves @ 2009-09-16 19:11 UTC (permalink / raw)
  To: gdb-patches, tromey; +Cc: Alexey Feldgendler

On Wednesday 16 September 2009 19:55:52, Tom Tromey wrote:
> What I would propose doing is adding a new charset named "UCS".  If this
> is selected as the target wide charset, then we would automatically pick
> UCS-2 or UCS-4 depending on sizeof(target wchar_t).  This would probably
> mean having a few special cases in the code (like we do for the -BE and
> -LE variants).  We would then make this the default target wide charset.
> 
> What do you think of that?

Something like that would also be useful for Windows targets, and would
address PR9996:

 http://sourceware.org/bugzilla/show_bug.cgi?id=9996

-- 
Pedro Alves


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Default target wide character set
  2009-09-16 18:56 ` Tom Tromey
  2009-09-16 19:11   ` Pedro Alves
@ 2009-09-16 20:40   ` Eli Zaretskii
  2009-09-16 20:46     ` Tom Tromey
  2009-09-17  8:50   ` Alexey Feldgendler
  2 siblings, 1 reply; 11+ messages in thread
From: Eli Zaretskii @ 2009-09-16 20:40 UTC (permalink / raw)
  To: tromey; +Cc: alexeyf, gdb-patches

> From: Tom Tromey <tromey@redhat.com>
> Cc: gdb-patches@sourceware.org
> Date: Wed, 16 Sep 2009 12:55:52 -0600
> 
> What I would propose doing is adding a new charset named "UCS".  If this
> is selected as the target wide charset, then we would automatically pick
> UCS-2 or UCS-4 depending on sizeof(target wchar_t).

AFAIK, Windows (whose wchar_t is 16-bit) uses UTF-16, not UCS-2.

What other platforms have a 16-bit wchar_t, and are you sure any
significant portion of them use UCS-2 (which is an obsolete encoding,
AFAIK)?


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Default target wide character set
  2009-09-16 19:11   ` Pedro Alves
@ 2009-09-16 20:41     ` Eli Zaretskii
  2009-09-16 20:46       ` Tom Tromey
  2009-09-17 10:25       ` Pedro Alves
  0 siblings, 2 replies; 11+ messages in thread
From: Eli Zaretskii @ 2009-09-16 20:41 UTC (permalink / raw)
  To: Pedro Alves; +Cc: gdb-patches, tromey, alexeyf

> From: Pedro Alves <pedro@codesourcery.com>
> Date: Wed, 16 Sep 2009 20:11:18 +0100
> Cc: "Alexey Feldgendler" <alexeyf@opera.com>
> 
> Something like that would also be useful for Windows targets, and would
> address PR9996:
> 
>  http://sourceware.org/bugzilla/show_bug.cgi?id=9996

Are you sure this PR still holds with the newer Cygwin 1.7? does
Cygwin still use 2-byte wchar_t?


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Default target wide character set
  2009-09-16 20:40   ` Eli Zaretskii
@ 2009-09-16 20:46     ` Tom Tromey
  2009-09-16 20:54       ` Eli Zaretskii
  0 siblings, 1 reply; 11+ messages in thread
From: Tom Tromey @ 2009-09-16 20:46 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: alexeyf, gdb-patches

>>>>> "Eli" == Eli Zaretskii <eliz@gnu.org> writes:

>> What I would propose doing is adding a new charset named "UCS".  If this
>> is selected as the target wide charset, then we would automatically pick
>> UCS-2 or UCS-4 depending on sizeof(target wchar_t).

Eli> AFAIK, Windows (whose wchar_t is 16-bit) uses UTF-16, not UCS-2.

Ok.

We could name it "auto" then.

Eli> What other platforms have a 16-bit wchar_t, and are you sure any
Eli> significant portion of them use UCS-2 (which is an obsolete encoding,
Eli> AFAIK)?

I don't know.

Tom


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Default target wide character set
  2009-09-16 20:41     ` Eli Zaretskii
@ 2009-09-16 20:46       ` Tom Tromey
  2009-09-17 10:25       ` Pedro Alves
  1 sibling, 0 replies; 11+ messages in thread
From: Tom Tromey @ 2009-09-16 20:46 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: Pedro Alves, gdb-patches, alexeyf

>>>>> "Eli" == Eli Zaretskii <eliz@gnu.org> writes:

>> http://sourceware.org/bugzilla/show_bug.cgi?id=9996

Eli> Are you sure this PR still holds with the newer Cygwin 1.7? does
Eli> Cygwin still use 2-byte wchar_t?

I don't know about that, but the underlying problem remains: you can
have a target-wide-charset which disagrees with your target, and get
nonsense.  There's not even any good way for GDB to detect and warn
about this.

I've been meaning to add a note to the manual for a while explaining
this gotcha.  It is pretty confusing when it does happen.

Tom

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Default target wide character set
  2009-09-16 20:46     ` Tom Tromey
@ 2009-09-16 20:54       ` Eli Zaretskii
  0 siblings, 0 replies; 11+ messages in thread
From: Eli Zaretskii @ 2009-09-16 20:54 UTC (permalink / raw)
  To: Tom Tromey; +Cc: alexeyf, gdb-patches

> From: Tom Tromey <tromey@redhat.com>
> Cc: alexeyf@opera.com, gdb-patches@sourceware.org
> Date: Wed, 16 Sep 2009 14:46:30 -0600
> 
> >>>>> "Eli" == Eli Zaretskii <eliz@gnu.org> writes:
> 
> >> What I would propose doing is adding a new charset named "UCS".  If this
> >> is selected as the target wide charset, then we would automatically pick
> >> UCS-2 or UCS-4 depending on sizeof(target wchar_t).
> 
> Eli> AFAIK, Windows (whose wchar_t is 16-bit) uses UTF-16, not UCS-2.
> 
> Ok.
> 
> We could name it "auto" then.

Yup.  Just to make myself clear: I was talking about the native
Windows runtime (MinGW).  I don't know what Cygwin uses; if it uses
something different, then we would need two different defaults.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Default target wide character set
  2009-09-16 18:56 ` Tom Tromey
  2009-09-16 19:11   ` Pedro Alves
  2009-09-16 20:40   ` Eli Zaretskii
@ 2009-09-17  8:50   ` Alexey Feldgendler
  2009-09-18 20:42     ` Tom Tromey
  2 siblings, 1 reply; 11+ messages in thread
From: Alexey Feldgendler @ 2009-09-17  8:50 UTC (permalink / raw)
  To: gdb-patches

On Wed, 16 Sep 2009 20:55:52 +0200, Tom Tromey <tromey@redhat.com> wrote:

> I don't know your copyright assignment situation, but if you are
> planning to submit patches, it doesn't hurt to get started on that
> early.  Send me email off-list if you want to do this.

Sure, I'll make any small fix that makes sense on its own as a separate  
patch.

Alexey>> Side question: how does gdb figure out sizeof(wchar_t)? Does it
Alexey>> come from the symbol table or from elsewhere?

> Yeah, look in c-lang.c for a call to lookup_typename with an argument of
> "wchar_t".  The resulting type can be queried for its attributes.

What happens then no symbol table is available?

> What I would propose doing is adding a new charset named "UCS".  If this
> is selected as the target wide charset, then we would automatically pick
> UCS-2 or UCS-4 depending on sizeof(target wchar_t).  This would probably
> mean having a few special cases in the code (like we do for the -BE and
> -LE variants).  We would then make this the default target wide charset.
>
> What do you think of that?

I think it's a very good idea. Indeed, it's much more user-friendly to  
have an auto-sensing option with clearly defined semantics.

On Wed, 16 Sep 2009 22:40:01 +0200, Eli Zaretskii <eliz@gnu.org> wrote:

> AFAIK, Windows (whose wchar_t is 16-bit) uses UTF-16, not UCS-2.
>
> What other platforms have a 16-bit wchar_t, and are you sure any
> significant portion of them use UCS-2 (which is an obsolete encoding,
> AFAIK)?

Because UCS-2 is a subset of UTF-16, it won't hurt to just change all uses  
of UCS-2 in gdb to UTF-16.

On Wed, 16 Sep 2009 22:46:30 +0200, Tom Tromey <tromey@redhat.com> wrote:

> We could name it "auto" then.

I agree, and make it default.

Seems we agree here, I'll submit a patch soon.


-- 
Alexey Feldgendler
Software Developer, Desktop Team, Opera Software ASA
[ICQ: 115226275] http://my.opera.com/feldgendler/


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Default target wide character set
  2009-09-16 20:41     ` Eli Zaretskii
  2009-09-16 20:46       ` Tom Tromey
@ 2009-09-17 10:25       ` Pedro Alves
  1 sibling, 0 replies; 11+ messages in thread
From: Pedro Alves @ 2009-09-17 10:25 UTC (permalink / raw)
  To: gdb-patches, Eli Zaretskii; +Cc: tromey, alexeyf

On Wednesday 16 September 2009 21:40:56, Eli Zaretskii wrote:
> > From: Pedro Alves <pedro@codesourcery.com>
> > Date: Wed, 16 Sep 2009 20:11:18 +0100
> > Cc: "Alexey Feldgendler" <alexeyf@opera.com>
> > 
> > Something like that would also be useful for Windows targets, and would
> > address PR9996:
> > 
> >  http://sourceware.org/bugzilla/show_bug.cgi?id=9996
> 
> Are you sure this PR still holds with the newer Cygwin 1.7? does
> Cygwin still use 2-byte wchar_t?

I'm not 100% sure, but I'm 99.9(9)% sure, but only because I
didn't take the trouble to confirm it myself.  Why wouldn't it?  Given
that 2-byte WCHAR is what Windows NT uses internally to store and handle
strings, and hence what is passed back and forth between Cygwin and
the NT and Win32 APIs Cygwin uses to do its business, switching away
from 2-byte wchar_t on Cygwin doesn't sound like a bad idea to me.

-- 
Pedro Alves

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Default target wide character set
  2009-09-17  8:50   ` Alexey Feldgendler
@ 2009-09-18 20:42     ` Tom Tromey
  0 siblings, 0 replies; 11+ messages in thread
From: Tom Tromey @ 2009-09-18 20:42 UTC (permalink / raw)
  To: Alexey Feldgendler; +Cc: gdb-patches

>>>>> "Alexey" == Alexey Feldgendler <alexeyf@opera.com> writes:

Tom> Yeah, look in c-lang.c for a call to lookup_typename with an argument of
Tom> "wchar_t".  The resulting type can be queried for its attributes.

Alexey> What happens then no symbol table is available?

It just doesn't work.  I think the user will get an error.

If this is an important failure mode, we could perhaps add a new flag to
print to mean "print as wchar_t", akin to the existing print/s.

Tom

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2009-09-18 20:42 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-09-15 14:12 Default target wide character set Alexey Feldgendler
2009-09-16 18:56 ` Tom Tromey
2009-09-16 19:11   ` Pedro Alves
2009-09-16 20:41     ` Eli Zaretskii
2009-09-16 20:46       ` Tom Tromey
2009-09-17 10:25       ` Pedro Alves
2009-09-16 20:40   ` Eli Zaretskii
2009-09-16 20:46     ` Tom Tromey
2009-09-16 20:54       ` Eli Zaretskii
2009-09-17  8:50   ` Alexey Feldgendler
2009-09-18 20:42     ` Tom Tromey

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox