RE: [RFC] Signed/unsigned character arrays are not strings

Mirror of the gdb mailing list
 help / color / mirror / Atom feed

* RE: [RFC] Signed/unsigned character arrays are not strings
@ 2007-02-28 13:05 pkoning
  2007-03-01 11:01 ` Mark Kettenis
  0 siblings, 1 reply; 28+ messages in thread
From: pkoning @ 2007-02-28 13:05 UTC (permalink / raw)
  To: jimb, mark.kettenis
  Cc: drow, eliz, dewar, nickrob, jan.kratochvil, Mathieu.Lacage, gdb

> Okay, here's a horrible idea.  :)  With this patch:
> 
> $ cat chars.c
> #include <stdio.h>
> #include <stdint.h>
> 
> typedef char byte_t;
> 
> char *c = "chars";
> unsigned char *uc = "unsigned chars";
> signed char *sc = "signed chars";
> byte_t *b = "bytes";
> int8_t *i8 = "int8_t's";
> uint8_t *ui8 = "uint8_t's";

Neat.

I would tweak it a little.  People might be using typedefs
for character strings that wrap, say, "unsigned char".
So if you're going to do a heuristic on the name, treat
it as a character string if the name ends in "char" (not 
necessarily with a preceding space) or "char_t" (because
many people use _t as the suffix for typedef names).

	paul

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Signed/unsigned character arrays are not strings
  2007-02-28 13:05 [RFC] Signed/unsigned character arrays are not strings pkoning
@ 2007-03-01 11:01 ` Mark Kettenis
  0 siblings, 0 replies; 28+ messages in thread
From: Mark Kettenis @ 2007-03-01 11:01 UTC (permalink / raw)
  To: pkoning
  Cc: jimb, mark.kettenis, drow, eliz, dewar, nickrob, jan.kratochvil,
	Mathieu.Lacage, gdb

> Date: Tue, 27 Feb 2007 20:51:54 -0500
> From: <pkoning@equallogic.com>
> 
> > Okay, here's a horrible idea.  :)  With this patch:
> > 
> > $ cat chars.c
> > #include <stdio.h>
> > #include <stdint.h>
> > 
> > typedef char byte_t;
> > 
> > char *c = "chars";
> > unsigned char *uc = "unsigned chars";
> > signed char *sc = "signed chars";
> > byte_t *b = "bytes";
> > int8_t *i8 = "int8_t's";
> > uint8_t *ui8 = "uint8_t's";
> 
> Neat.
> 
> I would tweak it a little.  People might be using typedefs
> for character strings that wrap, say, "unsigned char".
> So if you're going to do a heuristic on the name, treat
> it as a character string if the name ends in "char" (not 
> necessarily with a preceding space) or "char_t" (because
> many people use _t as the suffix for typedef names).

I think this is a bad idea.  These choices are pretty arbitrary; isn't
"string" a reasonable name for a character string typedef too.  And
then of course you'd need to add "string_t" too.  Really, if we're
going to have a list of types, I think it should be a list of types
for which we are not going to print the result as a string.  And it
should only include typedefs mentioned in the relevant language
standard.

Oh and of course POSIX explicitly reserves the _t suffix, so people
really should not be naming their types "char_t".

Mark


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Signed/unsigned character arrays are not strings
@ 2007-02-24 16:13 Nick Roberts
  2007-02-24 20:11 ` Daniel Jacobowitz
  0 siblings, 1 reply; 28+ messages in thread
From: Nick Roberts @ 2007-02-24 16:13 UTC (permalink / raw)
  To: gdb


On Thu, 25 Jan 2007 02:54:22 +0100, Jan Kratochvil wrote (gdb-patches):

> currently all these types are printed as strings:
> 	char
> 	signed char
> 	unsigned char
> the patch will reduce the printed strings only to
> 	char
> and the signed/unsigned version gets printed as an array of byte values
> (characters).  I hope nobody uses sign-specification for strings.
> On the other hand byte arrays become unreadable if printed as strings.

Emacs uses sign-specification for strings.  There is a user defined command
called xbacktrace that prints a backtrace of Lisp functions.  Previously it
got printed like:

(gdb) xbacktrace
"split-window" (0x838c8c9)
"split-window-vertically" (0x838c8c9)
"call-interactively" (0x85b3ac9)

It now gets printed as:

(gdb) xbacktrace
{115 's', 112 'p', 108 'l', 105 'i', 116 't', 45 '-', 119 'w', 105 'i',
  110 'n', 100 'd', 111 'o', 119 'w'} (0x838c8c9)
{115 's', 112 'p', 108 'l', 105 'i', 116 't', 45 '-', 119 'w', 105 'i',
  110 'n', 100 'd', 111 'o', 119 'w', 45 '-', 118 'v', 101 'e', 114 'r',
  116 't', 105 'i', 99 'c', 97 'a', 108 'l', 108 'l', 121 'y'} (0x838c8c9)
{99 'c', 97 'a', 108 'l', 108 'l', 45 '-', 105 'i', 110 'n', 116 't', 101 'e',
  114 'r', 97 'a', 99 'c', 116 't', 105 'i', 118 'v', 101 'e', 108 'l',
  121 'y'} (0x85b3ac9)

-- 
Nick                                           http://www.inet.net.nz/~nickrob


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Signed/unsigned character arrays are not strings
  2007-02-24 16:13 Nick Roberts
@ 2007-02-24 20:11 ` Daniel Jacobowitz
  2007-02-24 20:53   ` Nick Roberts
  0 siblings, 1 reply; 28+ messages in thread
From: Daniel Jacobowitz @ 2007-02-24 20:11 UTC (permalink / raw)
  To: gdb

On Sat, Feb 24, 2007 at 09:23:42PM +1300, Nick Roberts wrote:
> Emacs uses sign-specification for strings.  There is a user defined command
> called xbacktrace that prints a backtrace of Lisp functions.  Previously it
> got printed like:
> 
> (gdb) xbacktrace
> "split-window" (0x838c8c9)
> "split-window-vertically" (0x838c8c9)
> "call-interactively" (0x85b3ac9)

Does adding an appropriate (char *) cast fix the problem, and do you
think it's reasonable?

-- 
Daniel Jacobowitz
CodeSourcery


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Signed/unsigned character arrays are not strings
  2007-02-24 20:11 ` Daniel Jacobowitz
@ 2007-02-24 20:53   ` Nick Roberts
  2007-02-24 21:07     ` Jan Kratochvil
  2007-02-25 21:07     ` mathieu lacage
  0 siblings, 2 replies; 28+ messages in thread
From: Nick Roberts @ 2007-02-24 20:53 UTC (permalink / raw)
  To: Daniel Jacobowitz; +Cc: gdb

 > Does adding an appropriate (char *) cast fix the problem, and do you
 > think it's reasonable?

If you mean in .gdbinit, changing

define xprintstr
  set $data = $arg0->data
  output ($arg0->size > 1000) ? 0 : ($data[0])@($arg0->size_byte < 0 ? $arg0->size & ~gdb_array_mark_flag : $arg0->size_byte)
end

to:

define xprintstr
  set $data = $arg0->data
  output (char *) ($arg0->size > 1000) ? 0 : ($data[0])@($arg0->size_byte < 0 ? $arg0->size & ~gdb_array_mark_flag : $arg0->size_byte)
end

makes no difference.

I don't understand why unsigned chars should be printed as arrays except to
solve Jan's particular problem.  Maybe Emacs uses unsigned char for 8 bit
character sets like iso_8859-1:

2000-01-04  Gerd Moellmann  <gerd@gnu.org>

	* lisp.h (struct Lisp_String): Make DATA member `unsigned char *'.

Like another change that Ulrich Drepper is proposing (%a) this patch changes
existing behaviour.  Why not just add a new output format, or boolean variable?


-- 
Nick                                           http://www.inet.net.nz/~nickrob


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Signed/unsigned character arrays are not strings
  2007-02-24 20:53   ` Nick Roberts
@ 2007-02-24 21:07     ` Jan Kratochvil
  2007-02-25  8:00       ` Daniel Jacobowitz
  2007-02-25 21:07     ` mathieu lacage
  1 sibling, 1 reply; 28+ messages in thread
From: Jan Kratochvil @ 2007-02-24 21:07 UTC (permalink / raw)
  To: Nick Roberts; +Cc: Daniel Jacobowitz, gdb

[-- Attachment #1: Type: text/plain, Size: 1839 bytes --]

On Sat, 24 Feb 2007 21:11:02 +0100, Nick Roberts wrote:
>  > Does adding an appropriate (char *) cast fix the problem, and do you
>  > think it's reasonable?
> 
> If you mean in .gdbinit, changing
...
>   output ($arg0->size > 1000) ? 0 : ($data[0])@($arg0->size_byte < 0 ? $arg0->size & ~gdb_array_mark_flag : $arg0->size_byte)
...
> to:
...
>   output (char *) ($arg0->size > 1000) ? 0 : ($data[0])@($arg0->size_byte < 0 ? $arg0->size & ~gdb_array_mark_flag : $arg0->size_byte)
...
> makes no difference.

You have an association bug there.  In fact such workaround works [attached].
	(gdb) xbacktrace 
	0x92e75aa "recursive-edit" (0xa420a23)
	0x92e685c "byte-code" (0x825e7db)
	...

> I don't understand why unsigned chars should be printed as arrays except to
> solve Jan's particular problem.  Maybe Emacs uses unsigned char for 8 bit
> character sets like iso_8859-1:
> 
> 2000-01-04  Gerd Moellmann  <gerd@gnu.org>
> 
> 	* lisp.h (struct Lisp_String): Make DATA member `unsigned char *'.

Either you consider the type still as a normal C string and in this case it
should be `{,const} char *'.  Or you consider it as some arbitrary data (due to
its character set not matching the system default one) and in this case it is
is best to display the numerical value of each of its array element.  Printing
it just as a string just outputs invalid characters to the debugger's screen.

I believe Emacs should revert to using `char *' but I do not know the reasons
for the Gerd Moellmann's change above.


> Like another change that Ulrich Drepper is proposing (%a) this patch changes
> existing behaviour.  Why not just add a new output format, or boolean variable?

As in such case one can drop the patch completely as no-one would ever figure
such new output format exists.  Sure such decision would also make a sense.



Regards,
Jan

[-- Attachment #2: emacs-22.0.93-gdb-xprintstr-fix.patch --]
[-- Type: text/plain, Size: 502 bytes --]

--- emacs-22.0.93/src/.gdbinit-orig	2007-01-21 23:51:40.000000000 +0100
+++ emacs-22.0.93/src/.gdbinit	2007-02-24 21:35:08.000000000 +0100
@@ -978,7 +978,7 @@
 
 define xprintstr
   set $data = $arg0->data
-  output ($arg0->size > 1000) ? 0 : ($data[0])@($arg0->size_byte < 0 ? $arg0->size & ~gdb_array_mark_flag : $arg0->size_byte)
+  output (const char *) (($arg0->size > 1000) ? 0 : ($data[0])@($arg0->size_byte < 0 ? $arg0->size & ~gdb_array_mark_flag : $arg0->size_byte))
 end
 
 define xprintsym

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Signed/unsigned character arrays are not strings
  2007-02-24 21:07     ` Jan Kratochvil
@ 2007-02-25  8:00       ` Daniel Jacobowitz
  2007-02-25 19:54         ` Nick Roberts
  0 siblings, 1 reply; 28+ messages in thread
From: Daniel Jacobowitz @ 2007-02-25  8:00 UTC (permalink / raw)
  To: Jan Kratochvil; +Cc: Nick Roberts, gdb

On Sat, Feb 24, 2007 at 09:46:26PM +0100, Jan Kratochvil wrote:
> You have an association bug there.  In fact such workaround works [attached].
> 	(gdb) xbacktrace 
> 	0x92e75aa "recursive-edit" (0xa420a23)
> 	0x92e685c "byte-code" (0x825e7db)

Or try "set $data = (char *) $arg0->data" or so.  That should work
too.

-- 
Daniel Jacobowitz
CodeSourcery


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Signed/unsigned character arrays are not strings
  2007-02-25  8:00       ` Daniel Jacobowitz
@ 2007-02-25 19:54         ` Nick Roberts
  0 siblings, 0 replies; 28+ messages in thread
From: Nick Roberts @ 2007-02-25 19:54 UTC (permalink / raw)
  To: Daniel Jacobowitz; +Cc: Jan Kratochvil, gdb

Daniel Jacobowitz writes:
 > On Sat, Feb 24, 2007 at 09:46:26PM +0100, Jan Kratochvil wrote:
 > > You have an association bug there.  In fact such workaround works [attached].
 > > 	(gdb) xbacktrace 
 > > 	0x92e75aa "recursive-edit" (0xa420a23)
 > > 	0x92e685c "byte-code" (0x825e7db)
 > 
 > Or try "set $data = (char *) $arg0->data" or so.  That should work
 > too.

Yes this works, thanks (although Jan's suggestion to use (const char *) didn't
work for me).  I still don't see why strings should only be seven-bit, as
Emacs, at least works with different character sets, but my immediate problem
is fixed now, anyway.

-- 
Nick                                           http://www.inet.net.nz/~nickrob


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Signed/unsigned character arrays are not strings
  2007-02-24 20:53   ` Nick Roberts
  2007-02-24 21:07     ` Jan Kratochvil
@ 2007-02-25 21:07     ` mathieu lacage
  2007-02-26  0:45       ` Jan Kratochvil
  1 sibling, 1 reply; 28+ messages in thread
From: mathieu lacage @ 2007-02-25 21:07 UTC (permalink / raw)
  To: Nick Roberts; +Cc: Daniel Jacobowitz, gdb

On Sun, 2007-02-25 at 09:11 +1300, Nick Roberts wrote:

> I don't understand why unsigned chars should be printed as arrays except to
> solve Jan's particular problem.  Maybe Emacs uses unsigned char for 8 bit
> character sets like iso_8859-1:

I don't know how useful that is to you but a lot of people (the first
which comes to my mind is libxml2) decided to use "unsigned char *" to
identify utf-8 encoded strings in C.

Mathieu



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Signed/unsigned character arrays are not strings
  2007-02-25 21:07     ` mathieu lacage
@ 2007-02-26  0:45       ` Jan Kratochvil
  2007-02-27  7:17         ` Eli Zaretskii
                           ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Jan Kratochvil @ 2007-02-26  0:45 UTC (permalink / raw)
  To: mathieu lacage; +Cc: Nick Roberts, Daniel Jacobowitz, gdb

[-- Attachment #1: Type: text/plain, Size: 464 bytes --]

On Sun, 25 Feb 2007 08:59:41 +0100, mathieu lacage wrote:
...
> I don't know how useful that is to you but a lot of people (the first
> which comes to my mind is libxml2) decided to use "unsigned char *" to
> identify utf-8 encoded strings in C.

Together with the attached RMS's response I became more inclined to revert this
change and provide only "$xmm"-specific fix instead (probably for the GDB
int8_t/uint8_t internal types).

OK to submit the patch?


Jan

[-- Attachment #2: Type: message/rfc822, Size: 2450 bytes --]

From: Richard Stallman <rms@gnu.org>
To: Nick Roberts <nickrob@snap.net.nz>
Cc: jan.kratochvil@redhat.com
Cc: bug-gdb@gnu.org
Subject: Re: Emacs .gdbinit incompatible with latest GDB
Date: Sun, 25 Feb 2007 14:30:23 -0500
Message-ID: <E1HLP4h-0003js-7j@fencepost.gnu.org>

    the recent GDB has problems running GDB `xbacktrace' on EMACS
    http://sources.redhat.com/ml/gdb/2007-02/msg00252.html

It seems clear why the change was made:

	On the other hand byte arrays become unreadable if printed as strings.

However, it seems that their hope this would not bother anyone was based
on an assumption which is inaccurate:

      I hope nobody uses sign-specification for strings.

Which GDB behavior is better is a matter of how often each one is
convenient and how often it causes trouble.  I don't know enough to
have an opinion about that, but if neither one is clearly better
overall, it would be best to leave GDB the way it was.


[-- Attachment #3: Type: message/rfc822, Size: 2136 bytes --]

From: Richard Stallman <rms@gnu.org>
To: Nick Roberts <nickrob@snap.net.nz>
Cc: jan.kratochvil@redhat.com, emacs-pretest-bug@gnu.org
Subject: Re: Emacs .gdbinit incompatible with latest GDB
Date: Sun, 25 Feb 2007 14:30:24 -0500
Message-ID: <E1HLP4i-0003k4-IF@fencepost.gnu.org>

    2000-01-04  Gerd Moellmann  <gerd@gnu.org>

	    * lisp.h (struct Lisp_String): Make DATA member `unsigned char *'.

    I guess the questions to ask are:

    1) Why was this change made?

Probably to make it easier to avoid incorrect conversions when
extracting elements.  We don't want to get negative numbers
for byte values above 127.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Signed/unsigned character arrays are not strings
  2007-02-26  0:45       ` Jan Kratochvil
@ 2007-02-27  7:17         ` Eli Zaretskii
  2007-02-27  9:29         ` Daniel Jacobowitz
  2007-04-10 21:59         ` Daniel Jacobowitz
  2 siblings, 0 replies; 28+ messages in thread
From: Eli Zaretskii @ 2007-02-27  7:17 UTC (permalink / raw)
  To: Jan Kratochvil; +Cc: Mathieu.Lacage, nickrob, drow, gdb

> Date: Sun, 25 Feb 2007 20:53:50 +0100
> From: Jan Kratochvil <jan.kratochvil@redhat.com>
> Cc: Nick Roberts <nickrob@snap.net.nz>, Daniel Jacobowitz <drow@false.org>,         gdb@sourceware.org
> 
> OK to submit the patch?

Yes, please.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Signed/unsigned character arrays are not strings
  2007-02-26  0:45       ` Jan Kratochvil
  2007-02-27  7:17         ` Eli Zaretskii
@ 2007-02-27  9:29         ` Daniel Jacobowitz
  2007-02-27 12:02           ` Nick Roberts
  2007-04-10 21:59         ` Daniel Jacobowitz
  2 siblings, 1 reply; 28+ messages in thread
From: Daniel Jacobowitz @ 2007-02-27  9:29 UTC (permalink / raw)
  To: Jan Kratochvil; +Cc: mathieu lacage, Nick Roberts, gdb

On Sun, Feb 25, 2007 at 08:53:50PM +0100, Jan Kratochvil wrote:
> On Sun, 25 Feb 2007 08:59:41 +0100, mathieu lacage wrote:
> ...
> > I don't know how useful that is to you but a lot of people (the first
> > which comes to my mind is libxml2) decided to use "unsigned char *" to
> > identify utf-8 encoded strings in C.
> 
> Together with the attached RMS's response I became more inclined to revert this
> change and provide only "$xmm"-specific fix instead (probably for the GDB
> int8_t/uint8_t internal types).
> 
> OK to submit the patch?

RMS wrote:

> Which GDB behavior is better is a matter of how often each one is
> convenient and how often it causes trouble.  I don't know enough to
> have an opinion about that, but if neither one is clearly better
> overall, it would be best to leave GDB the way it was.

For myself, I think the new behavior is clearly better overall, and
that relatively few packages rely on sign-specified character types
for strings; that's why I approved Jan's patch.  I even proposed an
extension that I would find even more useful, to suppress the
single-quoted characters for arrays of signed or unsigned byte
variables.  (No one's commented on that; I'll wait until we decide
about this one first.)

Do you think that Emacs's behavior - an important GNU application, but
only one - changes the overall situation?  I don't, and I am generally
opposed to backing this change out.  I believe there are more
applications which use single byte arrays for numerical data than for
character data.  We can document how to produce string output more
clearly in the manual, perhaps?

-- 
Daniel Jacobowitz
CodeSourcery

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Signed/unsigned character arrays are not strings
  2007-02-27  9:29         ` Daniel Jacobowitz
@ 2007-02-27 12:02           ` Nick Roberts
  2007-02-27 17:06             ` Robert Dewar
  2007-02-27 22:12             ` Daniel Jacobowitz
  0 siblings, 2 replies; 28+ messages in thread
From: Nick Roberts @ 2007-02-27 12:02 UTC (permalink / raw)
  To: Daniel Jacobowitz; +Cc: Jan Kratochvil, mathieu lacage, gdb

 > > Which GDB behavior is better is a matter of how often each one is
 > > convenient and how often it causes trouble.  I don't know enough to
 > > have an opinion about that, but if neither one is clearly better
 > > overall, it would be best to leave GDB the way it was.
 > 
 > For myself, I think the new behavior is clearly better overall, and
 > that relatively few packages rely on sign-specified character types
 > for strings; that's why I approved Jan's patch.  I even proposed an
 > extension that I would find even more useful, to suppress the
 > single-quoted characters for arrays of signed or unsigned byte
 > variables.  (No one's commented on that; I'll wait until we decide
 > about this one first.)
 > 
 > Do you think that Emacs's behavior - an important GNU application, but
 > only one - changes the overall situation?  I don't, and I am generally
 > opposed to backing this change out.  

I'm not sure who you are addressing, but I don't think anyone is saying
Emacs changes the overall situation.

 >                                      I believe there are more
 > applications which use single byte arrays for numerical data than for
 > character data.  

That answers the question that we are really asking and justifies the patch.

 >                 We can document how to produce string output more
 > clearly in the manual, perhaps?

Yes I think that is important, especially for an incompatible change.

The following could also be updated in the manual:

         (gdb) print $xmm1
         $1 = {
           v4_float = {0, 3.43859137e-038, 1.54142831e-044, 1.821688e-044},
           v2_double = {9.92129282474342e-303, 2.7585945287983262e-313},
           v16_int8 = "\000\000\000\000\3706;\001\v\000\000\000\r\000\000",
           v8_int16 = {0, 0, 14072, 315, 11, 0, 13, 0},
           v4_int32 = {0, 20657912, 11, 13},
           v2_int64 = {88725056443645952, 55834574859},
           uint128 = 0x0000000d0000000b013b36f800000000
         }

-- 
Nick                                           http://www.inet.net.nz/~nickrob


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Signed/unsigned character arrays are not strings
  2007-02-27 12:02           ` Nick Roberts
@ 2007-02-27 17:06             ` Robert Dewar
  2007-02-27 18:42               ` Daniel Jacobowitz
  2007-02-27 21:47               ` Eli Zaretskii
  2007-02-27 22:12             ` Daniel Jacobowitz
  1 sibling, 2 replies; 28+ messages in thread
From: Robert Dewar @ 2007-02-27 17:06 UTC (permalink / raw)
  To: Nick Roberts; +Cc: Daniel Jacobowitz, Jan Kratochvil, mathieu lacage, gdb

Nick Roberts wrote:

> That answers the question that we are really asking and justifies the patch.

Not necessarily. First it is only a claim, without documentation,
second, any incompatible change seems basically problematic. I
prefer to avoid this incompatible change, it seems like it would
be a surprise to a substantial number of users. For sure the use
of unsigned char for character data is pretty common.
> 
>  >                 We can document how to produce string output more
>  > clearly in the manual, perhaps?

I would instead document more clearly how to produce the integer
output.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Signed/unsigned character arrays are not strings
  2007-02-27 17:06             ` Robert Dewar
@ 2007-02-27 18:42               ` Daniel Jacobowitz
  2007-02-27 21:53                 ` Eli Zaretskii
  2007-02-27 21:47               ` Eli Zaretskii
  1 sibling, 1 reply; 28+ messages in thread
From: Daniel Jacobowitz @ 2007-02-27 18:42 UTC (permalink / raw)
  To: Robert Dewar; +Cc: Nick Roberts, Jan Kratochvil, mathieu lacage, gdb

On Tue, Feb 27, 2007 at 07:51:53AM -0500, Robert Dewar wrote:
> Nick Roberts wrote:
> 
> >That answers the question that we are really asking and justifies the 
> >patch.
> 
> Not necessarily. First it is only a claim, without documentation,

Do you have any reasonable ideas on how to gather data?  I'm listening
:-)

I spent a little while poking at Google CodeSearch.  There were
definitely some matches of people assigning strings to "unsigned char
*" variables - most of the ones I looked at were in test code for
crypto libraries, or things like base64 / locale ctype tables.  There
were an order of magnitude (about 75x) more matches for plain "char
*".

signed\ char.*\ =\ .*\"		about   7000
unsigned\ char.*\ =\ .*\"	about  10600
char.*\ =\ .*\"			about 753000

I know that as a GDB developer, debugging GDB, I'd want explicitly
signed or unsigned characters to be printed as data; we made a
deliberate switch to using gdb_byte (which is unsigned char) for
unknown data read from target memory.  We cast it to char * when we
read strings.

> second, any incompatible change seems basically problematic.

I have some trouble understanding this.  Could someone explain it to
me?

It's an honest and serious question, I'm not asking for a lecture on
compatibility concepts here.  This is user interface, not core
functionality.  It's more like clarifying the text of one of GCC's
warning messages than changing the dialect of C it accepts.  I think
we have a lot of freedom to adapt our default output to be more useful
to our users, especially when we provide a way to get the old
behavior.  In this case that method is even completely backwards
compatible.

I think we have a lot of freedom to make this kind of change.  The
same reasoning applies to the print/x floating point discussion.

> > >                 We can document how to produce string output more
> > > clearly in the manual, perhaps?
> 
> I would instead document more clearly how to produce the integer
> output.

Without this patch there wasn't any way to produce the integer output
for single byte elements.  Which drove me batty working with vector
registers - I'm glad Jan posted the patch!

-- 
Daniel Jacobowitz
CodeSourcery

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Signed/unsigned character arrays are not strings
  2007-02-27 18:42               ` Daniel Jacobowitz
@ 2007-02-27 21:53                 ` Eli Zaretskii
  2007-02-27 22:12                   ` Daniel Jacobowitz
  0 siblings, 1 reply; 28+ messages in thread
From: Eli Zaretskii @ 2007-02-27 21:53 UTC (permalink / raw)
  To: Daniel Jacobowitz; +Cc: dewar, nickrob, jan.kratochvil, Mathieu.Lacage, gdb

> Date: Tue, 27 Feb 2007 08:14:42 -0500
> From: Daniel Jacobowitz <drow@false.org>
> Cc: Nick Roberts <nickrob@snap.net.nz>, 	Jan Kratochvil <jan.kratochvil@redhat.com>, 	mathieu lacage <Mathieu.Lacage@sophia.inria.fr>, gdb@sourceware.org
> 
> I spent a little while poking at Google CodeSearch.  There were
> definitely some matches of people assigning strings to "unsigned char
> *" variables - most of the ones I looked at were in test code for
> crypto libraries, or things like base64 / locale ctype tables.  There
> were an order of magnitude (about 75x) more matches for plain "char
> *".

Doesn't a similar situation exist with "unsigned int" and "int", or
with "unsigned long" and "long"?  And yet we don't treat them
differently.

IOW, I think it's quite expected that explicit signedness is
relatively rare, since in the vast majority of cases it is simply not
needed.  Interpreting this phenomenon as saying something about what
kind of data is stored is not necessarily a good idea.

> I know that as a GDB developer, debugging GDB, I'd want explicitly
> signed or unsigned characters to be printed as data

That is indeed one reason to use unsigned char.  But there is another,
as demonstrated by Emacs's Lisp_String type: to store non-ASCII
characters whose upper bit might be set.  And in those latter cases,
we do want the data displayed as text, not as numeric codes.

> > second, any incompatible change seems basically problematic.
> 
> I have some trouble understanding this.  Could someone explain it to
> me?
> 
> It's an honest and serious question, I'm not asking for a lecture on
> compatibility concepts here.

If you are not asking about general principles, then I really don't
understand what kind of explanations you would like to hear.

> This is user interface, not core
> functionality.  It's more like clarifying the text of one of GCC's
> warning messages than changing the dialect of C it accepts.  I think
> we have a lot of freedom to adapt our default output to be more useful
> to our users, especially when we provide a way to get the old
> behavior.

The issue is precisely that it is controversial whether the proposed
output is necessarily more useful to the user.  It is clearly more
useful in some cases, but not in the others.

> In this case that method is even completely backwards compatible.

??? Now _I_ would like to ask for explanations.  Do you mean the cast
to "char *" trick? if so, that's not backward compatibility, because
existing scripts in .gdbinit files need to be modified to get back
past behavior.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Signed/unsigned character arrays are not strings
  2007-02-27 21:53                 ` Eli Zaretskii
@ 2007-02-27 22:12                   ` Daniel Jacobowitz
  2007-02-27 22:14                     ` Mark Kettenis
  0 siblings, 1 reply; 28+ messages in thread
From: Daniel Jacobowitz @ 2007-02-27 22:12 UTC (permalink / raw)
  To: Eli Zaretskii; +Cc: dewar, nickrob, jan.kratochvil, Mathieu.Lacage, gdb

On Tue, Feb 27, 2007 at 11:06:17PM +0200, Eli Zaretskii wrote:
> Doesn't a similar situation exist with "unsigned int" and "int", or
> with "unsigned long" and "long"?  And yet we don't treat them
> differently.
> 
> IOW, I think it's quite expected that explicit signedness is
> relatively rare, since in the vast majority of cases it is simply not
> needed.  Interpreting this phenomenon as saying something about what
> kind of data is stored is not necessarily a good idea.

I feel that this is different for two reasons.  One is that the
situation for int and long is not the same, because "int" and "signed
int" are the same type in C - but "char" and "signed char" are not.
Char is explicitly of indeterminate sign.  The other is that there is
a widespread use of "char" for string data and "signed char" or
"unsigned char" for non-string data.

Of course, the first reason is a matter of standards and the second is
only a matter of my feeling and fumbling around with search engines.

> > I know that as a GDB developer, debugging GDB, I'd want explicitly
> > signed or unsigned characters to be printed as data
> 
> That is indeed one reason to use unsigned char.  But there is another,
> as demonstrated by Emacs's Lisp_String type: to store non-ASCII
> characters whose upper bit might be set.  And in those latter cases,
> we do want the data displayed as text, not as numeric codes.

Yes, there are good counterexamples.  Though I believe emacs also
stores some numeric non-character data in its strings (isn't there a
length or kind byte?).  Plus, this gets dangerously close to support
for explicitly printing strings of different character sets and
encodings - UTF-8 support is requested once a year or so.

Anyway, that's a project for another week :-)

> > This is user interface, not core
> > functionality.  It's more like clarifying the text of one of GCC's
> > warning messages than changing the dialect of C it accepts.  I think
> > we have a lot of freedom to adapt our default output to be more useful
> > to our users, especially when we provide a way to get the old
> > behavior.
> 
> The issue is precisely that it is controversial whether the proposed
> output is necessarily more useful to the user.  It is clearly more
> useful in some cases, but not in the others.

Yes, this is the part I think is really important.  I've provided what
information I can to support the fact that the new behavior is more
useful:

  - it's right for debugging GDB itself
  - it's right for processor vector registers supported by GDB
  - it seems to be right more often than not, given my abuse of
    CodeSearch.

But this is fuzzy.  I don't know how to find out more.  We have no way
to poll users.  I tried polling my coworkers, and got only agreement
that strings are usually stored in char * and not in unsigned char *.

There's some concrete reasons to do that, too.  GCC in some
configurations warns about passing an unsigned char * to a function
expecting a char * - functions like strlen.  As you know, some
applications have disabled this warning because they disagree or
because they agree but it would be too much labor to clean up; but I
know of plenty of projects that are fine with the new warning.

> > In this case that method is even completely backwards compatible.
> 
> ??? Now _I_ would like to ask for explanations.  Do you mean the cast
> to "char *" trick? if so, that's not backward compatibility, because
> existing scripts in .gdbinit files need to be modified to get back
> past behavior.

Other way round: using (char *) results would result in a backwards
compatible .gdbinit, because it would work with both old and new
versions of GDB.

Anyway, if we end up leaving the change in I will try to clarify the
manual.  There is almost nothing in it now about printing strings
using "print".

-- 
Daniel Jacobowitz
CodeSourcery

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Signed/unsigned character arrays are not strings
  2007-02-27 22:12                   ` Daniel Jacobowitz
@ 2007-02-27 22:14                     ` Mark Kettenis
  2007-02-28  0:47                       ` Paul Koning
                                         ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: Mark Kettenis @ 2007-02-27 22:14 UTC (permalink / raw)
  To: drow; +Cc: eliz, dewar, nickrob, jan.kratochvil, Mathieu.Lacage, gdb

> Date: Tue, 27 Feb 2007 16:53:16 -0500
> From: Daniel Jacobowitz <drow@false.org>
> 
> On Tue, Feb 27, 2007 at 11:06:17PM +0200, Eli Zaretskii wrote:
> > Doesn't a similar situation exist with "unsigned int" and "int", or
> > with "unsigned long" and "long"?  And yet we don't treat them
> > differently.
> > 
> > IOW, I think it's quite expected that explicit signedness is
> > relatively rare, since in the vast majority of cases it is simply not
> > needed.  Interpreting this phenomenon as saying something about what
> > kind of data is stored is not necessarily a good idea.
> 
> I feel that this is different for two reasons.  One is that the
> situation for int and long is not the same, because "int" and "signed
> int" are the same type in C - but "char" and "signed char" are not.
> Char is explicitly of indeterminate sign.  The other is that there is
> a widespread use of "char" for string data and "signed char" or
> "unsigned char" for non-string data.

Well, "char" really is "signed char" on most machines and "unsigned
char" on others.  One way to prevent the sign-extension problems this
sometimes causes, is to explicitly use "unsigned char *" for strings.

Anyway, I'm in favour of restore the traditional gdb behaviour of
printing all three as strings.

Mark


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Signed/unsigned character arrays are not strings
  2007-02-27 22:14                     ` Mark Kettenis
@ 2007-02-28  0:47                       ` Paul Koning
  2007-02-28  1:14                       ` Jim Blandy
  2007-02-28 14:35                       ` Daniel Jacobowitz
  2 siblings, 0 replies; 28+ messages in thread
From: Paul Koning @ 2007-02-28  0:47 UTC (permalink / raw)
  To: mark.kettenis
  Cc: drow, eliz, dewar, nickrob, jan.kratochvil, Mathieu.Lacage, gdb

>>>>> "Mark" == Mark Kettenis <mark.kettenis@xs4all.nl> writes:

 >> Date: Tue, 27 Feb 2007 16:53:16 -0500 From: Daniel Jacobowitz
 >> <drow@false.org>
 >> 
 >> On Tue, Feb 27, 2007 at 11:06:17PM +0200, Eli Zaretskii wrote: >
 >> Doesn't a similar situation exist with "unsigned int" and "int",
 >> or > with "unsigned long" and "long"?  And yet we don't treat them
 >> > differently.
 >> > 
 >> > IOW, I think it's quite expected that explicit signedness is >
 >> relatively rare, since in the vast majority of cases it is simply
 >> not > needed.  Interpreting this phenomenon as saying something
 >> about what > kind of data is stored is not necessarily a good
 >> idea.
 >> 
 >> I feel that this is different for two reasons.  One is that the
 >> situation for int and long is not the same, because "int" and
 >> "signed int" are the same type in C - but "char" and "signed char"
 >> are not.  Char is explicitly of indeterminate sign.  The other is
 >> that there is a widespread use of "char" for string data and
 >> "signed char" or "unsigned char" for non-string data.

 Mark> Well, "char" really is "signed char" on most machines and
 Mark> "unsigned char" on others.  One way to prevent the
 Mark> sign-extension problems this sometimes causes, is to explicitly
 Mark> use "unsigned char *" for strings.

Right.  For example, if you do string manipulation by table lookup on
the characters, then this is a good thing to do.  (It avoids the
painful 384 entry table hacks you otherwise need.)

 Mark> Anyway, I'm in favour of restore the traditional gdb behaviour
 Mark> of printing all three as strings.

Same here.

     paul


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Signed/unsigned character arrays are not strings
  2007-02-27 22:14                     ` Mark Kettenis
  2007-02-28  0:47                       ` Paul Koning
@ 2007-02-28  1:14                       ` Jim Blandy
  2007-02-28  1:59                         ` Jim Blandy
  2007-02-28 14:35                       ` Daniel Jacobowitz
  2 siblings, 1 reply; 28+ messages in thread
From: Jim Blandy @ 2007-02-28  1:14 UTC (permalink / raw)
  To: Mark Kettenis
  Cc: drow, eliz, dewar, nickrob, jan.kratochvil, Mathieu.Lacage, gdb


Okay, here's a horrible idea.  :)  With this patch:

$ cat chars.c
#include <stdio.h>
#include <stdint.h>

typedef char byte_t;

char *c = "chars";
unsigned char *uc = "unsigned chars";
signed char *sc = "signed chars";
byte_t *b = "bytes";
int8_t *i8 = "int8_t's";
uint8_t *ui8 = "uint8_t's";

int
main (int argc, char **argv)
{
  puts ("Hi!");
}
$ gcc -g chars.c -o chars
$ ~/uberbaum/build-cvs-out/gdb/gdb chars
GNU gdb 6.6.50.20070227-cvs
...
(gdb) print c
$1 = 0x8048450 "chars"
(gdb) print uc
$2 = (unsigned char *) 0x8048456 "unsigned chars"
(gdb) print sc
$3 = (signed char *) 0x8048465 "signed chars"
(gdb) print b
$4 = (byte_t *) 0x8048472
(gdb) print i8
$5 = (int8_t *) 0x8048478
(gdb) print ui8
$6 = (uint8_t *) 0x8048481
(gdb) start
Breakpoint 1 at 0x8048365: file chars.c, line 16.
Starting program: /home/jimb/play/chars 
main () at chars.c:16
16        puts ("Hi!");
(gdb) print $xmm0
$7 = {v4_float = {0, 0, 0, 0}, v2_double = {0, 0}, v16_int8 = {0 <repeats 16 times>}, v8_int16 = {
    0, 0, 0, 0, 0, 0, 0, 0}, v4_int32 = {0, 0, 0, 0}, v2_int64 = {0, 0}, 
  uint128 = 0x00000000000000000000000000000000}
(gdb) 

Because of the way C works, heuristics about what's textual and what's
numeric are inevitable here.  Whether or not you like the one I'm
suggesting, we should definitely consolidate the heuristic in one
place, so it's consistent.

If people like this, there are probably test suite changes needed.

gdb/ChangeLog:
2007-02-27  Jim Blandy  <jimb@codesourcery.com>

	* c-valprint.c (textual_element_type): New function.
	(c_val_print): Use textual_element_type to decide whether to print
	arrays, pointer types, and integer types as strings and characters.
	(c_value_print): Doc fix.

Index: gdb/c-valprint.c
===================================================================
RCS file: /cvs/src/src/gdb/c-valprint.c,v
retrieving revision 1.42
diff -u -r1.42 c-valprint.c
--- gdb/c-valprint.c	26 Jan 2007 20:54:16 -0000	1.42
+++ gdb/c-valprint.c	28 Feb 2007 00:39:07 -0000
@@ -56,6 +56,56 @@
 }
 
 
+/* Return non-zero if an array of TYPE or a pointer to TYPE should be
+   printed as a textual string, or zero if it should be treated as an
+   array of /pointer to integers.  */
+static int
+textual_element_type (struct type *type)
+{
+  /* GDB doesn't use TYPE_CODE_CHAR for the C 'char' types; instead,
+     it uses one-byte TYPE_CODE_INT types, with TYPE_NAMEs like
+     "char", "unsigned char", etc. and appropriate flags.  For various
+     reasons, this works out well in some places.
+
+     But this means that we have no clear distinction between types
+     representing text and types representing one-byte integers, used
+     numerically.  It's not too uncommon for programs to use 'unsigned
+     char' and 'signed char' for text.
+
+     So, our heuristic is that, if a one-byte TYPE_CODE_INT has a
+     TYPE_NAME of "char" or something ending with " char", then we
+     treat it as text; otherwise, we assume it's being used as data.
+     This makes all our SIMD types like builtin_type_v8_int8 and the
+     <stdint.h> types like uint8_t print numerically, but all 'char'
+     types print textually.  Code which says what it means does
+     well.  */
+  struct type *true_type = check_typedef (type);
+
+  if (TYPE_CODE (true_type) == TYPE_CODE_CHAR)
+    return 1;
+
+  /* Is this a one-byte integer type?  */
+  if (TYPE_CODE (true_type) == TYPE_CODE_INT
+      && TYPE_LENGTH (true_type) == 1)
+    {
+      int name_len;
+      
+      /* All integer types should have names.  */
+      gdb_assert (TYPE_NAME (type));
+
+      name_len = strlen (TYPE_NAME (type));
+
+      /* Is the name "char", or does it end with " char"?  */
+      if (strcmp (TYPE_NAME (type), "char") == 0
+          || (name_len > 5
+              && strcmp (TYPE_NAME (type) + name_len - 5, " char") == 0))
+        return 1;
+    }
+
+  return 0;
+}
+
+
 /* Print data of type TYPE located at VALADDR (within GDB), which came from
    the inferior at address ADDRESS, onto stdio stream STREAM according to
    FORMAT (a letter or 0 for natural format).  The data at VALADDR is in
@@ -94,11 +144,9 @@
 	    {
 	      print_spaces_filtered (2 + 2 * recurse, stream);
 	    }
-	  /* For an array of chars, print with string syntax.  */
-	  if (eltlen == 1 &&
-	      ((TYPE_CODE (elttype) == TYPE_CODE_INT && TYPE_NOSIGN (elttype))
-	       || ((current_language->la_language == language_m2)
-		   && (TYPE_CODE (elttype) == TYPE_CODE_CHAR)))
+
+	  /* Print arrays of textual chars with a string syntax.  */
+          if (textual_element_type (TYPE_TARGET_TYPE (type))
 	      && (format == 0 || format == 's'))
 	    {
 	      /* If requested, look for the first null char and only print
@@ -191,12 +239,11 @@
 	      deprecated_print_address_numeric (addr, 1, stream);
 	    }
 
-	  /* For a pointer to char or unsigned char, also print the string
+	  /* For a pointer to a textual type, also print the string
 	     pointed to, unless pointer is null.  */
 	  /* FIXME: need to handle wchar_t here... */
 
-	  if (TYPE_LENGTH (elttype) == 1
-	      && TYPE_CODE (elttype) == TYPE_CODE_INT
+	  if (textual_element_type (TYPE_TARGET_TYPE (type))
 	      && (format == 0 || format == 's')
 	      && addr != 0)
 	    {
@@ -398,7 +445,7 @@
 	     Since we don't know whether the value is really intended to
 	     be used as an integer or a character, print the character
 	     equivalent as well. */
-	  if (TYPE_LENGTH (type) == 1)
+	  if (textual_element_type (type))
 	    {
 	      fputs_filtered (" ", stream);
 	      LA_PRINT_CHAR ((unsigned char) unpack_long (type, valaddr + embedded_offset),
@@ -500,7 +547,9 @@
       || TYPE_CODE (type) == TYPE_CODE_REF)
     {
       /* Hack:  remove (char *) for char strings.  Their
-         type is indicated by the quoted string anyway. */
+         type is indicated by the quoted string anyway.
+         (Don't use textual_element_type here; quoted strings
+         are always exactly (char *).  */
       if (TYPE_CODE (type) == TYPE_CODE_PTR
 	  && TYPE_NAME (type) == NULL
 	  && TYPE_NAME (TYPE_TARGET_TYPE (type)) != NULL


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Signed/unsigned character arrays are not strings
  2007-02-28  1:14                       ` Jim Blandy
@ 2007-02-28  1:59                         ` Jim Blandy
  2007-02-28  5:26                           ` Nick Roberts
  0 siblings, 1 reply; 28+ messages in thread
From: Jim Blandy @ 2007-02-28  1:59 UTC (permalink / raw)
  To: Mark Kettenis
  Cc: drow, eliz, dewar, nickrob, jan.kratochvil, Mathieu.Lacage, gdb


Jim Blandy <jimb@codesourcery.com> writes:
> Because of the way C works, heuristics about what's textual and what's
> numeric are inevitable here.  Whether or not you like the one I'm
> suggesting, we should definitely consolidate the heuristic in one
> place, so it's consistent.

And what I really want to know is, when I read this thread in Gmail,
why does Google offer me an ad for "Lingerie in Leather"?


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Signed/unsigned character arrays are not strings
  2007-02-28  1:59                         ` Jim Blandy
@ 2007-02-28  5:26                           ` Nick Roberts
  0 siblings, 0 replies; 28+ messages in thread
From: Nick Roberts @ 2007-02-28  5:26 UTC (permalink / raw)
  To: Jim Blandy; +Cc: gdb

 > > Because of the way C works, heuristics about what's textual and what's
 > > numeric are inevitable here.  Whether or not you like the one I'm
 > > suggesting, we should definitely consolidate the heuristic in one
 > > place, so it's consistent.
 > 
 > And what I really want to know is, when I read this thread in Gmail,
 > why does Google offer me an ad for "Lingerie in Leather"?

Because of it's high textual content?

-- 
Nick                                           http://www.inet.net.nz/~nickrob


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Signed/unsigned character arrays are not strings
  2007-02-27 22:14                     ` Mark Kettenis
  2007-02-28  0:47                       ` Paul Koning
  2007-02-28  1:14                       ` Jim Blandy
@ 2007-02-28 14:35                       ` Daniel Jacobowitz
  2007-03-01  0:43                         ` Jim Blandy
  2007-03-01  0:54                         ` Nick Roberts
  2 siblings, 2 replies; 28+ messages in thread
From: Daniel Jacobowitz @ 2007-02-28 14:35 UTC (permalink / raw)
  To: Mark Kettenis; +Cc: eliz, dewar, nickrob, jan.kratochvil, Mathieu.Lacage, gdb

On Tue, Feb 27, 2007 at 11:11:32PM +0100, Mark Kettenis wrote:
> Anyway, I'm in favour of restore the traditional gdb behaviour of
> printing all three as strings.

While I haven't seen responses to some of my arguments in favor of the
new behavior, it's obvious that I'm in the minority here.  So, the
behavior ought to change.

I was concerned about Jim's patch, but thinking about it some more, I
would be happy with it (probably with Paul's suggestion).  That's
because "typedef char *name" is reasonably common, but "typedef char
name_unit;" is much less so.  What do those who prefer the old
behavior think of that?

If it's no better, then we're back to what Jan suggested: something
which only works for the GDB int8_t and uint8_t types, to fix the
display of vector registers.

By the way, I was thinking about this last night and wondered if
this is hinting at a sensible meaning for "print /s" ...

-- 
Daniel Jacobowitz
CodeSourcery

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Signed/unsigned character arrays are not strings
  2007-02-28 14:35                       ` Daniel Jacobowitz
@ 2007-03-01  0:43                         ` Jim Blandy
  2007-03-01  0:54                         ` Nick Roberts
  1 sibling, 0 replies; 28+ messages in thread
From: Jim Blandy @ 2007-03-01  0:43 UTC (permalink / raw)
  To: Mark Kettenis; +Cc: eliz, dewar, nickrob, jan.kratochvil, Mathieu.Lacage, gdb


Daniel Jacobowitz <drow@false.org> writes:
> On Tue, Feb 27, 2007 at 11:11:32PM +0100, Mark Kettenis wrote:
>> Anyway, I'm in favour of restore the traditional gdb behaviour of
>> printing all three as strings.
>
> While I haven't seen responses to some of my arguments in favor of the
> new behavior, it's obvious that I'm in the minority here.  So, the
> behavior ought to change.
>
> I was concerned about Jim's patch, but thinking about it some more, I
> would be happy with it (probably with Paul's suggestion).  That's
> because "typedef char *name" is reasonably common, but "typedef char
> name_unit;" is much less so.  What do those who prefer the old
> behavior think of that?

For whatever it's worth, here's a revised patch that implements Paul's
suggestions and has better comments (more accurately apologetic):

gdb/ChangeLog:
2007-02-27  Jim Blandy  <jimb@codesourcery.com>

	* c-valprint.c (textual_element_type): New function.
	(c_val_print): Use textual_element_type to decide whether to print
	arrays, pointer types, and integer types as strings and characters.
	(c_value_print): Doc fix.

Index: gdb/c-valprint.c
===================================================================
RCS file: /cvs/src/src/gdb/c-valprint.c,v
retrieving revision 1.42
diff -u -r1.42 c-valprint.c
--- gdb/c-valprint.c	26 Jan 2007 20:54:16 -0000	1.42
+++ gdb/c-valprint.c	28 Feb 2007 23:09:29 -0000
@@ -56,6 +56,66 @@
 }
 
 
+/* Apply a heuristic to decide whether an array of TYPE or a pointer
+   to TYPE should be printed as a textual string.  Return non-zero if
+   it should, or zero if it should be treated as an array of /pointer
+   to integers.
+
+   It's a shame that we need to use a heuristic here, but C hasn't
+   historically provided distinct types for bytes and characters; you
+   get 'char', which is guaranteed to occupy a byte.  You can put
+   qualifiers on it, but people use the qualified char types for text
+   too, so the qualifier's presence doesn't really tell you anything.
+
+   What's especially a shame here is that the heuristic is fixed.  The
+   user knows which of their types are textual and which are numeric,
+   but we don't give them a way to tell us, beyond using '/s' every time
+   they print something.  */
+static int
+textual_element_type (struct type *type)
+{
+  /* GDB doesn't use TYPE_CODE_CHAR for the C 'char' types; instead,
+     it uses one-byte TYPE_CODE_INT types, with TYPE_NAMEs like
+     "char", "unsigned char", etc. and appropriate flags.  For various
+     reasons, this works out well in some places.  */
+
+
+  struct type *true_type = check_typedef (type);
+
+  /* TYPE_CODE_CHAR is always textual.  But I don't think it ever
+     occurs in C code.  */
+  if (TYPE_CODE (true_type) == TYPE_CODE_CHAR)
+    return 1;
+
+  /* If a one-byte TYPE_CODE_INT has a TYPE_NAME of "char" or
+     something ending with " char", then we treat it as text;
+     otherwise, we assume it's being used as data.  This makes all our
+     SIMD types like builtin_type_v8_int8 and the <stdint.h> types
+     like uint8_t print numerically, but all 'char' types print
+     textually.  Code which says what it means does well.
+     We also recognize types ending in 'char_t'.  */
+  if (TYPE_CODE (true_type) == TYPE_CODE_INT
+      && TYPE_LENGTH (true_type) == 1)
+    {
+      int name_len;
+      
+      /* All integer types should have names.  */
+      gdb_assert (TYPE_NAME (type));
+
+      name_len = strlen (TYPE_NAME (type));
+
+      if (strcmp (TYPE_NAME (type), "char") == 0
+          || (name_len > 5
+              && strcmp (TYPE_NAME (type) + name_len - 5, " char") == 0)
+          || (name_len >= 6
+              && strcmp (TYPE_NAME (type) + name_len - 6, " char_t") == 0))
+        return 1;
+    }
+
+  return 0;
+}
+
+
 /* Print data of type TYPE located at VALADDR (within GDB), which came from
    the inferior at address ADDRESS, onto stdio stream STREAM according to
    FORMAT (a letter or 0 for natural format).  The data at VALADDR is in
@@ -94,11 +154,9 @@
 	    {
 	      print_spaces_filtered (2 + 2 * recurse, stream);
 	    }
-	  /* For an array of chars, print with string syntax.  */
-	  if (eltlen == 1 &&
-	      ((TYPE_CODE (elttype) == TYPE_CODE_INT && TYPE_NOSIGN (elttype))
-	       || ((current_language->la_language == language_m2)
-		   && (TYPE_CODE (elttype) == TYPE_CODE_CHAR)))
+
+	  /* Print arrays of textual chars with a string syntax.  */
+          if (textual_element_type (TYPE_TARGET_TYPE (type))
 	      && (format == 0 || format == 's'))
 	    {
 	      /* If requested, look for the first null char and only print
@@ -191,12 +249,11 @@
 	      deprecated_print_address_numeric (addr, 1, stream);
 	    }
 
-	  /* For a pointer to char or unsigned char, also print the string
+	  /* For a pointer to a textual type, also print the string
 	     pointed to, unless pointer is null.  */
 	  /* FIXME: need to handle wchar_t here... */
 
-	  if (TYPE_LENGTH (elttype) == 1
-	      && TYPE_CODE (elttype) == TYPE_CODE_INT
+	  if (textual_element_type (TYPE_TARGET_TYPE (type))
 	      && (format == 0 || format == 's')
 	      && addr != 0)
 	    {
@@ -398,7 +455,7 @@
 	     Since we don't know whether the value is really intended to
 	     be used as an integer or a character, print the character
 	     equivalent as well. */
-	  if (TYPE_LENGTH (type) == 1)
+	  if (textual_element_type (type))
 	    {
 	      fputs_filtered (" ", stream);
 	      LA_PRINT_CHAR ((unsigned char) unpack_long (type, valaddr + embedded_offset),
@@ -500,7 +557,9 @@
       || TYPE_CODE (type) == TYPE_CODE_REF)
     {
       /* Hack:  remove (char *) for char strings.  Their
-         type is indicated by the quoted string anyway. */
+         type is indicated by the quoted string anyway.
+         (Don't use textual_element_type here; quoted strings
+         are always exactly (char *).  */
       if (TYPE_CODE (type) == TYPE_CODE_PTR
 	  && TYPE_NAME (type) == NULL
 	  && TYPE_NAME (TYPE_TARGET_TYPE (type)) != NULL


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Signed/unsigned character arrays are not strings
  2007-02-28 14:35                       ` Daniel Jacobowitz
  2007-03-01  0:43                         ` Jim Blandy
@ 2007-03-01  0:54                         ` Nick Roberts
  1 sibling, 0 replies; 28+ messages in thread
From: Nick Roberts @ 2007-03-01  0:54 UTC (permalink / raw)
  To: Daniel Jacobowitz
  Cc: Mark Kettenis, eliz, dewar, jan.kratochvil, Mathieu.Lacage, gdb

 > While I haven't seen responses to some of my arguments in favor of the
 > new behavior, it's obvious that I'm in the minority here.

Yes, but I think choices should be based on reason, not gut feeling.  The
change didn't receive much attention for a month, so it can't be that great a
problem.

 >...
 > By the way, I was thinking about this last night and wondered if
 > this is hinting at a sensible meaning for "print /s" ...

I like this approach because it's simple.  Jim's is too complicated (well,
for me, at least!).

-- 
Nick                                           http://www.inet.net.nz/~nickrob


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Signed/unsigned character arrays are not strings
  2007-02-27 17:06             ` Robert Dewar
  2007-02-27 18:42               ` Daniel Jacobowitz
@ 2007-02-27 21:47               ` Eli Zaretskii
  1 sibling, 0 replies; 28+ messages in thread
From: Eli Zaretskii @ 2007-02-27 21:47 UTC (permalink / raw)
  To: Robert Dewar; +Cc: nickrob, drow, jan.kratochvil, Mathieu.Lacage, gdb

> Date: Tue, 27 Feb 2007 07:51:53 -0500
> From: Robert Dewar <dewar@adacore.com>
> CC: Daniel Jacobowitz <drow@false.org>,   Jan Kratochvil <jan.kratochvil@redhat.com>,  mathieu lacage <Mathieu.Lacage@sophia.inria.fr>,  gdb@sourceware.org
> 
> I prefer to avoid this incompatible change, it seems like it would
> be a surprise to a substantial number of users.

In case it was unclear, I'm of the same opinion.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Signed/unsigned character arrays are not strings
  2007-02-27 12:02           ` Nick Roberts
  2007-02-27 17:06             ` Robert Dewar
@ 2007-02-27 22:12             ` Daniel Jacobowitz
  1 sibling, 0 replies; 28+ messages in thread
From: Daniel Jacobowitz @ 2007-02-27 22:12 UTC (permalink / raw)
  To: Nick Roberts; +Cc: Jan Kratochvil, mathieu lacage, gdb

On Wed, Feb 28, 2007 at 12:02:06AM +1300, Nick Roberts wrote:
> The following could also be updated in the manual:
> 
>          (gdb) print $xmm1
>          $1 = {
>            v4_float = {0, 3.43859137e-038, 1.54142831e-044, 1.821688e-044},
>            v2_double = {9.92129282474342e-303, 2.7585945287983262e-313},
>            v16_int8 = "\000\000\000\000\3706;\001\v\000\000\000\r\000\000",
>            v8_int16 = {0, 0, 14072, 315, 11, 0, 13, 0},
>            v4_int32 = {0, 20657912, 11, 13},
>            v2_int64 = {88725056443645952, 55834574859},
>            uint128 = 0x0000000d0000000b013b36f800000000
>          }

Ooh, thanks.  Added to my list to update if the patch stays in.

-- 
Daniel Jacobowitz
CodeSourcery


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [RFC] Signed/unsigned character arrays are not strings
  2007-02-26  0:45       ` Jan Kratochvil
  2007-02-27  7:17         ` Eli Zaretskii
  2007-02-27  9:29         ` Daniel Jacobowitz
@ 2007-04-10 21:59         ` Daniel Jacobowitz
  2 siblings, 0 replies; 28+ messages in thread
From: Daniel Jacobowitz @ 2007-04-10 21:59 UTC (permalink / raw)
  To: Jan Kratochvil; +Cc: mathieu lacage, Nick Roberts, gdb

On Sun, Feb 25, 2007 at 08:53:50PM +0100, Jan Kratochvil wrote:
> On Sun, 25 Feb 2007 08:59:41 +0100, mathieu lacage wrote:
> ...
> > I don't know how useful that is to you but a lot of people (the first
> > which comes to my mind is libxml2) decided to use "unsigned char *" to
> > identify utf-8 encoded strings in C.
> 
> Together with the attached RMS's response I became more inclined to revert this
> change and provide only "$xmm"-specific fix instead (probably for the GDB
> int8_t/uint8_t internal types).

There was a lot of discussion about how to treat signed char, unsigned
char, signed char *, et cetera.  There weren't a lot of conclusions,
but several people did not like the new behavior, and then discussion
trailed off.

I don't want to just revert the patch, because the problem that Jan
was fixing (unuseful display of $xmm registers) is really quite
annoying.  I see these options:

1.  Make vector types special.  Treat arrays of single byte integers
as characters, like before, unless they occur in a vector type.  This
is reasonable, but tricky to implement.

2.  Make two special single byte integer types, with a GDB internal
"not a char" flag set.  Use them for our builtin int8_t and uint8_t.
Use these to build types for vector registers.  Print all other single
byte types from user code as chars or strings.  This is similar to
#1, a little less helpful, but fairly easy.

3.  Treat "char" as a character, but "unsigned char" and "signed char"
as numbers (Jan's patch started down this road and Jim's went a bit
further).  Treat pointers/arrays of char as strings and
pointers/arrays of unsigned or signed char as numbers.  Add a "/s"
flag to the print command that treats single byte types as
characters or strings.

For example:
  char str[] = "hi";
  unsigned char version[] = "6.5";

(gdb) p version
$1 = { 54, 46, 53 }
(gdb) p/s version
$2 = "6.5"
(gdb) p str
$3 = "hi"

4. Like #3, except that instead of adding a /s modifier, add a "set"
knob.  Of course in this case we get to argue about the default value.

I think it's important that we resolve this open issue before we
release a new version of GDB, so please post which you prefer.  I like
#3 best, followed by #2; #4 is a good compromise but I worry that we
are proliferating knobs that no one ever changes.  I'm interested in
any other suggestions, though I think we've ruled out guessing based
on the type name.

-- 
Daniel Jacobowitz
CodeSourcery

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2007-04-10 21:59 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-02-28 13:05 [RFC] Signed/unsigned character arrays are not strings pkoning
2007-03-01 11:01 ` Mark Kettenis
  -- strict thread matches above, loose matches on Subject: below --
2007-02-24 16:13 Nick Roberts
2007-02-24 20:11 ` Daniel Jacobowitz
2007-02-24 20:53   ` Nick Roberts
2007-02-24 21:07     ` Jan Kratochvil
2007-02-25  8:00       ` Daniel Jacobowitz
2007-02-25 19:54         ` Nick Roberts
2007-02-25 21:07     ` mathieu lacage
2007-02-26  0:45       ` Jan Kratochvil
2007-02-27  7:17         ` Eli Zaretskii
2007-02-27  9:29         ` Daniel Jacobowitz
2007-02-27 12:02           ` Nick Roberts
2007-02-27 17:06             ` Robert Dewar
2007-02-27 18:42               ` Daniel Jacobowitz
2007-02-27 21:53                 ` Eli Zaretskii
2007-02-27 22:12                   ` Daniel Jacobowitz
2007-02-27 22:14                     ` Mark Kettenis
2007-02-28  0:47                       ` Paul Koning
2007-02-28  1:14                       ` Jim Blandy
2007-02-28  1:59                         ` Jim Blandy
2007-02-28  5:26                           ` Nick Roberts
2007-02-28 14:35                       ` Daniel Jacobowitz
2007-03-01  0:43                         ` Jim Blandy
2007-03-01  0:54                         ` Nick Roberts
2007-02-27 21:47               ` Eli Zaretskii
2007-02-27 22:12             ` Daniel Jacobowitz
2007-04-10 21:59         ` Daniel Jacobowitz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox