[RFC] string handling in python

Mirror of the gdb mailing list
 help / color / mirror / Atom feed

* [RFC] string handling in python
@ 2008-07-07  5:25 Thiago Jung Bauermann
  2008-07-07 23:31 ` Tom Tromey
  0 siblings, 1 reply; 4+ messages in thread
From: Thiago Jung Bauermann @ 2008-07-07  5:25 UTC (permalink / raw)
  To: gdb ml

Hi folks,

I've been thinking about how we should handle strings and charsets in
Python, and I have some ideas. I'd like to ask your opinion about them.

First, some explanation about strings in Python, and how it deals with
different character sets (warning, I just learned about this stuff today
and I may be wrong about it...):

By default it is quite simple: Python doesn't deal with the problem. The
regular string type is just a byte array which has some convenience
methods to treat them as strings with 8-bit characters. If Python ever
needs to assume a charset, it will assume ASCII (it is possible to
change the default charset in Python, but it is highly discouraged).
Because of this, you can easily run into trouble if you use non-ASCII
characters (even Latin 1) in regular Python strings.

There's another string type which is the Unicode string. You get them by
prepending string literals with u, like in u"hello, world!". I believe
the internal representation is UTF-32 or UCS-4, but I'm not sure and it
doens't matter, in fact. Python can convert back and forth between
Unicode and several charsets, and from what I read you can mix Unicode
strings with regular strings and things will work (as long as the
regular strings are ASCII-only or you explicitly convert them to Unicode
usin string.decode("some_charset")).

There's some more info about this in
http://effbot.org/zone/unicode-objects.htm

So, in my opinion for GDB's Python bindings we should always use Unicode
strings, and convert to/from desired encodings as necessary. Strings
provided by the user would be assumed to have host_charset () encoding,
and strings coming from/going to the inferior would be assumed to have
target_charset () encoding.

So for example, to create a value object of char * type using a string
provided by the user or coming from Python code, GDB would first convert
the Python string object (assumed to be in the host charset) to a
unicode object (this process is called "decoding", in python parlance),
and then convert it from unicode to a string in the target charset. This
is what is implemented at the moment in gdbpy_make_value in the git
repo, BTW.

What do you think?
-- 
[]'s
Thiago Jung Bauermann
Software Engineer
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC] string handling in python
  2008-07-07  5:25 [RFC] string handling in python Thiago Jung Bauermann
@ 2008-07-07 23:31 ` Tom Tromey
  2008-07-08  5:31   ` Thiago Jung Bauermann
  0 siblings, 1 reply; 4+ messages in thread
From: Tom Tromey @ 2008-07-07 23:31 UTC (permalink / raw)
  To: Thiago Jung Bauermann; +Cc: gdb ml

>>>>> "Thiago" == Thiago Jung Bauermann <bauerman@br.ibm.com> writes:

Thiago> So, in my opinion for GDB's Python bindings we should always
Thiago> use Unicode strings, and convert to/from desired encodings as
Thiago> necessary. Strings provided by the user would be assumed to
Thiago> have host_charset () encoding, and strings coming from/going
Thiago> to the inferior would be assumed to have target_charset ()
Thiago> encoding.

Sounds reasonable to me.

I thought we already did some of this... search for host_charset in
the python directory.

Thiago> So for example, to create a value object of char * type using
Thiago> a string provided by the user or coming from Python code, GDB
Thiago> would first convert the Python string object (assumed to be in
Thiago> the host charset) to a unicode object (this process is called
Thiago> "decoding", in python parlance), and then convert it from
Thiago> unicode to a string in the target charset.

This sounds like a good candidate for convenience functions, one for
each direction.

Tom

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC] string handling in python
  2008-07-07 23:31 ` Tom Tromey
@ 2008-07-08  5:31   ` Thiago Jung Bauermann
  2008-07-08  5:35     ` Thiago Jung Bauermann
  0 siblings, 1 reply; 4+ messages in thread
From: Thiago Jung Bauermann @ 2008-07-08  5:31 UTC (permalink / raw)
  To: tromey; +Cc: gdb ml

On Mon, 2008-07-07 at 17:30 -0600, Tom Tromey wrote:
> >>>>> "Thiago" == Thiago Jung Bauermann <bauerman@br.ibm.com> writes:
> 
> Thiago> So, in my opinion for GDB's Python bindings we should always
> Thiago> use Unicode strings, and convert to/from desired encodings as
> Thiago> necessary. Strings provided by the user would be assumed to
> Thiago> have host_charset () encoding, and strings coming from/going
> Thiago> to the inferior would be assumed to have target_charset ()
> Thiago> encoding.
> 
> Sounds reasonable to me.
> 
> I thought we already did some of this... search for host_charset in
> the python directory.

It doesn't really work. PyString_Decode transforms the string from
host_charset to unicode, and then from unicode to Python's default
charset (almost always ASCII). So if you have any non-ASCII character in
the string, Python won't even be able to print the string on screen. I
just made the test, by making valpy_str use PyString_Decode instead of
PyString_String:

(gdb) p s
$2 = 0x80484f0 "acentuaÃ§Ã£o"
(gdb) py a = gdb.get_value_from_history (1)
(gdb) py print a
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in position 17: ordinal not in range(128)

Oddly enough, if I use PyString_String then the example works. I'm not
sure why, though. Probably PyString_String doesn't try to convert back
and forth between charsets, and just prints the stream of bytes to the
screen hoping for the best...

> Thiago> So for example, to create a value object of char * type using
> Thiago> a string provided by the user or coming from Python code, GDB
> Thiago> would first convert the Python string object (assumed to be in
> Thiago> the host charset) to a unicode object (this process is called
> Thiago> "decoding", in python parlance), and then convert it from
> Thiago> unicode to a string in the target charset.
> 
> This sounds like a good candidate for convenience functions, one for
> each direction.

Right, I'll add them.
-- 
[]'s
Thiago Jung Bauermann
Software Engineer
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC] string handling in python
  2008-07-08  5:31   ` Thiago Jung Bauermann
@ 2008-07-08  5:35     ` Thiago Jung Bauermann
  0 siblings, 0 replies; 4+ messages in thread
From: Thiago Jung Bauermann @ 2008-07-08  5:35 UTC (permalink / raw)
  To: tromey; +Cc: gdb ml

On Tue, 2008-07-08 at 02:30 -0300, Thiago Jung Bauermann wrote:
> I
> just made the test, by making valpy_str use PyString_Decode instead of
> PyString_String:

Er, please read PyString_FromString instead of PyString_String.
-- 
[]'s
Thiago Jung Bauermann
Software Engineer
IBM Linux Technology Center


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2008-07-08  5:35 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2008-07-07  5:25 [RFC] string handling in python Thiago Jung Bauermann
2008-07-07 23:31 ` Tom Tromey
2008-07-08  5:31   ` Thiago Jung Bauermann
2008-07-08  5:35     ` Thiago Jung Bauermann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox