[RFC] string handling in python - Thiago Jung Bauermann

Mirror of the gdb mailing list
 help / color / mirror / Atom feed

From: Thiago Jung Bauermann <bauerman@br.ibm.com>
To: gdb ml <gdb@sourceware.org>
Subject: [RFC] string handling in python
Date: Mon, 07 Jul 2008 05:25:00 -0000	[thread overview]
Message-ID: <1215408302.1795.38.camel@localhost.localdomain> (raw)

Hi folks,

I've been thinking about how we should handle strings and charsets in
Python, and I have some ideas. I'd like to ask your opinion about them.

First, some explanation about strings in Python, and how it deals with
different character sets (warning, I just learned about this stuff today
and I may be wrong about it...):

By default it is quite simple: Python doesn't deal with the problem. The
regular string type is just a byte array which has some convenience
methods to treat them as strings with 8-bit characters. If Python ever
needs to assume a charset, it will assume ASCII (it is possible to
change the default charset in Python, but it is highly discouraged).
Because of this, you can easily run into trouble if you use non-ASCII
characters (even Latin 1) in regular Python strings.

There's another string type which is the Unicode string. You get them by
prepending string literals with u, like in u"hello, world!". I believe
the internal representation is UTF-32 or UCS-4, but I'm not sure and it
doens't matter, in fact. Python can convert back and forth between
Unicode and several charsets, and from what I read you can mix Unicode
strings with regular strings and things will work (as long as the
regular strings are ASCII-only or you explicitly convert them to Unicode
usin string.decode("some_charset")).

There's some more info about this in
http://effbot.org/zone/unicode-objects.htm

So, in my opinion for GDB's Python bindings we should always use Unicode
strings, and convert to/from desired encodings as necessary. Strings
provided by the user would be assumed to have host_charset () encoding,
and strings coming from/going to the inferior would be assumed to have
target_charset () encoding.

So for example, to create a value object of char * type using a string
provided by the user or coming from Python code, GDB would first convert
the Python string object (assumed to be in the host charset) to a
unicode object (this process is called "decoding", in python parlance),
and then convert it from unicode to a string in the target charset. This
is what is implemented at the moment in gdbpy_make_value in the git
repo, BTW.

What do you think?
-- 
[]'s
Thiago Jung Bauermann
Software Engineer
IBM Linux Technology Center

next             reply	other threads:[~2008-07-07  5:25 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-07-07  5:25 Thiago Jung Bauermann [this message]
2008-07-07 23:31 ` Tom Tromey
2008-07-08  5:31   ` Thiago Jung Bauermann
2008-07-08  5:35     ` Thiago Jung Bauermann

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1215408302.1795.38.camel@localhost.localdomain \
    --to=bauerman@br.ibm.com \
    --cc=gdb@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox