From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gdb-return-32323-listarch-gdb=sources.redhat.com@sourceware.org>
Received: (qmail 8218 invoked by alias); 7 Jul 2008 05:25:45 -0000
Received: (qmail 8210 invoked by uid 22791); 7 Jul 2008 05:25:45 -0000
X-Spam-Check-By: sourceware.org
Received: from igw2.br.ibm.com (HELO igw2.br.ibm.com) (32.104.18.25)     by sourceware.org (qpsmtpd/0.31) with ESMTP; Mon, 07 Jul 2008 05:25:03 +0000
Received: from mailhub1.br.ibm.com (mailhub1 [9.18.232.109]) 	by igw2.br.ibm.com (Postfix) with ESMTP id A4B8217F40D 	for <gdb@sourceware.org>; Mon,  7 Jul 2008 02:12:23 -0300 (BRT)
Received: from d24av01.br.ibm.com (d24av01.br.ibm.com [9.18.232.46]) 	by mailhub1.br.ibm.com (8.13.8/8.13.8/NCO v9.0) with ESMTP id m675P38f2297866 	for <gdb@sourceware.org>; Mon, 7 Jul 2008 02:25:03 -0300
Received: from d24av01.br.ibm.com (loopback [127.0.0.1]) 	by d24av01.br.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id m675Oxav028817 	for <gdb@sourceware.org>; Mon, 7 Jul 2008 02:24:59 -0300
Received: from [9.18.200.36] ([9.18.200.36]) 	by d24av01.br.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id m675Owdg028812 	for <gdb@sourceware.org>; Mon, 7 Jul 2008 02:24:58 -0300
Subject: [RFC] string handling in python
From: Thiago Jung Bauermann <bauerman@br.ibm.com>
To: gdb ml <gdb@sourceware.org>
Content-Type: text/plain
Date: Mon, 07 Jul 2008 05:25:00 -0000
Message-Id: <1215408302.1795.38.camel@localhost.localdomain>
Mime-Version: 1.0
X-Mailer: Evolution 2.22.2
Content-Transfer-Encoding: 7bit
X-IsSubscribed: yes
Mailing-List: contact gdb-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <gdb.sourceware.org>
List-Subscribe: <mailto:gdb-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/gdb/>
List-Post: <mailto:gdb@sourceware.org>
List-Help: <mailto:gdb-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: gdb-owner@sourceware.org
X-SW-Source: 2008-07/txt/msg00037.txt.bz2

Hi folks,

I've been thinking about how we should handle strings and charsets in
Python, and I have some ideas. I'd like to ask your opinion about them.

First, some explanation about strings in Python, and how it deals with
different character sets (warning, I just learned about this stuff today
and I may be wrong about it...):

By default it is quite simple: Python doesn't deal with the problem. The
regular string type is just a byte array which has some convenience
methods to treat them as strings with 8-bit characters. If Python ever
needs to assume a charset, it will assume ASCII (it is possible to
change the default charset in Python, but it is highly discouraged).
Because of this, you can easily run into trouble if you use non-ASCII
characters (even Latin 1) in regular Python strings.

There's another string type which is the Unicode string. You get them by
prepending string literals with u, like in u"hello, world!". I believe
the internal representation is UTF-32 or UCS-4, but I'm not sure and it
doens't matter, in fact. Python can convert back and forth between
Unicode and several charsets, and from what I read you can mix Unicode
strings with regular strings and things will work (as long as the
regular strings are ASCII-only or you explicitly convert them to Unicode
usin string.decode("some_charset")).

There's some more info about this in
http://effbot.org/zone/unicode-objects.htm

So, in my opinion for GDB's Python bindings we should always use Unicode
strings, and convert to/from desired encodings as necessary. Strings
provided by the user would be assumed to have host_charset () encoding,
and strings coming from/going to the inferior would be assumed to have
target_charset () encoding.

So for example, to create a value object of char * type using a string
provided by the user or coming from Python code, GDB would first convert
the Python string object (assumed to be in the host charset) to a
unicode object (this process is called "decoding", in python parlance),
and then convert it from unicode to a string in the target charset. This
is what is implemented at the moment in gdbpy_make_value in the git
repo, BTW.

What do you think?
-- 
[]'s
Thiago Jung Bauermann
Software Engineer
IBM Linux Technology Center