From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gdb-return-32337-listarch-gdb=sources.redhat.com@sourceware.org>
Received: (qmail 10638 invoked by alias); 8 Jul 2008 05:31:26 -0000
Received: (qmail 10601 invoked by uid 22791); 8 Jul 2008 05:31:25 -0000
X-Spam-Check-By: sourceware.org
Received: from igw2.br.ibm.com (HELO igw2.br.ibm.com) (32.104.18.25)     by sourceware.org (qpsmtpd/0.31) with ESMTP; Tue, 08 Jul 2008 05:31:00 +0000
Received: from mailhub1.br.ibm.com (mailhub1 [9.18.232.109]) 	by igw2.br.ibm.com (Postfix) with ESMTP id 9356917F49D 	for <gdb@sourceware.org>; Tue,  8 Jul 2008 02:18:18 -0300 (BRT)
Received: from d24av02.br.ibm.com (d24av02.br.ibm.com [9.18.232.47]) 	by mailhub1.br.ibm.com (8.13.8/8.13.8/NCO v9.0) with ESMTP id m685UwGj1118278 	for <gdb@sourceware.org>; Tue, 8 Jul 2008 02:30:58 -0300
Received: from d24av02.br.ibm.com (loopback [127.0.0.1]) 	by d24av02.br.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id m685Urwl011474 	for <gdb@sourceware.org>; Tue, 8 Jul 2008 02:30:53 -0300
Received: from [9.8.0.156] ([9.8.0.156]) 	by d24av02.br.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id m685UrQj010953; 	Tue, 8 Jul 2008 02:30:53 -0300
Subject: Re: [RFC] string handling in python
From: Thiago Jung Bauermann <bauerman@br.ibm.com>
To: tromey@redhat.com
Cc: gdb ml <gdb@sourceware.org>
In-Reply-To: <m3prpp9srd.fsf@fleche.redhat.com>
References: <1215408302.1795.38.camel@localhost.localdomain> 	 <m3prpp9srd.fsf@fleche.redhat.com>
Content-Type: text/plain; charset=utf-8
Date: Tue, 08 Jul 2008 05:31:00 -0000
Message-Id: <1215495013.1795.79.camel@localhost.localdomain>
Mime-Version: 1.0
X-Mailer: Evolution 2.22.2
Content-Transfer-Encoding: 8bit
X-IsSubscribed: yes
Mailing-List: contact gdb-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Id: <gdb.sourceware.org>
List-Subscribe: <mailto:gdb-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/gdb/>
List-Post: <mailto:gdb@sourceware.org>
List-Help: <mailto:gdb-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: gdb-owner@sourceware.org
X-SW-Source: 2008-07/txt/msg00051.txt.bz2

On Mon, 2008-07-07 at 17:30 -0600, Tom Tromey wrote:
> >>>>> "Thiago" == Thiago Jung Bauermann <bauerman@br.ibm.com> writes:
> 
> Thiago> So, in my opinion for GDB's Python bindings we should always
> Thiago> use Unicode strings, and convert to/from desired encodings as
> Thiago> necessary. Strings provided by the user would be assumed to
> Thiago> have host_charset () encoding, and strings coming from/going
> Thiago> to the inferior would be assumed to have target_charset ()
> Thiago> encoding.
> 
> Sounds reasonable to me.
> 
> I thought we already did some of this... search for host_charset in
> the python directory.

It doesn't really work. PyString_Decode transforms the string from
host_charset to unicode, and then from unicode to Python's default
charset (almost always ASCII). So if you have any non-ASCII character in
the string, Python won't even be able to print the string on screen. I
just made the test, by making valpy_str use PyString_Decode instead of
PyString_String:

(gdb) p s
$2 = 0x80484f0 "acentuaÃ§Ã£o"
(gdb) py a = gdb.get_value_from_history (1)
(gdb) py print a
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in position 17: ordinal not in range(128)

Oddly enough, if I use PyString_String then the example works. I'm not
sure why, though. Probably PyString_String doesn't try to convert back
and forth between charsets, and just prints the stream of bytes to the
screen hoping for the best...

> Thiago> So for example, to create a value object of char * type using
> Thiago> a string provided by the user or coming from Python code, GDB
> Thiago> would first convert the Python string object (assumed to be in
> Thiago> the host charset) to a unicode object (this process is called
> Thiago> "decoding", in python parlance), and then convert it from
> Thiago> unicode to a string in the target charset.
> 
> This sounds like a good candidate for convenience functions, one for
> each direction.

Right, I'll add them.
-- 
[]'s
Thiago Jung Bauermann
Software Engineer
IBM Linux Technology Center