From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 10638 invoked by alias); 8 Jul 2008 05:31:26 -0000 Received: (qmail 10601 invoked by uid 22791); 8 Jul 2008 05:31:25 -0000 X-Spam-Check-By: sourceware.org Received: from igw2.br.ibm.com (HELO igw2.br.ibm.com) (32.104.18.25) by sourceware.org (qpsmtpd/0.31) with ESMTP; Tue, 08 Jul 2008 05:31:00 +0000 Received: from mailhub1.br.ibm.com (mailhub1 [9.18.232.109]) by igw2.br.ibm.com (Postfix) with ESMTP id 9356917F49D for ; Tue, 8 Jul 2008 02:18:18 -0300 (BRT) Received: from d24av02.br.ibm.com (d24av02.br.ibm.com [9.18.232.47]) by mailhub1.br.ibm.com (8.13.8/8.13.8/NCO v9.0) with ESMTP id m685UwGj1118278 for ; Tue, 8 Jul 2008 02:30:58 -0300 Received: from d24av02.br.ibm.com (loopback [127.0.0.1]) by d24av02.br.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id m685Urwl011474 for ; Tue, 8 Jul 2008 02:30:53 -0300 Received: from [9.8.0.156] ([9.8.0.156]) by d24av02.br.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id m685UrQj010953; Tue, 8 Jul 2008 02:30:53 -0300 Subject: Re: [RFC] string handling in python From: Thiago Jung Bauermann To: tromey@redhat.com Cc: gdb ml In-Reply-To: References: <1215408302.1795.38.camel@localhost.localdomain> Content-Type: text/plain; charset=utf-8 Date: Tue, 08 Jul 2008 05:31:00 -0000 Message-Id: <1215495013.1795.79.camel@localhost.localdomain> Mime-Version: 1.0 X-Mailer: Evolution 2.22.2 Content-Transfer-Encoding: 8bit X-IsSubscribed: yes Mailing-List: contact gdb-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gdb-owner@sourceware.org X-SW-Source: 2008-07/txt/msg00051.txt.bz2 On Mon, 2008-07-07 at 17:30 -0600, Tom Tromey wrote: > >>>>> "Thiago" == Thiago Jung Bauermann writes: > > Thiago> So, in my opinion for GDB's Python bindings we should always > Thiago> use Unicode strings, and convert to/from desired encodings as > Thiago> necessary. Strings provided by the user would be assumed to > Thiago> have host_charset () encoding, and strings coming from/going > Thiago> to the inferior would be assumed to have target_charset () > Thiago> encoding. > > Sounds reasonable to me. > > I thought we already did some of this... search for host_charset in > the python directory. It doesn't really work. PyString_Decode transforms the string from host_charset to unicode, and then from unicode to Python's default charset (almost always ASCII). So if you have any non-ASCII character in the string, Python won't even be able to print the string on screen. I just made the test, by making valpy_str use PyString_Decode instead of PyString_String: (gdb) p s $2 = 0x80484f0 "acentuação" (gdb) py a = gdb.get_value_from_history (1) (gdb) py print a Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'ascii' codec can't encode character u'\xe5' in position 17: ordinal not in range(128) Oddly enough, if I use PyString_String then the example works. I'm not sure why, though. Probably PyString_String doesn't try to convert back and forth between charsets, and just prints the stream of bytes to the screen hoping for the best... > Thiago> So for example, to create a value object of char * type using > Thiago> a string provided by the user or coming from Python code, GDB > Thiago> would first convert the Python string object (assumed to be in > Thiago> the host charset) to a unicode object (this process is called > Thiago> "decoding", in python parlance), and then convert it from > Thiago> unicode to a string in the target charset. > > This sounds like a good candidate for convenience functions, one for > each direction. Right, I'll add them. -- []'s Thiago Jung Bauermann Software Engineer IBM Linux Technology Center