From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 8218 invoked by alias); 7 Jul 2008 05:25:45 -0000 Received: (qmail 8210 invoked by uid 22791); 7 Jul 2008 05:25:45 -0000 X-Spam-Check-By: sourceware.org Received: from igw2.br.ibm.com (HELO igw2.br.ibm.com) (32.104.18.25) by sourceware.org (qpsmtpd/0.31) with ESMTP; Mon, 07 Jul 2008 05:25:03 +0000 Received: from mailhub1.br.ibm.com (mailhub1 [9.18.232.109]) by igw2.br.ibm.com (Postfix) with ESMTP id A4B8217F40D for ; Mon, 7 Jul 2008 02:12:23 -0300 (BRT) Received: from d24av01.br.ibm.com (d24av01.br.ibm.com [9.18.232.46]) by mailhub1.br.ibm.com (8.13.8/8.13.8/NCO v9.0) with ESMTP id m675P38f2297866 for ; Mon, 7 Jul 2008 02:25:03 -0300 Received: from d24av01.br.ibm.com (loopback [127.0.0.1]) by d24av01.br.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id m675Oxav028817 for ; Mon, 7 Jul 2008 02:24:59 -0300 Received: from [9.18.200.36] ([9.18.200.36]) by d24av01.br.ibm.com (8.12.11.20060308/8.12.11) with ESMTP id m675Owdg028812 for ; Mon, 7 Jul 2008 02:24:58 -0300 Subject: [RFC] string handling in python From: Thiago Jung Bauermann To: gdb ml Content-Type: text/plain Date: Mon, 07 Jul 2008 05:25:00 -0000 Message-Id: <1215408302.1795.38.camel@localhost.localdomain> Mime-Version: 1.0 X-Mailer: Evolution 2.22.2 Content-Transfer-Encoding: 7bit X-IsSubscribed: yes Mailing-List: contact gdb-help@sourceware.org; run by ezmlm Precedence: bulk List-Id: List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gdb-owner@sourceware.org X-SW-Source: 2008-07/txt/msg00037.txt.bz2 Hi folks, I've been thinking about how we should handle strings and charsets in Python, and I have some ideas. I'd like to ask your opinion about them. First, some explanation about strings in Python, and how it deals with different character sets (warning, I just learned about this stuff today and I may be wrong about it...): By default it is quite simple: Python doesn't deal with the problem. The regular string type is just a byte array which has some convenience methods to treat them as strings with 8-bit characters. If Python ever needs to assume a charset, it will assume ASCII (it is possible to change the default charset in Python, but it is highly discouraged). Because of this, you can easily run into trouble if you use non-ASCII characters (even Latin 1) in regular Python strings. There's another string type which is the Unicode string. You get them by prepending string literals with u, like in u"hello, world!". I believe the internal representation is UTF-32 or UCS-4, but I'm not sure and it doens't matter, in fact. Python can convert back and forth between Unicode and several charsets, and from what I read you can mix Unicode strings with regular strings and things will work (as long as the regular strings are ASCII-only or you explicitly convert them to Unicode usin string.decode("some_charset")). There's some more info about this in http://effbot.org/zone/unicode-objects.htm So, in my opinion for GDB's Python bindings we should always use Unicode strings, and convert to/from desired encodings as necessary. Strings provided by the user would be assumed to have host_charset () encoding, and strings coming from/going to the inferior would be assumed to have target_charset () encoding. So for example, to create a value object of char * type using a string provided by the user or coming from Python code, GDB would first convert the Python string object (assumed to be in the host charset) to a unicode object (this process is called "decoding", in python parlance), and then convert it from unicode to a string in the target charset. This is what is implemented at the moment in gdbpy_make_value in the git repo, BTW. What do you think? -- []'s Thiago Jung Bauermann Software Engineer IBM Linux Technology Center