From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 21398 invoked by alias); 14 Apr 2006 14:37:52 -0000 Received: (qmail 21386 invoked by uid 22791); 14 Apr 2006 14:37:51 -0000 X-Spam-Check-By: sourceware.org Received: from zigzag.lvk.cs.msu.su (HELO zigzag.lvk.cs.msu.su) (158.250.17.23) by sourceware.org (qpsmtpd/0.31) with ESMTP; Fri, 14 Apr 2006 14:37:48 +0000 Received: from Debian-exim by zigzag.lvk.cs.msu.su with spam-scanned (Exim 4.50) id 1FUPQW-0008VN-G7 for gdb@sources.redhat.com; Fri, 14 Apr 2006 18:37:41 +0400 Received: from zigzag.lvk.cs.msu.su ([158.250.17.23]) by zigzag.lvk.cs.msu.su with esmtp (Exim 4.50) id 1FUPQN-0008TX-MZ; Fri, 14 Apr 2006 18:37:27 +0400 From: Vladimir Prus To: Eli Zaretskii Subject: Re: printing wchar_t* Date: Fri, 14 Apr 2006 14:50:00 -0000 User-Agent: KMail/1.7.2 Cc: gdb@sources.redhat.com References: <200604141257.41690.ghost@cs.msu.su> In-Reply-To: MIME-Version: 1.0 Content-Type: text/plain; charset="koi8-r" Content-Transfer-Encoding: 7bit Content-Disposition: inline Message-Id: <200604141837.26618.ghost@cs.msu.su> Mailing-List: contact gdb-help@sourceware.org; run by ezmlm Precedence: bulk List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gdb-owner@sourceware.org X-SW-Source: 2006-04/txt/msg00193.txt.bz2 On Friday 14 April 2006 17:59, Eli Zaretskii wrote: > > In an original post, I've asked if gdb can print wchar_t just as a raw > > sequence of values, like this: > > > > 0x56, 0x1456 > > The answer is YES. Use array notation, and add a feature to report > the length of a wchar_t array. Ok. > Now, the same letter ``small a'' can be encoded in several other ways: > for example, its ISO-2022-7bit encoding is 0x1B 0x24 0x2C 0x31 0x28 > 0x50, its KOI8-r encoding is 0xC1, its ISO-8859-5 encoding is 0xD0, > etc. It should be obvious that, of all the encodings, only the > fixed-length ones can be used in a wchar_t array (because wchar_t > arrays are stateless, I don't think this statement is backed up by anything. > This is why I said that wchar_t is not used for an encoding (such as > ISO-8859-5 or UTF-8 or UTF-16), but for characters' codepoints. It is > nowadays almost universally accepted that wchar_t is a Unicode > codepoint, Again, can you provide any specific pointers to support that view? > the only difference between applications being whether only > the first 64K characters (the so-called BMP) are supported by 16-bit > wchar_t, or the entire 23-bit range is supported by a 32-bit wchar_t. I believe that on Windows: - wchar_t is 16-bit - wchar_t* values are supposed to be in UTF-16 encoding (see http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_9i79.asp Do you disagree with any of the above statements? If not, then it directly follows that a given wchar_t is not a Unicode code point, but a code unit in specific representation (UTF-16), and a given code points takes either one or two code units, that is either one or two wchar_t. This is contrary to your statement that wchar_t is a single code point. Anyway, this is quickly getting off-topic for gdb list, so maybe we should bring this somewhere else. - Volodya