From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <gdb-return-24905-listarch-gdb=sources.redhat.com@sourceware.org>
Received: (qmail 21398 invoked by alias); 14 Apr 2006 14:37:52 -0000
Received: (qmail 21386 invoked by uid 22791); 14 Apr 2006 14:37:51 -0000
X-Spam-Check-By: sourceware.org
Received: from zigzag.lvk.cs.msu.su (HELO zigzag.lvk.cs.msu.su) (158.250.17.23)     by sourceware.org (qpsmtpd/0.31) with ESMTP; Fri, 14 Apr 2006 14:37:48 +0000
Received: from Debian-exim by zigzag.lvk.cs.msu.su with spam-scanned (Exim 4.50) 	id 1FUPQW-0008VN-G7 	for gdb@sources.redhat.com; Fri, 14 Apr 2006 18:37:41 +0400
Received: from zigzag.lvk.cs.msu.su ([158.250.17.23]) 	by zigzag.lvk.cs.msu.su with esmtp (Exim 4.50) 	id 1FUPQN-0008TX-MZ; Fri, 14 Apr 2006 18:37:27 +0400
From: Vladimir Prus <ghost@cs.msu.su>
To: Eli Zaretskii <eliz@gnu.org>
Subject: Re: printing wchar_t*
Date: Fri, 14 Apr 2006 14:50:00 -0000
User-Agent: KMail/1.7.2
Cc: gdb@sources.redhat.com
References: <e1lsqg$aml$1@sea.gmane.org> <200604141257.41690.ghost@cs.msu.su> <uu08w1cnf.fsf@gnu.org>
In-Reply-To: <uu08w1cnf.fsf@gnu.org>
MIME-Version: 1.0
Content-Type: text/plain;   charset="koi8-r"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Message-Id: <200604141837.26618.ghost@cs.msu.su>
Mailing-List: contact gdb-help@sourceware.org; run by ezmlm
Precedence: bulk
List-Subscribe: <mailto:gdb-subscribe@sourceware.org>
List-Archive: <http://sourceware.org/ml/gdb/>
List-Post: <mailto:gdb@sourceware.org>
List-Help: <mailto:gdb-help@sourceware.org>, <http://sourceware.org/ml/#faqs>
Sender: gdb-owner@sourceware.org
X-SW-Source: 2006-04/txt/msg00193.txt.bz2

On Friday 14 April 2006 17:59, Eli Zaretskii wrote:

> > In an original post, I've asked if gdb can print wchar_t just as a raw
> > sequence of values, like this:
> >
> >     0x56, 0x1456
>
> The answer is YES.  Use array notation, and add a feature to report
> the length of a wchar_t array.

Ok.

> Now, the same letter ``small a'' can be encoded in several other ways:
> for example, its ISO-2022-7bit encoding is 0x1B 0x24 0x2C 0x31 0x28
> 0x50, its KOI8-r encoding is 0xC1, its ISO-8859-5 encoding is 0xD0,
> etc.  It should be obvious that, of all the encodings, only the
> fixed-length ones can be used in a wchar_t array (because wchar_t
> arrays are stateless, 

I don't think this statement is backed up by anything.

> This is why I said that wchar_t is not used for an encoding (such as
> ISO-8859-5 or UTF-8 or UTF-16), but for characters' codepoints.  It is
> nowadays almost universally accepted that wchar_t is a Unicode
> codepoint, 

Again, can you provide any specific pointers to support that view?

> the only difference between applications being whether only 
> the first 64K characters (the so-called BMP) are supported by 16-bit
> wchar_t, or the entire 23-bit range is supported by a 32-bit wchar_t.

I believe that on Windows:

- wchar_t is 16-bit
- wchar_t* values are supposed to be in UTF-16 encoding
(see    
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_9i79.asp

Do you disagree with any of the above statements? If not, then it directly 
follows that a given wchar_t is not a Unicode code point, but a code unit in 
specific representation (UTF-16), and a given code points takes either one or 
two code units, that is either one or two wchar_t. This is contrary to your 
statement that wchar_t is a single code point.

Anyway, this is quickly getting off-topic for gdb list, so maybe we should 
bring this somewhere else.

- Volodya