From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (qmail 30565 invoked by alias); 14 Apr 2006 15:00:11 -0000 Received: (qmail 30557 invoked by uid 22791); 14 Apr 2006 15:00:11 -0000 X-Spam-Check-By: sourceware.org Received: from nitzan.inter.net.il (HELO nitzan.inter.net.il) (192.114.186.20) by sourceware.org (qpsmtpd/0.31) with ESMTP; Fri, 14 Apr 2006 15:00:08 +0000 Received: from HOME-C4E4A596F7 (IGLD-83-130-214-179.inter.net.il [83.130.214.179]) by nitzan.inter.net.il (MOS 3.7.3-GA) with ESMTP id DDG15766 (AUTH halo1); Fri, 14 Apr 2006 17:59:59 +0300 (IDT) Date: Fri, 14 Apr 2006 17:18:00 -0000 Message-Id: From: Eli Zaretskii To: Vladimir Prus CC: gdb@sources.redhat.com In-reply-to: <200604141837.26618.ghost@cs.msu.su> (message from Vladimir Prus on Fri, 14 Apr 2006 18:37:25 +0400) Subject: Re: printing wchar_t* Reply-to: Eli Zaretskii References: <200604141257.41690.ghost@cs.msu.su> <200604141837.26618.ghost@cs.msu.su> X-IsSubscribed: yes Mailing-List: contact gdb-help@sourceware.org; run by ezmlm Precedence: bulk List-Subscribe: List-Archive: List-Post: List-Help: , Sender: gdb-owner@sourceware.org X-SW-Source: 2006-04/txt/msg00197.txt.bz2 > From: Vladimir Prus > Date: Fri, 14 Apr 2006 18:37:25 +0400 > Cc: gdb@sources.redhat.com > > > Now, the same letter ``small a'' can be encoded in several other ways: > > for example, its ISO-2022-7bit encoding is 0x1B 0x24 0x2C 0x31 0x28 > > 0x50, its KOI8-r encoding is 0xC1, its ISO-8859-5 encoding is 0xD0, > > etc. It should be obvious that, of all the encodings, only the > > fixed-length ones can be used in a wchar_t array (because wchar_t > > arrays are stateless, > > I don't think this statement is backed up by anything. > > > This is why I said that wchar_t is not used for an encoding (such as > > ISO-8859-5 or UTF-8 or UTF-16), but for characters' codepoints. It is > > nowadays almost universally accepted that wchar_t is a Unicode > > codepoint, > > Again, can you provide any specific pointers to support that view? I think Robert and myself already explained that in later messages. Feel free to ask specific questions if something is still unclear. > I believe that on Windows: > > - wchar_t is 16-bit > - wchar_t* values are supposed to be in UTF-16 encoding > (see > http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_9i79.asp > > Do you disagree with any of the above statements? wchar_t is just an integer type. You can stuff _anything_ into an integer array, but if you put UTF-16 there, each element is no longer a character, it is one of a few 16-bit integers that encode a character. In other words, it's a variant of multibyte strings, except that each element is 16-bit wide. Now, I know that Windows holds 16-bit UTF-16 encodings in wchar_t arrays, but that is not the L"foo" strings of wide characters. In the L"foo" notation, each of the 3 string characters _always_ occupies exactly one wchar_t element, and L"foo"[1] is _always_ the second character of the string. This is not true for UTF-16, as I hope is clear from this discussion. In UTF-16, array[1] is the second 16-bit value that encodes a character, and that character's encoding could need more than 1 16-bit value. > If not, then it directly > follows that a given wchar_t is not a Unicode code point, but a code unit in > specific representation (UTF-16), and a given code points takes either one or > two code units, that is either one or two wchar_t. This is contrary to your > statement that wchar_t is a single code point. My statement was based on the assumption that you are coding for a system where wchar_t is used for complete characters, not for UTF-16 strings. Only in that case, you can talk about ``wide characters'' and about wchar_t being a character. In UTF-16, an arbitrary element of the array might not be a complete character. > Anyway, this is quickly getting off-topic for gdb list, so maybe we should > bring this somewhere else. It _is_ on topic, IMHO, as long as we discuss features to be added to GDB.