Path: csiph.com!eternal-september.org!feeder.eternal-september.org!mx02.eternal-september.org!.POSTED!not-for-mail From: Keith Thompson Newsgroups: comp.lang.c Subject: Re: Working efficiently with 32-bit Unicode output streams, locale etc. Date: Tue, 01 Dec 2015 12:53:57 -0800 Organization: None to speak of Lines: 52 Message-ID: References: <0407abc1-4ce3-4213-91f2-987a3620bbc8@googlegroups.com> <834d72b5-230d-4ff6-a558-5885932e6b6b@googlegroups.com> <87a8pudybv.fsf@bsb.me.uk> <21078681-254f-4af6-8f17-9e967a409f28@googlegroups.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 8bit Injection-Info: mx02.eternal-september.org; posting-host="945944de09706c9b4e29b53c9d2efdc2"; logging-data="1467"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1++U8ocdtE5elXYXAhEufXN" User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.1 (gnu/linux) Cancel-Lock: sha1:+SsybjcGntATnkYO1DS7zxwpR1w= sha1:9cUStaFPn+mMrzBNyQAEjB+Lfis= Xref: csiph.com comp.lang.c:77569 BartC writes: [...] > If I run this code, where it prints the first 4 'somethings' of the string: > > printf("%.4s","£100pw"); > > Then it outputs "£10" in UTF8, not "£100". £90 is a big difference! The pound sign in your article is printed in my newsreader (actually in GNU Emacs) as \243. Your article headers include: Content-Type: text/plain; charset=windows-1252; format=flowed Apparently my system (I'm using Linux) isn't configured to understand windows-1252, so it falls back to displaying the character in octal. I see you're using Thunderbird on Windows. Is there any way you can configure it to post using UTF-8? Anyway ... > So does that 4 represent bytes or characters? > > The specs for printf on MSDN say printf returns the number of characters > printed, while the C standard says it's the number of characters > transmitted. > > But here it returns 4 for an output of "£10", clearly not 4 characters. > So it's all a bit of a mess. The C standard says that printf returns "the number of characters transmitted, or a negative value if an output or encoding error occurred". It appears to be using the word "character" in the sense defined in 3.7.1: character single-byte character bit representation that fits in a byte as opposed to 3.7: character member of a set of elements used for the organization, control, or representation of data -- Keith Thompson (The_Other_Keith) kst-u@mib.org Working, but not speaking, for JetHead Development, Inc. "We must do something. This is something. Therefore, we must do this." -- Antony Jay and Jonathan Lynn, "Yes Minister"