Path: csiph.com!eternal-september.org!feeder.eternal-september.org!mx02.eternal-september.org!.POSTED!not-for-mail
From: Keith Thompson <kst-u@mib.org>
Newsgroups: comp.lang.c
Subject: Re: Working efficiently with 32-bit Unicode output streams, locale etc.
Date: Wed, 02 Dec 2015 14:12:20 -0800
Organization: None to speak of
Lines: 28
Message-ID: <lnbna8sdob.fsf@kst-u.example.com>
References: <n3dfgs$a24$1@speranza.aioe.org> <n3g5eq$pvg$1@speranza.aioe.org> <dc1l2pF2cb8U1@mid.individual.net> <n3g77s$sor$1@speranza.aioe.org> <dc1m91F2cb8U2@mid.individual.net> <n3g921$vlq$1@speranza.aioe.org> <e037cc57-2024-491d-a992-8e821cd8014b@googlegroups.com> <n3h61a$v9h$1@dont-email.me> <n3hejb$ujn$1@dont-email.me> <lnh9k3w59i.fsf@kst-u.example.com> <n3iil7$ld7$1@dont-email.me> <n3j2ke$3do$1@dont-email.me> <n3k48m$9j6$1@dont-email.me> <n3kd1c$cnt$1@dont-email.me> <n3ku7i$kjd$1@dont-email.me> <n3l8s5$4v5$1@dont-email.me> <n3ld6f$k49$1@dont-email.me> <n3n33g$qs7$1@dont-email.me> <n3nafp$pd0$1@dont-email.me> <dc92reFi96mU6@mid.individual.net> <n3nmo8$e5j$1@dont-email.me>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Injection-Info: mx02.eternal-september.org; posting-host="945944de09706c9b4e29b53c9d2efdc2"; logging-data="25540"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1+VDLWTZpL7WX0ndNu7dGs+"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.1 (gnu/linux)
Cancel-Lock: sha1:TQLIzYvhCIJn7SGECI5PzwAaz9s= sha1:dRMRQGO58VBYAji1ZIJ8IMzJhSQ=
Xref: csiph.com comp.lang.c:77674

BartC <bc@freeuk.com> writes:
[...]
> It's like taking existing code which works perfectly well with 
> characters, and changing it to work with variable-length words.
>
> Who wants to code like that? I think UTF8 is a fine compression scheme 
> for storing text on disk, otherwise...

Or for transmitting text.

*Some* in-memory text processing can be done perfectly well using
UTF-8.  If that's all you're doing, there's no point in translating
UTF-8 to some other internal form.

Other kinds of processing can be more difficult due to the
variable-length encoding.  In that case, you can convert the UTF-8
you read from a file to, say, UTF-32 (or wchar_t[] if wchar_t is
32 bits on your system), do your stuff, then convert back to UTF-8
for output.

Yes, it's more complicated than working with 7-bit ASCII text that
can't represent accented letters or European currency symbols.

-- 
Keith Thompson (The_Other_Keith) kst-u@mib.org  <http://www.ghoti.net/~kst>
Working, but not speaking, for JetHead Development, Inc.
"We must do something.  This is something.  Therefore, we must do this."
    -- Antony Jay and Jonathan Lynn, "Yes Minister"