Path: csiph.com!eternal-september.org!feeder.eternal-september.org!mx02.eternal-september.org!.POSTED!not-for-mail From: Keith Thompson Newsgroups: comp.lang.c Subject: Re: Working efficiently with 32-bit Unicode output streams, locale etc. Date: Wed, 02 Dec 2015 14:12:20 -0800 Organization: None to speak of Lines: 28 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Injection-Info: mx02.eternal-september.org; posting-host="945944de09706c9b4e29b53c9d2efdc2"; logging-data="25540"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+VDLWTZpL7WX0ndNu7dGs+" User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/23.1 (gnu/linux) Cancel-Lock: sha1:TQLIzYvhCIJn7SGECI5PzwAaz9s= sha1:dRMRQGO58VBYAji1ZIJ8IMzJhSQ= Xref: csiph.com comp.lang.c:77674 BartC writes: [...] > It's like taking existing code which works perfectly well with > characters, and changing it to work with variable-length words. > > Who wants to code like that? I think UTF8 is a fine compression scheme > for storing text on disk, otherwise... Or for transmitting text. *Some* in-memory text processing can be done perfectly well using UTF-8. If that's all you're doing, there's no point in translating UTF-8 to some other internal form. Other kinds of processing can be more difficult due to the variable-length encoding. In that case, you can convert the UTF-8 you read from a file to, say, UTF-32 (or wchar_t[] if wchar_t is 32 bits on your system), do your stuff, then convert back to UTF-8 for output. Yes, it's more complicated than working with 7-bit ASCII text that can't represent accented letters or European currency symbols. -- Keith Thompson (The_Other_Keith) kst-u@mib.org Working, but not speaking, for JetHead Development, Inc. "We must do something. This is something. Therefore, we must do this." -- Antony Jay and Jonathan Lynn, "Yes Minister"