Path: csiph.com!optima2.xanadu-bbs.net!xanadu-bbs.net!news.glorb.com!Xl.tags.giganews.com!border1.nntp.dca1.giganews.com!nntp.giganews.com!local2.nntp.dca.giganews.com!news.giganews.com.POSTED!not-for-mail NNTP-Posting-Date: Fri, 14 Aug 2015 13:40:01 -0500 Return-Path: Sender: std-cpp-request@vandevoorde.com Approved: james.dennett@gmail.com Message-ID: <_oqdnWYF3MbGYFHInZ2dnUU78IWdnZ2d@giganews.com> Newsgroups: comp.std.c++ From: Jakob Bohm Subject: Re: Unicode support in C++ 17 Organization: WiseMo A/S References: Content-Type: text/plain; charset=utf-8 X-Original-Date: Thu, 13 Aug 2015 22:26:03 +0200 X-Submission-Address: std-cpp-submit@vandevoorde.com Date: Fri, 14 Aug 2015 13:33:14 CST Lines: 528 X-Usenet-Provider: http://www.giganews.com X-Trace: sv3-VukVShyZQTi9EHf7iXrqX3AOCna1gEoH710ngjLAZ1HyLwt/IRH2FyyiCOvmsgQ0nJ79ba9hkUTTjyM!t0amjP5y0xTW80Jgs2WTKS17LhTv3GUPyAJq2mZq+yowrVHHaX4R5S3QLKxV0qbTqZtYS6Fa/0tL X-Complaints-To: abuse@giganews.com X-DMCA-Notifications: http://www.giganews.com/info/dmca.html X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly X-Postfilter: 1.3.40 X-Original-Bytes: 23948 Xref: csiph.com comp.std.c++:770 On 12/08/2015 21:00, stg wrote: > > > > I would really like to see improved unicode support in C++ 17.After > reading the following discussion, I thought maybe I might be able to > participate in the discussion: > [ https://groups.google.com/a/isocpp.org/forum/?fromgroups=#!searchin/ > std-proposals/unicode/std-proposals/SGFtQkKE0bU/overview] > > Everything in this document reflects my best understanding > about Unicode, and C++. I would be delighted to have that > understanding improved or corrected. > > I was hoping the knowledgeable folk in this newsgroup might help me > evaluate some ideas. Please find my thoughts below, and be both > critical and kind: > > > 1.2 Desired functionality > ~~~~~~~~~~~~~~~~~~~~~~~~~ > > 1. composed-character awareness -- single display character may be > composed of multiple codepoints, or may be comprised of ligatures. The subset of programs which care about this consists mostly of those programs which do additional text formatting (e.g. columns, word line breaks etc.) and/or control cursor navigation in text input (like a C++ equivalent of GNU readline etc.). Such programs generally are more concerned if a sequence of codepoints represent a single screen location (and how big that is) on the actual output device in use, not if a hypothetical mega-implementation of all Unicode formatting features would do so. For instance, some display systems will artificially cause multi-codepoint (and sometimes even multi-char_t) characters to occupy as much space as their encoding, while others will not. Some display systems will do the right-to-left vs. left-to-right direction shifts automatically, while others expect applications to reorder displayed characters before output. One thing that is of general interest, but is too big/slow to be done implicitly during every string operation is to convert Unicode strings to one of the official normalized forms (A, B, C, D plus any future standard form that prevents visually equivalent display strings from having different encoding, to avoid security attacks that depend on fooling humans into accepting a made up name that looks just like a different name they trust, such as 0bama vs. Obama or V1adimir vs. Vladimir). These things are already available in libraries such as IBM's libuni. > 2. multi-byte codepoint awareness. This is important for UTF-8 and the higher codepoints in UTF-16, and has always been important for non-Unicode encodings of East Asian alphabets. Thus where possibly, standard library features for this should be done as natural extensions / bugfixes for the existing library functions that have always done this for traditional encodings. For UTF-8 and to a lesser degree UTF-16, the Unicode standard designers did extra work to ensure that things like sorting and searching would work in most cases when naively using routines that only use char_t values, specifically: 1. No UTF-8 or UTF-16 encoding of a codepoint will match at half-character locations when using a char_t based string search algorithm. 2. Comparing the UTF-8, UTF-32 or plain UCS-4 encodings of two strings using code that treats them simply as arrays of unsigned char or unsigned char32_t will get the same result and ordering as comparing those strings codepoint by codepoint using the equivalent codepoint numbers in the Unicode standard. 3. Comparing the UTF-16 encodings of two strings using code that treats them simply as arrays of unsigned char16_t values will get the same result as codepoint by codepoint comparisons, except that codepoints U+0000D800 to U+0000FFFF sort after U+10FFFF rather than between U+0000CFFF and U+00010000 . However this odd result is often needed for compatibility with existing systems that were originally designed for UCS-2 where that was the correct algorithm due to the historic non-existence of codepoints above U+00010000 . These nice properties do not hold for traditional East Asian encodings, though some of those encodings may happen to match some locale specific lexicographic orderings in a similar way. > 3. char_t indexing -- This is the current default behavior, and I > suppose we must keep it for the sake of backward compatiblity, and > for the implementation of 1 & 2. Also because this is the most relevant form in the following cases: 1. When processing text strings for purposes of storage or transmission, since most storage/transmission systems stores/transmits bits and bytes, not abstract characters. 2. When using the string class as an efficient and convenient container for arrays of non-text bytes, such code often gains great benefits from the ways string classes differ from vectors/lists of bytes, but would fail horribly if the string classes started having opinions on what bytes can be stored there. The computer industry has a long history of the insane costs imposed when interfaces are defined to process characters (in any character set) rather than sequences of bits and bytes. For instance because the Internet e-mail protocols were historically defined to operate on sequences of human readable English-characters from a common subset of ASCII and EBCDIC, even though actual transmission was always ASCII bytes, every e-mail containing attachments, pictures or non-English text needs to be transmitted using clunky Base64 and Hex encodings just in case some mail gateway on the way might temporarily process the e-mail using arcane character representations (e.g. on older IBM operating systems). And this is just one instance of how such a decision in the past has come back to haunt us. Thus it is best if most standard library classes, methods, types and functions are defined to be what some people call "8-bit clean", meaning that they won't mangle or damage arbitrary binary data given to them, if at all possible (the classic std::strxxx() and std::wcsxxx() functions obviously need to treat a char_t value of 0 specially as per their definitions, but must refrain from mistreating other values). > > Currently 3 is the default, but we can get 1&2 compliant behavior for > much string handling by specifying a locale. We can steer the default > behavior by setting the globale local, and a great deal of work has > been done to improve C++'s locale handling (see boost::locale). > > I consider that 1 is in fact the usual use-case, and 2 and 3 are > typically only of interest to library implementers. > In my experience, 3 is the most common use case where strings are not treated as opaque blobs (then there is no difference), the one exception being country-specific lexicographic ordering which is never the same as any sorting done purely for computational efficiency). Real world situations that truly care about codepoints or display characters often also care about words and sentences. For instance in many locales a list sorted for human consumption should ideally go like this has one h=C3=A1s one hat on h=C3=A2t on have not Which requires processing at the word and sentence level, not just the code point level. Such rules tend to reflect the way written text is usually pronounced (and thus memorized) amongst native speakers in that culture/language combination. I have heard rumors that some schools teach computing the other way round, but that is mostly an artifact of those educators lacking experience and/or deeper technical understanding before overconfidently instilling superficial misunderstandings into their pupils. > > 1.3 Current behavior > ~~~~~~~~~~~~~~~~~~~~ > > Let's consider a concrete example which is likely to be a very common > use case in the future: migrating legacy code from latin1 to utf-8, or > a developer who is used to thinking in terms of ascii want to write a > new application as a utf-8 application. I think this specific example > generalizes (e.g. to utf-16 or 32) in a trivial way, but I welcome > further insight. > > The developer may start by setting the global locale. If she wants > numbers to behave like the c-locale, except when given specific > context instructions, she might use a boost::locale, or perhaps she > rolls her own locale, comprising it out of existing facets that suit > her needs. The relevant detail is that the locale specifies that she > will be working with a utf-8 character set. > > If there is a legacy application being modernized or replaced, she'll > have to convert data sources and sinks to utf-8, but that's likely to > be a pretty trivial task. > > Streaming operations will work as expected, so she won't have to > modify the std::iostream and std::stringstream stuff. > > std::string will work fine as a container. That's where the good news > ends. > > > 1.3.1 sorting > ------------- > > To use std::sort she would have to specify that the application use > the locale() operator: > > ,---- > | std::sort (str.begin(), str.end(), std::locale); > `---- > > As the default sort uses the numeric src_(<) operator -- > i.e. it's a byte order sort that is efficient, but not humanly > meaningful. The above code works but isn't parsimonious. This depends on the purpose of the sort: If the sort is used for a purpose where an ASCII application would be happy to sort lowercase a after uppercase Z, then sorting by (32 bit) Unicode code point is the natural equivalent, and utf-8 was specifically designed (this is explicitly stated in the original standards) such that the naive byte comparison will yield the correct result with no extra effort. If the sort is used for a purpose where an ASCII application would want upper and lower case A/a to sort in close proximity, then the application will already need to use a more intelligent string comparison function. For ASCII a simple case-insensitive string compare function would do the trick, while for anything else, the application would need a highly locale-sensitive non-trivial comparison function such as the parametrized string comparison function from the Unicode standard (that function takes a bunch of parameters specifying most of the commonly occurring locale oddities, such as rules for the treatment of accents, uppercase/lowercase multiple spaces and even punctuation), or more practically a truly locale specific comparison function that can take into account locale-specific issues not covered by such a generic function. In practice this would simply involve delegating the comparison operation to a virtual method of the locale object, of which there can be several depending on usage context, for instance some locales have different rules for sorting dictionaries versus phone books. > > > 1.3.2 find and substr > --------------------- > > Consider: > ,---- > | auto pos1 = foo.find(someChar); > | // sanity check... > | auto bar = foo.substr(pos1, 3); > `---- > > The determination of pos1 can fail because it might find a match > inside a composite character. The determination of src_(bar) will > fail whenever there's a composite or multi-byte character in within > the next three positions. For all the standard UNICODE encodings (except UTF-7, a victim of the e-mail design mistake previously mentioned), the encoding has been designed to guarantee that searching for a valid encoding of a string or character in a valid encoding of a string will not result in false matches. However for any encoding that uses multiple char_t-s to represent a single code point, code point operations must be treated as substring operations, never as character operations. In your example above, if someChar is of type char_t, then it can only be a single-char_t codepoint, if it is a codepoint at all. If someChar is of type string, then extracting text where it was found should already account for someChar.length(), whatever unit that function measures its result in. pos1 can use any unit of measurement: Inches of paper, microliters of ink, count of codepoints etc., but count of char_t-s is just as useful for values that are treated simply as abstract non-iterable iterators. As for the second step of extracting a known character plus the next two characters, then such an operation makes sense only when the context makes clear why exactly two extra characters are requested, and if that reason refers to two display characters, two codepoints or two char_t-s. This semantic problem cannot be defaulted away without leading to lots of malfunctioning applications (namely those that needed either of the other two semantics in that particular code line, unrelated to what the rest of the application needs in unrelated code lines). For instance if we are looking for a marker sign followed by a two-letter abbreviation in some human-originated convention, then one must look at that convention to see if these abbreviations are defined to consist of two display characters, two codepoints or two char_t-s, taking into account that many real world human-written documents will use those words to refer to any of the other two meanings. If the relevant specification is unclear, then the conversion of this program from ASCII to utf-8 is the perfect time to settle that ambiguity before failing to interoperate with another application whose author would otherwise have interpreted the convention differently. If on the other hand we are looking to display the beginning of a text in a narrow indicator field, then we obviously want 3 display character cells, using whichever definition of that concept matches the actual properties of the intended output device, we might even want to change this to the first "3em" of the text using a specific font such as "Helvetica" or the first 3 6-point cells in braille. > > > 1.4 My naive proposal: > ~~~~~~~~~~~~~~~~~~~~~~ > > - A std::basic_string has a locale awareness, either "NONE (default, > current implementation), CODEPOINT (mainly or library implementers > who want to investigate codepoints, not composed characters), and > COMPOSITE (alternatively DISPLAY, or CHARACTER -- a displayable > character). > - std::locale gets a cc_iterator (composed-character iterator -- > iterates over displayable characters). > - std::locale gets a cp_iterator (codepoint iterator -- iterates over > displayable characters. for utf-32 locales this is just the byte > operator) > - std::string methods use the locale-aware iterators if the string is > locale-aware. So size() returns the number of displayble characers > for a std::string, the number of codepoints for a > std::string, and the number of bytes for a > std::string > > For a locale-aware string, the following behavior would change: > - std::sort would use the locale's () operator by default. Maps with > a la_string key would work in a locale aware way, maps with a > std::string would work with the old byte src_(<). > - integer positional arguments would refer to *composed characters*. > So src_(s.substr(pos,3)) would give the last 3 display > characters, regardless of whether or not they are ligatures, > composed, or simply 1-byte ascii codepoints. That would apply to > str[i] and str.size() as well. > > > 1.4.1 Pros > ---------- > > - updating legacy code should be almost-trivial -- change the > string construction to create locale-aware strings, and everything > should work as desired. > Only if that is the desired behavior, which often it is not once one starts looking at the code details. > - Minimal language pollution. Seems consistent with current language > design. > > > 1.4.2 Cons > ---------- > > - What to do when comparing std::string with = a > std::string? I suggest default behavior is > byte-comparision, but compilers should generate a warning. May need > to introduce a cast operations to avoid the warning. > - I don't see a way to prevent a developer from setting an > incompatible locale, and using an incompatible string. I suppose > this would have to throw an exception. > - std::string or std::la_string is clunky. > > > 1.5 Questions > ~~~~~~~~~~~~~ > > - chage locale awareness via typecasting? > Having all that locale-aware code in std::basic_string will seriously bloat any application wanting only the non-locale aware form. It is thus better to have std::basic_lstring as a subclass of std::basic_string, such that all the extra code will not be linked into statically linked utility programs that don't need this extra library code. Making std::basic_string a protected base class of std::basic_lstring will have additional benefits: - accidentlly mixing string and lstring types will cause type errors except where std::basic_lstring provides overloaded operations to handle the combination. - functions that need to be much more complex in std::basic_lstring can do this without forcing their simpler cousins in std::basic_string to be virtual and incur the resulting call overhead, which may easily exceed the low cost of the trivial non-locale implementatins. As an alternative to hiding the basic_string properties of a basic_lstring, one could use different names for the non-basic operations while keeping the basic operations from the base class available. For example size_t length() const; // Length in char_t units, usually // quick, inherited from basic_string size_t vlength() const; // Number of codepoints in string. // often expensive and charset // dependent, but may be cached // for speed. size_t tlength() const; // Text length in ideal screen // character cells, assuming an // semi-ideal display which merges // all accents etc. into the main // cell and uses no space for any // occurrence of formatting specials // such as the BOM. // Expensive size_t hlength() const; // Text length in ideal screen // character halfwidth cells, // assuming an ideal Asian (east) display // which merges all accents etc. into // the main cell, treats western // characters as half-width unless // explicitly marked full-width in the // character standard. Also counts no // space for non-spacing and formatting // characters. // Expensive size_t flength() const; // Text length in ideal screen // character fullwidth cells, // assuming an ideal Asian (east) display // which merges all accents etc. into // the main cell, treats western // characters as full-width unless // explicitly marked half-width in the // character standard. Also counts no // space for non-spacing and formatting // characters. // Expensive Similarly for the various substring and indexing operations. P.S. In the above document i distinguish explicitly between: UCS-4: 4-byte/31-bit char32_t encoding of the full potential of the Unicode Character Set, allowing codepoints from U+00000000 to U+7FFFFFFF Note that the sign bit is still reserved, just as it was in 1-byte/7-bit ASCII. UTF-32: 4-byte/31-bit char32_t encoding of the subset of the Unicode Character Set which can be encoded using the current UTF-16 encoding, i.e. the codepoints U+00000000 to U+0010FFFF inclusive. This is the subset that will be assigned meanings first, just as the codepoints from 0 to 127 were the first to be assigned in ASCII-derived character sets. UCS-2: Historic 2-byte/16-bit char16_t encoding of the first 64K code points in the Unicode Character set. More than 20 years ago some believed this and not UCS-4 would become the final standard and thus designed protocols and systems accordingly, this includes the designs of Java, Microsoft Windows, and mobile text messaging (SMS) standards of 160 7-bit chars or 70 16 bit chars. UTF-16: An encoding of the first about 1 million Unicode codepoints which is the same as UCS-2 for the common codepoints and a special char16_t[2] encoding of codepoints from U+00010000 to U+0010FFFF . This is mostly used when retrofitting UCS-2 systems to support a larger number of Unicode codepoints. UTF-8: An encoding of the full Unicode character range from U+00000000 to U+7FFFFFFF using a variable number of 8-bit chars such that the ASCII subset U+00000000 to U+0000007F encodes as itself and having many other practical properties. Many official documents have changed the original UTF-8 definition to formally prohibit the encoding of codepoints that cannot be encoded using UTF-16, but I view this as short sighted and potentially subject to future reversal. Enjoy Jakob -- Jakob Bohm, CIO, Partner, WiseMo A/S. https://www.wisemo.com Transformervej 29, 2860 S=C3=B8borg, Denmark. Direct +45 31 13 16 10 This public discussion message is non-binding and may contain errors. WiseMo - Remote Service Management for PCs, Phones and Embedded [ comp.std.c++ is moderated. To submit articles, try posting with your ] [ newsreader. If that fails, use mailto:std-cpp-submit@vandevoorde.com ] [ --- Please see the FAQ before posting. --- ] [ FAQ: http://www.comeaucomputing.com/csc/faq.html ]