Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.compilers > #2656
| From | gah4 <gah4@u.washington.edu> |
|---|---|
| Newsgroups | comp.compilers |
| Subject | Re: Lexing Unicode strings? |
| Date | 2021-05-04 14:47 -0700 |
| Organization | Compilers Central |
| Message-ID | <21-05-004@comp.compilers> (permalink) |
| References | <21-05-001@comp.compilers> <21-05-002@comp.compilers> |
On Tuesday, May 4, 2021 at 8:33:50 AM UTC-7, gah4 wrote: (snip, I wrote) > Note that in addition to have a 16 bit Unicode char, the Java language > itself is defined in terms of Unicode. Variable names can be any Unicode > letter, followed by Unicode letters and digits. Presumably, then, the > designers of Java compilers have figured this out, I suspect using the 16 bit char. (snip) > Yes, Unicode can be fun! > [Remember that Unicode is a 20 bit code and for characters outside the first 64K, > Java's UTF-16 uses pairs of 16 bit chars known as surrogates that make UTF-8 seem clean and beautiful. -John] I did know that Java used 16 bits, but never tried to figure out what they did with the rest of the characters. There should be enough in the first 64K for writing programs. I did once use π for a variable name, with the obvious value. It seems it is \u03c0. I even found an editor that allowed entering such characters, and then would write out the file with \u escapes. As far as I know, that is more usual than UTF-8. I believe that the Java parser converts from \u escapes fairly early, such that you can quote strings with \u0022, and then you should be able to put \uu0022 inside the strings. [If you're only going to allow the lower 64K, your users will be sad when they try to use quoted strings with uncommon Chinese characters or with emoji, or more likely your compiler will barf since they will be encoded as two surrogate characters and your lexer won't know what to do with them. If you're going to deal with Unicode, better bite the bullet and deal with the whole mess. -John]
Back to comp.compilers | Previous | Next — Previous in thread | Next in thread | Find similar
Re: Lexing Unicode strings? "Johann 'Myrkraverk' Oskarsson" <johann@myrkraverk.com> - 2021-05-03 19:58 -0400
Re: Lexing Unicode strings? gah4 <gah4@u.washington.edu> - 2021-05-04 01:11 -0700
Re: Lexing Unicode strings? gah4 <gah4@u.washington.edu> - 2021-05-04 14:47 -0700
Re: Lexing Unicode strings? Hans Aberg <haberg-news@telia.com> - 2021-07-14 15:39 -0400
csiph-web