Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!aioe.org!eternal-september.org!feeder.eternal-september.org!.POSTED!not-for-mail From: markspace <-@.> Newsgroups: comp.lang.java.programmer Subject: Re: unicode Date: Mon, 12 Sep 2011 20:16:12 -0700 Organization: A noiseless patient Spider Lines: 40 Message-ID: References: <6c991195-ab57-417c-92e0-6d5ee1c451dc@dq7g2000vbb.googlegroups.com> <4e6e7a2a$0$309$14726298@news.sunsite.dk> <88ff0d8c-af5f-4086-8232-26c80e5d8270@glegroupsg2000goo.googlegroups.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Tue, 13 Sep 2011 03:16:15 +0000 (UTC) Injection-Info: mx04.eternal-september.org; posting-host="XjIWM99mD7Ijfdu600oVPA"; logging-data="26411"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+q6KemjmA6ZT6/zbwoyFGqGE2JI9t6PUA=" User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0.2) Gecko/20110902 Thunderbird/6.0.2 In-Reply-To: <88ff0d8c-af5f-4086-8232-26c80e5d8270@glegroupsg2000goo.googlegroups.com> Cancel-Lock: sha1:shShAN6xWrwLzDPNE1BFWm4kB5k= Xref: x330-a1.tempe.blueboxinc.net comp.lang.java.programmer:7947 On 9/12/2011 5:46 PM, Lew wrote: > > That would defeat its purpose, which is somewhat similar to the > purpose of trigraphs in C, AIUI. There's only nine trigraphs, they're a lot harder to "hit" accidentally. > That is, if your keyboard lacks > certain characters, you can express source in "\u" notation and the > source parser will read it correctly. The problem is that \u is a lot more common than ??-. For example, \u also occurs in regex, which unfortunately seems to be the OP's confusion. > Its whole raison d'etre is to > precede compilation, not to be part of it. So how could it go away? > What would you do instead? I'd make the \u sequence a string and character escape. \u00A0 would be interpreted the same as \n. It would put a new line in the string, not in the compiler input. Every other type of \u escape (comments, parts of code) would be interpreted literally. Legacy code that relies on \u outside of strings and character constants would break. If you need to type a character that your keyboard doesn't have, get your editor to recognize an escape sequence, not the compiler. There's also digraphs in C, which are only recognized in tokenization, not as a preprocessed type of substitution. These are much better, as they are not recognized in string literals, character literals, or comments. I'd consider replacing \u for "missing keys" with C's digraphs. There's only five digraphs in C. The presence of \u in comments is especially pernicious, imo. The Java doc tool already has HTML escapes, we don't need a second redundant method of specifying unusual characters.