Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!aioe.org!eternal-september.org!feeder.eternal-september.org!.POSTED!not-for-mail
From: markspace <-@.>
Newsgroups: comp.lang.java.programmer
Subject: Re: unicode
Date: Mon, 12 Sep 2011 20:16:12 -0700
Organization: A noiseless patient Spider
Lines: 40
Message-ID: <j4mhtv$ppb$1@dont-email.me>
References: <6c991195-ab57-417c-92e0-6d5ee1c451dc@dq7g2000vbb.googlegroups.com> <nfss679ije8c4r70tn9kmnr055vm6nfua0@4ax.com> <4e6e7a2a$0$309$14726298@news.sunsite.dk> <j4m4rs$l5g$1@dont-email.me> <88ff0d8c-af5f-4086-8232-26c80e5d8270@glegroupsg2000goo.googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 13 Sep 2011 03:16:15 +0000 (UTC)
Injection-Info: mx04.eternal-september.org; posting-host="XjIWM99mD7Ijfdu600oVPA"; logging-data="26411"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1+q6KemjmA6ZT6/zbwoyFGqGE2JI9t6PUA="
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0.2) Gecko/20110902 Thunderbird/6.0.2
In-Reply-To: <88ff0d8c-af5f-4086-8232-26c80e5d8270@glegroupsg2000goo.googlegroups.com>
Cancel-Lock: sha1:shShAN6xWrwLzDPNE1BFWm4kB5k=
Xref: x330-a1.tempe.blueboxinc.net comp.lang.java.programmer:7947

On 9/12/2011 5:46 PM, Lew wrote:
>
> That would defeat its purpose, which is somewhat similar to the
> purpose of trigraphs in C, AIUI.


There's only nine trigraphs, they're a lot harder to "hit" accidentally.


>  That is, if your keyboard lacks
> certain characters, you can express source in "\u" notation and the
> source parser will read it correctly.


The problem is that \u is a lot more common than ??-.  For example, \u 
also occurs in regex, which unfortunately seems to be the OP's confusion.


>  Its whole raison d'etre is to
> precede compilation, not to be part of it.  So how could it go away?
> What would you do instead?


I'd make the \u sequence a string and character escape.  \u00A0 would be 
interpreted the same as \n.  It would put a new line in the string, not 
in the compiler input.  Every other type of \u escape (comments, parts 
of code) would be interpreted literally.  Legacy code that relies on \u 
outside of strings and character constants would break.  If you need to 
type a character that your keyboard doesn't have, get your editor to 
recognize an escape sequence, not the compiler.

There's also digraphs in C, which are only recognized in tokenization, 
not as a preprocessed type of substitution.  These are much better, as 
they are not recognized in string literals, character literals, or 
comments.  I'd consider replacing \u for "missing keys" with C's 
digraphs.  There's only five digraphs in C.

The presence of \u in comments is especially pernicious, imo.  The Java 
doc tool already has HTML escapes, we don't need a second redundant 
method of specifying unusual characters.