Groups > comp.lang.java.programmer > #7914 > unrolled thread

unicode

Started by	bob <bob@coolgroups.com>
First post	2011-09-12 12:24 -0700
Last post	2011-09-12 15:48 -0700
Articles	20 on this page of 24 — 8 participants

Back to article view | Back to comp.lang.java.programmer

  unicode bob <bob@coolgroups.com> - 2011-09-12 12:24 -0700
    Re: unicode Knute Johnson <nospam@knutejohnson.com> - 2011-09-12 14:04 -0700
    Re: unicode Roedy Green <see_website@mindprod.com.invalid> - 2011-09-12 14:08 -0700
      Re: unicode Arne Vajhøj <arne@vajhoej.dk> - 2011-09-12 17:31 -0400
        Re: unicode markspace <-@.> - 2011-09-12 16:33 -0700
          Re: unicode Lew <lewbloch@gmail.com> - 2011-09-12 17:46 -0700
            Re: unicode markspace <-@.> - 2011-09-12 20:16 -0700
            Re: unicode Roedy Green <see_website@mindprod.com.invalid> - 2011-09-12 22:05 -0700
              Re: unicode Roedy Green <see_website@mindprod.com.invalid> - 2011-09-12 22:10 -0700
              Re: unicode Andreas Leitgeb <avl@gamma.logic.tuwien.ac.at> - 2011-09-13 07:18 +0000
          Re: unicode Arne Vajhøj <arne@vajhoej.dk> - 2011-09-12 20:57 -0400
            Re: unicode markspace <-@.> - 2011-09-12 19:51 -0700
              Re: unicode Arne Vajhøj <arne@vajhoej.dk> - 2011-09-13 20:17 -0400
                Re: unicode markspace <-@.> - 2011-09-13 19:32 -0700
                  Re: unicode Roedy Green <see_website@mindprod.com.invalid> - 2011-09-14 11:49 -0700
                  Re: unicode Arne Vajhøj <arne@vajhoej.dk> - 2011-11-07 21:30 -0500
                    Re: unicode markspace <-@.> - 2011-11-07 19:18 -0800
                      Re: unicode Arne Vajhøj <arne@vajhoej.dk> - 2011-11-07 22:47 -0500
                        Re: unicode markspace <-@.> - 2011-11-07 21:12 -0800
            Re: unicode Paul Cager <paul.cager@googlemail.com> - 2011-09-13 04:05 -0700
          Re: unicode Roedy Green <see_website@mindprod.com.invalid> - 2011-09-12 22:02 -0700
            Re: unicode Arne Vajhøj <arne@vajhoej.dk> - 2011-09-13 20:30 -0400
    Re: unicode Arne Vajhøj <arne@vajhoej.dk> - 2011-09-12 17:29 -0400
      Re: unicode Lew <lewbloch@gmail.com> - 2011-09-12 15:48 -0700

Page 1 of 2 [1] 2 Next page →

#7914 — unicode

From	bob <bob@coolgroups.com>
Date	2011-09-12 12:24 -0700
Subject	unicode
Message-ID	<6c991195-ab57-417c-92e0-6d5ee1c451dc@dq7g2000vbb.googlegroups.com>

Why does Java complain when I do this?

	html = html.replaceAll("\u000A", " ");

This one works:

		// nbsp
		html = html.replaceAll("\u00A0", " ");

[toc] | [next] | [standalone]

#7915

From	Knute Johnson <nospam@knutejohnson.com>
Date	2011-09-12 14:04 -0700
Message-ID	<j4ls4c$pa8$1@dont-email.me>
In reply to	#7914

On 9/12/2011 12:24 PM, bob wrote:
> Why does Java complain when I do this?
>
> 	html = html.replaceAll("\u000A", " ");
>
> This one works:
>
> 		// nbsp
> 		html = html.replaceAll("\u00A0", " ");

That was interesting.  From the docs on java.util.regex.Pattern;

"Comparison to Perl 5

The Pattern engine performs traditional NFA-based matching with ordered 
alternation as occurs in Perl 5.

Perl constructs not supported by this class:
     .
     .
     .
     The preprocessing operations \l \u, \L, and \U."

You can however do this, which is even more interesting!

public class test {
     public static void main(String[] args) {
         String html = "hello world!\n";
         html = html.replaceAll("\\x{a}","~");
         System.out.println(html);
     }
}

C:\Documents and Settings\Knute Johnson>java test
hello world!~

 From the Java Language Specification

"Because Unicode escapes are processed very early, it is not correct to 
write "\u000a" for a string literal containing a single linefeed (LF); 
the Unicode escape \u000a is transformed into an actual linefeed in 
translation step 1 (§3.3) and the linefeed becomes a LineTerminator in 
step 2 (§3.4), and so the string literal is not valid in step 3. 
Instead, one should write "\n" (§3.10.6). Similarly, it is not correct 
to write "\u000d" for a string literal containing a single carriage 
return (CR). Instead use "\r"."

And there you have it!

-- 

Knute Johnson

[toc] | [prev] | [next] | [standalone]

#7916

From	Roedy Green <see_website@mindprod.com.invalid>
Date	2011-09-12 14:08 -0700
Message-ID	<nfss679ije8c4r70tn9kmnr055vm6nfua0@4ax.com>
In reply to	#7914

On Mon, 12 Sep 2011 12:24:47 -0700 (PDT), bob <bob@coolgroups.com>
wrote, quoted or indirectly quoted someone who said :

>	html = html.replaceAll("\u000A", " ");
that expands to

html = html.replaceAll("
", " " );

\u is treated in rather flat footed way, as if by a preprocessor.

see http://mindprod.com/jgloss/literal.html
-- 
Roedy Green Canadian Mind Products
http://mindprod.com
The modern conservative is engaged in one of man's oldest exercises in moral philosophy; that is, 
the search for a superior moral justification for selfishness.
~ John Kenneth Galbraith (born: 1908-10-15 died: 2006-04-29 at age: 97)

[toc] | [prev] | [next] | [standalone]

#7918

From	Arne Vajhøj <arne@vajhoej.dk>
Date	2011-09-12 17:31 -0400
Message-ID	<4e6e7a2a$0$309$14726298@news.sunsite.dk>
In reply to	#7916

On 9/12/2011 5:08 PM, Roedy Green wrote:
> On Mon, 12 Sep 2011 12:24:47 -0700 (PDT), bob<bob@coolgroups.com>
> wrote, quoted or indirectly quoted someone who said :
>
>> 	html = html.replaceAll("\u000A", " ");
> that expands to
>
> html = html.replaceAll("
> ", " " );
>
> \u is treated in rather flat footed way, as if by a preprocessor.

It is treated per spec.

And I would not use the term preprocessor - it is Java not C.

Arne

[toc] | [prev] | [next] | [standalone]

#7931

From	markspace <-@.>
Date	2011-09-12 16:33 -0700
Message-ID	<j4m4rs$l5g$1@dont-email.me>
In reply to	#7918

On 9/12/2011 2:31 PM, Arne Vajhøj wrote:

> On 9/12/2011 5:08 PM, Roedy Green wrote:
>> \u is treated in rather flat footed way, as if by a preprocessor.

> It is treated per spec.

Actually I agree with Roedy on this one.  Per spec or not, it's a dumb 
idea.  I think it should go away, frankly.

> And I would not use the term preprocessor - it is Java not C.

I've always heard this part of the Java compiler described as a 
preprocessor.  Is there some other documentation that refers to it 
differently?

[toc] | [prev] | [next] | [standalone]

#7936

From	Lew <lewbloch@gmail.com>
Date	2011-09-12 17:46 -0700
Message-ID	<88ff0d8c-af5f-4086-8232-26c80e5d8270@glegroupsg2000goo.googlegroups.com>
In reply to	#7931

markspace wrote:
> Arne Vajhï¿½j wrote:

markspace, you need to post in a way that preserves characters.

>> Roedy Green wrote:
>>> \u is treated in rather flat footed way, as if by a preprocessor.
> 
>> It is treated per spec.
> 
> Actually I agree with Roedy on this one.  Per spec or not, it's a dumb 
> idea.  I think it should go away, frankly.

That would defeat its purpose, which is somewhat similar to the purpose of trigraphs in C, AIUI.  That is, if your keyboard lacks certain characters, you can express source in "\u" notation and the source parser will read it correctly.  Its whole raison d'etre is to precede compilation, not to be part of it.  So how could it go away?  What would you do instead?

Anyway, per spec is what we have to live with, like it or not.

>> And I would not use the term preprocessor - it is Java not C.
> 
> I've always heard this part of the Java compiler described as a 
> preprocessor.  Is there some other documentation that refers to it 
> differently?

-- 
Lew

[toc] | [prev] | [next] | [standalone]

#7947

From	markspace <-@.>
Date	2011-09-12 20:16 -0700
Message-ID	<j4mhtv$ppb$1@dont-email.me>
In reply to	#7936

On 9/12/2011 5:46 PM, Lew wrote:
>
> That would defeat its purpose, which is somewhat similar to the
> purpose of trigraphs in C, AIUI.

There's only nine trigraphs, they're a lot harder to "hit" accidentally.

>  That is, if your keyboard lacks
> certain characters, you can express source in "\u" notation and the
> source parser will read it correctly.

The problem is that \u is a lot more common than ??-.  For example, \u 
also occurs in regex, which unfortunately seems to be the OP's confusion.

>  Its whole raison d'etre is to
> precede compilation, not to be part of it.  So how could it go away?
> What would you do instead?

I'd make the \u sequence a string and character escape.  \u00A0 would be 
interpreted the same as \n.  It would put a new line in the string, not 
in the compiler input.  Every other type of \u escape (comments, parts 
of code) would be interpreted literally.  Legacy code that relies on \u 
outside of strings and character constants would break.  If you need to 
type a character that your keyboard doesn't have, get your editor to 
recognize an escape sequence, not the compiler.

There's also digraphs in C, which are only recognized in tokenization, 
not as a preprocessed type of substitution.  These are much better, as 
they are not recognized in string literals, character literals, or 
comments.  I'd consider replacing \u for "missing keys" with C's 
digraphs.  There's only five digraphs in C.

The presence of \u in comments is especially pernicious, imo.  The Java 
doc tool already has HTML escapes, we don't need a second redundant 
method of specifying unusual characters.

[toc] | [prev] | [next] | [standalone]

#7957

From	Roedy Green <see_website@mindprod.com.invalid>
Date	2011-09-12 22:05 -0700
Message-ID	<ovot67hkb6a3lahse3jmo61fkpiqbkc36g@4ax.com>
In reply to	#7936

On Mon, 12 Sep 2011 17:46:51 -0700 (PDT), Lew <lewbloch@gmail.com>
wrote, quoted or indirectly quoted someone who said :

> Its whole raison d'etre is to precede compilation, not to be pa=
>rt of it.  So how could it go away?  What would you do instead?

You can't change the meaning of existing code, but if some language
evolved out of java it could redefine

out.println( "\u000a"); 
to have the same meaning as
out.println( "\n"); 

Though I doubt this would break many real-world programs even if you
changed the definition in Java 1.8.

-- 
Roedy Green Canadian Mind Products
http://mindprod.com
The modern conservative is engaged in one of man's oldest exercises in moral philosophy; that is, 
the search for a superior moral justification for selfishness.
~ John Kenneth Galbraith (born: 1908-10-15 died: 2006-04-29 at age: 97)

[toc] | [prev] | [next] | [standalone]

#7961

From	Roedy Green <see_website@mindprod.com.invalid>
Date	2011-09-12 22:10 -0700
Message-ID	<b8pt6711n3a1c43eir37igu3874pv29b8k@4ax.com>
In reply to	#7957

On Mon, 12 Sep 2011 22:05:19 -0700, Roedy Green
<see_website@mindprod.com.invalid> wrote, quoted or indirectly quoted
someone who said :

>
>You can't change the meaning of existing code, but if some language
>evolved out of java it could redefine
>
>out.println( "\u000a"); 
>to have the same meaning as
>out.println( "\n"); 
>
>Though I doubt this would break many real-world programs even if you
>changed the definition in Java 1.8.

Why fix it ever?

1. the current syntax strongly violates the principle of least
surprise. It amounts to a Java newbie hazing ritual.

2. It is a pointless lack of consistency.  You have to treat some
Unicode characters in various special magic ways. This is particularly
a nuisance for code generation. For that you want as simple an
algorithm as possible to produce a String literal from a String.

It was a smart Alec idea that was not thought through far enough. It
was seductively easy to implement.
-- 
Roedy Green Canadian Mind Products
http://mindprod.com
The modern conservative is engaged in one of man's oldest exercises in moral philosophy; that is, 
the search for a superior moral justification for selfishness.
~ John Kenneth Galbraith (born: 1908-10-15 died: 2006-04-29 at age: 97)

[toc] | [prev] | [next] | [standalone]

#7964

From	Andreas Leitgeb <avl@gamma.logic.tuwien.ac.at>
Date	2011-09-13 07:18 +0000
Message-ID	<slrnj6u0tn.6gl.avl@gamma.logic.tuwien.ac.at>
In reply to	#7957

General context was:
> "\u000a"

if line-breaks were just simply allowed to occur within
strings, then then this particular problem would disappear,
leaving only "\u0022" (double quote) as a problem.

If the argument was to enable writing of characters not
present on certain keyboards, then what would be the procedure
for synthesizing a backslash on a keyboard without such?
\u005c ? ;-)

[toc] | [prev] | [next] | [standalone]

#7939

From	Arne Vajhøj <arne@vajhoej.dk>
Date	2011-09-12 20:57 -0400
Message-ID	<4e6eaa8a$0$305$14726298@news.sunsite.dk>
In reply to	#7931

On 9/12/2011 7:33 PM, markspace wrote:
> On 9/12/2011 2:31 PM, Arne Vajhøj wrote:
>> On 9/12/2011 5:08 PM, Roedy Green wrote:
>>> \u is treated in rather flat footed way, as if by a preprocessor.
>
>> It is treated per spec.
>
> Actually I agree with Roedy on this one. Per spec or not, it's a dumb
> idea. I think it should go away, frankly.

I can not think of a better way to solve the problem that this
construct solves.

>> And I would not use the term preprocessor - it is Java not C.
>
> I've always heard this part of the Java compiler described as a
> preprocessor. Is there some other documentation that refers to it
> differently?

JLS uses "translating" and "translation".

Arne

[toc] | [prev] | [next] | [standalone]

#7946

From	markspace <-@.>
Date	2011-09-12 19:51 -0700
Message-ID	<j4mgeo$h1c$2@dont-email.me>
In reply to	#7939

On 9/12/2011 5:57 PM, Arne Vajhøj wrote:
> I can not think of a better way to solve the problem that this
> construct solves.


Which problem is that?  Because I honestly can think of a single use 
case for it.

[toc] | [prev] | [next] | [standalone]

#7997

From	Arne Vajhøj <arne@vajhoej.dk>
Date	2011-09-13 20:17 -0400
Message-ID	<4e6ff2a9$0$313$14726298@news.sunsite.dk>
In reply to	#7946

On 9/12/2011 10:51 PM, markspace wrote:
> On 9/12/2011 5:57 PM, Arne Vajhøj wrote:
>> I can not think of a better way to solve the problem that this
>> construct solves.
>
> Which problem is that? Because I honestly can think of a single use case
> for it.

To quote the JLS:

<quote>
The Java programming language specifies a standard way of transforming a 
program written in Unicode into ASCII that changes a program into a form 
that can be processed by ASCII-based tools. The transformation involves 
converting any Unicode escapes in the source text of the program to 
ASCII by adding an extra u-for example, \uxxxx becomes \uuxxxx-while 
simultaneously converting non-ASCII characters in the source text to 
Unicode escapes containing a single u each.

This transformed version is equally acceptable to a compiler for the 
Java programming language ("Java compiler") and represents the exact 
same program. The exact Unicode source can later be restored from this 
ASCII form by converting each escape sequence where multiple u's are 
present to a sequence of Unicode characters with one fewer u, while 
simultaneously converting each escape sequence with a single u to the 
corresponding single Unicode character.
</quote>

It allow you to use any unicode in names and strings with tools
that does not support those characters.

Arne

[toc] | [prev] | [next] | [standalone]

#8006

From	markspace <-@.>
Date	2011-09-13 19:32 -0700
Message-ID	<j4p3oh$pa2$1@dont-email.me>
In reply to	#7997

On 9/13/2011 5:17 PM, Arne Vajhøj wrote:

> It allow you to use any unicode in names and strings with tools
> that does not support those characters.

I understand what it does, I just don't think it's a problem.  That is, 
the \u preprocessor escape in Java is just a solution in search of a use 
case that doesn't exist, or at least is so corner-case-ish that it might 
as well not exist.  While at the same time it causes rather huge 
problems, relative to the one it fixes (if any).

Again, I just thing the darn thing is pernicious and should be removed. 
  At minimum, it should be removed from comments, that's just silly. 
(And I've personally been bit by the \u thing in a comment twice now. 
It's REALLY annoying when your trying to comment how you print \u for 
escape processing and you can't because "\u" isn't a valid string in a 
comment.)

[toc] | [prev] | [next] | [standalone]

#8029

From	Roedy Green <see_website@mindprod.com.invalid>
Date	2011-09-14 11:49 -0700
Message-ID	<f9t1779pr6a6irlcctkusd9911al27i7rj@4ax.com>
In reply to	#8006

On Tue, 13 Sep 2011 19:32:46 -0700, markspace <-@.> wrote, quoted or
indirectly quoted someone who said :

>Again, I just thing the darn thing is pernicious and should be removed. 
>  At minimum, it should be removed from comments, that's just silly. 
>(And I've personally been bit by the \u thing in a comment twice now. 
>It's REALLY annoying when your trying to comment how you print \u for 
>escape processing and you can't because "\u" isn't a valid string in a 
>comment.)

I know I have been bit by this too, but I forget the details.  
Could you give an example of a invalid \u comment and just when the
IDE/compiler complains or missteps?  I would like to enshrine it at
http://mindprod.com/jgloss/gotchas.html#BSU


-- 
Roedy Green Canadian Mind Products
http://mindprod.com
The modern conservative is engaged in one of man's oldest exercises in moral philosophy; that is, 
the search for a superior moral justification for selfishness.
~ John Kenneth Galbraith (born: 1908-10-15 died: 2006-04-29 at age: 97)

[toc] | [prev] | [next] | [standalone]

#9765

From	Arne Vajhøj <arne@vajhoej.dk>
Date	2011-11-07 21:30 -0500
Message-ID	<4eb89437$0$286$14726298@news.sunsite.dk>
In reply to	#8006

On 9/13/2011 10:32 PM, markspace wrote:
> On 9/13/2011 5:17 PM, Arne Vajhøj wrote:
>> It allow you to use any unicode in names and strings with tools
>> that does not support those characters.
>
> I understand what it does, I just don't think it's a problem. That is,
> the \u preprocessor escape in Java is just a solution in search of a use
> case that doesn't exist, or at least is so corner-case-ish that it might
> as well not exist. While at the same time it causes rather huge
> problems, relative to the one it fixes (if any).

You asked what problem it solves.

Now you know what problem it solves.

You still do not think it is a serious problem, but that is
a different discussion.

I would tend to agree that it is not a problem today, but I doubt
that unicode support was that common when Java 1.0 was brand new.

> Again, I just thing the darn thing is pernicious and should be removed.
> At minimum, it should be removed from comments, that's just silly. (And
> I've personally been bit by the \u thing in a comment twice now. It's
> REALLY annoying when your trying to comment how you print \u for escape
> processing and you can't because "\u" isn't a valid string in a comment.)

Once put in the language, then they can never remove it without breaking
existing code.

Arne

[toc] | [prev] | [next] | [standalone]

#9768

From	markspace <-@.>
Date	2011-11-07 19:18 -0800
Message-ID	<j9a71d$a3k$1@dont-email.me>
In reply to	#9765

On 11/7/2011 6:30 PM, Arne Vajhøj wrote:
> You asked what problem it solves.
>
> Now you know what problem it solves.
>
> You still do not think it is a serious problem, but that is
> a different discussion.

No, I disagree with that assertion.  If it's not an actual use case, 
something that doesn't actually come from a user, or solve a real user 
need, then it's just a pointless maintenance expense.  Just like any 
other "feature" that nobody needs or uses, it can just be removed.

> Once put in the language, then they can never remove it without breaking
> existing code.

My understand about these things is that they grep (*) through the code 
base of the most important customers and do an evaluation of the code 
changes required.  The question is "can we afford to make the changes 
this would require?"  It's a ROI question, not slavish devotion to 
backwards compatibility.  Yes the holy grail is "no code changes 
required" but that isn't a given, necessarily.  Sometimes you gotta 
break those eggs to make your omelet.

(*) Figuratively.  Not necessarily use the grep program.  It's a code 
inspection process.

[toc] | [prev] | [next] | [standalone]

#9771

From	Arne Vajhøj <arne@vajhoej.dk>
Date	2011-11-07 22:47 -0500
Message-ID	<4eb8a64f$0$286$14726298@news.sunsite.dk>
In reply to	#9768

On 11/7/2011 10:18 PM, markspace wrote:
> On 11/7/2011 6:30 PM, Arne Vajhøj wrote:
>> You asked what problem it solves.
>>
>> Now you know what problem it solves.
>>
>> You still do not think it is a serious problem, but that is
>> a different discussion.
>
> No, I disagree with that assertion. If it's not an actual use case,
> something that doesn't actually come from a user, or solve a real user
> need, then it's just a pointless maintenance expense. Just like any
> other "feature" that nobody needs or uses, it can just be removed.

If you search the Java bug database then you will see that
SUN got lots of bug reports including some that were compiler
bugs about this feature.

Somebody did use the feature.

>> Once put in the language, then they can never remove it without breaking
>> existing code.
>
>
> My understand about these things is that they grep (*) through the code
> base of the most important customers and do an evaluation of the code
> changes required. The question is "can we afford to make the changes
> this would require?" It's a ROI question, not slavish devotion to
> backwards compatibility. Yes the holy grail is "no code changes
> required" but that isn't a given, necessarily. Sometimes you gotta break
> those eggs to make your omelet.

I find it very difficult to see why people (and their employers) that
have coded according to spec should suffer to help people that have
not studied the spec.

Arne

[toc] | [prev] | [next] | [standalone]

#9773

From	markspace <-@.>
Date	2011-11-07 21:12 -0800
Message-ID	<j9adna$9pk$1@dont-email.me>
In reply to	#9771

On 11/7/2011 7:47 PM, Arne Vajhøj wrote:
> If you search the Java bug database then you will see that
> SUN got lots of bug reports including some that were compiler
> bugs about this feature.
>
> Somebody did use the feature.

I'll take a look.

> I find it very difficult to see why people (and their employers) that
> have coded according to spec should suffer to help people that have
> not studied the spec.

It's still a $ and cents equation in my mind.  "Suffering" doesn't enter 
the equation.  No matter what the feature, there's got to be a point 
where it's rational to drop support for it.  Happens all the time.

You may disagree with that in regards to this particular issue, but I'm 
having a somewhat hard time seeing why it isn't obvious in the general 
case: sometimes features must be dropped.

[toc] | [prev] | [next] | [standalone]

#7967

From	Paul Cager <paul.cager@googlemail.com>
Date	2011-09-13 04:05 -0700
Message-ID	<029b8f6c-9a19-4ab2-a650-6dcb7ec6d670@w8g2000yqi.googlegroups.com>
In reply to	#7939

On Sep 13, 1:57 am, Arne Vajhøj <a...@vajhoej.dk> wrote:
> On 9/12/2011 7:33 PM, markspace wrote:
> > I've always heard this part of the Java compiler described as a
> > preprocessor. Is there some other documentation that refers to it
> > differently?
>
> JLS uses "translating" and "translation".

The phrase "Lexical Analysis" stage would also be a good description
(I imagine the JLS avoids it because it implies a particular compiler
_implementation_ technique).

[toc] | [prev] | [next] | [standalone]

Page 1 of 2 [1] 2 Next page →

csiph-web

unicode

Contents

#7914 — unicode

#7915

#7916

#7918

#7931

#7936

#7947

#7957

#7961

#7964

#7939

#7946

#7997

#8006

#8029

#9765

#9768

#9771

#9773

#7967