Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.java.programmer > #7914 > unrolled thread
| Started by | bob <bob@coolgroups.com> |
|---|---|
| First post | 2011-09-12 12:24 -0700 |
| Last post | 2011-09-12 15:48 -0700 |
| Articles | 20 on this page of 24 — 8 participants |
Back to article view | Back to comp.lang.java.programmer
unicode bob <bob@coolgroups.com> - 2011-09-12 12:24 -0700
Re: unicode Knute Johnson <nospam@knutejohnson.com> - 2011-09-12 14:04 -0700
Re: unicode Roedy Green <see_website@mindprod.com.invalid> - 2011-09-12 14:08 -0700
Re: unicode Arne Vajhøj <arne@vajhoej.dk> - 2011-09-12 17:31 -0400
Re: unicode markspace <-@.> - 2011-09-12 16:33 -0700
Re: unicode Lew <lewbloch@gmail.com> - 2011-09-12 17:46 -0700
Re: unicode markspace <-@.> - 2011-09-12 20:16 -0700
Re: unicode Roedy Green <see_website@mindprod.com.invalid> - 2011-09-12 22:05 -0700
Re: unicode Roedy Green <see_website@mindprod.com.invalid> - 2011-09-12 22:10 -0700
Re: unicode Andreas Leitgeb <avl@gamma.logic.tuwien.ac.at> - 2011-09-13 07:18 +0000
Re: unicode Arne Vajhøj <arne@vajhoej.dk> - 2011-09-12 20:57 -0400
Re: unicode markspace <-@.> - 2011-09-12 19:51 -0700
Re: unicode Arne Vajhøj <arne@vajhoej.dk> - 2011-09-13 20:17 -0400
Re: unicode markspace <-@.> - 2011-09-13 19:32 -0700
Re: unicode Roedy Green <see_website@mindprod.com.invalid> - 2011-09-14 11:49 -0700
Re: unicode Arne Vajhøj <arne@vajhoej.dk> - 2011-11-07 21:30 -0500
Re: unicode markspace <-@.> - 2011-11-07 19:18 -0800
Re: unicode Arne Vajhøj <arne@vajhoej.dk> - 2011-11-07 22:47 -0500
Re: unicode markspace <-@.> - 2011-11-07 21:12 -0800
Re: unicode Paul Cager <paul.cager@googlemail.com> - 2011-09-13 04:05 -0700
Re: unicode Roedy Green <see_website@mindprod.com.invalid> - 2011-09-12 22:02 -0700
Re: unicode Arne Vajhøj <arne@vajhoej.dk> - 2011-09-13 20:30 -0400
Re: unicode Arne Vajhøj <arne@vajhoej.dk> - 2011-09-12 17:29 -0400
Re: unicode Lew <lewbloch@gmail.com> - 2011-09-12 15:48 -0700
Page 1 of 2 [1] 2 Next page →
| From | bob <bob@coolgroups.com> |
|---|---|
| Date | 2011-09-12 12:24 -0700 |
| Subject | unicode |
| Message-ID | <6c991195-ab57-417c-92e0-6d5ee1c451dc@dq7g2000vbb.googlegroups.com> |
Why does Java complain when I do this?
html = html.replaceAll("\u000A", " ");
This one works:
// nbsp
html = html.replaceAll("\u00A0", " ");
[toc] | [next] | [standalone]
| From | Knute Johnson <nospam@knutejohnson.com> |
|---|---|
| Date | 2011-09-12 14:04 -0700 |
| Message-ID | <j4ls4c$pa8$1@dont-email.me> |
| In reply to | #7914 |
On 9/12/2011 12:24 PM, bob wrote:
> Why does Java complain when I do this?
>
> html = html.replaceAll("\u000A", " ");
>
> This one works:
>
> // nbsp
> html = html.replaceAll("\u00A0", " ");
That was interesting. From the docs on java.util.regex.Pattern;
"Comparison to Perl 5
The Pattern engine performs traditional NFA-based matching with ordered
alternation as occurs in Perl 5.
Perl constructs not supported by this class:
.
.
.
The preprocessing operations \l \u, \L, and \U."
You can however do this, which is even more interesting!
public class test {
public static void main(String[] args) {
String html = "hello world!\n";
html = html.replaceAll("\\x{a}","~");
System.out.println(html);
}
}
C:\Documents and Settings\Knute Johnson>java test
hello world!~
From the Java Language Specification
"Because Unicode escapes are processed very early, it is not correct to
write "\u000a" for a string literal containing a single linefeed (LF);
the Unicode escape \u000a is transformed into an actual linefeed in
translation step 1 (§3.3) and the linefeed becomes a LineTerminator in
step 2 (§3.4), and so the string literal is not valid in step 3.
Instead, one should write "\n" (§3.10.6). Similarly, it is not correct
to write "\u000d" for a string literal containing a single carriage
return (CR). Instead use "\r"."
And there you have it!
--
Knute Johnson
[toc] | [prev] | [next] | [standalone]
| From | Roedy Green <see_website@mindprod.com.invalid> |
|---|---|
| Date | 2011-09-12 14:08 -0700 |
| Message-ID | <nfss679ije8c4r70tn9kmnr055vm6nfua0@4ax.com> |
| In reply to | #7914 |
On Mon, 12 Sep 2011 12:24:47 -0700 (PDT), bob <bob@coolgroups.com>
wrote, quoted or indirectly quoted someone who said :
> html = html.replaceAll("\u000A", " ");
that expands to
html = html.replaceAll("
", " " );
\u is treated in rather flat footed way, as if by a preprocessor.
see http://mindprod.com/jgloss/literal.html
--
Roedy Green Canadian Mind Products
http://mindprod.com
The modern conservative is engaged in one of man's oldest exercises in moral philosophy; that is,
the search for a superior moral justification for selfishness.
~ John Kenneth Galbraith (born: 1908-10-15 died: 2006-04-29 at age: 97)
[toc] | [prev] | [next] | [standalone]
| From | Arne Vajhøj <arne@vajhoej.dk> |
|---|---|
| Date | 2011-09-12 17:31 -0400 |
| Message-ID | <4e6e7a2a$0$309$14726298@news.sunsite.dk> |
| In reply to | #7916 |
On 9/12/2011 5:08 PM, Roedy Green wrote:
> On Mon, 12 Sep 2011 12:24:47 -0700 (PDT), bob<bob@coolgroups.com>
> wrote, quoted or indirectly quoted someone who said :
>
>> html = html.replaceAll("\u000A", " ");
> that expands to
>
> html = html.replaceAll("
> ", " " );
>
> \u is treated in rather flat footed way, as if by a preprocessor.
It is treated per spec.
And I would not use the term preprocessor - it is Java not C.
Arne
[toc] | [prev] | [next] | [standalone]
| From | markspace <-@.> |
|---|---|
| Date | 2011-09-12 16:33 -0700 |
| Message-ID | <j4m4rs$l5g$1@dont-email.me> |
| In reply to | #7918 |
On 9/12/2011 2:31 PM, Arne Vajhøj wrote: > On 9/12/2011 5:08 PM, Roedy Green wrote: >> \u is treated in rather flat footed way, as if by a preprocessor. > It is treated per spec. Actually I agree with Roedy on this one. Per spec or not, it's a dumb idea. I think it should go away, frankly. > And I would not use the term preprocessor - it is Java not C. I've always heard this part of the Java compiler described as a preprocessor. Is there some other documentation that refers to it differently?
[toc] | [prev] | [next] | [standalone]
| From | Lew <lewbloch@gmail.com> |
|---|---|
| Date | 2011-09-12 17:46 -0700 |
| Message-ID | <88ff0d8c-af5f-4086-8232-26c80e5d8270@glegroupsg2000goo.googlegroups.com> |
| In reply to | #7931 |
markspace wrote: > Arne Vajh�j wrote: markspace, you need to post in a way that preserves characters. >> Roedy Green wrote: >>> \u is treated in rather flat footed way, as if by a preprocessor. > >> It is treated per spec. > > Actually I agree with Roedy on this one. Per spec or not, it's a dumb > idea. I think it should go away, frankly. That would defeat its purpose, which is somewhat similar to the purpose of trigraphs in C, AIUI. That is, if your keyboard lacks certain characters, you can express source in "\u" notation and the source parser will read it correctly. Its whole raison d'etre is to precede compilation, not to be part of it. So how could it go away? What would you do instead? Anyway, per spec is what we have to live with, like it or not. >> And I would not use the term preprocessor - it is Java not C. > > I've always heard this part of the Java compiler described as a > preprocessor. Is there some other documentation that refers to it > differently? -- Lew
[toc] | [prev] | [next] | [standalone]
| From | markspace <-@.> |
|---|---|
| Date | 2011-09-12 20:16 -0700 |
| Message-ID | <j4mhtv$ppb$1@dont-email.me> |
| In reply to | #7936 |
On 9/12/2011 5:46 PM, Lew wrote: > > That would defeat its purpose, which is somewhat similar to the > purpose of trigraphs in C, AIUI. There's only nine trigraphs, they're a lot harder to "hit" accidentally. > That is, if your keyboard lacks > certain characters, you can express source in "\u" notation and the > source parser will read it correctly. The problem is that \u is a lot more common than ??-. For example, \u also occurs in regex, which unfortunately seems to be the OP's confusion. > Its whole raison d'etre is to > precede compilation, not to be part of it. So how could it go away? > What would you do instead? I'd make the \u sequence a string and character escape. \u00A0 would be interpreted the same as \n. It would put a new line in the string, not in the compiler input. Every other type of \u escape (comments, parts of code) would be interpreted literally. Legacy code that relies on \u outside of strings and character constants would break. If you need to type a character that your keyboard doesn't have, get your editor to recognize an escape sequence, not the compiler. There's also digraphs in C, which are only recognized in tokenization, not as a preprocessed type of substitution. These are much better, as they are not recognized in string literals, character literals, or comments. I'd consider replacing \u for "missing keys" with C's digraphs. There's only five digraphs in C. The presence of \u in comments is especially pernicious, imo. The Java doc tool already has HTML escapes, we don't need a second redundant method of specifying unusual characters.
[toc] | [prev] | [next] | [standalone]
| From | Roedy Green <see_website@mindprod.com.invalid> |
|---|---|
| Date | 2011-09-12 22:05 -0700 |
| Message-ID | <ovot67hkb6a3lahse3jmo61fkpiqbkc36g@4ax.com> |
| In reply to | #7936 |
On Mon, 12 Sep 2011 17:46:51 -0700 (PDT), Lew <lewbloch@gmail.com> wrote, quoted or indirectly quoted someone who said : > Its whole raison d'etre is to precede compilation, not to be pa= >rt of it. So how could it go away? What would you do instead? You can't change the meaning of existing code, but if some language evolved out of java it could redefine out.println( "\u000a"); to have the same meaning as out.println( "\n"); Though I doubt this would break many real-world programs even if you changed the definition in Java 1.8. -- Roedy Green Canadian Mind Products http://mindprod.com The modern conservative is engaged in one of man's oldest exercises in moral philosophy; that is, the search for a superior moral justification for selfishness. ~ John Kenneth Galbraith (born: 1908-10-15 died: 2006-04-29 at age: 97)
[toc] | [prev] | [next] | [standalone]
| From | Roedy Green <see_website@mindprod.com.invalid> |
|---|---|
| Date | 2011-09-12 22:10 -0700 |
| Message-ID | <b8pt6711n3a1c43eir37igu3874pv29b8k@4ax.com> |
| In reply to | #7957 |
On Mon, 12 Sep 2011 22:05:19 -0700, Roedy Green <see_website@mindprod.com.invalid> wrote, quoted or indirectly quoted someone who said : > >You can't change the meaning of existing code, but if some language >evolved out of java it could redefine > >out.println( "\u000a"); >to have the same meaning as >out.println( "\n"); > >Though I doubt this would break many real-world programs even if you >changed the definition in Java 1.8. Why fix it ever? 1. the current syntax strongly violates the principle of least surprise. It amounts to a Java newbie hazing ritual. 2. It is a pointless lack of consistency. You have to treat some Unicode characters in various special magic ways. This is particularly a nuisance for code generation. For that you want as simple an algorithm as possible to produce a String literal from a String. It was a smart Alec idea that was not thought through far enough. It was seductively easy to implement. -- Roedy Green Canadian Mind Products http://mindprod.com The modern conservative is engaged in one of man's oldest exercises in moral philosophy; that is, the search for a superior moral justification for selfishness. ~ John Kenneth Galbraith (born: 1908-10-15 died: 2006-04-29 at age: 97)
[toc] | [prev] | [next] | [standalone]
| From | Andreas Leitgeb <avl@gamma.logic.tuwien.ac.at> |
|---|---|
| Date | 2011-09-13 07:18 +0000 |
| Message-ID | <slrnj6u0tn.6gl.avl@gamma.logic.tuwien.ac.at> |
| In reply to | #7957 |
General context was: > "\u000a" if line-breaks were just simply allowed to occur within strings, then then this particular problem would disappear, leaving only "\u0022" (double quote) as a problem. If the argument was to enable writing of characters not present on certain keyboards, then what would be the procedure for synthesizing a backslash on a keyboard without such? \u005c ? ;-)
[toc] | [prev] | [next] | [standalone]
| From | Arne Vajhøj <arne@vajhoej.dk> |
|---|---|
| Date | 2011-09-12 20:57 -0400 |
| Message-ID | <4e6eaa8a$0$305$14726298@news.sunsite.dk> |
| In reply to | #7931 |
On 9/12/2011 7:33 PM, markspace wrote: > On 9/12/2011 2:31 PM, Arne Vajhøj wrote: >> On 9/12/2011 5:08 PM, Roedy Green wrote: >>> \u is treated in rather flat footed way, as if by a preprocessor. > >> It is treated per spec. > > Actually I agree with Roedy on this one. Per spec or not, it's a dumb > idea. I think it should go away, frankly. I can not think of a better way to solve the problem that this construct solves. >> And I would not use the term preprocessor - it is Java not C. > > I've always heard this part of the Java compiler described as a > preprocessor. Is there some other documentation that refers to it > differently? JLS uses "translating" and "translation". Arne
[toc] | [prev] | [next] | [standalone]
| From | markspace <-@.> |
|---|---|
| Date | 2011-09-12 19:51 -0700 |
| Message-ID | <j4mgeo$h1c$2@dont-email.me> |
| In reply to | #7939 |
On 9/12/2011 5:57 PM, Arne Vajhøj wrote: > I can not think of a better way to solve the problem that this > construct solves. Which problem is that? Because I honestly can think of a single use case for it.
[toc] | [prev] | [next] | [standalone]
| From | Arne Vajhøj <arne@vajhoej.dk> |
|---|---|
| Date | 2011-09-13 20:17 -0400 |
| Message-ID | <4e6ff2a9$0$313$14726298@news.sunsite.dk> |
| In reply to | #7946 |
On 9/12/2011 10:51 PM, markspace wrote:
> On 9/12/2011 5:57 PM, Arne Vajhøj wrote:
>> I can not think of a better way to solve the problem that this
>> construct solves.
>
> Which problem is that? Because I honestly can think of a single use case
> for it.
To quote the JLS:
<quote>
The Java programming language specifies a standard way of transforming a
program written in Unicode into ASCII that changes a program into a form
that can be processed by ASCII-based tools. The transformation involves
converting any Unicode escapes in the source text of the program to
ASCII by adding an extra u-for example, \uxxxx becomes \uuxxxx-while
simultaneously converting non-ASCII characters in the source text to
Unicode escapes containing a single u each.
This transformed version is equally acceptable to a compiler for the
Java programming language ("Java compiler") and represents the exact
same program. The exact Unicode source can later be restored from this
ASCII form by converting each escape sequence where multiple u's are
present to a sequence of Unicode characters with one fewer u, while
simultaneously converting each escape sequence with a single u to the
corresponding single Unicode character.
</quote>
It allow you to use any unicode in names and strings with tools
that does not support those characters.
Arne
[toc] | [prev] | [next] | [standalone]
| From | markspace <-@.> |
|---|---|
| Date | 2011-09-13 19:32 -0700 |
| Message-ID | <j4p3oh$pa2$1@dont-email.me> |
| In reply to | #7997 |
On 9/13/2011 5:17 PM, Arne Vajhøj wrote: > It allow you to use any unicode in names and strings with tools > that does not support those characters. I understand what it does, I just don't think it's a problem. That is, the \u preprocessor escape in Java is just a solution in search of a use case that doesn't exist, or at least is so corner-case-ish that it might as well not exist. While at the same time it causes rather huge problems, relative to the one it fixes (if any). Again, I just thing the darn thing is pernicious and should be removed. At minimum, it should be removed from comments, that's just silly. (And I've personally been bit by the \u thing in a comment twice now. It's REALLY annoying when your trying to comment how you print \u for escape processing and you can't because "\u" isn't a valid string in a comment.)
[toc] | [prev] | [next] | [standalone]
| From | Roedy Green <see_website@mindprod.com.invalid> |
|---|---|
| Date | 2011-09-14 11:49 -0700 |
| Message-ID | <f9t1779pr6a6irlcctkusd9911al27i7rj@4ax.com> |
| In reply to | #8006 |
On Tue, 13 Sep 2011 19:32:46 -0700, markspace <-@.> wrote, quoted or indirectly quoted someone who said : >Again, I just thing the darn thing is pernicious and should be removed. > At minimum, it should be removed from comments, that's just silly. >(And I've personally been bit by the \u thing in a comment twice now. >It's REALLY annoying when your trying to comment how you print \u for >escape processing and you can't because "\u" isn't a valid string in a >comment.) I know I have been bit by this too, but I forget the details. Could you give an example of a invalid \u comment and just when the IDE/compiler complains or missteps? I would like to enshrine it at http://mindprod.com/jgloss/gotchas.html#BSU -- Roedy Green Canadian Mind Products http://mindprod.com The modern conservative is engaged in one of man's oldest exercises in moral philosophy; that is, the search for a superior moral justification for selfishness. ~ John Kenneth Galbraith (born: 1908-10-15 died: 2006-04-29 at age: 97)
[toc] | [prev] | [next] | [standalone]
| From | Arne Vajhøj <arne@vajhoej.dk> |
|---|---|
| Date | 2011-11-07 21:30 -0500 |
| Message-ID | <4eb89437$0$286$14726298@news.sunsite.dk> |
| In reply to | #8006 |
On 9/13/2011 10:32 PM, markspace wrote: > On 9/13/2011 5:17 PM, Arne Vajhøj wrote: >> It allow you to use any unicode in names and strings with tools >> that does not support those characters. > > I understand what it does, I just don't think it's a problem. That is, > the \u preprocessor escape in Java is just a solution in search of a use > case that doesn't exist, or at least is so corner-case-ish that it might > as well not exist. While at the same time it causes rather huge > problems, relative to the one it fixes (if any). You asked what problem it solves. Now you know what problem it solves. You still do not think it is a serious problem, but that is a different discussion. I would tend to agree that it is not a problem today, but I doubt that unicode support was that common when Java 1.0 was brand new. > Again, I just thing the darn thing is pernicious and should be removed. > At minimum, it should be removed from comments, that's just silly. (And > I've personally been bit by the \u thing in a comment twice now. It's > REALLY annoying when your trying to comment how you print \u for escape > processing and you can't because "\u" isn't a valid string in a comment.) Once put in the language, then they can never remove it without breaking existing code. Arne
[toc] | [prev] | [next] | [standalone]
| From | markspace <-@.> |
|---|---|
| Date | 2011-11-07 19:18 -0800 |
| Message-ID | <j9a71d$a3k$1@dont-email.me> |
| In reply to | #9765 |
On 11/7/2011 6:30 PM, Arne Vajhøj wrote: > You asked what problem it solves. > > Now you know what problem it solves. > > You still do not think it is a serious problem, but that is > a different discussion. No, I disagree with that assertion. If it's not an actual use case, something that doesn't actually come from a user, or solve a real user need, then it's just a pointless maintenance expense. Just like any other "feature" that nobody needs or uses, it can just be removed. > Once put in the language, then they can never remove it without breaking > existing code. My understand about these things is that they grep (*) through the code base of the most important customers and do an evaluation of the code changes required. The question is "can we afford to make the changes this would require?" It's a ROI question, not slavish devotion to backwards compatibility. Yes the holy grail is "no code changes required" but that isn't a given, necessarily. Sometimes you gotta break those eggs to make your omelet. (*) Figuratively. Not necessarily use the grep program. It's a code inspection process.
[toc] | [prev] | [next] | [standalone]
| From | Arne Vajhøj <arne@vajhoej.dk> |
|---|---|
| Date | 2011-11-07 22:47 -0500 |
| Message-ID | <4eb8a64f$0$286$14726298@news.sunsite.dk> |
| In reply to | #9768 |
On 11/7/2011 10:18 PM, markspace wrote: > On 11/7/2011 6:30 PM, Arne Vajhøj wrote: >> You asked what problem it solves. >> >> Now you know what problem it solves. >> >> You still do not think it is a serious problem, but that is >> a different discussion. > > No, I disagree with that assertion. If it's not an actual use case, > something that doesn't actually come from a user, or solve a real user > need, then it's just a pointless maintenance expense. Just like any > other "feature" that nobody needs or uses, it can just be removed. If you search the Java bug database then you will see that SUN got lots of bug reports including some that were compiler bugs about this feature. Somebody did use the feature. >> Once put in the language, then they can never remove it without breaking >> existing code. > > > My understand about these things is that they grep (*) through the code > base of the most important customers and do an evaluation of the code > changes required. The question is "can we afford to make the changes > this would require?" It's a ROI question, not slavish devotion to > backwards compatibility. Yes the holy grail is "no code changes > required" but that isn't a given, necessarily. Sometimes you gotta break > those eggs to make your omelet. I find it very difficult to see why people (and their employers) that have coded according to spec should suffer to help people that have not studied the spec. Arne
[toc] | [prev] | [next] | [standalone]
| From | markspace <-@.> |
|---|---|
| Date | 2011-11-07 21:12 -0800 |
| Message-ID | <j9adna$9pk$1@dont-email.me> |
| In reply to | #9771 |
On 11/7/2011 7:47 PM, Arne Vajhøj wrote: > If you search the Java bug database then you will see that > SUN got lots of bug reports including some that were compiler > bugs about this feature. > > Somebody did use the feature. I'll take a look. > I find it very difficult to see why people (and their employers) that > have coded according to spec should suffer to help people that have > not studied the spec. It's still a $ and cents equation in my mind. "Suffering" doesn't enter the equation. No matter what the feature, there's got to be a point where it's rational to drop support for it. Happens all the time. You may disagree with that in regards to this particular issue, but I'm having a somewhat hard time seeing why it isn't obvious in the general case: sometimes features must be dropped.
[toc] | [prev] | [next] | [standalone]
| From | Paul Cager <paul.cager@googlemail.com> |
|---|---|
| Date | 2011-09-13 04:05 -0700 |
| Message-ID | <029b8f6c-9a19-4ab2-a650-6dcb7ec6d670@w8g2000yqi.googlegroups.com> |
| In reply to | #7939 |
On Sep 13, 1:57 am, Arne Vajhøj <a...@vajhoej.dk> wrote: > On 9/12/2011 7:33 PM, markspace wrote: > > I've always heard this part of the Java compiler described as a > > preprocessor. Is there some other documentation that refers to it > > differently? > > JLS uses "translating" and "translation". The phrase "Lexical Analysis" stage would also be a good description (I imagine the JLS avoids it because it implies a particular compiler _implementation_ technique).
[toc] | [prev] | [next] | [standalone]
Page 1 of 2 [1] 2 Next page →
Back to top | Article view | comp.lang.java.programmer
csiph-web