Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!aioe.org!eternal-september.org!feeder.eternal-september.org!.POSTED!not-for-mail From: Knute Johnson Newsgroups: comp.lang.java.programmer Subject: Re: unicode Date: Mon, 12 Sep 2011 14:04:13 -0700 Organization: A noiseless patient Spider Lines: 53 Message-ID: References: <6c991195-ab57-417c-92e0-6d5ee1c451dc@dq7g2000vbb.googlegroups.com> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Mon, 12 Sep 2011 21:04:12 +0000 (UTC) Injection-Info: mx04.eternal-september.org; posting-host="mz/LDSJwiWnk3Jnnqg7x+Q"; logging-data="25928"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+EzmdB//OufqKkRxzNWu2B" User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20110902 Thunderbird/6.0.2 In-Reply-To: <6c991195-ab57-417c-92e0-6d5ee1c451dc@dq7g2000vbb.googlegroups.com> Cancel-Lock: sha1:uVXKWb442mkFG/KdkyBiIt/Lrpw= Xref: x330-a1.tempe.blueboxinc.net comp.lang.java.programmer:7915 On 9/12/2011 12:24 PM, bob wrote: > Why does Java complain when I do this? > > html = html.replaceAll("\u000A", " "); > > This one works: > > // nbsp > html = html.replaceAll("\u00A0", " "); That was interesting. From the docs on java.util.regex.Pattern; "Comparison to Perl 5 The Pattern engine performs traditional NFA-based matching with ordered alternation as occurs in Perl 5. Perl constructs not supported by this class: . . . The preprocessing operations \l \u, \L, and \U." You can however do this, which is even more interesting! public class test { public static void main(String[] args) { String html = "hello world!\n"; html = html.replaceAll("\\x{a}","~"); System.out.println(html); } } C:\Documents and Settings\Knute Johnson>java test hello world!~ From the Java Language Specification "Because Unicode escapes are processed very early, it is not correct to write "\u000a" for a string literal containing a single linefeed (LF); the Unicode escape \u000a is transformed into an actual linefeed in translation step 1 (§3.3) and the linefeed becomes a LineTerminator in step 2 (§3.4), and so the string literal is not valid in step 3. Instead, one should write "\n" (§3.10.6). Similarly, it is not correct to write "\u000d" for a string literal containing a single carriage return (CR). Instead use "\r"." And there you have it! -- Knute Johnson