Path: csiph.com!usenet.pasdenom.info!gegeweb.org!eternal-september.org!feeder.eternal-september.org!mx04.eternal-september.org!.POSTED!not-for-mail From: Knute Johnson Newsgroups: comp.lang.java.programmer Subject: Re: Unicode escapes and String literals? Date: Thu, 13 Dec 2012 16:11:46 -0800 Organization: A noiseless patient Spider Lines: 101 Message-ID: References: <50ca6046$0$284$14726298@news.sunsite.dk> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Fri, 14 Dec 2012 00:11:46 +0000 (UTC) Injection-Info: mx04.eternal-september.org; posting-host="9b3fcb0d22708969e4dc99e7aa0ef1f9"; logging-data="3785"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18hI8clNJlXszhkHop78xXm" User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:17.0) Gecko/17.0 Thunderbird/17.0 In-Reply-To: <50ca6046$0$284$14726298@news.sunsite.dk> Cancel-Lock: sha1:18cOHbU3/WI3u5ziKiCQzxGiwz8= Xref: csiph.com comp.lang.java.programmer:20310 On 12/13/2012 3:09 PM, Arne Vajhøj wrote: > On 12/13/2012 12:31 PM, Knute Johnson wrote: >> I just had a great revelation as I was putting together my SSCCE >> for the question I was going to ask. So it has changed my >> question. How do I do the conversion of unicode escape sequences >> to a String that are done by string literals? >> >> String s = "\u0066\u0065\u0064"; >> >> becomes "fed" but if you create a String with \u0066\u0065\u0064 in >> it without using the literal it stays \u0066\u0065\u0064. Is there >> a built in mechanism in Java for doing that translation to a >> String? > > I don't think there is anything built in. > > But it is trivial to code. > > This was posted just a few months back: > > import java.util.regex.Matcher; import java.util.regex.Pattern; > > public class Unescape { private static final Pattern p = > Pattern.compile("\\\\u([0-9A-F]{4})"); public static String > U2U(String s) { //String res = s; //Matcher m = p.matcher(res); > //while (m.find()) { // res = res.replaceAll("\\" + m.group(0), > Character.toString((char) Integer.parseInt(m.group(1), 16))); //} > //return res; Matcher m = p.matcher(s); StringBuffer res = new > StringBuffer(); while (m.find()) { m.appendReplacement(res, > Character.toString((char) Integer.parseInt(m.group(1), 16))); } > m.appendTail(res); return res.toString(); } public static void > main(String[] args) { > > System.out.println(U2U("\\u0041\\u0042\\u0043\\u000A\\u0031\\u0032\\u0033")); > > > } } > > Arne Well, brilliant minds think alike. Where were you when I asked the first time :-). I don't remember a thread on this going by but that's getting harder to do all the time. I originally had String.valueOf() instead of Character.toString(). I think the latter is better but not sure if it makes any difference. Could be a non-trivial Unicode gotcha eh Daniel? Thanks everybody. import java.util.regex.*; public class test6 { public static void main(String[] args) { String clear = "byte me!"; System.out.println(clear); String escpd = unicodeEscape(clear); System.out.println(escpd); Pattern p = Pattern.compile("\\\\u([0-9a-fA-F]{4})"); Matcher m = p.matcher(escpd); StringBuffer buf = new StringBuffer(); while (m.find()) { String repl = Character.toString((char)Integer.parseInt(m.group(1),16)); m.appendReplacement(buf,repl); } m.appendTail(buf); System.out.println(buf); } public static String unicodeEscape(char c) { return String.format("\\u%04x",(int)c); } public static String unicodeEscape(Character c) { if (c == null) return null; return unicodeEscape(c.charValue()); } public static String unicodeEscape(String str) { StringBuilder buf = new StringBuilder(); for (int i=0; ijava test6 byte me! \u0062\u0079\u0074\u0065\u0020\u006d\u0065\u0021 byte me! -- Knute Johnson