Groups > comp.lang.java.programmer > #17083 > unrolled thread

retriving escape unicode sequences from files ...

Started by	"qwertmonkey" <qwertmonkey@1:261/38.remove-yy0-this>
First post	2012-08-03 18:54 +0000
Last post	2012-08-08 06:20 +0000
Articles	9 — 9 participants

Back to article view | Back to comp.lang.java.programmer

  retriving escape unicode sequences from files ... "qwertmonkey" <qwertmonkey@1:261/38.remove-yy0-this> - 2012-08-03 18:54 +0000
    Re: retriving escape unicode sequences from files ... "markspace" <markspace@1:261/38.remove-yy0-this> - 2012-08-03 18:54 +0000
    Re: retriving escape unicode sequences from files ... "Roedy Green" <roedy.green@1:261/38.remove-yy0-this> - 2012-08-03 18:54 +0000
    Re: retriving escape unicode sequences from files ... "glen herrmannsfeldt" <glen.herrmannsfeldt@1:261/38.remove-5qr-this> - 2012-08-04 18:41 +0000
    Re: retriving escape unicode sequences from files ... "Arne Vajhøj" <arne.vajhøj@1:261/38.remove-5qr-this> - 2012-08-04 18:41 +0000
      Re: retriving escape unicode sequences from files ... "Daniel Pitts" <daniel.pitts@1:261/38.remove-5qr-this> - 2012-08-04 18:41 +0000
        Re: retriving escape unicode sequences from files ... "markspace" <markspace@1:261/38.remove-5qr-this> - 2012-08-04 18:41 +0000
          Re: retriving escape unicode sequences from files ... "Lew" <lew@1:261/38.remove-5qr-this> - 2012-08-04 18:41 +0000
        Re: retriving escape unicode sequences from files ... "Arne Vajhøj" <arne.vajhøj@1:261/38.remove-p82-this> - 2012-08-08 06:20 +0000

#17083 — retriving escape unicode sequences from files ...

From	"qwertmonkey" <qwertmonkey@1:261/38.remove-yy0-this>
Date	2012-08-03 18:54 +0000
Subject	retriving escape unicode sequences from files ...
Message-ID	<501C1568.56042.calajapr@time.synchro.net>

From: qwertmonkey@syberianoutpost.ru

 Why is it that if you save a unicode sequence in a file, say "frantais"
~
\u0066\u0072\u0061\u006e\u00e7\u0061\u0069\u0073
~
 and then retrieve as a String you can't then convert it back to a UTF-8 String
~
 As you can test with this piece of code, you can simply declare the String as
a literal one or give it in the command prompt, but retrieving what seems to be 
the same sequence of characters (as they print to standard out) from a file 
doesn't seem to work
~
import java.io.ByteArrayOutputStream; import java.io.PrintStream;
import java.io.UnsupportedEncodingException; import java.io.IOException;

// __
public class UniKdEnk00Test{
 private static final String aNWLn = System.getProperty("line.separator");
// __
 public static void main (String[] aArgs){
  try{
// __
   if((aArgs == null) ||  (aArgs.length != 1)){ throw new IOException(aNWLn +
"// __ usage:" + aNWLn + aNWLn +
" java UniKdEnk00Test \\u0066\\u0072\\u0061\\u006e\\u00e7\\u0061\\u0069\\u0073"
+ aNWLn);  }
   String aUniKdEnk = "\u0066\u0072\u0061\u006e\u00e7\u0061\u0069\u0073";
   byte[] bAr = aUniKdEnk.getBytes("UTF-8");
   ByteArrayOutputStream BOS = new ByteArrayOutputStream();
   BOS.write(bAr, 0, bAr.length);
   String aUTF8L = new String(BOS.toByteArray(), "UTF-8");
   System.out.println(aUTF8L);
   BOS.reset();
  }catch(UnsupportedEncodingException UEncX){ UEncX.printStackTrace(); }
    catch(IOException IOX) { IOX.printStackTrace(); }
// __
 }
}
~
 lbrtchx
 comp.lang.java.programmer: escape unicode sequences in files ...

--- BBBS/Li6 v4.10 Dada-1
 * Origin: Prism bbs (1:261/38)
--- Synchronet 3.16a-Win32 NewsLink 1.98
Time Warp of the Future BBS - telnet://time.synchro.net:24

[toc] | [next] | [standalone]

#17085

From	"markspace" <markspace@1:261/38.remove-yy0-this>
Date	2012-08-03 18:54 +0000
Message-ID	<501C1568.56044.calajapr@time.synchro.net>
In reply to	#17083

  To: qwertmonkey
From: markspace <-@.>

On 8/2/2012 8:52 PM, qwertmonkey@syberianoutpost.ru wrote:
>   Why is it that if you save a unicode sequence in a file, say "frantais"
> ~
> \u0066\u0072\u0061\u006e\u00e7\u0061\u0069\u0073
> ~
>   and then retrieve as a String you can't then convert it back to a UTF-8
String

Because it isn't French, it's just the ASCII characters \, u, 0, 0, 6, 6 etc.  
This is a totally different concept from the idea of escape sequences that the 
compiler interprets for you.

If you want to read French out of a file, put *French* in the file, not ASCII.  
It can't work any other way.

If you want to interpret ASCII as escape sequences, you'll have to write the 
interpreter.  The Java Properties object reads escape sequences, but I don't 
think you can separate just the escape parser out.

--- BBBS/Li6 v4.10 Dada-1
 * Origin: Prism bbs (1:261/38)
--- Synchronet 3.16a-Win32 NewsLink 1.98
Time Warp of the Future BBS - telnet://time.synchro.net:24

[toc] | [prev] | [next] | [standalone]

#17088

From	"Roedy Green" <roedy.green@1:261/38.remove-yy0-this>
Date	2012-08-03 18:54 +0000
Message-ID	<501C1569.56047.calajapr@time.synchro.net>
In reply to	#17083

  To: qwertmonkey
From: Roedy Green <see_website@mindprod.com.invalid>

On Fri, 3 Aug 2012 03:52:12 +0000 (UTC), qwertmonkey@syberianoutpost.ru wrote, 
quoted or indirectly quoted someone who said :

> Why is it that if you save a unicode sequence in a file, say "frantais"

This is a bit of a simplification.
You need to understand encoding, which kicks in when you use a Reader or 
Writer.  Otherwise you are dealing with raw bytes and InputStreams and 
OutputStreams.

Encoding takes your 16-bit internal Unicode chars and converts it back and 
forth to UTF-8 bytes.

see http://mindprod.com/applet/fileio.html for sample code see 
http://mindprod.com/jgloss/encoding.html for an explanation of encoding and the 
various types of encoding.

--
Roedy Green Canadian Mind Products
http://mindprod.com
The greatest shortcoming of the human race is our inability to understand the 
exponential function.
 ~ Dr. Albert A. Bartlett (born: 1923-03-21 age: 89)
http://www.youtube.com/watch?v=F-QA2rkpBSY

--- BBBS/Li6 v4.10 Dada-1
 * Origin: Prism bbs (1:261/38)
--- Synchronet 3.16a-Win32 NewsLink 1.98
Time Warp of the Future BBS - telnet://time.synchro.net:24

[toc] | [prev] | [next] | [standalone]

#17156

From	"glen herrmannsfeldt" <glen.herrmannsfeldt@1:261/38.remove-5qr-this>
Date	2012-08-04 18:41 +0000
Message-ID	<501D6353.56116.calajapr@time.synchro.net>
In reply to	#17083

  To: qwertmonkey
From: glen herrmannsfeldt <gah@ugcs.caltech.edu>

qwertmonkey@syberianoutpost.ru wrote:
> Why is it that if you save a unicode sequence in a file, say "frantais"
> ~
> \u0066\u0072\u0061\u006e\u00e7\u0061\u0069\u0073

Note the difference between \u0066 and \uu0066.

Specifically, consider the java program:

class quote {
public static void main(String args[]) {
   System.out.println(\u0022hi there\u0021\u0022);
   }
}

-- glen

--- BBBS/Li6 v4.10 Dada-1
 * Origin: Prism bbs (1:261/38)
--- Synchronet 3.16a-Win32 NewsLink 1.98
Time Warp of the Future BBS - telnet://time.synchro.net:24

[toc] | [prev] | [next] | [standalone]

#17157

From	"Arne Vajhøj" <arne.vajhøj@1:261/38.remove-5qr-this>
Date	2012-08-04 18:41 +0000
Message-ID	<501D6353.56117.calajapr@time.synchro.net>
In reply to	#17083

  To: qwertmonkey
From: Arne Vajhoj <arne@vajhoej.dk>

On 8/2/2012 11:52 PM, qwertmonkey@syberianoutpost.ru wrote:
>   Why is it that if you save a unicode sequence in a file, say "frantais"
> ~
> \u0066\u0072\u0061\u006e\u00e7\u0061\u0069\u0073
> ~
>   and then retrieve as a String you can't then convert it back to a UTF-8
String
> ~

Some code from my shelf:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Unescape {
     private static final Pattern p = Pattern.compile("\\\\u([0-9A-F]{4})");
     public static String U2U(String s) {
         String res = s;
         Matcher m = p.matcher(res);
         while(m.find()) {
             res = res.replaceAll("\\" + m.group(0),
Character.toString((char)Integer.parseInt(m.group(1), 16)));
         }
         return res;
     }
     public static void main(String[] args) {

System.out.println(U2U("\\u0041\\u0042\\u0043\\u000A\\u0031\\u0032\\u0033"));
     }
}

Arne

--- BBBS/Li6 v4.10 Dada-1
 * Origin: Prism bbs (1:261/38)
--- Synchronet 3.16a-Win32 NewsLink 1.98
Time Warp of the Future BBS - telnet://time.synchro.net:24

[toc] | [prev] | [next] | [standalone]

#17160

From	"Daniel Pitts" <daniel.pitts@1:261/38.remove-5qr-this>
Date	2012-08-04 18:41 +0000
Message-ID	<501D6353.56120.calajapr@time.synchro.net>
In reply to	#17157

  To: Arne Vajhøj
From: Daniel Pitts <newsgroup.nospam@virtualinfinity.net>

On 8/3/12 5:37 PM, Arne Vajhoj wrote:
> On 8/2/2012 11:52 PM, qwertmonkey@syberianoutpost.ru wrote:
>>   Why is it that if you save a unicode sequence in a file, say "frantais"
>> ~
>> \u0066\u0072\u0061\u006e\u00e7\u0061\u0069\u0073
>> ~
>>   and then retrieve as a String you can't then convert it back to a
>> UTF-8 String
>> ~
>
> Some code from my shelf:
>
> import java.util.regex.Matcher;
> import java.util.regex.Pattern;
>
> public class Unescape {
>      private static final Pattern p =
> Pattern.compile("\\\\u([0-9A-F]{4})");
>      public static String U2U(String s) {
>          String res = s;
>          Matcher m = p.matcher(res);
>          while(m.find()) {
>              res = res.replaceAll("\\" + m.group(0),
> Character.toString((char)Integer.parseInt(m.group(1), 16)));
>          }
>          return res;
>      }
>      public static void main(String[] args) {
>
> System.out.println(U2U("\\u0041\\u0042\\u0043\\u000A\\u0031\\u0032\\u0033"));
>
>      }
> }
And if you wanted this to be effecient, you'd use appendReplacement instead of 
res.replaceAll()

--- BBBS/Li6 v4.10 Dada-1
 * Origin: Prism bbs (1:261/38)
--- Synchronet 3.16a-Win32 NewsLink 1.98
Time Warp of the Future BBS - telnet://time.synchro.net:24

[toc] | [prev] | [next] | [standalone]

#17161

From	"markspace" <markspace@1:261/38.remove-5qr-this>
Date	2012-08-04 18:41 +0000
Message-ID	<501D6354.56122.calajapr@time.synchro.net>
In reply to	#17160

  To: Daniel Pitts
From: markspace <-@.>

On 8/3/2012 8:49 PM, Daniel Pitts wrote:

> And if you wanted this to be effecient, you'd use appendReplacement
> instead of res.replaceAll()
>


Free code is free.  Not efficient.  ;-)

--- BBBS/Li6 v4.10 Dada-1
 * Origin: Prism bbs (1:261/38)
--- Synchronet 3.16a-Win32 NewsLink 1.98
Time Warp of the Future BBS - telnet://time.synchro.net:24

[toc] | [prev] | [next] | [standalone]

#17163

From	"Lew" <lew@1:261/38.remove-5qr-this>
Date	2012-08-04 18:41 +0000
Message-ID	<501D6354.56123.calajapr@time.synchro.net>
In reply to	#17161

  To: markspace
From: Lew <lewbloch@gmail.com>

markspace wrote:
> Daniel Pitts wrote:
>> And if you wanted this to be efficient, you'd use appendReplacement
>> instead of res.replaceAll()
>
> Free code is free.  Not efficient.  ;-)

Not always. But after some reviewers suggest improvements, it converges on it.

Valuably, the posting to Usenet opens up public review for suggestions for 
improvement like this.

The pedagogical value of exposing code to tweaks offered by commenters is 
beyond measure.

--
Lew

--- BBBS/Li6 v4.10 Dada-1
 * Origin: Prism bbs (1:261/38)
--- Synchronet 3.16a-Win32 NewsLink 1.98
Time Warp of the Future BBS - telnet://time.synchro.net:24

[toc] | [prev] | [next] | [standalone]

#17338

From	"Arne Vajhøj" <arne.vajhøj@1:261/38.remove-p82-this>
Date	2012-08-08 06:20 +0000
Message-ID	<5021F864.56292.calajapr@time.synchro.net>
In reply to	#17160

  To: Daniel Pitts
From: Arne Vajhoj <arne@vajhoej.dk>

On 8/3/2012 11:49 PM, Daniel Pitts wrote:
> On 8/3/12 5:37 PM, Arne Vajhoj wrote:
>> On 8/2/2012 11:52 PM, qwertmonkey@syberianoutpost.ru wrote:
>>>   Why is it that if you save a unicode sequence in a file, say
>>> "frantais"
>>> ~
>>> \u0066\u0072\u0061\u006e\u00e7\u0061\u0069\u0073
>>> ~
>>>   and then retrieve as a String you can't then convert it back to a
>>> UTF-8 String
>>> ~
>>
>> Some code from my shelf:
>>
>> import java.util.regex.Matcher;
>> import java.util.regex.Pattern;
>>
>> public class Unescape {
>>      private static final Pattern p =
>> Pattern.compile("\\\\u([0-9A-F]{4})");
>>      public static String U2U(String s) {
>>          String res = s;
>>          Matcher m = p.matcher(res);
>>          while(m.find()) {
>>              res = res.replaceAll("\\" + m.group(0),
>> Character.toString((char)Integer.parseInt(m.group(1), 16)));
>>          }
>>          return res;
>>      }
>>      public static void main(String[] args) {
>>
>> System.out.println(U2U("\\u0041\\u0042\\u0043\\u000A\\u0031\\u0032\\u0033"))
;
>>
>>
>>      }
>> }
> And if you wanted this to be effecient, you'd use appendReplacement
> instead of res.replaceAll()

I did not even knew that existed.

So:

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Unescape {
        private static final Pattern p = Pattern.compile("\\\\u([0-9A-F]{4})");
        public static String U2U(String s) {
                Matcher m = p.matcher(s);
                StringBuffer res = new StringBuffer();
                while (m.find()) {
                        m.appendReplacement(res, Character.toString((char)
Integer.parseInt(m.group(1), 16)));
                }
                m.appendTail(res);
                return res.toString();
        }
        public static void main(String[] args) {

System.out.println(U2U("\\u0041\\u0042\\u0043\\u000A\\u0031\\u0032\\u0033"));
        }
}

Arne

--- BBBS/Li6 v4.10 Dada-1
 * Origin: Prism bbs (1:261/38)
--- Synchronet 3.16a-Win32 NewsLink 1.98
Time Warp of the Future BBS - telnet://time.synchro.net:24

[toc] | [prev] | [standalone]

csiph-web

retriving escape unicode sequences from files ...

Contents

#17083 — retriving escape unicode sequences from files ...

#17085

#17088

#17156

#17157

#17160

#17161

#17163

#17338