Groups > comp.lang.java.programmer > #19834 > unrolled thread

Detect XML document encodings with SAX

Started by	Sebastian <sebastian@undisclosed.invalid>
First post	2012-11-21 15:32 +0100
Last post	2012-12-16 17:43 +0200
Articles	20 on this page of 43 — 9 participants

Back to article view | Back to comp.lang.java.programmer

  Detect XML document encodings with SAX Sebastian <sebastian@undisclosed.invalid> - 2012-11-21 15:32 +0100
    Re: Detect XML document encodings with SAX Lew <lewbloch@gmail.com> - 2012-11-21 11:31 -0800
      Re: Detect XML document encodings with SAX Sebastian <sebastian@undisclosed.invalid> - 2012-11-22 00:39 +0100
        Re: Detect XML document encodings with SAX Lew <lewbloch@gmail.com> - 2012-11-21 16:37 -0800
          Re: Detect XML document encodings with SAX Sebastian <sebastian@undisclosed.invalid> - 2012-11-22 07:41 +0100
            Re: Detect XML document encodings with SAX markspace <-@.> - 2012-11-21 23:18 -0800
              Re: Detect XML document encodings with SAX Steven Simpson <ss@domain.invalid> - 2012-11-22 07:53 +0000
                Re: Detect XML document encodings with SAX markspace <-@.> - 2012-11-22 08:31 -0800
              Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-23 21:21 -0500
      Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-23 21:11 -0500
      Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-23 21:20 -0500
        Re: Detect XML document encodings with SAX Lew <lewbloch@gmail.com> - 2012-11-24 02:14 -0800
          Re: Detect XML document encodings with SAX Sebastian <sebastian@undisclosed.invalid> - 2012-11-24 22:18 +0100
            Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-24 17:07 -0500
              Re: Detect XML document encodings with SAX Sebastian <sebastian@undisclosed.invalid> - 2012-11-25 10:50 +0100
            Re: Detect XML document encodings with SAX markspace <-@.> - 2012-11-24 17:12 -0800
              Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-24 20:17 -0500
                Re: Detect XML document encodings with SAX markspace <-@.> - 2012-11-24 18:02 -0800
                  Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-24 21:10 -0500
                    Re: Detect XML document encodings with SAX markspace <-@.> - 2012-11-24 18:25 -0800
                      Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-24 21:37 -0500
                        Re: Detect XML document encodings with SAX markspace <-@.> - 2012-11-24 21:01 -0800
                          Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-25 16:30 -0500
                            Re: Detect XML document encodings with SAX Gene Wirchenko <genew@telus.net> - 2012-12-12 18:03 -0800
                              Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-12-12 21:09 -0500
                                Re: Detect XML document encodings with SAX Lew <lewbloch@gmail.com> - 2012-12-12 18:58 -0800
                                  Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-12-12 22:17 -0500
                                    Re: Detect XML document encodings with SAX Lew <lewbloch@gmail.com> - 2012-12-12 22:51 -0800
                                Re: Detect XML document encodings with SAX Gene Wirchenko <genew@telus.net> - 2012-12-12 21:52 -0800
                  Re: Detect XML document encodings with SAX Sebastian <sebastian@undisclosed.invalid> - 2012-11-25 10:45 +0100
                    Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-25 16:23 -0500
                    Re: Detect XML document encodings with SAX markspace <-@.> - 2012-11-25 13:24 -0800
                  Re: Detect XML document encodings with SAX Sebastian <sebastian@undisclosed.invalid> - 2012-11-25 10:58 +0100
          Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-24 17:13 -0500
          Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-24 17:19 -0500
    Re: Detect XML document encodings with SAX Roedy Green <see_website@mindprod.com.invalid> - 2012-11-22 03:24 -0800
      Re: Detect XML document encodings with SAX "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-24 00:13 +0100
        Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-23 21:22 -0500
    Re: Detect XML document encodings with SAX Steven Simpson <ss@domain.invalid> - 2012-11-25 11:00 +0000
      Re: Detect XML document encodings with SAX Sebastian <sebastian@undisclosed.invalid> - 2012-11-25 12:32 +0100
      Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-25 14:41 -0500
    Re: Detect XML document encodings with SAX Roedy Green <see_website@mindprod.com.invalid> - 2012-12-12 20:32 -0800
    Re: Detect XML document encodings with SAX Stanimir Stamenkov <s7an10@netscape.net> - 2012-12-16 17:43 +0200

Page 2 of 3 — ← Prev page 1 [2] 3 Next page →

#19920

From	Arne Vajhøj <arne@vajhoej.dk>
Date	2012-11-24 21:37 -0500
Message-ID	<50b18453$0$290$14726298@news.sunsite.dk>
In reply to	#19919

On 11/24/2012 9:25 PM, markspace wrote:
> On 11/24/2012 6:10 PM, Arne Vajhøj wrote:
>> On 11/24/2012 9:02 PM, markspace wrote:
>>> (Your code throws an exception because I have no directory named
>>> "/work".  Look up StringReader or ByteArrayInputStream.  Use those
>>> instead of relying on actual files and a file system.)
>>
>> The code actually writes the files.
>
> Yeah, I'd also rather not have artifacts on my hard drive too, thanks.

The why ask for an SSCCE?

Or does Java files not count as artifacts?

Arne

[toc] | [prev] | [next] | [standalone]

#19921

From	markspace <-@.>
Date	2012-11-24 21:01 -0800
Message-ID	<k8s8nk$fku$1@dont-email.me>
In reply to	#19920

On 11/24/2012 6:37 PM, Arne Vajhøj wrote:
> On 11/24/2012 9:25 PM, markspace wrote:
>> On 11/24/2012 6:10 PM, Arne Vajhøj wrote:
>>> On 11/24/2012 9:02 PM, markspace wrote:
>>>> (Your code throws an exception because I have no directory named
>>>> "/work".  Look up StringReader or ByteArrayInputStream.  Use those
>>>> instead of relying on actual files and a file system.)
>>>
>>> The code actually writes the files.
>>
>> Yeah, I'd also rather not have artifacts on my hard drive too, thanks.
>
> The why ask for an SSCCE?


Because it can be done with out using files.

It isn't self-contained if it depends on my file-system.  I have no 
/work directory, so the example program fails, but not in the the way 
intended.  Ergo, it's not an SSCCE.

[toc] | [prev] | [next] | [standalone]

#19960

From	Arne Vajhøj <arne@vajhoej.dk>
Date	2012-11-25 16:30 -0500
Message-ID	<50b28ded$0$289$14726298@news.sunsite.dk>
In reply to	#19921

On 11/25/2012 12:01 AM, markspace wrote:
> On 11/24/2012 6:37 PM, Arne Vajhøj wrote:
>> On 11/24/2012 9:25 PM, markspace wrote:
>>> On 11/24/2012 6:10 PM, Arne Vajhøj wrote:
>>>> On 11/24/2012 9:02 PM, markspace wrote:
>>>>> (Your code throws an exception because I have no directory named
>>>>> "/work".  Look up StringReader or ByteArrayInputStream.  Use those
>>>>> instead of relying on actual files and a file system.)
>>>>
>>>> The code actually writes the files.
>>>
>>> Yeah, I'd also rather not have artifacts on my hard drive too, thanks.
>>
>> The why ask for an SSCCE?
>>
## Or does Java files not count as artifacts?
>
> Because it can be done with out using files.

That really does not answer the question of why you want an SSCCE
if you don't want artifacts on your hard drive.

And you can obviously not read from a file without reading
from a file.

You may make an assumption that the XML parsers do not
use the underlying input classes to get the encoding.

As I have already explained to you once, then that is not
always the case.

> It isn't self-contained if it depends on my file-system.  I have no
> /work directory, so the example program fails, but not in the the way
> intended.  Ergo, it's not an SSCCE.

Actually SSCCE allows for input files.

If you don't want input files, then ask for a MSSSCCE and link
to the rules for that.

Arne

[toc] | [prev] | [next] | [standalone]

#20281

From	Gene Wirchenko <genew@telus.net>
Date	2012-12-12 18:03 -0800
Message-ID	<qndic811jjnaoqep1jc95h7c27ivrhinoq@4ax.com>
In reply to	#19960

On Sun, 25 Nov 2012 16:30:20 -0500, Arne Vajhøj <arne@vajhoej.dk>
wrote:

[snip]

>If you don't want input files, then ask for a MSSSCCE and link
                                               ^^^^^^^
>to the rules for that.

     Please expand your new acronym.

Sincerely,

Gene Wirchenko

[toc] | [prev] | [next] | [standalone]

#20282

From	Arne Vajhøj <arne@vajhoej.dk>
Date	2012-12-12 21:09 -0500
Message-ID	<50c938de$0$289$14726298@news.sunsite.dk>
In reply to	#20281

On 12/12/2012 9:03 PM, Gene Wirchenko wrote:
> On Sun, 25 Nov 2012 16:30:20 -0500, Arne Vajhøj <arne@vajhoej.dk>
> wrote:
>
> [snip]
>
>> If you don't want input files, then ask for a MSSSCCE and link
>                                                 ^^^^^^^
>> to the rules for that.
>
>       Please expand your new acronym.

MarkSpace SSCCE

:-)

Arne

[toc] | [prev] | [next] | [standalone]

#20283

From	Lew <lewbloch@gmail.com>
Date	2012-12-12 18:58 -0800
Message-ID	<0037028d-8fb7-41ed-ac1f-42e00e50d67e@googlegroups.com>
In reply to	#20282

Arne Vajhøj wrote:
> MarkSpace SSCCE

Apparently the OP gave up on getting help and was unwilling to provide the 
materials requested.

-- 
Lew

[toc] | [prev] | [next] | [standalone]

#20284

From	Arne Vajhøj <arne@vajhoej.dk>
Date	2012-12-12 22:17 -0500
Message-ID	<50c948dd$0$288$14726298@news.sunsite.dk>
In reply to	#20283

On 12/12/2012 9:58 PM, Lew wrote:
> Apparently the OP gave up on getting help and was unwilling to provide the
> materials requested.

????

Steven Simpson solved the problem with the provided information.

And OP acknowledged it.

Arne

[toc] | [prev] | [next] | [standalone]

#20288

From	Lew <lewbloch@gmail.com>
Date	2012-12-12 22:51 -0800
Message-ID	<eb266140-3ebd-4f41-99df-921e55c456aa@googlegroups.com>
In reply to	#20284

Arne Vajhøj wrote:
> Lew wrote:
>> Apparently the OP gave up on getting help and was unwilling to provide the
>> materials requested.
> 
> ????
> 
> Steven Simpson solved the problem with the provided information.
> 
> And OP acknowledged it.

I stand corrected.

-- 
Lew

[toc] | [prev] | [next] | [standalone]

#20286

From	Gene Wirchenko <genew@telus.net>
Date	2012-12-12 21:52 -0800
Message-ID	<r8ric81k3himnab3kqiv9p014f7tfc3tbq@4ax.com>
In reply to	#20282

On Wed, 12 Dec 2012 21:09:32 -0500, Arne Vajhøj <arne@vajhoej.dk>
wrote:

>On 12/12/2012 9:03 PM, Gene Wirchenko wrote:
>> On Sun, 25 Nov 2012 16:30:20 -0500, Arne Vajhøj <arne@vajhoej.dk>
>> wrote:
>>
>> [snip]
>>
>>> If you don't want input files, then ask for a MSSSCCE and link
>>                                                 ^^^^^^^
>>> to the rules for that.
>>
>>       Please expand your new acronym.
>
>MarkSpace SSCCE
>
>:-)

    Thank you.

Sincerely,

Gene Wirchenko

[toc] | [prev] | [next] | [standalone]

#19925

From	Sebastian <sebastian@undisclosed.invalid>
Date	2012-11-25 10:45 +0100
Message-ID	<k8sp7h$h5h$1@news.albasani.net>
In reply to	#19917

Am 25.11.2012 03:02, schrieb markspace:
> On 11/24/2012 5:17 PM, Arne Vajhøj wrote:
>> On 11/24/2012 8:12 PM, markspace wrote:
>>> I personally am still looking for an SSCCE, as your last one didn't
>>> reproduce the error for me.
>>
>> Did you try my 1 2 3 example?
>
>
> No that errors on me too. Really all the OP did was cut and paste the
> example code from his link as his SSCCE, it didn't even contain his data
> file. There's no way I'm going to bother writing code for anyone who is
> that lazy.

I did include an XML file with my original post, and said to use 
Elliotte Rusty's code to see what it (wrongly) says about the encoding.
What else wouldyou require? Both Arne Vajhøj and Steven Simpson could 
reproduce the phenomenon that way.

-- Sebastian

[toc] | [prev] | [next] | [standalone]

#19958

From	Arne Vajhøj <arne@vajhoej.dk>
Date	2012-11-25 16:23 -0500
Message-ID	<50b28c66$0$289$14726298@news.sunsite.dk>
In reply to	#19925

On 11/25/2012 4:45 AM, Sebastian wrote:
> Am 25.11.2012 03:02, schrieb markspace:
>> On 11/24/2012 5:17 PM, Arne Vajhøj wrote:
>>> On 11/24/2012 8:12 PM, markspace wrote:
>>>> I personally am still looking for an SSCCE, as your last one didn't
>>>> reproduce the error for me.
>>>
>>> Did you try my 1 2 3 example?
>>
>>
>> No that errors on me too. Really all the OP did was cut and paste the
>> example code from his link as his SSCCE, it didn't even contain his data
>> file. There's no way I'm going to bother writing code for anyone who is
>> that lazy.
>
> I did include an XML file with my original post, and said to use
> Elliotte Rusty's code to see what it (wrongly) says about the encoding.
> What else wouldyou require? Both Arne Vajhøj and Steven Simpson could
> reproduce the phenomenon that way.

Just like some younger people find it uber-cool to have
their pants hanging down just over their knees, then some
not quite as young people find it uber-cool to scream
for SSCCE and clarification.

:-)

Arne

[toc] | [prev] | [next] | [standalone]

#19959

From	markspace <-@.>
Date	2012-11-25 13:24 -0800
Message-ID	<k8u29o$ol8$1@dont-email.me>
In reply to	#19925

On 11/25/2012 1:45 AM, Sebastian wrote:

> I did include an XML file with my original post, and said to use
> Elliotte Rusty's code to see what it (wrongly) says about the encoding.
> What else wouldyou require?

The code you copied prints "unknown" for me.  I mentioned that awhile 
back.  I assume the encoding in the xml file you attached did not 
survive the interwebs.

I require: a 100% self-contained program.  Use a StringReader and a Java 
string constant.  Those can encode character values greater than 127 as 
ASCII just fine, so I assume they'll survive Usenet just fine.

Self-contained means self-contained, not "except for the bits I left off."

[toc] | [prev] | [next] | [standalone]

#19927

From	Sebastian <sebastian@undisclosed.invalid>
Date	2012-11-25 10:58 +0100
Message-ID	<k8sq0l$iss$1@news.albasani.net>
In reply to	#19917

Am 25.11.2012 03:02, schrieb markspace:
[snip]
> file. There's no way I'm going to bother writing code for anyone who is
> that lazy.

I wasn't really asking for code to be written. I was interested to know 
whether others see the same behavior and consider it quirky, too, and 
whether they had suggestions for a parser that would behave more as I 
expected.

Of course, I am still grateful to Arne for having posted his code.

-- Sebastian

[toc] | [prev] | [next] | [standalone]

#19906

From	Arne Vajhøj <arne@vajhoej.dk>
Date	2012-11-24 17:13 -0500
Message-ID	<50b14677$0$282$14726298@news.sunsite.dk>
In reply to	#19888

On 11/24/2012 5:14 AM, Lew wrote:
> Arne Vajhøj wrote:
>> Lew wrote:
>>> But SAX is a parser, so it doesn't output, it inputs. What are you telling us?
>>
>> Output usually mean System.out.println - that works fine with a parser.
>
> His phrasing wasn't clear to me. That's why I asked for clarification.

Then maybe we need "How to ask for clarifications the smart way".

>>> However, according to
>>> http://xmlwriter.net/xml_guide/xml_declaration.shtml#Encoding
>>> supported encodings only include UTF-8, UTF-16, ISO-10646-UCS-2,
>>> ISO-10646-UCS-4, ISO-8859-1 to ISO-8859-9, ISO-2022-JP, Shift_JIS,
>>> and EUC-JP,
>>> So it looks like you must not accept XML documents with such a
>>> non-standard encoding.
>>
>> Those that has researched would know that the XML spec do not
>> limit the encodings at all. The XML processor must support UTF-8
>> and UTF-16, but are free to support others.
>
> Perhaps the OP's parser doesn't exercise that freedom, judging by the
> symptoms.

There are nothing in OP's symptoms that indicate lack of support
for encodings.

OP's symptoms is that it parse fine with encoding XYZ but when asked
by caller it claims wrongfully to be using UTF-8.

Arne

[toc] | [prev] | [next] | [standalone]

#19907

From	Arne Vajhøj <arne@vajhoej.dk>
Date	2012-11-24 17:19 -0500
Message-ID	<50b147f7$0$294$14726298@news.sunsite.dk>
In reply to	#19888

On 11/24/2012 5:14 AM, Lew wrote:
> Obviously I don't know the answer, but he's asking for suggestions
> to investigate, AIUI. He's having encoding problems. His XML is apparently
> encoded in Windows-1252, a notoriously funky encoding especially for
> the variety of characters with which one might wish to deal.

CP-1252 is just another encoding. It is not more or less funky than
any other encoding.

In fact it is identical with ISO-8859-1 for all characters except
128-159, which are control characters/unmapped in ISO-8859-1 but has
various extra values in CP-1252.

>                                                             So why not
> investigate obtaining material that isn't in such a notoriously funky
> encoding, like, oh, say, the old reliable standard UTF-8?

If one can chose the data files and the software, then life is easy.

Arne

[toc] | [prev] | [next] | [standalone]

#19847

From	Roedy Green <see_website@mindprod.com.invalid>
Date	2012-11-22 03:24 -0800
Message-ID	<lj2sa8pruu1mtn20amkj2olt97h87n2d7k@4ax.com>
In reply to	#19834

On Wed, 21 Nov 2012 15:32:19 +0100, Sebastian
<sebastian@undisclosed.invalid> wrote, quoted or indirectly quoted
someone who said :

>Does anyone have an idea why that is so? And how I could
>go about making some XML parser determine the correct encoding?

See http://mindprod.com/products2.html#ENCODINGRECOGNISER

This is a manual assist tool to help you guess the encoding.

Encodings are not embedded in any way in files. You just have to know.

ARGHHH!

See http://mindprod.com/jgloss/encoding.html
for how to use native2ascii to interconvert encodings.  

The XML world likes UTF-8.  Using anything else is just asking for
trouble.
-- 
Roedy Green Canadian Mind Products http://mindprod.com
Students who hire or con others to do their homework are as foolish 
as couch potatoes who hire others to go to the gym for them.

[toc] | [prev] | [next] | [standalone]

#19872

From	"Peter J. Holzer" <hjp-usenet2@hjp.at>
Date	2012-11-24 00:13 +0100
Message-ID	<slrnkb00o8.jbt.hjp-usenet2@hrunkner.hjp.at>
In reply to	#19847

On 2012-11-22 11:24, Roedy Green <see_website@mindprod.com.invalid> wrote:
> On Wed, 21 Nov 2012 15:32:19 +0100, Sebastian
><sebastian@undisclosed.invalid> wrote, quoted or indirectly quoted
> someone who said :
>>Does anyone have an idea why that is so? And how I could
>>go about making some XML parser determine the correct encoding?
>
> See http://mindprod.com/products2.html#ENCODINGRECOGNISER
>
> This is a manual assist tool to help you guess the encoding.

No need to guess.

> Encodings are not embedded in any way in files. You just have to know.

Not true for XML. The file Sebastian posted starts with

<?xml version="1.0" encoding="windows-1250"?>

	hp


-- 
   _  | Peter J. Holzer    | Fluch der elektronischen Textverarbeitung:
|_|_) | Sysadmin WSR       | Man feilt solange an seinen Text um, bis
| |   | hjp@hjp.at         | die Satzbestandteile des Satzes nicht mehr
__/   | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel

[toc] | [prev] | [next] | [standalone]

#19881

From	Arne Vajhøj <arne@vajhoej.dk>
Date	2012-11-23 21:22 -0500
Message-ID	<50b02f7e$0$283$14726298@news.sunsite.dk>
In reply to	#19872

On 11/23/2012 6:13 PM, Peter J. Holzer wrote:
> On 2012-11-22 11:24, Roedy Green <see_website@mindprod.com.invalid> wrote:
>> On Wed, 21 Nov 2012 15:32:19 +0100, Sebastian
>> <sebastian@undisclosed.invalid> wrote, quoted or indirectly quoted
>> someone who said :
>>> Does anyone have an idea why that is so? And how I could
>>> go about making some XML parser determine the correct encoding?
>>
>> See http://mindprod.com/products2.html#ENCODINGRECOGNISER
>>
>> This is a manual assist tool to help you guess the encoding.
>
> No need to guess.
>
>> Encodings are not embedded in any way in files. You just have to know.
>
> Not true for XML. The file Sebastian posted starts with
>
> <?xml version="1.0" encoding="windows-1250"?>

New around here?

Don't expect Roedy's posts to relate that much to what he is
replying to.

Arne

[toc] | [prev] | [next] | [standalone]

#19928

From	Steven Simpson <ss@domain.invalid>
Date	2012-11-25 11:00 +0000
Message-ID	<maa9o9-2vr.ln1@s.simpson148.btinternet.com>
In reply to	#19834

On 21/11/12 14:32, Sebastian wrote:
> Does anyone have an idea why that is so? And how I could
> go about making some XML parser determine the correct encoding?

Sussed it!  (Come to think of it, I feel I've sussed this before...)

The charset returned by the locator changes during parsing.  At 
startDocument(), it is the assumed charset, possibly based on the first 
four-or-so bytes.  At endDocument(), it is reset to null.  On the first 
call to startElement, it has the correct value.  There might be an 
earlier event where it is correct - I didn't investigate.

SSCCE...

import org.xml.sax.*;
import org.xml.sax.ext.*;
import org.xml.sax.helpers.*;

import java.io.*;
import java.nio.charset.*;

public class SAXEncodingDetector extends DefaultHandler {
     static void escape(PrintWriter out, CharsetEncoder enc, CharSequence text) {
         final int len = text.length();
         for (int i = 0; i < len; i++) {
             char c = text.charAt(i);
             if (enc.canEncode(c))
                 out.print(c);
             else
                 out.printf("&#x%x;", (int) c);
         }
     }

     static final String MESSAGE = "L\u00f6we \u20ac";

     static byte[] createXMLBytes(String charsetName)
         throws UnsupportedEncodingException {
         Charset charset = Charset.forName(charsetName);
         CharsetEncoder encoder = charset.newEncoder();
         ByteArrayOutputStream bytesOut = new ByteArrayOutputStream();
         PrintWriter out =
             new PrintWriter(new OutputStreamWriter(bytesOut, charset));
         out.printf("<?xml version=\"1.0\" encoding=\"%s\" ?>%n", charsetName);
         out.print("<root>");
         escape(out, encoder, MESSAGE);
         out.println("</root>");
         out.close();
         return bytesOut.toByteArray();
     }

     public static void main(String[] args) throws SAXException, IOException {
         for (int i = 0; i < args.length; i++) {
             String inCharset = args[i];
             byte[] bytes = createXMLBytes(inCharset);
             System.out.printf("%nCharset %s: (%d bytes)%n",
                               inCharset, bytes.length);
             printBytes(bytes, System.out);
             ByteArrayInputStream in = new ByteArrayInputStream(bytes);

             XMLReader parser = XMLReaderFactory.createXMLReader();
             SAXEncodingDetector handler = new SAXEncodingDetector();
             parser.setContentHandler(handler);
             parser.parse(new InputSource(in));

             System.out.printf("Charset at document start: %s%n",
                               handler.encodingAtDocumentStart);
             System.out.printf(" Charset at element start: %s%n",
                               handler.encodingAtElementStart);
             System.out.printf("   Charset at element end: %s%n",
                               handler.encodingAtElementEnd);
             System.out.printf("  Charset at document end: %s%n",
                               handler.encodingAtDocumentEnd);
             String content = handler.content.toString();
             System.out.println("Content: " + content);
             if (!content.equals(MESSAGE))
                 System.out.println("Warning: message corrupted");
         }
     }
     
     private String encodingAtDocumentStart;
     private String encodingAtElementStart;
     private String encodingAtElementEnd;
     private String encodingAtDocumentEnd;
     private Locator2 locator;
     private StringWriter content = new StringWriter();

     private boolean inElement;

     @Override
     public void setDocumentLocator(Locator locator) {
         if (locator instanceof Locator2) {
             this.locator = (Locator2) locator;
         }
     }
     
     @Override
     public void startDocument() throws SAXException {
         if (locator != null) {
             this.encodingAtDocumentStart = locator.getEncoding();
         }
     }

     @Override
     public void endDocument() throws SAXException {
         if (locator != null) {
             this.encodingAtDocumentEnd = locator.getEncoding();
         }
     }

     @Override
     public void startElement(String uri, String localName,
                              String qName, Attributes atts) {
         if (localName.equals("root")) {
             if (locator != null)
                 this.encodingAtElementStart = locator.getEncoding();
             inElement = true;
         }
     }

     @Override
     public void endElement(String uri, String localName, String qName) {
         if (localName.equals("root")) {
             if (locator != null)
                 this.encodingAtElementEnd = locator.getEncoding();
             inElement = false;
         }
     }

     @Override
     public void characters(char[] ch, int start, int length) {
         if (inElement)
             content.write(ch, start, length);
     }

     static void printBytes(byte[] bytes, PrintStream out) {
         for (int major = 0; major < bytes.length; major += 16) {
             final int lim = Math.min(major + 16, bytes.length) - major;
             for (int minor = 0; minor < 16; minor++) {
                 if (minor < lim) {
                     final int pos = major + minor;
                     out.printf("%02X ", bytes[pos]);
                 } else {
                     out.print(".. ");
                 }
             }

             for (int minor = 0; minor < 16; minor++) {
                 if (minor < lim) {
                     final int pos = major + minor;
                     final int c = bytes[pos] & 0xff;
                     if (c == 10) {
                         out.print("\\n");
                     } else if (c == 13) {
                         out.print("\\r");
                     } else if (c == 9) {
                         out.print("\\t");
                     } else if (c < 32) {
                         out.printf("^%c", (char) (c + 64));
                     } else if (c >= 127 && c <= 160) {
                         out.printf("%02X", c);
                     } else {
                         out.printf("%c ", (char) c);
                     }
                 } else {
                     out.print("..");
                 }
             }

             out.println();
         }
     }
}



Command:

java SAXEncodingDetector US-ASCII ISO-8859-1 UTF-8 windows-1252



Output:

Charset US-ASCII: (75 bytes)
3C 3F 78 6D 6C 20 76 65 72 73 69 6F 6E 3D 22 31 < ? x m l   v e r s i o n = " 1
2E 30 22 20 65 6E 63 6F 64 69 6E 67 3D 22 55 53 . 0 "   e n c o d i n g = " U S
2D 41 53 43 49 49 22 20 3F 3E 0A 3C 72 6F 6F 74 - A S C I I "   ? > \n< r o o t
3E 4C 26 23 78 66 36 3B 77 65 20 26 23 78 32 30 > L & # x f 6 ; w e   & # x 2 0
61 63 3B 3C 2F 72 6F 6F 74 3E 0A .. .. .. .. .. a c ; < / r o o t > \n..........
Charset at document start: UTF-8
  Charset at element start: US-ASCII
    Charset at element end: US-ASCII
   Charset at document end: null
Content: Löwe €

Charset ISO-8859-1: (72 bytes)
3C 3F 78 6D 6C 20 76 65 72 73 69 6F 6E 3D 22 31 < ? x m l   v e r s i o n = " 1
2E 30 22 20 65 6E 63 6F 64 69 6E 67 3D 22 49 53 . 0 "   e n c o d i n g = " I S
4F 2D 38 38 35 39 2D 31 22 20 3F 3E 0A 3C 72 6F O - 8 8 5 9 - 1 "   ? > \n< r o
6F 74 3E 4C F6 77 65 20 26 23 78 32 30 61 63 3B o t > L ö w e   & # x 2 0 a c ;
3C 2F 72 6F 6F 74 3E 0A .. .. .. .. .. .. .. .. < / r o o t > \n................
Charset at document start: UTF-8
  Charset at element start: ISO-8859-1
    Charset at element end: ISO-8859-1
   Charset at document end: null
Content: Löwe €

Charset UTF-8: (63 bytes)
3C 3F 78 6D 6C 20 76 65 72 73 69 6F 6E 3D 22 31 < ? x m l   v e r s i o n = " 1
2E 30 22 20 65 6E 63 6F 64 69 6E 67 3D 22 55 54 . 0 "   e n c o d i n g = " U T
46 2D 38 22 20 3F 3E 0A 3C 72 6F 6F 74 3E 4C C3 F - 8 "   ? > \n< r o o t > L Ã
B6 77 65 20 E2 82 AC 3C 2F 72 6F 6F 74 3E 0A .. ¶ w e   â 82¬ < / r o o t > \n..
Charset at document start: UTF-8
  Charset at element start: UTF-8
    Charset at element end: UTF-8
   Charset at document end: null
Content: Löwe €

Charset windows-1252: (67 bytes)
3C 3F 78 6D 6C 20 76 65 72 73 69 6F 6E 3D 22 31 < ? x m l   v e r s i o n = " 1
2E 30 22 20 65 6E 63 6F 64 69 6E 67 3D 22 77 69 . 0 "   e n c o d i n g = " w i
6E 64 6F 77 73 2D 31 32 35 32 22 20 3F 3E 0A 3C n d o w s - 1 2 5 2 "   ? > \n<
72 6F 6F 74 3E 4C F6 77 65 20 80 3C 2F 72 6F 6F r o o t > L ö w e   80< / r o o
74 3E 0A .. .. .. .. .. .. .. .. .. .. .. .. .. t > \n..........................
Charset at document start: UTF-8
  Charset at element start: windows-1252
    Charset at element end: windows-1252
   Charset at document end: null
Content: Löwe €



-- 
ss at comp dot lancs dot ac dot uk

[toc] | [prev] | [next] | [standalone]

#19929

From	Sebastian <sebastian@undisclosed.invalid>
Date	2012-11-25 12:32 +0100
Message-ID	<k8svhc$nk$1@news.albasani.net>
In reply to	#19928

Am 25.11.2012 12:00, schrieb Steven Simpson:
> On 21/11/12 14:32, Sebastian wrote:
>> Does anyone have an idea why that is so? And how I could
>> go about making some XML parser determine the correct encoding?
>
> Sussed it! (Come to think of it, I feel I've sussed this before...)
>
> The charset returned by the locator changes during parsing. At
> startDocument(), it is the assumed charset, possibly based on the first
> four-or-so bytes. At endDocument(), it is reset to null. On the first
> call to startElement, it has the correct value. There might be an
> earlier event where it is correct - I didn't investigate.

Oh, that is it! Thanks for the explanation...

> SSCCE...
[snip]
...and the code. (And now I know what a real SSCCE is, too.)

-- Sebastian

[toc] | [prev] | [next] | [standalone]

Page 2 of 3 — ← Prev page 1 [2] 3 Next page →

csiph-web

Detect XML document encodings with SAX

Contents

#19920

#19921

#19960

#20281

#20282

#20283

#20284

#20288

#20286

#19925

#19958

#19959

#19927

#19906

#19907

#19847

#19872

#19881

#19928

#19929