Groups > comp.lang.java.programmer > #19834 > unrolled thread

Detect XML document encodings with SAX

Started by	Sebastian <sebastian@undisclosed.invalid>
First post	2012-11-21 15:32 +0100
Last post	2012-12-16 17:43 +0200
Articles	20 on this page of 43 — 9 participants

Back to article view | Back to comp.lang.java.programmer

  Detect XML document encodings with SAX Sebastian <sebastian@undisclosed.invalid> - 2012-11-21 15:32 +0100
    Re: Detect XML document encodings with SAX Lew <lewbloch@gmail.com> - 2012-11-21 11:31 -0800
      Re: Detect XML document encodings with SAX Sebastian <sebastian@undisclosed.invalid> - 2012-11-22 00:39 +0100
        Re: Detect XML document encodings with SAX Lew <lewbloch@gmail.com> - 2012-11-21 16:37 -0800
          Re: Detect XML document encodings with SAX Sebastian <sebastian@undisclosed.invalid> - 2012-11-22 07:41 +0100
            Re: Detect XML document encodings with SAX markspace <-@.> - 2012-11-21 23:18 -0800
              Re: Detect XML document encodings with SAX Steven Simpson <ss@domain.invalid> - 2012-11-22 07:53 +0000
                Re: Detect XML document encodings with SAX markspace <-@.> - 2012-11-22 08:31 -0800
              Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-23 21:21 -0500
      Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-23 21:11 -0500
      Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-23 21:20 -0500
        Re: Detect XML document encodings with SAX Lew <lewbloch@gmail.com> - 2012-11-24 02:14 -0800
          Re: Detect XML document encodings with SAX Sebastian <sebastian@undisclosed.invalid> - 2012-11-24 22:18 +0100
            Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-24 17:07 -0500
              Re: Detect XML document encodings with SAX Sebastian <sebastian@undisclosed.invalid> - 2012-11-25 10:50 +0100
            Re: Detect XML document encodings with SAX markspace <-@.> - 2012-11-24 17:12 -0800
              Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-24 20:17 -0500
                Re: Detect XML document encodings with SAX markspace <-@.> - 2012-11-24 18:02 -0800
                  Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-24 21:10 -0500
                    Re: Detect XML document encodings with SAX markspace <-@.> - 2012-11-24 18:25 -0800
                      Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-24 21:37 -0500
                        Re: Detect XML document encodings with SAX markspace <-@.> - 2012-11-24 21:01 -0800
                          Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-25 16:30 -0500
                            Re: Detect XML document encodings with SAX Gene Wirchenko <genew@telus.net> - 2012-12-12 18:03 -0800
                              Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-12-12 21:09 -0500
                                Re: Detect XML document encodings with SAX Lew <lewbloch@gmail.com> - 2012-12-12 18:58 -0800
                                  Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-12-12 22:17 -0500
                                    Re: Detect XML document encodings with SAX Lew <lewbloch@gmail.com> - 2012-12-12 22:51 -0800
                                Re: Detect XML document encodings with SAX Gene Wirchenko <genew@telus.net> - 2012-12-12 21:52 -0800
                  Re: Detect XML document encodings with SAX Sebastian <sebastian@undisclosed.invalid> - 2012-11-25 10:45 +0100
                    Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-25 16:23 -0500
                    Re: Detect XML document encodings with SAX markspace <-@.> - 2012-11-25 13:24 -0800
                  Re: Detect XML document encodings with SAX Sebastian <sebastian@undisclosed.invalid> - 2012-11-25 10:58 +0100
          Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-24 17:13 -0500
          Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-24 17:19 -0500
    Re: Detect XML document encodings with SAX Roedy Green <see_website@mindprod.com.invalid> - 2012-11-22 03:24 -0800
      Re: Detect XML document encodings with SAX "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-24 00:13 +0100
        Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-23 21:22 -0500
    Re: Detect XML document encodings with SAX Steven Simpson <ss@domain.invalid> - 2012-11-25 11:00 +0000
      Re: Detect XML document encodings with SAX Sebastian <sebastian@undisclosed.invalid> - 2012-11-25 12:32 +0100
      Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-25 14:41 -0500
    Re: Detect XML document encodings with SAX Roedy Green <see_website@mindprod.com.invalid> - 2012-12-12 20:32 -0800
    Re: Detect XML document encodings with SAX Stanimir Stamenkov <s7an10@netscape.net> - 2012-12-16 17:43 +0200

Page 1 of 3 [1] 2 3 Next page →

#19834 — Detect XML document encodings with SAX

From	Sebastian <sebastian@undisclosed.invalid>
Date	2012-11-21 15:32 +0100
Subject	Detect XML document encodings with SAX
Message-ID	<k8ioi7$2e2$1@news.albasani.net>

[Multipart message — attachments visible in raw view] — view raw

Hello there,

I discovered this post:
http://www.ibm.com/developerworks/library/x-tipsaxxni/

and implemented both approaches (SAX and Xerces XNI).

Unfortunately, for the attached XML file, both methods
output an encoding of UTF-8, while looking at the file
makes it clear that it is not UTF-8 encoded (all characters,
including the umlaut and the Euro-sign, take one byte, and the
declared encoding also is not UTF-8).

Does anyone have an idea why that is so? And how I could
go about making some XML parser determine the correct encoding?

-- Sebastian

[toc] | [next] | [standalone]

#19837

From	Lew <lewbloch@gmail.com>
Date	2012-11-21 11:31 -0800
Message-ID	<0b3b04bf-24dd-4d59-a16d-14c745b66c76@googlegroups.com>
In reply to	#19834

Sebastian wrote:
> I discovered this post:
> http://www.ibm.com/developerworks/library/x-tipsaxxni/
> 
> and implemented both approaches (SAX and Xerces XNI).
> 
> Unfortunately, for the attached XML file, both methods

Don't do attachments on Usenet.

> output an encoding of UTF-8, while looking at the file

as they should. XML should be encoded in UTF-8 nearly always.

But SAX is a parser, so it doesn't output, it inputs. What are you telling us?

> makes it clear that it is not UTF-8 encoded (all characters,
> including the umlaut and the Euro-sign, take one byte, and the
> declared encoding also is not UTF-8).

http://sscce.org/

> Does anyone have an idea why that is so? And how I could

You used the default encoding in your Writer.

> go about making some XML parser determine the correct encoding?

Your problem is writing the file, no? That has nothing to do with parsing.

If your problem is with reading the file, then the encoding in the XML declaration 
should suffice to guide the parser. But then why do you talk about methods that 
"output an encoding"?

However, according to 
http://xmlwriter.net/xml_guide/xml_declaration.shtml#Encoding
supported encodings only include UTF-8, UTF-16, ISO-10646-UCS-2, 
ISO-10646-UCS-4, ISO-8859-1 to ISO-8859-9, ISO-2022-JP, Shift_JIS, and EUC-JP, 
as you would have learned had you researched your question.

So it looks like you must not accept XML documents with such a non-standard 
encoding.

Show us the code, or at least an SSCCE of it.

-- 
Lew

[toc] | [prev] | [next] | [standalone]

#19838

From	Sebastian <sebastian@undisclosed.invalid>
Date	2012-11-22 00:39 +0100
Message-ID	<k8jokk$kco$1@news.albasani.net>
In reply to	#19837

Am 21.11.2012 20:31, schrieb Lew:
> Sebastian wrote:
>> I discovered this post:
>> http://www.ibm.com/developerworks/library/x-tipsaxxni/
>>
>> and implemented both approaches (SAX and Xerces XNI).
[snip]

>
> Your problem is writing the file, no? That has nothing to do with parsing.
No, it is with parsing the file. Parsing with the purpose of detecting
the encoding.

> If your problem is with reading the file, then the encoding in the XML declaration
> should suffice to guide the parser.
My question is exactly why in this case this does not suffice.

>But then why do you talk about methods that
> "output an encoding"?
I meant the System.out.println() statements in the code.

[snip]

> Show us the code, or at least an SSCCE of it.
>
I was referring to the code in the IBM developerworks article that I 
linked to. Perhaps I should simply have copied out that code into my 
original post. So here goes:

import org.xml.sax.*;
import org.xml.sax.ext.*;
import org.xml.sax.helpers.*;

import java.io.IOException;

public class SAXEncodingDetector extends DefaultHandler {

/**
* print the encodings of all URLs given on the command line.
*/
     public static void main(String[] args) throws SAXException, 
IOException {
         XMLReader parser = XMLReaderFactory.createXMLReader();
         SAXEncodingDetector handler = new SAXEncodingDetector();
         parser.setContentHandler(handler);
         for (int i = 0; i < args.length; i++) {
             try {
                 parser.parse(args[i]);
             }
             catch (SAXException ex) {
                 System.out.println(handler.encoding);
             }
         }
     }

     private String encoding;
     private Locator2 locator;

     @Override
     public void setDocumentLocator(Locator locator) {
         if (locator instanceof Locator2) {
             this.locator = (Locator2) locator;
         }
         else {
             this.encoding = "unknown";
         }
     }

     @Override
     public void startDocument() throws SAXException {
         if (locator != null) {
             this.encoding = locator.getEncoding();
         }
         throw new SAXException("Early termination");
     }

}

[toc] | [prev] | [next] | [standalone]

#19839

From	Lew <lewbloch@gmail.com>
Date	2012-11-21 16:37 -0800
Message-ID	<bdb9651d-4fdb-4844-a718-aa93c7fe44ab@googlegroups.com>
In reply to	#19838

Sebastian wrote:
> schrieb Lew:
>> Sebastian wrote:
>>> I discovered this post:
>>> http://www.ibm.com/developerworks/library/x-tipsaxxni/
>>>
>>> and implemented both approaches (SAX and Xerces XNI).
> 
> [snip]
> 
>> Your problem is writing the file, no? That has nothing to do with parsing.
> 
> No, it is with parsing the file. Parsing with the purpose of detecting
> the encoding.

Not clear from your phrasing.

>> If your problem is with reading the file, then the encoding in the XML declaration
>> should suffice to guide the parser.
> 
> My question is exactly why in this case this does not suffice.

Did my answer to that question not suffice?

I notice you didn't address my answer in your response; in fact you snipped it.

-- 
Lew

[toc] | [prev] | [next] | [standalone]

#19842

From	Sebastian <sebastian@undisclosed.invalid>
Date	2012-11-22 07:41 +0100
Message-ID	<k8khbm$vgq$1@news.albasani.net>
In reply to	#19839

Am 22.11.2012 01:37, schrieb Lew:
> Sebastian wrote:
>> schrieb Lew:
>>> Sebastian wrote:
>>>> I discovered this post:
>>>> http://www.ibm.com/developerworks/library/x-tipsaxxni/
>>>>
>>>> and implemented both approaches (SAX and Xerces XNI).
>>
>> [snip]
>>
>>> Your problem is writing the file, no? That has nothing to do with parsing.
>>
>> No, it is with parsing the file. Parsing with the purpose of detecting
>> the encoding.
>
> Not clear from your phrasing.
>
>>> If your problem is with reading the file, then the encoding in the XML declaration
>>> should suffice to guide the parser.
>>
>> My question is exactly why in this case this does not suffice.
>
> Did my answer to that question not suffice?
>
> I notice you didn't address my answer in your response; in fact you snipped it.

The answer cannot be that windows-1250 is non-standard. In fact, the 
declared encoding of the XML file does not seem to matter. The code will 
always output "UTF-8".

I am using Java 7 on Windows XP.

-- Sebastian

[toc] | [prev] | [next] | [standalone]

#19844

From	markspace <-@.>
Date	2012-11-21 23:18 -0800
Message-ID	<k8kjl4$skg$1@dont-email.me>
In reply to	#19842

On 11/21/2012 10:41 PM, Sebastian wrote:

>
> The answer cannot be that windows-1250 is non-standard. In fact, the
> declared encoding of the XML file does not seem to matter. The code will
> always output "UTF-8".
>

Maybe this quote from the article will help you out:

"This approach works 90 percent of the time, maybe a little more. But 
SAX parsers aren't required to support the Locator interface, much less 
Locator2, and a few don't. A second option, if you know you're using 
Xerces, is to work with XNI"

Since the output of the program is "unknown", I'd guess that this 
particular SAX parser doesn't support Locator2, like it says.

[toc] | [prev] | [next] | [standalone]

#19846

From	Steven Simpson <ss@domain.invalid>
Date	2012-11-22 07:53 +0000
Message-ID	<9921o9-usm.ln1@s.simpson148.btinternet.com>
In reply to	#19844

On 22/11/12 07:18, markspace wrote:
> On 11/21/2012 10:41 PM, Sebastian wrote:
>>
>> The answer cannot be that windows-1250 is non-standard. In fact, the
>> declared encoding of the XML file does not seem to matter. The code will
>> always output "UTF-8".
>>
>
> Maybe this quote from the article will help you out:
>
> "This approach works 90 percent of the time, maybe a little more. But 
> SAX parsers aren't required to support the Locator interface, much 
> less Locator2, and a few don't. A second option, if you know you're 
> using Xerces, is to work with XNI"
>
>
> Since the output of the program is "unknown", I'd guess that this 
> particular SAX parser doesn't support Locator2, like it says.

Like the OP, I'm getting "UTF-8", and tracing in the code shows that it 
is getting a Locator2.


-- 
ss at comp dot lancs dot ac dot uk

[toc] | [prev] | [next] | [standalone]

#19853

From	markspace <-@.>
Date	2012-11-22 08:31 -0800
Message-ID	<k8lk20$euc$1@dont-email.me>
In reply to	#19846

On 11/21/2012 11:53 PM, Steven Simpson wrote:

> Like the OP, I'm getting "UTF-8", and tracing in the code shows that it
> is getting a Locator2.


Oh, well mine doesn't.  I guess we have two different implementations. 
Sorry can't guess what is up with yours.

[toc] | [prev] | [next] | [standalone]

#19880

From	Arne Vajhøj <arne@vajhoej.dk>
Date	2012-11-23 21:21 -0500
Message-ID	<50b02f35$0$283$14726298@news.sunsite.dk>
In reply to	#19844

On 11/22/2012 2:18 AM, markspace wrote:
> On 11/21/2012 10:41 PM, Sebastian wrote:
>> The answer cannot be that windows-1250 is non-standard. In fact, the
>> declared encoding of the XML file does not seem to matter. The code will
>> always output "UTF-8".
>>
>
> Maybe this quote from the article will help you out:
>
> "This approach works 90 percent of the time, maybe a little more. But
> SAX parsers aren't required to support the Locator interface, much less
> Locator2, and a few don't. A second option, if you know you're using
> Xerces, is to work with XNI"
>
> Since the output of the program is "unknown", I'd guess that this
> particular SAX parser doesn't support Locator2, like it says.

Except that it does not return Unknown - it returns UTF-8.

Arne

[toc] | [prev] | [next] | [standalone]

#19878

From	Arne Vajhøj <arne@vajhoej.dk>
Date	2012-11-23 21:11 -0500
Message-ID	<50b02ce7$0$287$14726298@news.sunsite.dk>
In reply to	#19837

Sebastian wrote:
> I discovered this post:
> http://www.ibm.com/developerworks/library/x-tipsaxxni/
>
> and implemented both approaches (SAX and Xerces XNI).
>
> Unfortunately, for the attached XML file, both methods
> output an encoding of UTF-8, while looking at the file

I tried.

And I can not get it to work either.

SAX detects UTF-8 no matter what it really is.

StAX seems never to detect and W3C DOM seems to
always detect correct.

I can not offer an explanation. Obviously the parsers
need to internally detect correct. Otherwise they
could not parse correct.

Code below.

Arne

====

import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamReader;

import org.w3c.dom.Document;
import org.xml.sax.InputSource;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.ext.Locator2;
import org.xml.sax.helpers.XMLReaderFactory;
import org.xml.sax.helpers.DefaultHandler;

public class XmlEncodingDectect {
	private static final String FNM1 = "/work/foobar1.xml";
	private static final String FNM2 = "/work/foobar2.xml";
	private static final String FNM3 = "/work/foobar3.xml";
	private static void gen1() throws IOException {
		PrintWriter pw = new PrintWriter(new FileWriter(FNM1));
		pw.println("<?xml version='1.0' encoding='UTF-8'?>");
		pw.println("<root/>");
		pw.close();
	}
	private static void gen2() throws IOException {
		PrintWriter pw = new PrintWriter(new FileWriter(FNM2));
		pw.println("<?xml version='1.0' encoding='ISO-8859-1'?>");
		pw.println("<root/>");
		pw.close();
	}
	private static void gen3() throws IOException {
		PrintWriter pw = new PrintWriter(new FileWriter(FNM3));
		pw.println("<?xml version='1.0'?>");
		pw.println("<root/>");
		pw.close();
	}
	private static String encoding;
	private static String detectSAX(String fnm) throws SAXException, 
IOException {
         XMLReader parser = XMLReaderFactory.createXMLReader();
         parser.setContentHandler(new DefaultHandler() {
			private Locator2 locator;
             @Override
             public void setDocumentLocator(Locator locator) {
                 if (locator instanceof Locator2) {
                     this.locator = (Locator2) locator;
                 } else {
                     encoding = "Unknown";
                 }
             }
             @Override
             public void startDocument() throws SAXException {
                 if (locator != null) {
                     encoding = locator.getEncoding();
                 }
             }        	
         });
         parser.parse(new InputSource(new FileInputStream(fnm)));
         return encoding;
	}
	private static String detectW3CDOM(String fnm) throws 
ParserConfigurationException, FileNotFoundException, SAXException, 
IOException {
         DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
         DocumentBuilder db = dbf.newDocumentBuilder();
         Document doc = db.parse(new InputSource(new FileInputStream(fnm)));
         String encoding = doc.getXmlEncoding();
         return encoding != null ? encoding : "Unknown";
	}
	private static String detectStAX(String fnm) throws 
FileNotFoundException, XMLStreamException {
		XMLInputFactory xif = XMLInputFactory.newInstance();
         XMLStreamReader xsr = xif.createXMLStreamReader(new 
FileInputStream(fnm));
         String encoding = null;
         while(xsr.hasNext()) {
         	xsr.next();
             switch(xsr.getEventType()) {
                 case XMLStreamReader.START_DOCUMENT:
                 	encoding = xsr.getEncoding();
                 	break;
                 default:
                 	break;
             }
         }
         return encoding != null ? encoding : "Unknown";
	}
	public static void main(String[] args) throws IOException, 
SAXException, ParserConfigurationException, XMLStreamException {
		gen1();
		System.out.println(detectSAX(FNM1));
		System.out.println(detectW3CDOM(FNM1));
		System.out.println(detectStAX(FNM1));
		gen2();
		System.out.println(detectSAX(FNM2));
		System.out.println(detectW3CDOM(FNM2));
		System.out.println(detectStAX(FNM2));
		gen3();
		System.out.println(detectSAX(FNM3));
		System.out.println(detectW3CDOM(FNM3));
		System.out.println(detectStAX(FNM3));
	}
}

[toc] | [prev] | [next] | [standalone]

#19879

From	Arne Vajhøj <arne@vajhoej.dk>
Date	2012-11-23 21:20 -0500
Message-ID	<50b02ee6$0$283$14726298@news.sunsite.dk>
In reply to	#19837

On 11/21/2012 2:31 PM, Lew wrote:
> Sebastian wrote:
>> I discovered this post:
>> http://www.ibm.com/developerworks/library/x-tipsaxxni/
>>
>> and implemented both approaches (SAX and Xerces XNI).
>>
>> Unfortunately, for the attached XML file, both methods
>
> Don't do attachments on Usenet.
>
>> output an encoding of UTF-8, while looking at the file
>
> as they should.

No.

If the XML prolog specifies another encoding than UTF-8,
then it should not return UTF-8.

>                 XML should be encoded in UTF-8 nearly always.

XML allows for other encodings.

And Java XML parsers support it.

So it should always work.

> But SAX is a parser, so it doesn't output, it inputs. What are you telling us?

Output usually mean System.out.println - that works fine with a parser.

> If your problem is with reading the file, then the encoding in the XML declaration
> should suffice to guide the parser. But then why do you talk about methods that
> "output an encoding"?

Because he wants to know what it is.

 > However, according to
 > http://xmlwriter.net/xml_guide/xml_declaration.shtml#Encoding
 > supported encodings only include UTF-8, UTF-16, ISO-10646-UCS-2,
 > ISO-10646-UCS-4, ISO-8859-1 to ISO-8859-9, ISO-2022-JP, Shift_JIS, 
and EUC-JP,
 > as you would have learned had you researched your question.
 >
 > So it looks like you must not accept XML documents with such a 
non-standard
 > encoding.

Those that has researched would know that the XML spec do not
limit the encodings at all. The XML processor must support UTF-8
and UTF-16, but are free to support others.

Arne



Arne

[toc] | [prev] | [next] | [standalone]

#19888

From	Lew <lewbloch@gmail.com>
Date	2012-11-24 02:14 -0800
Message-ID	<d64baf3c-d582-4308-b6b4-714ef3049ef5@googlegroups.com>
In reply to	#19879

Arne Vajhøj wrote:
> Lew wrote:
>> Sebastian wrote:
[snip]
>>> output an encoding of UTF-8, while looking at the file
>> as they should.
> 
> No.
> 
> If the XML prolog specifies another encoding than UTF-8,
> then it should not return UTF-8.

True, but I'm saying they should specify UTF-8 in the prolog.

>>                 XML should be encoded in UTF-8 nearly always.

See?

> XML allows for other encodings.

So? You should use UTF-8 nearly always, i.e., unless there's a compelling 
reason not to.

> And Java XML parsers support it.

For those rare times when you deviate from the usual UTF-8.

> So it should always work.

>> But SAX is a parser, so it doesn't output, it inputs. What are you telling us?
> 
> Output usually mean System.out.println - that works fine with a parser.

His phrasing wasn't clear to me. That's why I asked for clarification.

I could have guessed, too.

>> If your problem is with reading the file, then the encoding in the XML declaration

See? You're preaching to the choir.

>> should suffice to guide the parser. But then why do you talk about methods that

>> "output an encoding"?
> 
> Because he wants to know what it is.
> 
>> However, according to
>> http://xmlwriter.net/xml_guide/xml_declaration.shtml#Encoding
>> supported encodings only include UTF-8, UTF-16, ISO-10646-UCS-2,
>> ISO-10646-UCS-4, ISO-8859-1 to ISO-8859-9, ISO-2022-JP, Shift_JIS, 
>> and EUC-JP,
>> So it looks like you must not accept XML documents with such a 
>> non-standard encoding.
>
> Those that has researched would know that the XML spec do not
> limit the encodings at all. The XML processor must support UTF-8
> and UTF-16, but are free to support others.

Perhaps the OP's parser doesn't exercise that freedom, judging by the 
symptoms.

'sall I'm sayin'.

Obviously I don't know the answer, but he's asking for suggestions 
to investigate, AIUI. He's having encoding problems. His XML is apparently 
encoded in Windows-1252, a notoriously funky encoding especially for 
the variety of characters with which one might wish to deal. So why not
investigate obtaining material that isn't in such a notoriously funky 
encoding, like, oh, say, the old reliable standard UTF-8?

Perhaps that isn't feasible, for reasons as yet unstated, but that's 
the nature of brainstorming.

-- 
Lew

[toc] | [prev] | [next] | [standalone]

#19904

From	Sebastian <sebastian@undisclosed.invalid>
Date	2012-11-24 22:18 +0100
Message-ID	<k8rdfq$gbg$1@news.albasani.net>
In reply to	#19888

Sebastian wrote:
 > I discovered this post:
 > http://www.ibm.com/developerworks/library/x-tipsaxxni/
 >
 > and implemented both approaches (SAX and Xerces XNI).
 >
 > Unfortunately, for the attached XML file, both methods
 > output an encoding of UTF-8, while looking at the file

Am 24.11.2012 11:14, schrieb Lew:
[snip]
>
> Obviously I don't know the answer, but he's asking for suggestions
> to investigate, AIUI. He's having encoding problems. His XML is apparently
> encoded in Windows-1252, a notoriously funky encoding especially for
> the variety of characters with which one might wish to deal. So why not
> investigate obtaining material that isn't in such a notoriously funky
> encoding, like, oh, say, the old reliable standard UTF-8?
>
> Perhaps that isn't feasible, for reasons as yet unstated, but that's
> the nature of brainstorming.

Here's the background to my question:
I am dealing with other people's code that processes XML files.
Unfortunately, that code, which I have no control over, seems to use
some home-grown parsing algorithm, which DOES NOT always detect
encodings correctly, but expects to be told them.

The XML files come from several sources in different encodings, and I
cannot dictate anything there either.

So I thought, well, why don't I add a little preprocessor to discover
the encoding to give to that terrible file processor I'm stuck with.
Shouldn't be that hard, because, as Arne said:

 > Am 24.11.2012 03:11, schrieb Arne Vajhøj:
 > Obviously the parsers
 > need to internally detect correct. Otherwise they
 > could not parse correct.

The only approach that seems to work (at least for Arne), namely
W3C DOM, is out of the question for me, because the files are
potentially huge and I cannot keep a complete document model in memory.
I need something along the lines of SAX. I'll have to look around some more.

-- Sebastian

PS: The author of that article from which I took the code isn't just
anyone. Elliotte Rusty Harold hosts the XML web site 
http://www.cafeconleche.org/ and is affiliated with the University of 
North Carolina. Perhaps I could try to get in touch with him.

[toc] | [prev] | [next] | [standalone]

#19905

From	Arne Vajhøj <arne@vajhoej.dk>
Date	2012-11-24 17:07 -0500
Message-ID	<50b14516$0$282$14726298@news.sunsite.dk>
In reply to	#19904

On 11/24/2012 4:18 PM, Sebastian wrote:
> Am 24.11.2012 11:14, schrieb Lew:
> [snip]
>>
>> Obviously I don't know the answer, but he's asking for suggestions
>> to investigate, AIUI. He's having encoding problems. His XML is
>> apparently
>> encoded in Windows-1252, a notoriously funky encoding especially for
>> the variety of characters with which one might wish to deal. So why not
>> investigate obtaining material that isn't in such a notoriously funky
>> encoding, like, oh, say, the old reliable standard UTF-8?
>>
>> Perhaps that isn't feasible, for reasons as yet unstated, but that's
>> the nature of brainstorming.
>
> Here's the background to my question:
> I am dealing with other people's code that processes XML files.
> Unfortunately, that code, which I have no control over, seems to use
> some home-grown parsing algorithm, which DOES NOT always detect
> encodings correctly, but expects to be told them.
>
> The XML files come from several sources in different encodings, and I
> cannot dictate anything there either.

I would consider it tempting to rewrite that app to use a standard
XML parser.

It would solve this problem and possibly also some future problems.

> So I thought, well, why don't I add a little preprocessor to discover
> the encoding to give to that terrible file processor I'm stuck with.
> Shouldn't be that hard, because, as Arne said:
>
>  > Am 24.11.2012 03:11, schrieb Arne Vajhøj:
>  > Obviously the parsers
>  > need to internally detect correct. Otherwise they
>  > could not parse correct.
>
> The only approach that seems to work (at least for Arne), namely
> W3C DOM, is out of the question for me, because the files are
> potentially huge and I cannot keep a complete document model in memory.
> I need something along the lines of SAX. I'll have to look around some
> more.

What about just reading the first few lines until you have the
XML declaration.

Parsing the encoding out of that should be simple.

	private static final Pattern encpat = 
Pattern.compile("encoding\\s*=\\s*['\"]([^'\"]+)['\"]");
	private static String detectSimple(String fnm) throws IOException {
		BufferedReader br = new BufferedReader(new FileReader(fnm));
		String firstpart = "";
		while(!firstpart.contains(">")) firstpart += br.readLine();
		br.close();
		Matcher m = encpat.matcher(firstpart);
		if(m.find()) {
			return m.group(1);
		} else {
			return "Unknown";
		}
	}

I do not like the solution, but given the restrictions in the
context, then maybe it is what you need.

> PS: The author of that article from which I took the code isn't just
> anyone. Elliotte Rusty Harold hosts the XML web site
> http://www.cafeconleche.org/ and is affiliated with the University of
> North Carolina. Perhaps I could try to get in touch with him.

Teaching at a university is no guarantee of good practical
programming skills.

Arne

[toc] | [prev] | [next] | [standalone]

#19926

From	Sebastian <sebastian@undisclosed.invalid>
Date	2012-11-25 10:50 +0100
Message-ID	<k8sphg$hn4$1@news.albasani.net>
In reply to	#19905

Am 24.11.2012 23:07, schrieb Arne Vajhøj:
[snip]
> I would consider it tempting to rewrite that app to use a standard
> XML parser.
>
> It would solve this problem and possibly also some future problems.

Yes, I wish I could do that (or rather, have that done...) It seems that
app also handles other types of files (like csv) and regardless of
file type they always do the same, namely open an InputStreamReader
given a charset name.

[snip]

> What about just reading the first few lines until you have the
> XML declaration.
>
> Parsing the encoding out of that should be simple.
>
> private static final Pattern encpat =
> Pattern.compile("encoding\\s*=\\s*['\"]([^'\"]+)['\"]");
> private static String detectSimple(String fnm) throws IOException {
> BufferedReader br = new BufferedReader(new FileReader(fnm));
> String firstpart = "";
> while(!firstpart.contains(">")) firstpart += br.readLine();
> br.close();
> Matcher m = encpat.matcher(firstpart);
> if(m.find()) {
> return m.group(1);
> } else {
> return "Unknown";
> }
> }
>
> I do not like the solution, but given the restrictions in the
> context, then maybe it is what you need.

Thanks for the suggestion. I'll use that idea until a better solution 
becomes feasible.

-- Sebastian

[toc] | [prev] | [next] | [standalone]

#19914

From	markspace <-@.>
Date	2012-11-24 17:12 -0800
Message-ID	<k8rral$nb1$1@dont-email.me>
In reply to	#19904

On 11/24/2012 1:18 PM, Sebastian wrote:
> I am dealing with other people's code that processes XML files.
> Unfortunately, that code, which I have no control over, seems to use
> some home-grown parsing algorithm, which DOES NOT always detect
> encodings correctly, but expects to be told them.

That's not a big deal.  Several of the Java components work this way. 
Open the file with an assumed encoding, and test the encoding.  If you 
are wrong, throw an exception, which causes the stream to be re-opened 
with the correct encoding (now that the correct encoding has been detected).

Be careful you're not subverting an established, working process here.

I personally am still looking for an SSCCE, as your last one didn't 
reproduce the error for me.

[toc] | [prev] | [next] | [standalone]

#19915

From	Arne Vajhøj <arne@vajhoej.dk>
Date	2012-11-24 20:17 -0500
Message-ID	<50b171be$0$292$14726298@news.sunsite.dk>
In reply to	#19914

On 11/24/2012 8:12 PM, markspace wrote:
> I personally am still looking for an SSCCE, as your last one didn't
> reproduce the error for me.

Did you try my 1 2 3 example?

Arne

[toc] | [prev] | [next] | [standalone]

#19917

From	markspace <-@.>
Date	2012-11-24 18:02 -0800
Message-ID	<k8ru8h$4na$1@dont-email.me>
In reply to	#19915

On 11/24/2012 5:17 PM, Arne Vajhøj wrote:
> On 11/24/2012 8:12 PM, markspace wrote:
>> I personally am still looking for an SSCCE, as your last one didn't
>> reproduce the error for me.
>
> Did you try my 1 2 3 example?

No that errors on me too.  Really all the OP did was cut and paste the 
example code from his link as his SSCCE, it didn't even contain his data 
file.  There's no way I'm going to bother writing code for anyone who is 
that lazy.

(Your code throws an exception because I have no directory named 
"/work".  Look up StringReader or ByteArrayInputStream.  Use those 
instead of relying on actual files and a file system.)

[toc] | [prev] | [next] | [standalone]

#19918

From	Arne Vajhøj <arne@vajhoej.dk>
Date	2012-11-24 21:10 -0500
Message-ID	<50b17e18$0$288$14726298@news.sunsite.dk>
In reply to	#19917

On 11/24/2012 9:02 PM, markspace wrote:
> On 11/24/2012 5:17 PM, Arne Vajhøj wrote:
>> On 11/24/2012 8:12 PM, markspace wrote:
>>> I personally am still looking for an SSCCE, as your last one didn't
>>> reproduce the error for me.
>>
>> Did you try my 1 2 3 example?
>
> No that errors on me too.  Really all the OP did was cut and paste the
> example code from his link as his SSCCE, it didn't even contain his data
> file.  There's no way I'm going to bother writing code for anyone who is
> that lazy.
>
> (Your code throws an exception because I have no directory named
> "/work".  Look up StringReader or ByteArrayInputStream.  Use those
> instead of relying on actual files and a file system.)

The code actually writes the files. Just change the path to
something valid for write.

I could have read from memory, but when it comes to encoding
I want to use a real file.

I have previously had bad experience with XML parsed from a String
always being considered UTF-16. Not in Java, but still.

Arne

[toc] | [prev] | [next] | [standalone]

#19919

From	markspace <-@.>
Date	2012-11-24 18:25 -0800
Message-ID	<k8rvin$9of$1@dont-email.me>
In reply to	#19918

On 11/24/2012 6:10 PM, Arne Vajhøj wrote:

> On 11/24/2012 9:02 PM, markspace wrote:
>> (Your code throws an exception because I have no directory named
>> "/work".  Look up StringReader or ByteArrayInputStream.  Use those
>> instead of relying on actual files and a file system.)

>
> The code actually writes the files.


Yeah, I'd also rather not have artifacts on my hard drive too, thanks.

[toc] | [prev] | [next] | [standalone]

Page 1 of 3 [1] 2 3 Next page →

csiph-web

Detect XML document encodings with SAX

Contents

#19834 — Detect XML document encodings with SAX

#19837

#19838

#19839

#19842

#19844

#19846

#19853

#19880

#19878

#19879

#19888

#19904

#19905

#19926

#19914

#19915

#19917

#19918

#19919