Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.java.programmer > #19834 > unrolled thread
| Started by | Sebastian <sebastian@undisclosed.invalid> |
|---|---|
| First post | 2012-11-21 15:32 +0100 |
| Last post | 2012-12-16 17:43 +0200 |
| Articles | 20 on this page of 43 — 9 participants |
Back to article view | Back to comp.lang.java.programmer
Detect XML document encodings with SAX Sebastian <sebastian@undisclosed.invalid> - 2012-11-21 15:32 +0100
Re: Detect XML document encodings with SAX Lew <lewbloch@gmail.com> - 2012-11-21 11:31 -0800
Re: Detect XML document encodings with SAX Sebastian <sebastian@undisclosed.invalid> - 2012-11-22 00:39 +0100
Re: Detect XML document encodings with SAX Lew <lewbloch@gmail.com> - 2012-11-21 16:37 -0800
Re: Detect XML document encodings with SAX Sebastian <sebastian@undisclosed.invalid> - 2012-11-22 07:41 +0100
Re: Detect XML document encodings with SAX markspace <-@.> - 2012-11-21 23:18 -0800
Re: Detect XML document encodings with SAX Steven Simpson <ss@domain.invalid> - 2012-11-22 07:53 +0000
Re: Detect XML document encodings with SAX markspace <-@.> - 2012-11-22 08:31 -0800
Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-23 21:21 -0500
Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-23 21:11 -0500
Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-23 21:20 -0500
Re: Detect XML document encodings with SAX Lew <lewbloch@gmail.com> - 2012-11-24 02:14 -0800
Re: Detect XML document encodings with SAX Sebastian <sebastian@undisclosed.invalid> - 2012-11-24 22:18 +0100
Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-24 17:07 -0500
Re: Detect XML document encodings with SAX Sebastian <sebastian@undisclosed.invalid> - 2012-11-25 10:50 +0100
Re: Detect XML document encodings with SAX markspace <-@.> - 2012-11-24 17:12 -0800
Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-24 20:17 -0500
Re: Detect XML document encodings with SAX markspace <-@.> - 2012-11-24 18:02 -0800
Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-24 21:10 -0500
Re: Detect XML document encodings with SAX markspace <-@.> - 2012-11-24 18:25 -0800
Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-24 21:37 -0500
Re: Detect XML document encodings with SAX markspace <-@.> - 2012-11-24 21:01 -0800
Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-25 16:30 -0500
Re: Detect XML document encodings with SAX Gene Wirchenko <genew@telus.net> - 2012-12-12 18:03 -0800
Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-12-12 21:09 -0500
Re: Detect XML document encodings with SAX Lew <lewbloch@gmail.com> - 2012-12-12 18:58 -0800
Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-12-12 22:17 -0500
Re: Detect XML document encodings with SAX Lew <lewbloch@gmail.com> - 2012-12-12 22:51 -0800
Re: Detect XML document encodings with SAX Gene Wirchenko <genew@telus.net> - 2012-12-12 21:52 -0800
Re: Detect XML document encodings with SAX Sebastian <sebastian@undisclosed.invalid> - 2012-11-25 10:45 +0100
Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-25 16:23 -0500
Re: Detect XML document encodings with SAX markspace <-@.> - 2012-11-25 13:24 -0800
Re: Detect XML document encodings with SAX Sebastian <sebastian@undisclosed.invalid> - 2012-11-25 10:58 +0100
Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-24 17:13 -0500
Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-24 17:19 -0500
Re: Detect XML document encodings with SAX Roedy Green <see_website@mindprod.com.invalid> - 2012-11-22 03:24 -0800
Re: Detect XML document encodings with SAX "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-24 00:13 +0100
Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-23 21:22 -0500
Re: Detect XML document encodings with SAX Steven Simpson <ss@domain.invalid> - 2012-11-25 11:00 +0000
Re: Detect XML document encodings with SAX Sebastian <sebastian@undisclosed.invalid> - 2012-11-25 12:32 +0100
Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-25 14:41 -0500
Re: Detect XML document encodings with SAX Roedy Green <see_website@mindprod.com.invalid> - 2012-12-12 20:32 -0800
Re: Detect XML document encodings with SAX Stanimir Stamenkov <s7an10@netscape.net> - 2012-12-16 17:43 +0200
Page 1 of 3 [1] 2 3 Next page →
| From | Sebastian <sebastian@undisclosed.invalid> |
|---|---|
| Date | 2012-11-21 15:32 +0100 |
| Subject | Detect XML document encodings with SAX |
| Message-ID | <k8ioi7$2e2$1@news.albasani.net> |
[Multipart message — attachments visible in raw view] — view raw
Hello there, I discovered this post: http://www.ibm.com/developerworks/library/x-tipsaxxni/ and implemented both approaches (SAX and Xerces XNI). Unfortunately, for the attached XML file, both methods output an encoding of UTF-8, while looking at the file makes it clear that it is not UTF-8 encoded (all characters, including the umlaut and the Euro-sign, take one byte, and the declared encoding also is not UTF-8). Does anyone have an idea why that is so? And how I could go about making some XML parser determine the correct encoding? -- Sebastian
[toc] | [next] | [standalone]
| From | Lew <lewbloch@gmail.com> |
|---|---|
| Date | 2012-11-21 11:31 -0800 |
| Message-ID | <0b3b04bf-24dd-4d59-a16d-14c745b66c76@googlegroups.com> |
| In reply to | #19834 |
Sebastian wrote: > I discovered this post: > http://www.ibm.com/developerworks/library/x-tipsaxxni/ > > and implemented both approaches (SAX and Xerces XNI). > > Unfortunately, for the attached XML file, both methods Don't do attachments on Usenet. > output an encoding of UTF-8, while looking at the file as they should. XML should be encoded in UTF-8 nearly always. But SAX is a parser, so it doesn't output, it inputs. What are you telling us? > makes it clear that it is not UTF-8 encoded (all characters, > including the umlaut and the Euro-sign, take one byte, and the > declared encoding also is not UTF-8). http://sscce.org/ > Does anyone have an idea why that is so? And how I could You used the default encoding in your Writer. > go about making some XML parser determine the correct encoding? Your problem is writing the file, no? That has nothing to do with parsing. If your problem is with reading the file, then the encoding in the XML declaration should suffice to guide the parser. But then why do you talk about methods that "output an encoding"? However, according to http://xmlwriter.net/xml_guide/xml_declaration.shtml#Encoding supported encodings only include UTF-8, UTF-16, ISO-10646-UCS-2, ISO-10646-UCS-4, ISO-8859-1 to ISO-8859-9, ISO-2022-JP, Shift_JIS, and EUC-JP, as you would have learned had you researched your question. So it looks like you must not accept XML documents with such a non-standard encoding. Show us the code, or at least an SSCCE of it. -- Lew
[toc] | [prev] | [next] | [standalone]
| From | Sebastian <sebastian@undisclosed.invalid> |
|---|---|
| Date | 2012-11-22 00:39 +0100 |
| Message-ID | <k8jokk$kco$1@news.albasani.net> |
| In reply to | #19837 |
Am 21.11.2012 20:31, schrieb Lew:
> Sebastian wrote:
>> I discovered this post:
>> http://www.ibm.com/developerworks/library/x-tipsaxxni/
>>
>> and implemented both approaches (SAX and Xerces XNI).
[snip]
>
> Your problem is writing the file, no? That has nothing to do with parsing.
No, it is with parsing the file. Parsing with the purpose of detecting
the encoding.
> If your problem is with reading the file, then the encoding in the XML declaration
> should suffice to guide the parser.
My question is exactly why in this case this does not suffice.
>But then why do you talk about methods that
> "output an encoding"?
I meant the System.out.println() statements in the code.
[snip]
> Show us the code, or at least an SSCCE of it.
>
I was referring to the code in the IBM developerworks article that I
linked to. Perhaps I should simply have copied out that code into my
original post. So here goes:
import org.xml.sax.*;
import org.xml.sax.ext.*;
import org.xml.sax.helpers.*;
import java.io.IOException;
public class SAXEncodingDetector extends DefaultHandler {
/**
* print the encodings of all URLs given on the command line.
*/
public static void main(String[] args) throws SAXException,
IOException {
XMLReader parser = XMLReaderFactory.createXMLReader();
SAXEncodingDetector handler = new SAXEncodingDetector();
parser.setContentHandler(handler);
for (int i = 0; i < args.length; i++) {
try {
parser.parse(args[i]);
}
catch (SAXException ex) {
System.out.println(handler.encoding);
}
}
}
private String encoding;
private Locator2 locator;
@Override
public void setDocumentLocator(Locator locator) {
if (locator instanceof Locator2) {
this.locator = (Locator2) locator;
}
else {
this.encoding = "unknown";
}
}
@Override
public void startDocument() throws SAXException {
if (locator != null) {
this.encoding = locator.getEncoding();
}
throw new SAXException("Early termination");
}
}
[toc] | [prev] | [next] | [standalone]
| From | Lew <lewbloch@gmail.com> |
|---|---|
| Date | 2012-11-21 16:37 -0800 |
| Message-ID | <bdb9651d-4fdb-4844-a718-aa93c7fe44ab@googlegroups.com> |
| In reply to | #19838 |
Sebastian wrote: > schrieb Lew: >> Sebastian wrote: >>> I discovered this post: >>> http://www.ibm.com/developerworks/library/x-tipsaxxni/ >>> >>> and implemented both approaches (SAX and Xerces XNI). > > [snip] > >> Your problem is writing the file, no? That has nothing to do with parsing. > > No, it is with parsing the file. Parsing with the purpose of detecting > the encoding. Not clear from your phrasing. >> If your problem is with reading the file, then the encoding in the XML declaration >> should suffice to guide the parser. > > My question is exactly why in this case this does not suffice. Did my answer to that question not suffice? I notice you didn't address my answer in your response; in fact you snipped it. -- Lew
[toc] | [prev] | [next] | [standalone]
| From | Sebastian <sebastian@undisclosed.invalid> |
|---|---|
| Date | 2012-11-22 07:41 +0100 |
| Message-ID | <k8khbm$vgq$1@news.albasani.net> |
| In reply to | #19839 |
Am 22.11.2012 01:37, schrieb Lew: > Sebastian wrote: >> schrieb Lew: >>> Sebastian wrote: >>>> I discovered this post: >>>> http://www.ibm.com/developerworks/library/x-tipsaxxni/ >>>> >>>> and implemented both approaches (SAX and Xerces XNI). >> >> [snip] >> >>> Your problem is writing the file, no? That has nothing to do with parsing. >> >> No, it is with parsing the file. Parsing with the purpose of detecting >> the encoding. > > Not clear from your phrasing. > >>> If your problem is with reading the file, then the encoding in the XML declaration >>> should suffice to guide the parser. >> >> My question is exactly why in this case this does not suffice. > > Did my answer to that question not suffice? > > I notice you didn't address my answer in your response; in fact you snipped it. The answer cannot be that windows-1250 is non-standard. In fact, the declared encoding of the XML file does not seem to matter. The code will always output "UTF-8". I am using Java 7 on Windows XP. -- Sebastian
[toc] | [prev] | [next] | [standalone]
| From | markspace <-@.> |
|---|---|
| Date | 2012-11-21 23:18 -0800 |
| Message-ID | <k8kjl4$skg$1@dont-email.me> |
| In reply to | #19842 |
On 11/21/2012 10:41 PM, Sebastian wrote: > > The answer cannot be that windows-1250 is non-standard. In fact, the > declared encoding of the XML file does not seem to matter. The code will > always output "UTF-8". > Maybe this quote from the article will help you out: "This approach works 90 percent of the time, maybe a little more. But SAX parsers aren't required to support the Locator interface, much less Locator2, and a few don't. A second option, if you know you're using Xerces, is to work with XNI" Since the output of the program is "unknown", I'd guess that this particular SAX parser doesn't support Locator2, like it says.
[toc] | [prev] | [next] | [standalone]
| From | Steven Simpson <ss@domain.invalid> |
|---|---|
| Date | 2012-11-22 07:53 +0000 |
| Message-ID | <9921o9-usm.ln1@s.simpson148.btinternet.com> |
| In reply to | #19844 |
On 22/11/12 07:18, markspace wrote: > On 11/21/2012 10:41 PM, Sebastian wrote: >> >> The answer cannot be that windows-1250 is non-standard. In fact, the >> declared encoding of the XML file does not seem to matter. The code will >> always output "UTF-8". >> > > Maybe this quote from the article will help you out: > > "This approach works 90 percent of the time, maybe a little more. But > SAX parsers aren't required to support the Locator interface, much > less Locator2, and a few don't. A second option, if you know you're > using Xerces, is to work with XNI" > > > Since the output of the program is "unknown", I'd guess that this > particular SAX parser doesn't support Locator2, like it says. Like the OP, I'm getting "UTF-8", and tracing in the code shows that it is getting a Locator2. -- ss at comp dot lancs dot ac dot uk
[toc] | [prev] | [next] | [standalone]
| From | markspace <-@.> |
|---|---|
| Date | 2012-11-22 08:31 -0800 |
| Message-ID | <k8lk20$euc$1@dont-email.me> |
| In reply to | #19846 |
On 11/21/2012 11:53 PM, Steven Simpson wrote: > Like the OP, I'm getting "UTF-8", and tracing in the code shows that it > is getting a Locator2. Oh, well mine doesn't. I guess we have two different implementations. Sorry can't guess what is up with yours.
[toc] | [prev] | [next] | [standalone]
| From | Arne Vajhøj <arne@vajhoej.dk> |
|---|---|
| Date | 2012-11-23 21:21 -0500 |
| Message-ID | <50b02f35$0$283$14726298@news.sunsite.dk> |
| In reply to | #19844 |
On 11/22/2012 2:18 AM, markspace wrote: > On 11/21/2012 10:41 PM, Sebastian wrote: >> The answer cannot be that windows-1250 is non-standard. In fact, the >> declared encoding of the XML file does not seem to matter. The code will >> always output "UTF-8". >> > > Maybe this quote from the article will help you out: > > "This approach works 90 percent of the time, maybe a little more. But > SAX parsers aren't required to support the Locator interface, much less > Locator2, and a few don't. A second option, if you know you're using > Xerces, is to work with XNI" > > Since the output of the program is "unknown", I'd guess that this > particular SAX parser doesn't support Locator2, like it says. Except that it does not return Unknown - it returns UTF-8. Arne
[toc] | [prev] | [next] | [standalone]
| From | Arne Vajhøj <arne@vajhoej.dk> |
|---|---|
| Date | 2012-11-23 21:11 -0500 |
| Message-ID | <50b02ce7$0$287$14726298@news.sunsite.dk> |
| In reply to | #19837 |
Sebastian wrote:
> I discovered this post:
> http://www.ibm.com/developerworks/library/x-tipsaxxni/
>
> and implemented both approaches (SAX and Xerces XNI).
>
> Unfortunately, for the attached XML file, both methods
> output an encoding of UTF-8, while looking at the file
I tried.
And I can not get it to work either.
SAX detects UTF-8 no matter what it really is.
StAX seems never to detect and W3C DOM seems to
always detect correct.
I can not offer an explanation. Obviously the parsers
need to internally detect correct. Otherwise they
could not parse correct.
Code below.
Arne
====
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamException;
import javax.xml.stream.XMLStreamReader;
import org.w3c.dom.Document;
import org.xml.sax.InputSource;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.ext.Locator2;
import org.xml.sax.helpers.XMLReaderFactory;
import org.xml.sax.helpers.DefaultHandler;
public class XmlEncodingDectect {
private static final String FNM1 = "/work/foobar1.xml";
private static final String FNM2 = "/work/foobar2.xml";
private static final String FNM3 = "/work/foobar3.xml";
private static void gen1() throws IOException {
PrintWriter pw = new PrintWriter(new FileWriter(FNM1));
pw.println("<?xml version='1.0' encoding='UTF-8'?>");
pw.println("<root/>");
pw.close();
}
private static void gen2() throws IOException {
PrintWriter pw = new PrintWriter(new FileWriter(FNM2));
pw.println("<?xml version='1.0' encoding='ISO-8859-1'?>");
pw.println("<root/>");
pw.close();
}
private static void gen3() throws IOException {
PrintWriter pw = new PrintWriter(new FileWriter(FNM3));
pw.println("<?xml version='1.0'?>");
pw.println("<root/>");
pw.close();
}
private static String encoding;
private static String detectSAX(String fnm) throws SAXException,
IOException {
XMLReader parser = XMLReaderFactory.createXMLReader();
parser.setContentHandler(new DefaultHandler() {
private Locator2 locator;
@Override
public void setDocumentLocator(Locator locator) {
if (locator instanceof Locator2) {
this.locator = (Locator2) locator;
} else {
encoding = "Unknown";
}
}
@Override
public void startDocument() throws SAXException {
if (locator != null) {
encoding = locator.getEncoding();
}
}
});
parser.parse(new InputSource(new FileInputStream(fnm)));
return encoding;
}
private static String detectW3CDOM(String fnm) throws
ParserConfigurationException, FileNotFoundException, SAXException,
IOException {
DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
DocumentBuilder db = dbf.newDocumentBuilder();
Document doc = db.parse(new InputSource(new FileInputStream(fnm)));
String encoding = doc.getXmlEncoding();
return encoding != null ? encoding : "Unknown";
}
private static String detectStAX(String fnm) throws
FileNotFoundException, XMLStreamException {
XMLInputFactory xif = XMLInputFactory.newInstance();
XMLStreamReader xsr = xif.createXMLStreamReader(new
FileInputStream(fnm));
String encoding = null;
while(xsr.hasNext()) {
xsr.next();
switch(xsr.getEventType()) {
case XMLStreamReader.START_DOCUMENT:
encoding = xsr.getEncoding();
break;
default:
break;
}
}
return encoding != null ? encoding : "Unknown";
}
public static void main(String[] args) throws IOException,
SAXException, ParserConfigurationException, XMLStreamException {
gen1();
System.out.println(detectSAX(FNM1));
System.out.println(detectW3CDOM(FNM1));
System.out.println(detectStAX(FNM1));
gen2();
System.out.println(detectSAX(FNM2));
System.out.println(detectW3CDOM(FNM2));
System.out.println(detectStAX(FNM2));
gen3();
System.out.println(detectSAX(FNM3));
System.out.println(detectW3CDOM(FNM3));
System.out.println(detectStAX(FNM3));
}
}
[toc] | [prev] | [next] | [standalone]
| From | Arne Vajhøj <arne@vajhoej.dk> |
|---|---|
| Date | 2012-11-23 21:20 -0500 |
| Message-ID | <50b02ee6$0$283$14726298@news.sunsite.dk> |
| In reply to | #19837 |
On 11/21/2012 2:31 PM, Lew wrote: > Sebastian wrote: >> I discovered this post: >> http://www.ibm.com/developerworks/library/x-tipsaxxni/ >> >> and implemented both approaches (SAX and Xerces XNI). >> >> Unfortunately, for the attached XML file, both methods > > Don't do attachments on Usenet. > >> output an encoding of UTF-8, while looking at the file > > as they should. No. If the XML prolog specifies another encoding than UTF-8, then it should not return UTF-8. > XML should be encoded in UTF-8 nearly always. XML allows for other encodings. And Java XML parsers support it. So it should always work. > But SAX is a parser, so it doesn't output, it inputs. What are you telling us? Output usually mean System.out.println - that works fine with a parser. > If your problem is with reading the file, then the encoding in the XML declaration > should suffice to guide the parser. But then why do you talk about methods that > "output an encoding"? Because he wants to know what it is. > However, according to > http://xmlwriter.net/xml_guide/xml_declaration.shtml#Encoding > supported encodings only include UTF-8, UTF-16, ISO-10646-UCS-2, > ISO-10646-UCS-4, ISO-8859-1 to ISO-8859-9, ISO-2022-JP, Shift_JIS, and EUC-JP, > as you would have learned had you researched your question. > > So it looks like you must not accept XML documents with such a non-standard > encoding. Those that has researched would know that the XML spec do not limit the encodings at all. The XML processor must support UTF-8 and UTF-16, but are free to support others. Arne Arne
[toc] | [prev] | [next] | [standalone]
| From | Lew <lewbloch@gmail.com> |
|---|---|
| Date | 2012-11-24 02:14 -0800 |
| Message-ID | <d64baf3c-d582-4308-b6b4-714ef3049ef5@googlegroups.com> |
| In reply to | #19879 |
Arne Vajhøj wrote: > Lew wrote: >> Sebastian wrote: [snip] >>> output an encoding of UTF-8, while looking at the file >> as they should. > > No. > > If the XML prolog specifies another encoding than UTF-8, > then it should not return UTF-8. True, but I'm saying they should specify UTF-8 in the prolog. >> XML should be encoded in UTF-8 nearly always. See? > XML allows for other encodings. So? You should use UTF-8 nearly always, i.e., unless there's a compelling reason not to. > And Java XML parsers support it. For those rare times when you deviate from the usual UTF-8. > So it should always work. >> But SAX is a parser, so it doesn't output, it inputs. What are you telling us? > > Output usually mean System.out.println - that works fine with a parser. His phrasing wasn't clear to me. That's why I asked for clarification. I could have guessed, too. >> If your problem is with reading the file, then the encoding in the XML declaration See? You're preaching to the choir. >> should suffice to guide the parser. But then why do you talk about methods that >> "output an encoding"? > > Because he wants to know what it is. > >> However, according to >> http://xmlwriter.net/xml_guide/xml_declaration.shtml#Encoding >> supported encodings only include UTF-8, UTF-16, ISO-10646-UCS-2, >> ISO-10646-UCS-4, ISO-8859-1 to ISO-8859-9, ISO-2022-JP, Shift_JIS, >> and EUC-JP, >> So it looks like you must not accept XML documents with such a >> non-standard encoding. > > Those that has researched would know that the XML spec do not > limit the encodings at all. The XML processor must support UTF-8 > and UTF-16, but are free to support others. Perhaps the OP's parser doesn't exercise that freedom, judging by the symptoms. 'sall I'm sayin'. Obviously I don't know the answer, but he's asking for suggestions to investigate, AIUI. He's having encoding problems. His XML is apparently encoded in Windows-1252, a notoriously funky encoding especially for the variety of characters with which one might wish to deal. So why not investigate obtaining material that isn't in such a notoriously funky encoding, like, oh, say, the old reliable standard UTF-8? Perhaps that isn't feasible, for reasons as yet unstated, but that's the nature of brainstorming. -- Lew
[toc] | [prev] | [next] | [standalone]
| From | Sebastian <sebastian@undisclosed.invalid> |
|---|---|
| Date | 2012-11-24 22:18 +0100 |
| Message-ID | <k8rdfq$gbg$1@news.albasani.net> |
| In reply to | #19888 |
Sebastian wrote: > I discovered this post: > http://www.ibm.com/developerworks/library/x-tipsaxxni/ > > and implemented both approaches (SAX and Xerces XNI). > > Unfortunately, for the attached XML file, both methods > output an encoding of UTF-8, while looking at the file Am 24.11.2012 11:14, schrieb Lew: [snip] > > Obviously I don't know the answer, but he's asking for suggestions > to investigate, AIUI. He's having encoding problems. His XML is apparently > encoded in Windows-1252, a notoriously funky encoding especially for > the variety of characters with which one might wish to deal. So why not > investigate obtaining material that isn't in such a notoriously funky > encoding, like, oh, say, the old reliable standard UTF-8? > > Perhaps that isn't feasible, for reasons as yet unstated, but that's > the nature of brainstorming. Here's the background to my question: I am dealing with other people's code that processes XML files. Unfortunately, that code, which I have no control over, seems to use some home-grown parsing algorithm, which DOES NOT always detect encodings correctly, but expects to be told them. The XML files come from several sources in different encodings, and I cannot dictate anything there either. So I thought, well, why don't I add a little preprocessor to discover the encoding to give to that terrible file processor I'm stuck with. Shouldn't be that hard, because, as Arne said: > Am 24.11.2012 03:11, schrieb Arne Vajhøj: > Obviously the parsers > need to internally detect correct. Otherwise they > could not parse correct. The only approach that seems to work (at least for Arne), namely W3C DOM, is out of the question for me, because the files are potentially huge and I cannot keep a complete document model in memory. I need something along the lines of SAX. I'll have to look around some more. -- Sebastian PS: The author of that article from which I took the code isn't just anyone. Elliotte Rusty Harold hosts the XML web site http://www.cafeconleche.org/ and is affiliated with the University of North Carolina. Perhaps I could try to get in touch with him.
[toc] | [prev] | [next] | [standalone]
| From | Arne Vajhøj <arne@vajhoej.dk> |
|---|---|
| Date | 2012-11-24 17:07 -0500 |
| Message-ID | <50b14516$0$282$14726298@news.sunsite.dk> |
| In reply to | #19904 |
On 11/24/2012 4:18 PM, Sebastian wrote:
> Am 24.11.2012 11:14, schrieb Lew:
> [snip]
>>
>> Obviously I don't know the answer, but he's asking for suggestions
>> to investigate, AIUI. He's having encoding problems. His XML is
>> apparently
>> encoded in Windows-1252, a notoriously funky encoding especially for
>> the variety of characters with which one might wish to deal. So why not
>> investigate obtaining material that isn't in such a notoriously funky
>> encoding, like, oh, say, the old reliable standard UTF-8?
>>
>> Perhaps that isn't feasible, for reasons as yet unstated, but that's
>> the nature of brainstorming.
>
> Here's the background to my question:
> I am dealing with other people's code that processes XML files.
> Unfortunately, that code, which I have no control over, seems to use
> some home-grown parsing algorithm, which DOES NOT always detect
> encodings correctly, but expects to be told them.
>
> The XML files come from several sources in different encodings, and I
> cannot dictate anything there either.
I would consider it tempting to rewrite that app to use a standard
XML parser.
It would solve this problem and possibly also some future problems.
> So I thought, well, why don't I add a little preprocessor to discover
> the encoding to give to that terrible file processor I'm stuck with.
> Shouldn't be that hard, because, as Arne said:
>
> > Am 24.11.2012 03:11, schrieb Arne Vajhøj:
> > Obviously the parsers
> > need to internally detect correct. Otherwise they
> > could not parse correct.
>
> The only approach that seems to work (at least for Arne), namely
> W3C DOM, is out of the question for me, because the files are
> potentially huge and I cannot keep a complete document model in memory.
> I need something along the lines of SAX. I'll have to look around some
> more.
What about just reading the first few lines until you have the
XML declaration.
Parsing the encoding out of that should be simple.
private static final Pattern encpat =
Pattern.compile("encoding\\s*=\\s*['\"]([^'\"]+)['\"]");
private static String detectSimple(String fnm) throws IOException {
BufferedReader br = new BufferedReader(new FileReader(fnm));
String firstpart = "";
while(!firstpart.contains(">")) firstpart += br.readLine();
br.close();
Matcher m = encpat.matcher(firstpart);
if(m.find()) {
return m.group(1);
} else {
return "Unknown";
}
}
I do not like the solution, but given the restrictions in the
context, then maybe it is what you need.
> PS: The author of that article from which I took the code isn't just
> anyone. Elliotte Rusty Harold hosts the XML web site
> http://www.cafeconleche.org/ and is affiliated with the University of
> North Carolina. Perhaps I could try to get in touch with him.
Teaching at a university is no guarantee of good practical
programming skills.
Arne
[toc] | [prev] | [next] | [standalone]
| From | Sebastian <sebastian@undisclosed.invalid> |
|---|---|
| Date | 2012-11-25 10:50 +0100 |
| Message-ID | <k8sphg$hn4$1@news.albasani.net> |
| In reply to | #19905 |
Am 24.11.2012 23:07, schrieb Arne Vajhøj:
[snip]
> I would consider it tempting to rewrite that app to use a standard
> XML parser.
>
> It would solve this problem and possibly also some future problems.
Yes, I wish I could do that (or rather, have that done...) It seems that
app also handles other types of files (like csv) and regardless of
file type they always do the same, namely open an InputStreamReader
given a charset name.
[snip]
> What about just reading the first few lines until you have the
> XML declaration.
>
> Parsing the encoding out of that should be simple.
>
> private static final Pattern encpat =
> Pattern.compile("encoding\\s*=\\s*['\"]([^'\"]+)['\"]");
> private static String detectSimple(String fnm) throws IOException {
> BufferedReader br = new BufferedReader(new FileReader(fnm));
> String firstpart = "";
> while(!firstpart.contains(">")) firstpart += br.readLine();
> br.close();
> Matcher m = encpat.matcher(firstpart);
> if(m.find()) {
> return m.group(1);
> } else {
> return "Unknown";
> }
> }
>
> I do not like the solution, but given the restrictions in the
> context, then maybe it is what you need.
Thanks for the suggestion. I'll use that idea until a better solution
becomes feasible.
-- Sebastian
[toc] | [prev] | [next] | [standalone]
| From | markspace <-@.> |
|---|---|
| Date | 2012-11-24 17:12 -0800 |
| Message-ID | <k8rral$nb1$1@dont-email.me> |
| In reply to | #19904 |
On 11/24/2012 1:18 PM, Sebastian wrote: > I am dealing with other people's code that processes XML files. > Unfortunately, that code, which I have no control over, seems to use > some home-grown parsing algorithm, which DOES NOT always detect > encodings correctly, but expects to be told them. That's not a big deal. Several of the Java components work this way. Open the file with an assumed encoding, and test the encoding. If you are wrong, throw an exception, which causes the stream to be re-opened with the correct encoding (now that the correct encoding has been detected). Be careful you're not subverting an established, working process here. I personally am still looking for an SSCCE, as your last one didn't reproduce the error for me.
[toc] | [prev] | [next] | [standalone]
| From | Arne Vajhøj <arne@vajhoej.dk> |
|---|---|
| Date | 2012-11-24 20:17 -0500 |
| Message-ID | <50b171be$0$292$14726298@news.sunsite.dk> |
| In reply to | #19914 |
On 11/24/2012 8:12 PM, markspace wrote: > I personally am still looking for an SSCCE, as your last one didn't > reproduce the error for me. Did you try my 1 2 3 example? Arne
[toc] | [prev] | [next] | [standalone]
| From | markspace <-@.> |
|---|---|
| Date | 2012-11-24 18:02 -0800 |
| Message-ID | <k8ru8h$4na$1@dont-email.me> |
| In reply to | #19915 |
On 11/24/2012 5:17 PM, Arne Vajhøj wrote: > On 11/24/2012 8:12 PM, markspace wrote: >> I personally am still looking for an SSCCE, as your last one didn't >> reproduce the error for me. > > Did you try my 1 2 3 example? No that errors on me too. Really all the OP did was cut and paste the example code from his link as his SSCCE, it didn't even contain his data file. There's no way I'm going to bother writing code for anyone who is that lazy. (Your code throws an exception because I have no directory named "/work". Look up StringReader or ByteArrayInputStream. Use those instead of relying on actual files and a file system.)
[toc] | [prev] | [next] | [standalone]
| From | Arne Vajhøj <arne@vajhoej.dk> |
|---|---|
| Date | 2012-11-24 21:10 -0500 |
| Message-ID | <50b17e18$0$288$14726298@news.sunsite.dk> |
| In reply to | #19917 |
On 11/24/2012 9:02 PM, markspace wrote: > On 11/24/2012 5:17 PM, Arne Vajhøj wrote: >> On 11/24/2012 8:12 PM, markspace wrote: >>> I personally am still looking for an SSCCE, as your last one didn't >>> reproduce the error for me. >> >> Did you try my 1 2 3 example? > > No that errors on me too. Really all the OP did was cut and paste the > example code from his link as his SSCCE, it didn't even contain his data > file. There's no way I'm going to bother writing code for anyone who is > that lazy. > > (Your code throws an exception because I have no directory named > "/work". Look up StringReader or ByteArrayInputStream. Use those > instead of relying on actual files and a file system.) The code actually writes the files. Just change the path to something valid for write. I could have read from memory, but when it comes to encoding I want to use a real file. I have previously had bad experience with XML parsed from a String always being considered UTF-16. Not in Java, but still. Arne
[toc] | [prev] | [next] | [standalone]
| From | markspace <-@.> |
|---|---|
| Date | 2012-11-24 18:25 -0800 |
| Message-ID | <k8rvin$9of$1@dont-email.me> |
| In reply to | #19918 |
On 11/24/2012 6:10 PM, Arne Vajhøj wrote: > On 11/24/2012 9:02 PM, markspace wrote: >> (Your code throws an exception because I have no directory named >> "/work". Look up StringReader or ByteArrayInputStream. Use those >> instead of relying on actual files and a file system.) > > The code actually writes the files. Yeah, I'd also rather not have artifacts on my hard drive too, thanks.
[toc] | [prev] | [next] | [standalone]
Page 1 of 3 [1] 2 3 Next page →
Back to top | Article view | comp.lang.java.programmer
csiph-web