Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.java.programmer > #19834 > unrolled thread
| Started by | Sebastian <sebastian@undisclosed.invalid> |
|---|---|
| First post | 2012-11-21 15:32 +0100 |
| Last post | 2012-12-16 17:43 +0200 |
| Articles | 20 on this page of 43 — 9 participants |
Back to article view | Back to comp.lang.java.programmer
Detect XML document encodings with SAX Sebastian <sebastian@undisclosed.invalid> - 2012-11-21 15:32 +0100
Re: Detect XML document encodings with SAX Lew <lewbloch@gmail.com> - 2012-11-21 11:31 -0800
Re: Detect XML document encodings with SAX Sebastian <sebastian@undisclosed.invalid> - 2012-11-22 00:39 +0100
Re: Detect XML document encodings with SAX Lew <lewbloch@gmail.com> - 2012-11-21 16:37 -0800
Re: Detect XML document encodings with SAX Sebastian <sebastian@undisclosed.invalid> - 2012-11-22 07:41 +0100
Re: Detect XML document encodings with SAX markspace <-@.> - 2012-11-21 23:18 -0800
Re: Detect XML document encodings with SAX Steven Simpson <ss@domain.invalid> - 2012-11-22 07:53 +0000
Re: Detect XML document encodings with SAX markspace <-@.> - 2012-11-22 08:31 -0800
Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-23 21:21 -0500
Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-23 21:11 -0500
Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-23 21:20 -0500
Re: Detect XML document encodings with SAX Lew <lewbloch@gmail.com> - 2012-11-24 02:14 -0800
Re: Detect XML document encodings with SAX Sebastian <sebastian@undisclosed.invalid> - 2012-11-24 22:18 +0100
Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-24 17:07 -0500
Re: Detect XML document encodings with SAX Sebastian <sebastian@undisclosed.invalid> - 2012-11-25 10:50 +0100
Re: Detect XML document encodings with SAX markspace <-@.> - 2012-11-24 17:12 -0800
Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-24 20:17 -0500
Re: Detect XML document encodings with SAX markspace <-@.> - 2012-11-24 18:02 -0800
Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-24 21:10 -0500
Re: Detect XML document encodings with SAX markspace <-@.> - 2012-11-24 18:25 -0800
Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-24 21:37 -0500
Re: Detect XML document encodings with SAX markspace <-@.> - 2012-11-24 21:01 -0800
Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-25 16:30 -0500
Re: Detect XML document encodings with SAX Gene Wirchenko <genew@telus.net> - 2012-12-12 18:03 -0800
Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-12-12 21:09 -0500
Re: Detect XML document encodings with SAX Lew <lewbloch@gmail.com> - 2012-12-12 18:58 -0800
Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-12-12 22:17 -0500
Re: Detect XML document encodings with SAX Lew <lewbloch@gmail.com> - 2012-12-12 22:51 -0800
Re: Detect XML document encodings with SAX Gene Wirchenko <genew@telus.net> - 2012-12-12 21:52 -0800
Re: Detect XML document encodings with SAX Sebastian <sebastian@undisclosed.invalid> - 2012-11-25 10:45 +0100
Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-25 16:23 -0500
Re: Detect XML document encodings with SAX markspace <-@.> - 2012-11-25 13:24 -0800
Re: Detect XML document encodings with SAX Sebastian <sebastian@undisclosed.invalid> - 2012-11-25 10:58 +0100
Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-24 17:13 -0500
Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-24 17:19 -0500
Re: Detect XML document encodings with SAX Roedy Green <see_website@mindprod.com.invalid> - 2012-11-22 03:24 -0800
Re: Detect XML document encodings with SAX "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-24 00:13 +0100
Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-23 21:22 -0500
Re: Detect XML document encodings with SAX Steven Simpson <ss@domain.invalid> - 2012-11-25 11:00 +0000
Re: Detect XML document encodings with SAX Sebastian <sebastian@undisclosed.invalid> - 2012-11-25 12:32 +0100
Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-25 14:41 -0500
Re: Detect XML document encodings with SAX Roedy Green <see_website@mindprod.com.invalid> - 2012-12-12 20:32 -0800
Re: Detect XML document encodings with SAX Stanimir Stamenkov <s7an10@netscape.net> - 2012-12-16 17:43 +0200
Page 2 of 3 — ← Prev page 1 [2] 3 Next page →
| From | Arne Vajhøj <arne@vajhoej.dk> |
|---|---|
| Date | 2012-11-24 21:37 -0500 |
| Message-ID | <50b18453$0$290$14726298@news.sunsite.dk> |
| In reply to | #19919 |
On 11/24/2012 9:25 PM, markspace wrote: > On 11/24/2012 6:10 PM, Arne Vajhøj wrote: >> On 11/24/2012 9:02 PM, markspace wrote: >>> (Your code throws an exception because I have no directory named >>> "/work". Look up StringReader or ByteArrayInputStream. Use those >>> instead of relying on actual files and a file system.) >> >> The code actually writes the files. > > Yeah, I'd also rather not have artifacts on my hard drive too, thanks. The why ask for an SSCCE? Or does Java files not count as artifacts? Arne
[toc] | [prev] | [next] | [standalone]
| From | markspace <-@.> |
|---|---|
| Date | 2012-11-24 21:01 -0800 |
| Message-ID | <k8s8nk$fku$1@dont-email.me> |
| In reply to | #19920 |
On 11/24/2012 6:37 PM, Arne Vajhøj wrote: > On 11/24/2012 9:25 PM, markspace wrote: >> On 11/24/2012 6:10 PM, Arne Vajhøj wrote: >>> On 11/24/2012 9:02 PM, markspace wrote: >>>> (Your code throws an exception because I have no directory named >>>> "/work". Look up StringReader or ByteArrayInputStream. Use those >>>> instead of relying on actual files and a file system.) >>> >>> The code actually writes the files. >> >> Yeah, I'd also rather not have artifacts on my hard drive too, thanks. > > The why ask for an SSCCE? Because it can be done with out using files. It isn't self-contained if it depends on my file-system. I have no /work directory, so the example program fails, but not in the the way intended. Ergo, it's not an SSCCE.
[toc] | [prev] | [next] | [standalone]
| From | Arne Vajhøj <arne@vajhoej.dk> |
|---|---|
| Date | 2012-11-25 16:30 -0500 |
| Message-ID | <50b28ded$0$289$14726298@news.sunsite.dk> |
| In reply to | #19921 |
On 11/25/2012 12:01 AM, markspace wrote: > On 11/24/2012 6:37 PM, Arne Vajhøj wrote: >> On 11/24/2012 9:25 PM, markspace wrote: >>> On 11/24/2012 6:10 PM, Arne Vajhøj wrote: >>>> On 11/24/2012 9:02 PM, markspace wrote: >>>>> (Your code throws an exception because I have no directory named >>>>> "/work". Look up StringReader or ByteArrayInputStream. Use those >>>>> instead of relying on actual files and a file system.) >>>> >>>> The code actually writes the files. >>> >>> Yeah, I'd also rather not have artifacts on my hard drive too, thanks. >> >> The why ask for an SSCCE? >> ## Or does Java files not count as artifacts? > > Because it can be done with out using files. That really does not answer the question of why you want an SSCCE if you don't want artifacts on your hard drive. And you can obviously not read from a file without reading from a file. You may make an assumption that the XML parsers do not use the underlying input classes to get the encoding. As I have already explained to you once, then that is not always the case. > It isn't self-contained if it depends on my file-system. I have no > /work directory, so the example program fails, but not in the the way > intended. Ergo, it's not an SSCCE. Actually SSCCE allows for input files. If you don't want input files, then ask for a MSSSCCE and link to the rules for that. Arne
[toc] | [prev] | [next] | [standalone]
| From | Gene Wirchenko <genew@telus.net> |
|---|---|
| Date | 2012-12-12 18:03 -0800 |
| Message-ID | <qndic811jjnaoqep1jc95h7c27ivrhinoq@4ax.com> |
| In reply to | #19960 |
On Sun, 25 Nov 2012 16:30:20 -0500, Arne Vajhøj <arne@vajhoej.dk>
wrote:
[snip]
>If you don't want input files, then ask for a MSSSCCE and link
^^^^^^^
>to the rules for that.
Please expand your new acronym.
Sincerely,
Gene Wirchenko
[toc] | [prev] | [next] | [standalone]
| From | Arne Vajhøj <arne@vajhoej.dk> |
|---|---|
| Date | 2012-12-12 21:09 -0500 |
| Message-ID | <50c938de$0$289$14726298@news.sunsite.dk> |
| In reply to | #20281 |
On 12/12/2012 9:03 PM, Gene Wirchenko wrote: > On Sun, 25 Nov 2012 16:30:20 -0500, Arne Vajhøj <arne@vajhoej.dk> > wrote: > > [snip] > >> If you don't want input files, then ask for a MSSSCCE and link > ^^^^^^^ >> to the rules for that. > > Please expand your new acronym. MarkSpace SSCCE :-) Arne
[toc] | [prev] | [next] | [standalone]
| From | Lew <lewbloch@gmail.com> |
|---|---|
| Date | 2012-12-12 18:58 -0800 |
| Message-ID | <0037028d-8fb7-41ed-ac1f-42e00e50d67e@googlegroups.com> |
| In reply to | #20282 |
Arne Vajhøj wrote: > MarkSpace SSCCE Apparently the OP gave up on getting help and was unwilling to provide the materials requested. -- Lew
[toc] | [prev] | [next] | [standalone]
| From | Arne Vajhøj <arne@vajhoej.dk> |
|---|---|
| Date | 2012-12-12 22:17 -0500 |
| Message-ID | <50c948dd$0$288$14726298@news.sunsite.dk> |
| In reply to | #20283 |
On 12/12/2012 9:58 PM, Lew wrote: > Apparently the OP gave up on getting help and was unwilling to provide the > materials requested. ???? Steven Simpson solved the problem with the provided information. And OP acknowledged it. Arne
[toc] | [prev] | [next] | [standalone]
| From | Lew <lewbloch@gmail.com> |
|---|---|
| Date | 2012-12-12 22:51 -0800 |
| Message-ID | <eb266140-3ebd-4f41-99df-921e55c456aa@googlegroups.com> |
| In reply to | #20284 |
Arne Vajhøj wrote: > Lew wrote: >> Apparently the OP gave up on getting help and was unwilling to provide the >> materials requested. > > ???? > > Steven Simpson solved the problem with the provided information. > > And OP acknowledged it. I stand corrected. -- Lew
[toc] | [prev] | [next] | [standalone]
| From | Gene Wirchenko <genew@telus.net> |
|---|---|
| Date | 2012-12-12 21:52 -0800 |
| Message-ID | <r8ric81k3himnab3kqiv9p014f7tfc3tbq@4ax.com> |
| In reply to | #20282 |
On Wed, 12 Dec 2012 21:09:32 -0500, Arne Vajhøj <arne@vajhoej.dk>
wrote:
>On 12/12/2012 9:03 PM, Gene Wirchenko wrote:
>> On Sun, 25 Nov 2012 16:30:20 -0500, Arne Vajhøj <arne@vajhoej.dk>
>> wrote:
>>
>> [snip]
>>
>>> If you don't want input files, then ask for a MSSSCCE and link
>> ^^^^^^^
>>> to the rules for that.
>>
>> Please expand your new acronym.
>
>MarkSpace SSCCE
>
>:-)
Thank you.
Sincerely,
Gene Wirchenko
[toc] | [prev] | [next] | [standalone]
| From | Sebastian <sebastian@undisclosed.invalid> |
|---|---|
| Date | 2012-11-25 10:45 +0100 |
| Message-ID | <k8sp7h$h5h$1@news.albasani.net> |
| In reply to | #19917 |
Am 25.11.2012 03:02, schrieb markspace: > On 11/24/2012 5:17 PM, Arne Vajhøj wrote: >> On 11/24/2012 8:12 PM, markspace wrote: >>> I personally am still looking for an SSCCE, as your last one didn't >>> reproduce the error for me. >> >> Did you try my 1 2 3 example? > > > No that errors on me too. Really all the OP did was cut and paste the > example code from his link as his SSCCE, it didn't even contain his data > file. There's no way I'm going to bother writing code for anyone who is > that lazy. I did include an XML file with my original post, and said to use Elliotte Rusty's code to see what it (wrongly) says about the encoding. What else wouldyou require? Both Arne Vajhøj and Steven Simpson could reproduce the phenomenon that way. -- Sebastian
[toc] | [prev] | [next] | [standalone]
| From | Arne Vajhøj <arne@vajhoej.dk> |
|---|---|
| Date | 2012-11-25 16:23 -0500 |
| Message-ID | <50b28c66$0$289$14726298@news.sunsite.dk> |
| In reply to | #19925 |
On 11/25/2012 4:45 AM, Sebastian wrote: > Am 25.11.2012 03:02, schrieb markspace: >> On 11/24/2012 5:17 PM, Arne Vajhøj wrote: >>> On 11/24/2012 8:12 PM, markspace wrote: >>>> I personally am still looking for an SSCCE, as your last one didn't >>>> reproduce the error for me. >>> >>> Did you try my 1 2 3 example? >> >> >> No that errors on me too. Really all the OP did was cut and paste the >> example code from his link as his SSCCE, it didn't even contain his data >> file. There's no way I'm going to bother writing code for anyone who is >> that lazy. > > I did include an XML file with my original post, and said to use > Elliotte Rusty's code to see what it (wrongly) says about the encoding. > What else wouldyou require? Both Arne Vajhøj and Steven Simpson could > reproduce the phenomenon that way. Just like some younger people find it uber-cool to have their pants hanging down just over their knees, then some not quite as young people find it uber-cool to scream for SSCCE and clarification. :-) Arne
[toc] | [prev] | [next] | [standalone]
| From | markspace <-@.> |
|---|---|
| Date | 2012-11-25 13:24 -0800 |
| Message-ID | <k8u29o$ol8$1@dont-email.me> |
| In reply to | #19925 |
On 11/25/2012 1:45 AM, Sebastian wrote: > I did include an XML file with my original post, and said to use > Elliotte Rusty's code to see what it (wrongly) says about the encoding. > What else wouldyou require? The code you copied prints "unknown" for me. I mentioned that awhile back. I assume the encoding in the xml file you attached did not survive the interwebs. I require: a 100% self-contained program. Use a StringReader and a Java string constant. Those can encode character values greater than 127 as ASCII just fine, so I assume they'll survive Usenet just fine. Self-contained means self-contained, not "except for the bits I left off."
[toc] | [prev] | [next] | [standalone]
| From | Sebastian <sebastian@undisclosed.invalid> |
|---|---|
| Date | 2012-11-25 10:58 +0100 |
| Message-ID | <k8sq0l$iss$1@news.albasani.net> |
| In reply to | #19917 |
Am 25.11.2012 03:02, schrieb markspace: [snip] > file. There's no way I'm going to bother writing code for anyone who is > that lazy. I wasn't really asking for code to be written. I was interested to know whether others see the same behavior and consider it quirky, too, and whether they had suggestions for a parser that would behave more as I expected. Of course, I am still grateful to Arne for having posted his code. -- Sebastian
[toc] | [prev] | [next] | [standalone]
| From | Arne Vajhøj <arne@vajhoej.dk> |
|---|---|
| Date | 2012-11-24 17:13 -0500 |
| Message-ID | <50b14677$0$282$14726298@news.sunsite.dk> |
| In reply to | #19888 |
On 11/24/2012 5:14 AM, Lew wrote: > Arne Vajhøj wrote: >> Lew wrote: >>> But SAX is a parser, so it doesn't output, it inputs. What are you telling us? >> >> Output usually mean System.out.println - that works fine with a parser. > > His phrasing wasn't clear to me. That's why I asked for clarification. Then maybe we need "How to ask for clarifications the smart way". >>> However, according to >>> http://xmlwriter.net/xml_guide/xml_declaration.shtml#Encoding >>> supported encodings only include UTF-8, UTF-16, ISO-10646-UCS-2, >>> ISO-10646-UCS-4, ISO-8859-1 to ISO-8859-9, ISO-2022-JP, Shift_JIS, >>> and EUC-JP, >>> So it looks like you must not accept XML documents with such a >>> non-standard encoding. >> >> Those that has researched would know that the XML spec do not >> limit the encodings at all. The XML processor must support UTF-8 >> and UTF-16, but are free to support others. > > Perhaps the OP's parser doesn't exercise that freedom, judging by the > symptoms. There are nothing in OP's symptoms that indicate lack of support for encodings. OP's symptoms is that it parse fine with encoding XYZ but when asked by caller it claims wrongfully to be using UTF-8. Arne
[toc] | [prev] | [next] | [standalone]
| From | Arne Vajhøj <arne@vajhoej.dk> |
|---|---|
| Date | 2012-11-24 17:19 -0500 |
| Message-ID | <50b147f7$0$294$14726298@news.sunsite.dk> |
| In reply to | #19888 |
On 11/24/2012 5:14 AM, Lew wrote: > Obviously I don't know the answer, but he's asking for suggestions > to investigate, AIUI. He's having encoding problems. His XML is apparently > encoded in Windows-1252, a notoriously funky encoding especially for > the variety of characters with which one might wish to deal. CP-1252 is just another encoding. It is not more or less funky than any other encoding. In fact it is identical with ISO-8859-1 for all characters except 128-159, which are control characters/unmapped in ISO-8859-1 but has various extra values in CP-1252. > So why not > investigate obtaining material that isn't in such a notoriously funky > encoding, like, oh, say, the old reliable standard UTF-8? If one can chose the data files and the software, then life is easy. Arne
[toc] | [prev] | [next] | [standalone]
| From | Roedy Green <see_website@mindprod.com.invalid> |
|---|---|
| Date | 2012-11-22 03:24 -0800 |
| Message-ID | <lj2sa8pruu1mtn20amkj2olt97h87n2d7k@4ax.com> |
| In reply to | #19834 |
On Wed, 21 Nov 2012 15:32:19 +0100, Sebastian <sebastian@undisclosed.invalid> wrote, quoted or indirectly quoted someone who said : >Does anyone have an idea why that is so? And how I could >go about making some XML parser determine the correct encoding? See http://mindprod.com/products2.html#ENCODINGRECOGNISER This is a manual assist tool to help you guess the encoding. Encodings are not embedded in any way in files. You just have to know. ARGHHH! See http://mindprod.com/jgloss/encoding.html for how to use native2ascii to interconvert encodings. The XML world likes UTF-8. Using anything else is just asking for trouble. -- Roedy Green Canadian Mind Products http://mindprod.com Students who hire or con others to do their homework are as foolish as couch potatoes who hire others to go to the gym for them.
[toc] | [prev] | [next] | [standalone]
| From | "Peter J. Holzer" <hjp-usenet2@hjp.at> |
|---|---|
| Date | 2012-11-24 00:13 +0100 |
| Message-ID | <slrnkb00o8.jbt.hjp-usenet2@hrunkner.hjp.at> |
| In reply to | #19847 |
On 2012-11-22 11:24, Roedy Green <see_website@mindprod.com.invalid> wrote: > On Wed, 21 Nov 2012 15:32:19 +0100, Sebastian ><sebastian@undisclosed.invalid> wrote, quoted or indirectly quoted > someone who said : >>Does anyone have an idea why that is so? And how I could >>go about making some XML parser determine the correct encoding? > > See http://mindprod.com/products2.html#ENCODINGRECOGNISER > > This is a manual assist tool to help you guess the encoding. No need to guess. > Encodings are not embedded in any way in files. You just have to know. Not true for XML. The file Sebastian posted starts with <?xml version="1.0" encoding="windows-1250"?> hp -- _ | Peter J. Holzer | Fluch der elektronischen Textverarbeitung: |_|_) | Sysadmin WSR | Man feilt solange an seinen Text um, bis | | | hjp@hjp.at | die Satzbestandteile des Satzes nicht mehr __/ | http://www.hjp.at/ | zusammenpaßt. -- Ralph Babel
[toc] | [prev] | [next] | [standalone]
| From | Arne Vajhøj <arne@vajhoej.dk> |
|---|---|
| Date | 2012-11-23 21:22 -0500 |
| Message-ID | <50b02f7e$0$283$14726298@news.sunsite.dk> |
| In reply to | #19872 |
On 11/23/2012 6:13 PM, Peter J. Holzer wrote: > On 2012-11-22 11:24, Roedy Green <see_website@mindprod.com.invalid> wrote: >> On Wed, 21 Nov 2012 15:32:19 +0100, Sebastian >> <sebastian@undisclosed.invalid> wrote, quoted or indirectly quoted >> someone who said : >>> Does anyone have an idea why that is so? And how I could >>> go about making some XML parser determine the correct encoding? >> >> See http://mindprod.com/products2.html#ENCODINGRECOGNISER >> >> This is a manual assist tool to help you guess the encoding. > > No need to guess. > >> Encodings are not embedded in any way in files. You just have to know. > > Not true for XML. The file Sebastian posted starts with > > <?xml version="1.0" encoding="windows-1250"?> New around here? Don't expect Roedy's posts to relate that much to what he is replying to. Arne
[toc] | [prev] | [next] | [standalone]
| From | Steven Simpson <ss@domain.invalid> |
|---|---|
| Date | 2012-11-25 11:00 +0000 |
| Message-ID | <maa9o9-2vr.ln1@s.simpson148.btinternet.com> |
| In reply to | #19834 |
On 21/11/12 14:32, Sebastian wrote:
> Does anyone have an idea why that is so? And how I could
> go about making some XML parser determine the correct encoding?
Sussed it! (Come to think of it, I feel I've sussed this before...)
The charset returned by the locator changes during parsing. At
startDocument(), it is the assumed charset, possibly based on the first
four-or-so bytes. At endDocument(), it is reset to null. On the first
call to startElement, it has the correct value. There might be an
earlier event where it is correct - I didn't investigate.
SSCCE...
import org.xml.sax.*;
import org.xml.sax.ext.*;
import org.xml.sax.helpers.*;
import java.io.*;
import java.nio.charset.*;
public class SAXEncodingDetector extends DefaultHandler {
static void escape(PrintWriter out, CharsetEncoder enc, CharSequence text) {
final int len = text.length();
for (int i = 0; i < len; i++) {
char c = text.charAt(i);
if (enc.canEncode(c))
out.print(c);
else
out.printf("&#x%x;", (int) c);
}
}
static final String MESSAGE = "L\u00f6we \u20ac";
static byte[] createXMLBytes(String charsetName)
throws UnsupportedEncodingException {
Charset charset = Charset.forName(charsetName);
CharsetEncoder encoder = charset.newEncoder();
ByteArrayOutputStream bytesOut = new ByteArrayOutputStream();
PrintWriter out =
new PrintWriter(new OutputStreamWriter(bytesOut, charset));
out.printf("<?xml version=\"1.0\" encoding=\"%s\" ?>%n", charsetName);
out.print("<root>");
escape(out, encoder, MESSAGE);
out.println("</root>");
out.close();
return bytesOut.toByteArray();
}
public static void main(String[] args) throws SAXException, IOException {
for (int i = 0; i < args.length; i++) {
String inCharset = args[i];
byte[] bytes = createXMLBytes(inCharset);
System.out.printf("%nCharset %s: (%d bytes)%n",
inCharset, bytes.length);
printBytes(bytes, System.out);
ByteArrayInputStream in = new ByteArrayInputStream(bytes);
XMLReader parser = XMLReaderFactory.createXMLReader();
SAXEncodingDetector handler = new SAXEncodingDetector();
parser.setContentHandler(handler);
parser.parse(new InputSource(in));
System.out.printf("Charset at document start: %s%n",
handler.encodingAtDocumentStart);
System.out.printf(" Charset at element start: %s%n",
handler.encodingAtElementStart);
System.out.printf(" Charset at element end: %s%n",
handler.encodingAtElementEnd);
System.out.printf(" Charset at document end: %s%n",
handler.encodingAtDocumentEnd);
String content = handler.content.toString();
System.out.println("Content: " + content);
if (!content.equals(MESSAGE))
System.out.println("Warning: message corrupted");
}
}
private String encodingAtDocumentStart;
private String encodingAtElementStart;
private String encodingAtElementEnd;
private String encodingAtDocumentEnd;
private Locator2 locator;
private StringWriter content = new StringWriter();
private boolean inElement;
@Override
public void setDocumentLocator(Locator locator) {
if (locator instanceof Locator2) {
this.locator = (Locator2) locator;
}
}
@Override
public void startDocument() throws SAXException {
if (locator != null) {
this.encodingAtDocumentStart = locator.getEncoding();
}
}
@Override
public void endDocument() throws SAXException {
if (locator != null) {
this.encodingAtDocumentEnd = locator.getEncoding();
}
}
@Override
public void startElement(String uri, String localName,
String qName, Attributes atts) {
if (localName.equals("root")) {
if (locator != null)
this.encodingAtElementStart = locator.getEncoding();
inElement = true;
}
}
@Override
public void endElement(String uri, String localName, String qName) {
if (localName.equals("root")) {
if (locator != null)
this.encodingAtElementEnd = locator.getEncoding();
inElement = false;
}
}
@Override
public void characters(char[] ch, int start, int length) {
if (inElement)
content.write(ch, start, length);
}
static void printBytes(byte[] bytes, PrintStream out) {
for (int major = 0; major < bytes.length; major += 16) {
final int lim = Math.min(major + 16, bytes.length) - major;
for (int minor = 0; minor < 16; minor++) {
if (minor < lim) {
final int pos = major + minor;
out.printf("%02X ", bytes[pos]);
} else {
out.print(".. ");
}
}
for (int minor = 0; minor < 16; minor++) {
if (minor < lim) {
final int pos = major + minor;
final int c = bytes[pos] & 0xff;
if (c == 10) {
out.print("\\n");
} else if (c == 13) {
out.print("\\r");
} else if (c == 9) {
out.print("\\t");
} else if (c < 32) {
out.printf("^%c", (char) (c + 64));
} else if (c >= 127 && c <= 160) {
out.printf("%02X", c);
} else {
out.printf("%c ", (char) c);
}
} else {
out.print("..");
}
}
out.println();
}
}
}
Command:
java SAXEncodingDetector US-ASCII ISO-8859-1 UTF-8 windows-1252
Output:
Charset US-ASCII: (75 bytes)
3C 3F 78 6D 6C 20 76 65 72 73 69 6F 6E 3D 22 31 < ? x m l v e r s i o n = " 1
2E 30 22 20 65 6E 63 6F 64 69 6E 67 3D 22 55 53 . 0 " e n c o d i n g = " U S
2D 41 53 43 49 49 22 20 3F 3E 0A 3C 72 6F 6F 74 - A S C I I " ? > \n< r o o t
3E 4C 26 23 78 66 36 3B 77 65 20 26 23 78 32 30 > L & # x f 6 ; w e & # x 2 0
61 63 3B 3C 2F 72 6F 6F 74 3E 0A .. .. .. .. .. a c ; < / r o o t > \n..........
Charset at document start: UTF-8
Charset at element start: US-ASCII
Charset at element end: US-ASCII
Charset at document end: null
Content: Löwe €
Charset ISO-8859-1: (72 bytes)
3C 3F 78 6D 6C 20 76 65 72 73 69 6F 6E 3D 22 31 < ? x m l v e r s i o n = " 1
2E 30 22 20 65 6E 63 6F 64 69 6E 67 3D 22 49 53 . 0 " e n c o d i n g = " I S
4F 2D 38 38 35 39 2D 31 22 20 3F 3E 0A 3C 72 6F O - 8 8 5 9 - 1 " ? > \n< r o
6F 74 3E 4C F6 77 65 20 26 23 78 32 30 61 63 3B o t > L ö w e & # x 2 0 a c ;
3C 2F 72 6F 6F 74 3E 0A .. .. .. .. .. .. .. .. < / r o o t > \n................
Charset at document start: UTF-8
Charset at element start: ISO-8859-1
Charset at element end: ISO-8859-1
Charset at document end: null
Content: Löwe €
Charset UTF-8: (63 bytes)
3C 3F 78 6D 6C 20 76 65 72 73 69 6F 6E 3D 22 31 < ? x m l v e r s i o n = " 1
2E 30 22 20 65 6E 63 6F 64 69 6E 67 3D 22 55 54 . 0 " e n c o d i n g = " U T
46 2D 38 22 20 3F 3E 0A 3C 72 6F 6F 74 3E 4C C3 F - 8 " ? > \n< r o o t > L Ã
B6 77 65 20 E2 82 AC 3C 2F 72 6F 6F 74 3E 0A .. ¶ w e â 82¬ < / r o o t > \n..
Charset at document start: UTF-8
Charset at element start: UTF-8
Charset at element end: UTF-8
Charset at document end: null
Content: Löwe €
Charset windows-1252: (67 bytes)
3C 3F 78 6D 6C 20 76 65 72 73 69 6F 6E 3D 22 31 < ? x m l v e r s i o n = " 1
2E 30 22 20 65 6E 63 6F 64 69 6E 67 3D 22 77 69 . 0 " e n c o d i n g = " w i
6E 64 6F 77 73 2D 31 32 35 32 22 20 3F 3E 0A 3C n d o w s - 1 2 5 2 " ? > \n<
72 6F 6F 74 3E 4C F6 77 65 20 80 3C 2F 72 6F 6F r o o t > L ö w e 80< / r o o
74 3E 0A .. .. .. .. .. .. .. .. .. .. .. .. .. t > \n..........................
Charset at document start: UTF-8
Charset at element start: windows-1252
Charset at element end: windows-1252
Charset at document end: null
Content: Löwe €
--
ss at comp dot lancs dot ac dot uk
[toc] | [prev] | [next] | [standalone]
| From | Sebastian <sebastian@undisclosed.invalid> |
|---|---|
| Date | 2012-11-25 12:32 +0100 |
| Message-ID | <k8svhc$nk$1@news.albasani.net> |
| In reply to | #19928 |
Am 25.11.2012 12:00, schrieb Steven Simpson: > On 21/11/12 14:32, Sebastian wrote: >> Does anyone have an idea why that is so? And how I could >> go about making some XML parser determine the correct encoding? > > Sussed it! (Come to think of it, I feel I've sussed this before...) > > The charset returned by the locator changes during parsing. At > startDocument(), it is the assumed charset, possibly based on the first > four-or-so bytes. At endDocument(), it is reset to null. On the first > call to startElement, it has the correct value. There might be an > earlier event where it is correct - I didn't investigate. Oh, that is it! Thanks for the explanation... > SSCCE... [snip] ...and the code. (And now I know what a real SSCCE is, too.) -- Sebastian
[toc] | [prev] | [next] | [standalone]
Page 2 of 3 — ← Prev page 1 [2] 3 Next page →
Back to top | Article view | comp.lang.java.programmer
csiph-web