Received: by 10.66.79.40 with SMTP id g8mr1927428pax.45.1353752052714; Sat, 24 Nov 2012 02:14:12 -0800 (PST) Received: by 10.50.163.66 with SMTP id yg2mr2632976igb.0.1353752052467; Sat, 24 Nov 2012 02:14:12 -0800 (PST) Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!nntp.club.cc.cmu.edu!newsfeed.news.ucla.edu!usenet.stanford.edu!kr7no3105308pbb.0!news-out.google.com!s9ni6931pbb.0!nntp.google.com!kr7no3105295pbb.0!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail Newsgroups: comp.lang.java.programmer Date: Sat, 24 Nov 2012 02:14:12 -0800 (PST) In-Reply-To: <50b02ee6$0$283$14726298@news.sunsite.dk> Complaints-To: groups-abuse@google.com Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=173.164.137.214; posting-account=CP-lKQoAAAAGtB5diOuGlDQk0jIwmH0T NNTP-Posting-Host: 173.164.137.214 References: <0b3b04bf-24dd-4d59-a16d-14c745b66c76@googlegroups.com> <50b02ee6$0$283$14726298@news.sunsite.dk> User-Agent: G2/1.0 MIME-Version: 1.0 Message-ID: Subject: Re: Detect XML document encodings with SAX From: Lew Injection-Date: Sat, 24 Nov 2012 10:14:12 +0000 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Xref: csiph.com comp.lang.java.programmer:19888 Arne Vajh=F8j wrote: > Lew wrote: >> Sebastian wrote: [snip] >>> output an encoding of UTF-8, while looking at the file >> as they should. >=20 > No. >=20 > If the XML prolog specifies another encoding than UTF-8, > then it should not return UTF-8. True, but I'm saying they should specify UTF-8 in the prolog. >> XML should be encoded in UTF-8 nearly always. See? =20 > XML allows for other encodings. So? You should use UTF-8 nearly always, i.e., unless there's a compelling= =20 reason not to. > And Java XML parsers support it. For those rare times when you deviate from the usual UTF-8. > So it should always work. >> But SAX is a parser, so it doesn't output, it inputs. What are you telli= ng us? >=20 > Output usually mean System.out.println - that works fine with a parser. His phrasing wasn't clear to me. That's why I asked for clarification. I could have guessed, too. >> If your problem is with reading the file, then the encoding in the XML d= eclaration See? You're preaching to the choir. >> should suffice to guide the parser. But then why do you talk about metho= ds that >> "output an encoding"? >=20 > Because he wants to know what it is. >=20 >> However, according to >> http://xmlwriter.net/xml_guide/xml_declaration.shtml#Encoding >> supported encodings only include UTF-8, UTF-16, ISO-10646-UCS-2, >> ISO-10646-UCS-4, ISO-8859-1 to ISO-8859-9, ISO-2022-JP, Shift_JIS,=20 >> and EUC-JP, >> So it looks like you must not accept XML documents with such a=20 >> non-standard encoding. > > Those that has researched would know that the XML spec do not > limit the encodings at all. The XML processor must support UTF-8 > and UTF-16, but are free to support others. Perhaps the OP's parser doesn't exercise that freedom, judging by the=20 symptoms. 'sall I'm sayin'. Obviously I don't know the answer, but he's asking for suggestions=20 to investigate, AIUI. He's having encoding problems. His XML is apparently= =20 encoded in Windows-1252, a notoriously funky encoding especially for=20 the variety of characters with which one might wish to deal. So why not investigate obtaining material that isn't in such a notoriously funky=20 encoding, like, oh, say, the old reliable standard UTF-8? Perhaps that isn't feasible, for reasons as yet unstated, but that's=20 the nature of brainstorming. --=20 Lew