Received: by 10.66.79.40 with SMTP id g8mr1927428pax.45.1353752052714; Sat, 24 Nov 2012 02:14:12 -0800 (PST)
Received: by 10.50.163.66 with SMTP id yg2mr2632976igb.0.1353752052467; Sat, 24 Nov 2012 02:14:12 -0800 (PST)
Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!nntp.club.cc.cmu.edu!newsfeed.news.ucla.edu!usenet.stanford.edu!kr7no3105308pbb.0!news-out.google.com!s9ni6931pbb.0!nntp.google.com!kr7no3105295pbb.0!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail
Newsgroups: comp.lang.java.programmer
Date: Sat, 24 Nov 2012 02:14:12 -0800 (PST)
In-Reply-To: <50b02ee6$0$283$14726298@news.sunsite.dk>
Complaints-To: groups-abuse@google.com
Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=173.164.137.214; posting-account=CP-lKQoAAAAGtB5diOuGlDQk0jIwmH0T
NNTP-Posting-Host: 173.164.137.214
References: <k8ioi7$2e2$1@news.albasani.net> <0b3b04bf-24dd-4d59-a16d-14c745b66c76@googlegroups.com> <50b02ee6$0$283$14726298@news.sunsite.dk>
User-Agent: G2/1.0
MIME-Version: 1.0
Message-ID: <d64baf3c-d582-4308-b6b4-714ef3049ef5@googlegroups.com>
Subject: Re: Detect XML document encodings with SAX
From: Lew <lewbloch@gmail.com>
Injection-Date: Sat, 24 Nov 2012 10:14:12 +0000
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Xref: csiph.com comp.lang.java.programmer:19888

Arne Vajh=F8j wrote:
> Lew wrote:
>> Sebastian wrote:
[snip]
>>> output an encoding of UTF-8, while looking at the file
>> as they should.
>=20
> No.
>=20
> If the XML prolog specifies another encoding than UTF-8,
> then it should not return UTF-8.

True, but I'm saying they should specify UTF-8 in the prolog.

>>                 XML should be encoded in UTF-8 nearly always.

See?
=20
> XML allows for other encodings.

So? You should use UTF-8 nearly always, i.e., unless there's a compelling=
=20
reason not to.

> And Java XML parsers support it.

For those rare times when you deviate from the usual UTF-8.

> So it should always work.

>> But SAX is a parser, so it doesn't output, it inputs. What are you telli=
ng us?
>=20
> Output usually mean System.out.println - that works fine with a parser.

His phrasing wasn't clear to me. That's why I asked for clarification.

I could have guessed, too.

>> If your problem is with reading the file, then the encoding in the XML d=
eclaration

See? You're preaching to the choir.

>> should suffice to guide the parser. But then why do you talk about metho=
ds that

>> "output an encoding"?
>=20
> Because he wants to know what it is.
>=20
>> However, according to
>> http://xmlwriter.net/xml_guide/xml_declaration.shtml#Encoding
>> supported encodings only include UTF-8, UTF-16, ISO-10646-UCS-2,
>> ISO-10646-UCS-4, ISO-8859-1 to ISO-8859-9, ISO-2022-JP, Shift_JIS,=20
>> and EUC-JP,
>> So it looks like you must not accept XML documents with such a=20
>> non-standard encoding.
>
> Those that has researched would know that the XML spec do not
> limit the encodings at all. The XML processor must support UTF-8
> and UTF-16, but are free to support others.

Perhaps the OP's parser doesn't exercise that freedom, judging by the=20
symptoms.

'sall I'm sayin'.

Obviously I don't know the answer, but he's asking for suggestions=20
to investigate, AIUI. He's having encoding problems. His XML is apparently=
=20
encoded in Windows-1252, a notoriously funky encoding especially for=20
the variety of characters with which one might wish to deal. So why not
investigate obtaining material that isn't in such a notoriously funky=20
encoding, like, oh, say, the old reliable standard UTF-8?

Perhaps that isn't feasible, for reasons as yet unstated, but that's=20
the nature of brainstorming.

--=20
Lew