Received: by 10.224.183.16 with SMTP id ce16mr1048005qab.8.1353526297155; Wed, 21 Nov 2012 11:31:37 -0800 (PST) Received: by 10.182.221.40 with SMTP id qb8mr438574obc.11.1353526297102; Wed, 21 Nov 2012 11:31:37 -0800 (PST) Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!news.glorb.com!i9no6319740qap.0!news-out.google.com!gf5ni2982588qab.0!nntp.google.com!i9no6342230qap.0!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail Newsgroups: comp.lang.java.programmer Date: Wed, 21 Nov 2012 11:31:36 -0800 (PST) In-Reply-To: Complaints-To: groups-abuse@google.com Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=69.28.149.29; posting-account=CP-lKQoAAAAGtB5diOuGlDQk0jIwmH0T NNTP-Posting-Host: 69.28.149.29 References: User-Agent: G2/1.0 MIME-Version: 1.0 Message-ID: <0b3b04bf-24dd-4d59-a16d-14c745b66c76@googlegroups.com> Subject: Re: Detect XML document encodings with SAX From: Lew Injection-Date: Wed, 21 Nov 2012 19:31:37 +0000 Content-Type: text/plain; charset=ISO-8859-1 Xref: csiph.com comp.lang.java.programmer:19837 Sebastian wrote: > I discovered this post: > http://www.ibm.com/developerworks/library/x-tipsaxxni/ > > and implemented both approaches (SAX and Xerces XNI). > > Unfortunately, for the attached XML file, both methods Don't do attachments on Usenet. > output an encoding of UTF-8, while looking at the file as they should. XML should be encoded in UTF-8 nearly always. But SAX is a parser, so it doesn't output, it inputs. What are you telling us? > makes it clear that it is not UTF-8 encoded (all characters, > including the umlaut and the Euro-sign, take one byte, and the > declared encoding also is not UTF-8). http://sscce.org/ > Does anyone have an idea why that is so? And how I could You used the default encoding in your Writer. > go about making some XML parser determine the correct encoding? Your problem is writing the file, no? That has nothing to do with parsing. If your problem is with reading the file, then the encoding in the XML declaration should suffice to guide the parser. But then why do you talk about methods that "output an encoding"? However, according to http://xmlwriter.net/xml_guide/xml_declaration.shtml#Encoding supported encodings only include UTF-8, UTF-16, ISO-10646-UCS-2, ISO-10646-UCS-4, ISO-8859-1 to ISO-8859-9, ISO-2022-JP, Shift_JIS, and EUC-JP, as you would have learned had you researched your question. So it looks like you must not accept XML documents with such a non-standard encoding. Show us the code, or at least an SSCCE of it. -- Lew