Path: csiph.com!usenet.pasdenom.info!news.albasani.net!.POSTED!not-for-mail From: Sebastian Newsgroups: comp.lang.java.programmer Subject: Detect XML document encodings with SAX Date: Wed, 21 Nov 2012 15:32:19 +0100 Organization: albasani.net Lines: 33 Message-ID: Mime-Version: 1.0 Content-Type: multipart/mixed; boundary="------------010502090900010202040501" X-Trace: news.albasani.net uSw1i1xKYawNWsbbBsTykLPSfH3XxZKtVNxsA3H1xIllhG3PN8eFLSQXwvJBUmd/jabV0FpcMlrkag9Gs2jgCR7NTtmdUnM8JARLLAUjqWK+nck3PCoStwz8TBRCeHZg NNTP-Posting-Date: Wed, 21 Nov 2012 14:30:31 +0000 (UTC) Injection-Info: news.albasani.net; logging-data="ZbVvyAEA/2HGcPmk0gOiwuBn5fNGL2dTlJIIzUbfzqQ2KRMohaGnW8/xCRzwsEVk4v7CGOBG5Gym2oeZgKodxrAQm3acgN/OX/XrofJbMw8Z0jFkSZeoEOOYOX6u7BBt"; mail-complaints-to="abuse@albasani.net" User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.2.15) Gecko/20110303 Thunderbird/3.1.9 Cancel-Lock: sha1:Ix9OfOW9EpSuNEePXQKAhm/DEok= Xref: csiph.com comp.lang.java.programmer:19834 This is a multi-part message in MIME format. --------------010502090900010202040501 Content-Type: text/plain; charset=ISO-8859-15; format=flowed Content-Transfer-Encoding: 7bit Hello there, I discovered this post: http://www.ibm.com/developerworks/library/x-tipsaxxni/ and implemented both approaches (SAX and Xerces XNI). Unfortunately, for the attached XML file, both methods output an encoding of UTF-8, while looking at the file makes it clear that it is not UTF-8 encoded (all characters, including the umlaut and the Euro-sign, take one byte, and the declared encoding also is not UTF-8). Does anyone have an idea why that is so? And how I could go about making some XML parser determine the correct encoding? -- Sebastian --------------010502090900010202040501 Content-Type: text/xml; name="test.xml" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename="test.xml" PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0id2luZG93cy0xMjUwIj8+DQo8Zm9vPg0K ICAgIDxiYXo+TPZ3ZSCAPC9iYXo+DQo8L2Zvbz4= --------------010502090900010202040501--