Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.java.programmer > #19834 > unrolled thread

Detect XML document encodings with SAX

Started bySebastian <sebastian@undisclosed.invalid>
First post2012-11-21 15:32 +0100
Last post2012-12-16 17:43 +0200
Articles 3 on this page of 43 — 9 participants

Back to article view | Back to comp.lang.java.programmer


Contents

  Detect XML document encodings with SAX Sebastian <sebastian@undisclosed.invalid> - 2012-11-21 15:32 +0100
    Re: Detect XML document encodings with SAX Lew <lewbloch@gmail.com> - 2012-11-21 11:31 -0800
      Re: Detect XML document encodings with SAX Sebastian <sebastian@undisclosed.invalid> - 2012-11-22 00:39 +0100
        Re: Detect XML document encodings with SAX Lew <lewbloch@gmail.com> - 2012-11-21 16:37 -0800
          Re: Detect XML document encodings with SAX Sebastian <sebastian@undisclosed.invalid> - 2012-11-22 07:41 +0100
            Re: Detect XML document encodings with SAX markspace <-@.> - 2012-11-21 23:18 -0800
              Re: Detect XML document encodings with SAX Steven Simpson <ss@domain.invalid> - 2012-11-22 07:53 +0000
                Re: Detect XML document encodings with SAX markspace <-@.> - 2012-11-22 08:31 -0800
              Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-23 21:21 -0500
      Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-23 21:11 -0500
      Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-23 21:20 -0500
        Re: Detect XML document encodings with SAX Lew <lewbloch@gmail.com> - 2012-11-24 02:14 -0800
          Re: Detect XML document encodings with SAX Sebastian <sebastian@undisclosed.invalid> - 2012-11-24 22:18 +0100
            Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-24 17:07 -0500
              Re: Detect XML document encodings with SAX Sebastian <sebastian@undisclosed.invalid> - 2012-11-25 10:50 +0100
            Re: Detect XML document encodings with SAX markspace <-@.> - 2012-11-24 17:12 -0800
              Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-24 20:17 -0500
                Re: Detect XML document encodings with SAX markspace <-@.> - 2012-11-24 18:02 -0800
                  Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-24 21:10 -0500
                    Re: Detect XML document encodings with SAX markspace <-@.> - 2012-11-24 18:25 -0800
                      Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-24 21:37 -0500
                        Re: Detect XML document encodings with SAX markspace <-@.> - 2012-11-24 21:01 -0800
                          Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-25 16:30 -0500
                            Re: Detect XML document encodings with SAX Gene Wirchenko <genew@telus.net> - 2012-12-12 18:03 -0800
                              Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-12-12 21:09 -0500
                                Re: Detect XML document encodings with SAX Lew <lewbloch@gmail.com> - 2012-12-12 18:58 -0800
                                  Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-12-12 22:17 -0500
                                    Re: Detect XML document encodings with SAX Lew <lewbloch@gmail.com> - 2012-12-12 22:51 -0800
                                Re: Detect XML document encodings with SAX Gene Wirchenko <genew@telus.net> - 2012-12-12 21:52 -0800
                  Re: Detect XML document encodings with SAX Sebastian <sebastian@undisclosed.invalid> - 2012-11-25 10:45 +0100
                    Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-25 16:23 -0500
                    Re: Detect XML document encodings with SAX markspace <-@.> - 2012-11-25 13:24 -0800
                  Re: Detect XML document encodings with SAX Sebastian <sebastian@undisclosed.invalid> - 2012-11-25 10:58 +0100
          Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-24 17:13 -0500
          Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-24 17:19 -0500
    Re: Detect XML document encodings with SAX Roedy Green <see_website@mindprod.com.invalid> - 2012-11-22 03:24 -0800
      Re: Detect XML document encodings with SAX "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2012-11-24 00:13 +0100
        Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-23 21:22 -0500
    Re: Detect XML document encodings with SAX Steven Simpson <ss@domain.invalid> - 2012-11-25 11:00 +0000
      Re: Detect XML document encodings with SAX Sebastian <sebastian@undisclosed.invalid> - 2012-11-25 12:32 +0100
      Re: Detect XML document encodings with SAX Arne Vajhøj <arne@vajhoej.dk> - 2012-11-25 14:41 -0500
    Re: Detect XML document encodings with SAX Roedy Green <see_website@mindprod.com.invalid> - 2012-12-12 20:32 -0800
    Re: Detect XML document encodings with SAX Stanimir Stamenkov <s7an10@netscape.net> - 2012-12-16 17:43 +0200

Page 3 of 3 — ← Prev page 1 2 [3]


#19952

FromArne Vajhøj <arne@vajhoej.dk>
Date2012-11-25 14:41 -0500
Message-ID<50b27476$0$281$14726298@news.sunsite.dk>
In reply to#19928
On 11/25/2012 6:00 AM, Steven Simpson wrote:
> On 21/11/12 14:32, Sebastian wrote:
>> Does anyone have an idea why that is so? And how I could
>> go about making some XML parser determine the correct encoding?
>
> Sussed it!  (Come to think of it, I feel I've sussed this before...)
>
> The charset returned by the locator changes during parsing.  At
> startDocument(), it is the assumed charset, possibly based on the first
> four-or-so bytes.  At endDocument(), it is reset to null.  On the first
> call to startElement, it has the correct value.

Cool.

Arne

[toc] | [prev] | [next] | [standalone]


#20285

FromRoedy Green <see_website@mindprod.com.invalid>
Date2012-12-12 20:32 -0800
Message-ID<38mic8hnlk2uuc2irrg0rco49sf3odsgr1@4ax.com>
In reply to#19834
On Wed, 21 Nov 2012 15:32:19 +0100, Sebastian
<sebastian@undisclosed.invalid> wrote, quoted or indirectly quoted
someone who said :

>Does anyone have an idea why that is so? And how I could
>go about making some XML parser determine the correct encoding?

There are not many encodings easy to recognise.
See http://mindprod.com/products.html#ENCODINGRECOGNISER

I think you are better off to figure out what it is and convert it to
UTF-8 with native2ascii. 
see http://mindprod.com/jgloss/encoding.html#NATIVE2ASCII

XML and UTF-8 are the expected pair. You are just asking for trouble
using some other encoding.
-- 
Roedy Green Canadian Mind Products http://mindprod.com
Students who hire or con others to do their homework are as foolish 
as couch potatoes who hire others to go to the gym for them. 

[toc] | [prev] | [next] | [standalone]


#20378

FromStanimir Stamenkov <s7an10@netscape.net>
Date2012-12-16 17:43 +0200
Message-ID<kakq61$oiu$1@dont-email.me>
In reply to#19834
Wed, 21 Nov 2012 15:32:19 +0100, /Sebastian/:

> I discovered this post:
> http://www.ibm.com/developerworks/library/x-tipsaxxni/
>
> and implemented both approaches (SAX and Xerces XNI).
>
> Unfortunately, for the attached XML file, both methods
> output an encoding of UTF-8, while looking at the file
> makes it clear that it is not UTF-8 encoded (all characters,
> including the umlaut and the Euro-sign, take one byte, and the
> declared encoding also is not UTF-8).
>
> Does anyone have an idea why that is so? And how I could
> go about making some XML parser determine the correct encoding?

Sorry if this has been answered already elsewhere in the thread. 
The XML specification has a guideline for detecting the source encoding:

http://www.w3.org/TR/xml/#sec-guessing

and this is basically what parsers do.  One-byte encodings are 
basically indistinguishable from each other and they could be only 
reliably detected in presence of an explicit encoding 
information/declaration.

-- 
Stanimir

[toc] | [prev] | [standalone]


Page 3 of 3 — ← Prev page 1 2 [3]

Back to top | Article view | comp.lang.java.programmer


csiph-web