Path: csiph.com!usenet.pasdenom.info!news.albasani.net!.POSTED!not-for-mail From: Sebastian Newsgroups: comp.lang.java.programmer Subject: Re: Detect XML document encodings with SAX Date: Sun, 25 Nov 2012 10:50:25 +0100 Organization: albasani.net Lines: 42 Message-ID: References: <0b3b04bf-24dd-4d59-a16d-14c745b66c76@googlegroups.com> <50b02ee6$0$283$14726298@news.sunsite.dk> <50b14516$0$282$14726298@news.sunsite.dk> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 8bit X-Trace: news.albasani.net eh18TytOivrUkadOUHQVNWMnfzNwFBly48p65D14XDFsv72rwgxBr7OC95q13oU/k5A7j02luP1k4epagnYDqA== NNTP-Posting-Date: Sun, 25 Nov 2012 09:48:32 +0000 (UTC) Injection-Info: news.albasani.net; logging-data="kvmeG6l6V8VehsXM22M6Z0elBwbTRDjVwn9/2fnE2N985pTRPeEPJJ/+s5iV2UChZtGDe9FjBAahdxOI2VsZbcgPG4oljuWZaTMnUGoKFJVX5OmHBww6WwleuV9zTpqS"; mail-complaints-to="abuse@albasani.net" User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; de; rv:1.9.2.15) Gecko/20110303 Thunderbird/3.1.9 In-Reply-To: <50b14516$0$282$14726298@news.sunsite.dk> Cancel-Lock: sha1:meY1LLsSUI+sX5IFTqamtA/ppEs= Xref: csiph.com comp.lang.java.programmer:19926 Am 24.11.2012 23:07, schrieb Arne Vajhøj: [snip] > I would consider it tempting to rewrite that app to use a standard > XML parser. > > It would solve this problem and possibly also some future problems. Yes, I wish I could do that (or rather, have that done...) It seems that app also handles other types of files (like csv) and regardless of file type they always do the same, namely open an InputStreamReader given a charset name. [snip] > What about just reading the first few lines until you have the > XML declaration. > > Parsing the encoding out of that should be simple. > > private static final Pattern encpat = > Pattern.compile("encoding\\s*=\\s*['\"]([^'\"]+)['\"]"); > private static String detectSimple(String fnm) throws IOException { > BufferedReader br = new BufferedReader(new FileReader(fnm)); > String firstpart = ""; > while(!firstpart.contains(">")) firstpart += br.readLine(); > br.close(); > Matcher m = encpat.matcher(firstpart); > if(m.find()) { > return m.group(1); > } else { > return "Unknown"; > } > } > > I do not like the solution, but given the restrictions in the > context, then maybe it is what you need. Thanks for the suggestion. I'll use that idea until a better solution becomes feasible. -- Sebastian