Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.java.programmer > #26111
| From | Owen Jacobson <angrybaldguy@gmail.com> |
|---|---|
| Newsgroups | comp.lang.java.programmer |
| Message-ID | <2011021122151950227-angrybaldguy@gmailcom> (permalink) |
| References | <15bd3363-c781-487b-98d5-2243eff7cc8f@24g2000yqa.googlegroups.com> |
| Subject | Re: replace extended characters |
| Date | 2011-02-11 22:15 -0500 |
On 2011-02-10 18:33:39 -0500, VIDEO MAN said: > Hi, > > I'm trying to create a java utility that will read in a file that may > or may not contain extended ascii characters and replace these > characters with a predetermined character e.g. replace é with e and > then write the amended file out. > > How would people suggest I approach this from an efficiency point of > view given that the input files could be pretty large? > > Any guidance appreciated. This process already has a name: "normalization". The Java standard library includes tools (java.text.Normalizer and friends) for applying the standard Unicode normalizations. A rough sketch for the program would be: 1. Load your input text under the correct encoding. You will likely want to leave the choice of encodings up to the user; "extended ASCII" is not an encoding but encompasses several possible options, and since there are only statistical and not deterministically correct ways to detect encodings, it's better not to guess. 2. Normalize the text under NFD. This will replace characters like 'ü' with a sequence containing a simple character - 'u' in this case - and a sequence of combining marks - '¨' in this case. (Alternately, use NFKD instead of NFD. NFKD is more liberal about changing the meaning of the normalized text, but permits things like detatching ligatures which NFD does not do. Examples are in the normalization spec - see figure 6.) 3. Output the resulting normalized text under the target encoding (presumably US-ASCII). You'll want to do this the "hard way", via java.nio.charset.Charset and CharsetEncoder, so that you can use the onUnmappableCharacter action CodingErrorAction.IGNORE to strip unencodable characters. You'll want to read up on normalization and the Unicode normalization specs[0] before proceeding. This topic is fraught with non-obvious edge cases. The suggestion that you use iconv or other existing Unicode-aware encoding conversion tools is a good one, incidentally. This isn't really a problem you need to solve yourself, unless you're completely convinced that none of the usual normalization rules is right for your use case. -o [0] <http://unicode.org/reports/tr15/>
Back to comp.lang.java.programmer | Previous | Next — Previous in thread | Next in thread | Find similar
replace extended characters VIDEO MAN <bigmush7@googlemail.com> - 2011-02-10 15:33 -0800
Re: replace extended characters RedGrittyBrick <RedGrittyBrick@spamweary.invalid> - 2011-02-11 15:31 +0000
Re: replace extended characters Arved Sandstrom <asandstrom3minus1@eastlink.ca> - 2011-02-10 21:27 -0400
Re: replace extended characters Arne Vajhøj <arne@vajhoej.dk> - 2011-02-10 21:42 -0500
Re: replace extended characters Lawrence D'Oliveiro <ldo@geek-central.gen.new_zealand> - 2011-02-11 15:35 +1300
Re: replace extended characters Lew <noone@lewscanon.com> - 2011-02-10 21:29 -0500
Re: replace extended characters Joshua Cranmer <Pidgeot18@verizon.invalid> - 2011-02-11 18:40 -0500
Re: replace extended characters Roedy Green <see_website@mindprod.com.invalid> - 2011-02-11 16:57 -0800
Re: replace extended characters v_borchert@despammed.com (Volker Borchert) - 2011-02-12 05:58 +0000
Re: replace extended characters Arne Vajhøj <arne@vajhoej.dk> - 2011-02-10 21:52 -0500
Re: replace extended characters Joshua Cranmer <Pidgeot18@verizon.invalid> - 2011-02-10 19:37 -0500
Re: replace extended characters Lew <noone@lewscanon.com> - 2011-02-10 19:18 -0500
Re: replace extended characters Roedy Green <see_website@mindprod.com.invalid> - 2011-02-11 16:55 -0800
Re: replace extended characters Owen Jacobson <angrybaldguy@gmail.com> - 2011-02-11 22:15 -0500
Re: replace extended characters Roedy Green <see_website@mindprod.com.invalid> - 2011-02-11 15:07 -0800
Re: replace extended characters Roedy Green <see_website@mindprod.com.invalid> - 2011-02-11 15:11 -0800
csiph-web