Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.java.programmer > #26111

Re: replace extended characters

From Owen Jacobson <angrybaldguy@gmail.com>
Newsgroups comp.lang.java.programmer
Message-ID <2011021122151950227-angrybaldguy@gmailcom> (permalink)
References <15bd3363-c781-487b-98d5-2243eff7cc8f@24g2000yqa.googlegroups.com>
Subject Re: replace extended characters
Date 2011-02-11 22:15 -0500

Show all headers | View raw


On 2011-02-10 18:33:39 -0500, VIDEO MAN said:

> Hi,
> 
> I'm trying to create a java utility that will read in a file that may
> or may not contain extended ascii characters and replace these
> characters with a predetermined character e.g. replace é with e and
> then write the amended file out.
> 
> How would people suggest I approach this from an efficiency  point of
> view given that the input files could be pretty large?
> 
> Any guidance appreciated.

This process already has a name: "normalization". The Java standard 
library includes tools (java.text.Normalizer and friends) for applying 
the standard Unicode normalizations. A rough sketch for the program 
would be:

1. Load your input text under the correct encoding. You will likely 
want to leave the choice of encodings up to the user; "extended ASCII" 
is not an encoding but encompasses several possible options, and since 
there are only statistical and not deterministically correct ways to 
detect encodings, it's better not to guess.

2. Normalize the text under NFD. This will replace characters like 'ü' 
with a sequence containing a simple character - 'u' in this case - and 
a sequence of combining marks - '¨' in this case. (Alternately, use 
NFKD instead of NFD. NFKD is more liberal about changing the meaning of 
the normalized text, but permits things like detatching ligatures which 
NFD does not do. Examples are in the normalization spec - see figure 6.)

3. Output the resulting normalized text under the target encoding 
(presumably US-ASCII). You'll want to do this the "hard way", via 
java.nio.charset.Charset and CharsetEncoder, so that you can use the 
onUnmappableCharacter action CodingErrorAction.IGNORE to strip 
unencodable characters.

You'll want to read up on normalization and the Unicode normalization 
specs[0] before proceeding. This topic is fraught with non-obvious edge 
cases.

The suggestion that you use iconv or other existing Unicode-aware 
encoding conversion tools is a good one, incidentally. This isn't 
really a problem you need to solve yourself, unless you're completely 
convinced that none of the usual normalization rules is right for your 
use case.

-o

[0] <http://unicode.org/reports/tr15/>

Back to comp.lang.java.programmer | Previous | NextPrevious in thread | Next in thread | Find similar


Thread

replace extended characters VIDEO MAN <bigmush7@googlemail.com> - 2011-02-10 15:33 -0800
  Re: replace extended characters RedGrittyBrick <RedGrittyBrick@spamweary.invalid> - 2011-02-11 15:31 +0000
  Re: replace extended characters Arved Sandstrom <asandstrom3minus1@eastlink.ca> - 2011-02-10 21:27 -0400
    Re: replace extended characters Arne Vajhøj <arne@vajhoej.dk> - 2011-02-10 21:42 -0500
    Re: replace extended characters Lawrence D'Oliveiro <ldo@geek-central.gen.new_zealand> - 2011-02-11 15:35 +1300
    Re: replace extended characters Lew <noone@lewscanon.com> - 2011-02-10 21:29 -0500
  Re: replace extended characters Joshua Cranmer <Pidgeot18@verizon.invalid> - 2011-02-11 18:40 -0500
  Re: replace extended characters Roedy Green <see_website@mindprod.com.invalid> - 2011-02-11 16:57 -0800
    Re: replace extended characters v_borchert@despammed.com (Volker Borchert) - 2011-02-12 05:58 +0000
  Re: replace extended characters Arne Vajhøj <arne@vajhoej.dk> - 2011-02-10 21:52 -0500
  Re: replace extended characters Joshua Cranmer <Pidgeot18@verizon.invalid> - 2011-02-10 19:37 -0500
  Re: replace extended characters Lew <noone@lewscanon.com> - 2011-02-10 19:18 -0500
    Re: replace extended characters Roedy Green <see_website@mindprod.com.invalid> - 2011-02-11 16:55 -0800
  Re: replace extended characters Owen Jacobson <angrybaldguy@gmail.com> - 2011-02-11 22:15 -0500
  Re: replace extended characters Roedy Green <see_website@mindprod.com.invalid> - 2011-02-11 15:07 -0800
    Re: replace extended characters Roedy Green <see_website@mindprod.com.invalid> - 2011-02-11 15:11 -0800

csiph-web