Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!us.feeder.erje.net!news2.arglkargh.de!news.swapon.de!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: Robert Klemme <shortcutter@googlemail.com>
Newsgroups: comp.lang.ruby
Subject: Re: regex and decomposed character
Date: Mon, 16 Dec 2013 20:58:57 +0100
Lines: 49
Message-ID: <bh94c2Fma4mU1@mid.individual.net>
References: <l8n8jl$oev$1@shakotay.alphanet.ch>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: quoted-printable
X-Trace: individual.net 7PhlrTQbM6ckY6BEFXsnOAqniUFk52HDqG4CHT0hPiFfYqyFw=
Cancel-Lock: sha1:nksKPCbCqJuVAoNw3lbHYfis55Y=
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0
In-Reply-To: <l8n8jl$oev$1@shakotay.alphanet.ch>
Xref: csiph.com comp.lang.ruby:6900

On 16.12.2013 17:09, Une B=E9vue wrote:
> Running on Mac OS X the UTF-8 chars are decomposed, for example de =E9 =
is
> represented as :
> 65 301
> instead of E9 (precomposed char)

On Linux with locale en_US.UTF-8:

$ echo '=E9' | od -t x1c
0000000  c3  a9  0a
         303 251  \n
0000003

> Then, in a script we could have a mixture of both representation.
>
> Normalement iconv, only on mac not on linux and other OSes, has the
> ability to transform from UTF-8MAC (decomposed) to UTF-8, but this
> doesn't work with a regex, i don't know why.
>
> for example, in french de screen shots are named "Capture d'=E9cran..."=

> and i'm unable, until now, to do a working regex over that, both "=E9" =
and
> "'" used by Apple aren't recognised.
>
> I found a workaround changing the default string for screen shots to
> "Capture ecran..." (no "=E9" no "'") however i wonder on a more efficie=
nt
> solution.
>
> With Ruby 2 is there a way to switch between decomposed and precomposed=

> chars ?

Can you put a zip up somewhere (e.g. github) with original text and a=20
Ruby file you wrote just for matching?  Then we could use that as=20
starting point for own experiments.

Also, did you try to give the source file an explicit encoding like so?

#!/usr/bin/ruby
# encoding: utf-8

Kind regards

	robert