Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!us.feeder.erje.net!news2.arglkargh.de!news.swapon.de!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail From: Robert Klemme Newsgroups: comp.lang.ruby Subject: Re: regex and decomposed character Date: Mon, 16 Dec 2013 20:58:57 +0100 Lines: 49 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: quoted-printable X-Trace: individual.net 7PhlrTQbM6ckY6BEFXsnOAqniUFk52HDqG4CHT0hPiFfYqyFw= Cancel-Lock: sha1:nksKPCbCqJuVAoNw3lbHYfis55Y= User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0 In-Reply-To: Xref: csiph.com comp.lang.ruby:6900 On 16.12.2013 17:09, Une B=E9vue wrote: > Running on Mac OS X the UTF-8 chars are decomposed, for example de =E9 = is > represented as : > 65 301 > instead of E9 (precomposed char) On Linux with locale en_US.UTF-8: $ echo '=E9' | od -t x1c 0000000 c3 a9 0a 303 251 \n 0000003 > Then, in a script we could have a mixture of both representation. > > Normalement iconv, only on mac not on linux and other OSes, has the > ability to transform from UTF-8MAC (decomposed) to UTF-8, but this > doesn't work with a regex, i don't know why. > > for example, in french de screen shots are named "Capture d'=E9cran..."= > and i'm unable, until now, to do a working regex over that, both "=E9" = and > "'" used by Apple aren't recognised. > > I found a workaround changing the default string for screen shots to > "Capture ecran..." (no "=E9" no "'") however i wonder on a more efficie= nt > solution. > > With Ruby 2 is there a way to switch between decomposed and precomposed= > chars ? Can you put a zip up somewhere (e.g. github) with original text and a=20 Ruby file you wrote just for matching? Then we could use that as=20 starting point for own experiments. Also, did you try to give the source file an explicit encoding like so? #!/usr/bin/ruby # encoding: utf-8 Kind regards robert