Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.ruby > #6900

Re: regex and decomposed character

Path csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!feeder.erje.net!us.feeder.erje.net!news2.arglkargh.de!news.swapon.de!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From Robert Klemme <shortcutter@googlemail.com>
Newsgroups comp.lang.ruby
Subject Re: regex and decomposed character
Date Mon, 16 Dec 2013 20:58:57 +0100
Lines 49
Message-ID <bh94c2Fma4mU1@mid.individual.net> (permalink)
References <l8n8jl$oev$1@shakotay.alphanet.ch>
Mime-Version 1.0
Content-Type text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding quoted-printable
X-Trace individual.net 7PhlrTQbM6ckY6BEFXsnOAqniUFk52HDqG4CHT0hPiFfYqyFw=
Cancel-Lock sha1:nksKPCbCqJuVAoNw3lbHYfis55Y=
User-Agent Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0
In-Reply-To <l8n8jl$oev$1@shakotay.alphanet.ch>
Xref csiph.com comp.lang.ruby:6900

Show key headers only | View raw


On 16.12.2013 17:09, Une Bévue wrote:
> Running on Mac OS X the UTF-8 chars are decomposed, for example de é is
> represented as :
> 65 301
> instead of E9 (precomposed char)

On Linux with locale en_US.UTF-8:

$ echo 'é' | od -t x1c
0000000  c3  a9  0a
         303 251  \n
0000003

> Then, in a script we could have a mixture of both representation.
>
> Normalement iconv, only on mac not on linux and other OSes, has the
> ability to transform from UTF-8MAC (decomposed) to UTF-8, but this
> doesn't work with a regex, i don't know why.
>
> for example, in french de screen shots are named "Capture d'écran..."
> and i'm unable, until now, to do a working regex over that, both "é" and
> "'" used by Apple aren't recognised.
>
> I found a workaround changing the default string for screen shots to
> "Capture ecran..." (no "é" no "'") however i wonder on a more efficient
> solution.
>
> With Ruby 2 is there a way to switch between decomposed and precomposed
> chars ?

Can you put a zip up somewhere (e.g. github) with original text and a 
Ruby file you wrote just for matching?  Then we could use that as 
starting point for own experiments.

Also, did you try to give the source file an explicit encoding like so?

#!/usr/bin/ruby
# encoding: utf-8

Kind regards

	robert

Back to comp.lang.ruby | Previous | NextPrevious in thread | Next in thread | Find similar


Thread

regex and decomposed character Une Bévue <unbewusst.sein@fai.invalid> - 2013-12-16 17:09 +0100
  Re: regex and decomposed character Robert Klemme <shortcutter@googlemail.com> - 2013-12-16 20:58 +0100
    Re: regex and decomposed character Une Bévue <unbewusst.sein@fai.invalid> - 2013-12-16 21:18 +0100
      Re: regex and decomposed character Une Bévue <unbewusst.sein@fai.invalid> - 2013-12-16 21:22 +0100
  Re: regex and decomposed character Thibault Jouan <tj+usenet@a13.fr> - 2013-12-22 21:04 +0000
  Re: regex and decomposed character theone1 <link285@yahoo.com> - 2014-02-10 08:17 -0600

csiph-web