Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.ruby > #6900

Re: regex and decomposed character

From Robert Klemme <shortcutter@googlemail.com>
Newsgroups comp.lang.ruby
Subject Re: regex and decomposed character
Date 2013-12-16 20:58 +0100
Message-ID <bh94c2Fma4mU1@mid.individual.net> (permalink)
References <l8n8jl$oev$1@shakotay.alphanet.ch>

Show all headers | View raw


On 16.12.2013 17:09, Une Bévue wrote:
> Running on Mac OS X the UTF-8 chars are decomposed, for example de é is
> represented as :
> 65 301
> instead of E9 (precomposed char)

On Linux with locale en_US.UTF-8:

$ echo 'é' | od -t x1c
0000000  c3  a9  0a
         303 251  \n
0000003

> Then, in a script we could have a mixture of both representation.
>
> Normalement iconv, only on mac not on linux and other OSes, has the
> ability to transform from UTF-8MAC (decomposed) to UTF-8, but this
> doesn't work with a regex, i don't know why.
>
> for example, in french de screen shots are named "Capture d'écran..."
> and i'm unable, until now, to do a working regex over that, both "é" and
> "'" used by Apple aren't recognised.
>
> I found a workaround changing the default string for screen shots to
> "Capture ecran..." (no "é" no "'") however i wonder on a more efficient
> solution.
>
> With Ruby 2 is there a way to switch between decomposed and precomposed
> chars ?

Can you put a zip up somewhere (e.g. github) with original text and a 
Ruby file you wrote just for matching?  Then we could use that as 
starting point for own experiments.

Also, did you try to give the source file an explicit encoding like so?

#!/usr/bin/ruby
# encoding: utf-8

Kind regards

	robert

Back to comp.lang.ruby | Previous | NextPrevious in thread | Next in thread | Find similar


Thread

regex and decomposed character Une Bévue <unbewusst.sein@fai.invalid> - 2013-12-16 17:09 +0100
  Re: regex and decomposed character Robert Klemme <shortcutter@googlemail.com> - 2013-12-16 20:58 +0100
    Re: regex and decomposed character Une Bévue <unbewusst.sein@fai.invalid> - 2013-12-16 21:18 +0100
      Re: regex and decomposed character Une Bévue <unbewusst.sein@fai.invalid> - 2013-12-16 21:22 +0100
  Re: regex and decomposed character Thibault Jouan <tj+usenet@a13.fr> - 2013-12-22 21:04 +0000
  Re: regex and decomposed character theone1 <link285@yahoo.com> - 2014-02-10 08:17 -0600

csiph-web