Groups > comp.lang.python > #45471 > unrolled thread

Diacretical incensitive search

Started by	Olive <diolu.remove_this_part@bigfoot.com>
First post	2013-05-17 08:57 +0200
Last post	2013-05-20 09:10 +0000
Articles	6 — 5 participants

Back to article view | Back to comp.lang.python

  Diacretical incensitive search Olive <diolu.remove_this_part@bigfoot.com> - 2013-05-17 08:57 +0200
    Re: Diacretical incensitive search Petite Abeille <petite.abeille@gmail.com> - 2013-05-17 09:15 +0200
    Re: Diacretical incensitive search Peter Otten <__peter__@web.de> - 2013-05-17 10:30 +0200
      Re: Diacretical incensitive search Olive <diolu.remove_this_part@bigfoot.com> - 2013-05-17 17:37 +0200
        Re: Diacretical incensitive search jmfauth <wxjmfauth@gmail.com> - 2013-05-17 10:31 -0700
    Re: Diacretical incensitive search Jorgen Grahn <grahn+nntp@snipabacken.se> - 2013-05-20 09:10 +0000

#45471 — Diacretical incensitive search

From	Olive <diolu.remove_this_part@bigfoot.com>
Date	2013-05-17 08:57 +0200
Subject	Diacretical incensitive search
Message-ID	<20130517085704.3f6609e8@pcolivier.chezmoi.net>

One feature that seems to be missing in the re module (or any tools that I know for searching text) is "diacretical incensitive search". I would like to have a match for something like this:

re.match("franc", "français")

in about the same whay we can have a case incensitive search:

re.match("(?i)fran", "Français").

Another related and more general problem (in the sense that it could easily be used to solve the first problem) would be to translate a string removing any diacritical mark:

nodiac("Français") -> "Francais"

The algorithm to write such a function is trivial but there are a lot of mark we can put on a letter. It would be necessary to have the list of "a"'s with something on it. i.e. "à,á,ã", etc. and this for every letter. Trying to make such a list by hand would inevitably lead to some symbols forgotten (and would be tedious). 

Olive

[toc] | [next] | [standalone]

#45472

From	Petite Abeille <petite.abeille@gmail.com>
Date	2013-05-17 09:15 +0200
Message-ID	<mailman.1782.1368774962.3114.python-list@python.org>
In reply to	#45471

On May 17, 2013, at 8:57 AM, Olive <diolu.remove_this_part@bigfoot.com> wrote:

> The algorithm to write such a function is trivial but there are a lot of mark we can put on a letter. It would be necessary to have the list of "a"'s with something on it. i.e. "à,á,ã", etc. and this for every letter. Trying to make such a list by hand would inevitably lead to some symbols forgotten (and would be tedious). 

Perhaps of interest… Sean M. Burke Unidecode… 

There appear to be several python implementations, e.g.:

https://pypi.python.org/pypi/Unidecode

[toc] | [prev] | [next] | [standalone]

#45473

From	Peter Otten <__peter__@web.de>
Date	2013-05-17 10:30 +0200
Message-ID	<mailman.1783.1368779403.3114.python-list@python.org>
In reply to	#45471

Olive wrote:

> One feature that seems to be missing in the re module (or any tools that I
> know for searching text) is "diacretical incensitive search". I would like
> to have a match for something like this:
> 
> re.match("franc", "français")
> 
> in about the same whay we can have a case incensitive search:
> 
> re.match("(?i)fran", "Français").
> 
> Another related and more general problem (in the sense that it could
> easily be used to solve the first problem) would be to translate a string
> removing any diacritical mark:
> 
> nodiac("Français") -> "Francais"
> 
> The algorithm to write such a function is trivial but there are a lot of
> mark we can put on a letter. It would be necessary to have the list of
> "a"'s with something on it. i.e. "à,á,ã", etc. and this for every letter.
> Trying to make such a list by hand would inevitably lead to some symbols
> forgotten (and would be tedious).

[Python3.3]

>>> unicodedata.normalize("NFKD", "Français").encode("ascii", 
"ignore").decode()
'Francais'

import sys
from collections import defaultdict
from unicodedata import name, normalize

d = defaultdict(list)
for i in range(sys.maxunicode):
    c = chr(i)
    n = normalize("NFKD", c)[0]
    if ord(n) < 128 and n.isalpha(): # optional
        d[n].append(c)

for k, v in d.items():
    if len(v) > 1:
        print(k, "".join(v))

See also <http://effbot.org/zone/unicode-convert.htm>

PS: Be warned that experiments on the console may be misleading:

>>> unicodedata.normalize("NFKD", "ç")
'c'
>>> ascii(_)
"'c\\u0327'"

[toc] | [prev] | [next] | [standalone]

#45476

From	Olive <diolu.remove_this_part@bigfoot.com>
Date	2013-05-17 17:37 +0200
Message-ID	<20130517173717.65f0aa31@pcolivier.chezmoi.net>
In reply to	#45473

Tanks a lot!

[toc] | [prev] | [next] | [standalone]

#45480

From	jmfauth <wxjmfauth@gmail.com>
Date	2013-05-17 10:31 -0700
Message-ID	<ef50d7a2-9297-4b19-b1b0-249e86e09678@en2g2000vbb.googlegroups.com>
In reply to	#45476

--------


The handling of diacriticals is especially a nice case
study. One can use it to toy with some specific features of
Unicode, normalisation, decomposition, ...

... and also to show how Unicode can be badly implemented.

First and quick example that came to my mind (Py325 and Py332):

>>> timeit.repeat("ud.normalize('NFKC', ud.normalize('NFKD', 'ᶑḗḖḕḹ'))", "import unicodedata as ud")
[2.929404406789672, 2.923327801150208, 2.923659417064755]

>>> timeit.repeat("ud.normalize('NFKC', ud.normalize('NFKD', 'ᶑḗḖḕḹ'))", "import unicodedata as ud")
[3.8437222586746884, 3.829490737203514, 3.819266963414293]

jmf

[toc] | [prev] | [next] | [standalone]

#45609

From	Jorgen Grahn <grahn+nntp@snipabacken.se>
Date	2013-05-20 09:10 +0000
Message-ID	<slrnkpjq3m.3jn.grahn+nntp@frailea.sa.invalid>
In reply to	#45471

On Fri, 2013-05-17, Olive wrote:

> One feature that seems to be missing in the re module (or any tools
> that I know for searching text) is "diacretical incensitive search". I
> would like to have a match for something like this:

> re.match("franc", "français")
...

> The algorithm to write such a function is trivial but there are a
> lot of mark we can put on a letter. It would be necessary to have the
> list of "a"'s with something on it. i.e. "à,á,ã", etc. and this for
> every letter. Trying to make such a list by hand would inevitably lead
> to some symbols forgotten (and would be tedious). 

Ok, but please remember that the diacriticals are of varying importance.
The english "naïve" is easily recognizable when written as "naive".
The swedish word "får" cannot be spelled "far" and still be understood.

This is IMHO out of the scope of re, and perhaps case-insensitivity
should have been too.  Perhaps it /would/ have been, if regular
expressions hadn't come from the ASCII world where these things are
easy.

/Jorgen

-- 
  // Jorgen Grahn <grahn@  Oo  o.   .     .
\X/     snipabacken.se>   O  o   .

[toc] | [prev] | [standalone]

csiph-web

Diacretical incensitive search

Contents

#45471 — Diacretical incensitive search

#45472

#45473

#45476

#45480

#45609