Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!goblin1!goblin2!goblin.stu.neva.ru!newsfeed1.swip.net!uio.no!news.tele.dk!news.tele.dk!small.news.tele.dk!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Date: Wed, 30 Oct 2013 21:17:04 +0100
From: Ulrich Goebel <ml@fam-goebel.de>
User-Agent: Mozilla/5.0 (X11; Linux i686; rv:24.0) Gecko/20100101 Thunderbird/24.0
MIME-Version: 1.0
To: Python <python-list@python.org>
Subject: Re: sorting german characters =?UTF-8?B?w6TDtsO8Li4uIHNvbHZlZA==?=
References: <52715232.5020108@fam-goebel.de> <CANc-5UxZz_h5s_oGSBzHMZZnqnGDfpdALuWfq5vFb5__mD6jwQ@mail.gmail.com>
In-Reply-To: <CANc-5UxZz_h5s_oGSBzHMZZnqnGDfpdALuWfq5vFb5__mD6jwQ@mail.gmail.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.1838.1383164370.18130.python-list@python.org>
Lines: 55
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:58108

Hi,

Am 30.10.2013 19:48, schrieb Skip Montanaro:
> Perhaps this? http://stackoverflow.com/questions/816285/where-is-pythons-best-ascii-for-this-unicode-database

There I found the module unidecode
(http://pypi.python.org/pypi/Unidecode),
and I found it very helpful. Thanks a lot!

So my function normal() now looks like this:

from unidecode import unidecode

def normal (s):
   r = s
   r = r.strip()              # take away blanks at the ends
   r = r.replace(u' ', '')    # take away all other blanks
   r = unidecode(r)           # makes the main work
                              # - see the docu of unidecode
   r = r.upper()              # changes all to uppercase letters
   return r

def compare (a, b):
   aa = normal(a)
   bb = normal(b)
   if aa < bb:
     return -1
   elif aa == bb:
     return 0
   else:
     return 1

For the "normal" cases, that works quiet perfect - as I want it to do. 
For more (extreme) difficult cases there are even limits, but they are 
even wide limits, I would say. For example,
   print normal(u'-£-¥-Ć-û-á-€-Đ-ø-ț-ﬀ-ỗ-Ể-ễ-ḯ-ę-ä-ö-ü-ß-')
gives
   -PS-Y=-C-U-A-EU-D-O-T-FF-O-E-E-I-E-A-O-U-SS-
That shows a bit of what unidecode does - and what it doesn't.

> There is also a rather long-ish recent topic on a similar topic that
> might be worth scanning as well.

Sorry, I didn't find that.

> I've no direct/recent experience with this topic. I'm just an
> interested bystander.

But even a helpful bystander. Thank You!

Ulrich

-- 
Ulrich Goebel
Paracelsusstr. 120, 53177 Bonn