Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #84368 > unrolled thread

Re: Case-insensitive sorting of strings (Python newbie)

Started byPeter Otten <__peter__@web.de>
First post2015-01-23 18:53 +0100
Last post2015-01-24 02:34 -0800
Articles 4 — 4 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.


Contents

  Re: Case-insensitive sorting of strings (Python newbie) Peter Otten <__peter__@web.de> - 2015-01-23 18:53 +0100
    Re: Case-insensitive sorting of strings (Python newbie) Marko Rauhamaa <marko@pacujo.net> - 2015-01-23 21:14 +0200
      Re: Case-insensitive sorting of strings (Python newbie) Chris Angelico <rosuav@gmail.com> - 2015-01-24 06:56 +1100
    Re: Case-insensitive sorting of strings (Python newbie) wxjmfauth@gmail.com - 2015-01-24 02:34 -0800

#84368 — Re: Case-insensitive sorting of strings (Python newbie)

FromPeter Otten <__peter__@web.de>
Date2015-01-23 18:53 +0100
SubjectRe: Case-insensitive sorting of strings (Python newbie)
Message-ID<mailman.18046.1422035592.18130.python-list@python.org>
John Sampson wrote:

> I notice that the string method 'lower' seems to convert some strings
> (input from a text file) to Unicode but not others.
> This messes up sorting if it is used on arguments of 'sorted' since
> Unicode strings come before ordinary ones.
> 
> Is there a better way of case-insensitive sorting of strings in a list?
> Is it necessary to convert strings read from a plaintext file
> to Unicode? If so, how? This is Python 2.7.8.

The standard recommendation is to convert bytes to unicode as early as 
possible and only manipulate unicode. This is more likely to give correct 
results when slicing or converting a string.

$ cat tmp.txt
ähnlich
üblich
nötig
möglich
Maß
Maße
Masse
ÄHNLICH
$ python
Python 2.7.6 (default, Mar 22 2014, 22:59:56) 
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> for line in open("tmp.txt"):
...     line = line.strip()
...     print line, line.lower()
... 
ähnlich ähnlich
üblich üblich
nötig nötig
möglich möglich
Maß maß
Maße maße
Masse masse
ÄHNLICH Ähnlich

Now the same with unicode. To read text with a specific encoding use either 
codecs.open() or io.open() instead of the built-in (replace utf-8 with your 
actual encoding):

>>> import io
>>> for line in io.open("tmp.txt", encoding="utf-8"): 
...     line = line.strip()
...     print line, line.lower()
... 
ähnlich ähnlich
üblich üblich
nötig nötig
möglich möglich
Maß maß
Maße maße
Masse masse
ÄHNLICH ähnlich

Unfortunately this will not give the order that you (or a german speaker in 
the example below) will probably expect:

>>> print "".join(sorted(io.open("tmp.txt"), key=unicode.lower))
Masse
Maß
Maße
möglich
nötig
ähnlich
ÄHNLICH
üblich

For case-insensitive sorting you get better results with locale.strxfrm() -- 
but this doesn't accept unicode:

>>> import locale
>>> locale.setlocale(locale.LC_ALL, "")
'de_DE.UTF-8'
>>> print "".join(sorted(io.open("tmp.txt"), key=locale.strxfrm))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 
0: ordinal not in range(128)

As a workaround you can sort first:

>>> print "".join(sorted(open("tmp.txt"), key=locale.strxfrm))
ähnlich
ÄHNLICH
Maß
Masse
Maße
möglich
nötig
üblich

You should still convert the result to unicode if you want to do further 
processing in Python.

[toc] | [next] | [standalone]


#84387

FromMarko Rauhamaa <marko@pacujo.net>
Date2015-01-23 21:14 +0200
Message-ID<873871fgxk.fsf@elektro.pacujo.net>
In reply to#84368
Peter Otten <__peter__@web.de>:

> The standard recommendation is to convert bytes to unicode as early as
> possible and only manipulate unicode.

Unicode doesn't get you off the hook (as you explain later in your
post). Upper/lowercase as well as collation order is ambiguous. Python
even with decent locale support can't be expected to do it all for you.

Well, if Python can't, then who can? Probably nobody in the world, not
generically, anyway.

Example:

    >>> print("re\u0301sume\u0301")
    résumé
    >>> print("r\u00e9sum\u00e9")
    résumé
    >>> print("re\u0301sume\u0301" == "r\u00e9sum\u00e9")
    False
    >>> print("\ufb01nd")
    find
    >>> print("find")
    find
    >>> print("\ufb01nd" == "find")
    False

If equality can't be determined, words really can't be sorted.


Marko

[toc] | [prev] | [next] | [standalone]


#84392

FromChris Angelico <rosuav@gmail.com>
Date2015-01-24 06:56 +1100
Message-ID<mailman.18057.1422042982.18130.python-list@python.org>
In reply to#84387
On Sat, Jan 24, 2015 at 6:14 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
> Well, if Python can't, then who can? Probably nobody in the world, not
> generically, anyway.
>
> Example:
>
>     >>> print("re\u0301sume\u0301")
>     résumé
>     >>> print("r\u00e9sum\u00e9")
>     résumé
>     >>> print("re\u0301sume\u0301" == "r\u00e9sum\u00e9")
>     False
>     >>> print("\ufb01nd")
>     find
>     >>> print("find")
>     find
>     >>> print("\ufb01nd" == "find")
>     False
>
> If equality can't be determined, words really can't be sorted.

Ah, that's a bit easier to deal with. Just use Unicode normalization.

>>> print(unicodedata.normalize("NFC","re\u0301sume\u0301") == unicodedata.normalize("NFC","r\u00e9sum\u00e9"))
True

It's a bit verbose, but if you're doing a lot of comparisons, you
probably want to make a key-function that folds together everything
that you want to be treated the same way, for instance:

def key(s):
    """Normalize a Unicode string for comparison purposes.

    Composes, case-folds, and trims excess spaces.
    """
    return unicodedata.normalize("NFC",s).strip().casefold()

Then it's much tidier:

>>> print(key("re\u0301sume\u0301") == key("r\u00e9sum\u00e9"))
True
>>> print(key("\ufb01nd") == key("find"))
True

You may want to go further, too; for search comparisons, you'll want
to use NFKC normalization, and probably translate all strings of
Unicode whitespace into single U+0020s, or completely strip out
zero-width non-breaking spaces (and maybe zero-width breaking spaces,
too), etc, etc. It all depends on what you mean by "equality". But
certainly a basic NFC or NFD normalization is safe for general work.

ChrisA

[toc] | [prev] | [next] | [standalone]


#84456

Fromwxjmfauth@gmail.com
Date2015-01-24 02:34 -0800
Message-ID<2260ad2f-6581-4c2f-896f-b2c3d7a27ebc@googlegroups.com>
In reply to#84368
Le vendredi 23 janvier 2015 18:54:11 UTC+1, Peter Otten a écrit :
> John Sampson wrote:
> 
> > I notice that the string method 'lower' seems to convert some strings
> > (input from a text file) to Unicode but not others.
> > This messes up sorting if it is used on arguments of 'sorted' since
> > Unicode strings come before ordinary ones.
> > 
> > Is there a better way of case-insensitive sorting of strings in a list?
> > Is it necessary to convert strings read from a plaintext file
> > to Unicode? If so, how? This is Python 2.7.8.
> 
> The standard recommendation is to convert bytes to unicode as early as 
> possible and only manipulate unicode. This is more likely to give correct 
> results when slicing or converting a string.
> 
> $ cat tmp.txt
> ähnlich
> üblich
> nötig
> möglich
> Maß
> Maße
> Masse
> ÄHNLICH
> $ python
> Python 2.7.6 (default, Mar 22 2014, 22:59:56) 
> [GCC 4.8.2] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> for line in open("tmp.txt"):
> ...     line = line.strip()
> ...     print line, line.lower()
> ... 
> ähnlich ähnlich
> üblich üblich
> nötig nötig
> möglich möglich
> Maß maß
> Maße maße
> Masse masse
> ÄHNLICH Ähnlich
> 
> Now the same with unicode. To read text with a specific encoding use either 
> codecs.open() or io.open() instead of the built-in (replace utf-8 with your 
> actual encoding):
> 
> >>> import io
> >>> for line in io.open("tmp.txt", encoding="utf-8"): 
> ...     line = line.strip()
> ...     print line, line.lower()
> ... 
> ähnlich ähnlich
> üblich üblich
> nötig nötig
> möglich möglich
> Maß maß
> Maße maße
> Masse masse
> ÄHNLICH ähnlich
> 
> Unfortunately this will not give the order that you (or a german speaker in 
> the example below) will probably expect:
> 
> >>> print "".join(sorted(io.open("tmp.txt"), key=unicode.lower))
> Masse
> Maß
> Maße
> möglich
> nötig
> ähnlich
> ÄHNLICH
> üblich
> 
> For case-insensitive sorting you get better results with locale.strxfrm() -- 
> but this doesn't accept unicode:
> 
> >>> import locale
> >>> locale.setlocale(locale.LC_ALL, "")
> 'de_DE.UTF-8'
> >>> print "".join(sorted(io.open("tmp.txt"), key=locale.strxfrm))
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 
> 0: ordinal not in range(128)
> 
> As a workaround you can sort first:
> 
> >>> print "".join(sorted(open("tmp.txt"), key=locale.strxfrm))
> ähnlich
> ÄHNLICH
> Maß
> Masse
> Maße
> möglich
> nötig
> üblich
> 
> You should still convert the result to unicode if you want to do further 
> processing in Python.

-------
Hard drive archeology. Python 2 and Python 3.

One (among other) way(s) to work is to use the Unicode
Collation Algorithm (Default Unicode Collation Element
Table (DUCET)).


In action with a reduced (latin only, > ~1000 code points)
characters set from allkeys.txt. Dirty work.
I added the French word éléphant.

code:

[...]

    li = ['ähnlich', 'ÄHNLICH', 'Maß', 'Masse', 'Maße', \
          'möglich', 'nötig', 'üblich']
    li.insert(0, 'éléphant')
    print(li)
    r = sorted(li, key=c.tri)
    print(r)

[...]

output:

>c:\python32\pythonw -u "unicodecollation.py"

['éléphant', 'ähnlich', 'ÄHNLICH', 'Maß', 'Masse', 'Maße', 'möglich', 'nötig', 'üblich']
['ähnlich', 'ÄHNLICH', 'éléphant', 'Maß', 'Maße', 'Masse', 'möglich', 'nötig', 'üblich']
>Exit code: 0


---

Why to continue to waste time with this product?
Its ridiculous(?), absurd(?), ascii-centric, non std,
buggy (definitively) Unicode implementation?

This just become only nice for pedacogical purposes.

jmf

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web