Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #84368 > unrolled thread
| Started by | Peter Otten <__peter__@web.de> |
|---|---|
| First post | 2015-01-23 18:53 +0100 |
| Last post | 2015-01-24 02:34 -0800 |
| Articles | 4 — 4 participants |
Back to article view | Back to comp.lang.python
This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by
below is the oldest one visible, not the original post.
Re: Case-insensitive sorting of strings (Python newbie) Peter Otten <__peter__@web.de> - 2015-01-23 18:53 +0100
Re: Case-insensitive sorting of strings (Python newbie) Marko Rauhamaa <marko@pacujo.net> - 2015-01-23 21:14 +0200
Re: Case-insensitive sorting of strings (Python newbie) Chris Angelico <rosuav@gmail.com> - 2015-01-24 06:56 +1100
Re: Case-insensitive sorting of strings (Python newbie) wxjmfauth@gmail.com - 2015-01-24 02:34 -0800
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2015-01-23 18:53 +0100 |
| Subject | Re: Case-insensitive sorting of strings (Python newbie) |
| Message-ID | <mailman.18046.1422035592.18130.python-list@python.org> |
John Sampson wrote:
> I notice that the string method 'lower' seems to convert some strings
> (input from a text file) to Unicode but not others.
> This messes up sorting if it is used on arguments of 'sorted' since
> Unicode strings come before ordinary ones.
>
> Is there a better way of case-insensitive sorting of strings in a list?
> Is it necessary to convert strings read from a plaintext file
> to Unicode? If so, how? This is Python 2.7.8.
The standard recommendation is to convert bytes to unicode as early as
possible and only manipulate unicode. This is more likely to give correct
results when slicing or converting a string.
$ cat tmp.txt
ähnlich
üblich
nötig
möglich
Maß
Maße
Masse
ÄHNLICH
$ python
Python 2.7.6 (default, Mar 22 2014, 22:59:56)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> for line in open("tmp.txt"):
... line = line.strip()
... print line, line.lower()
...
ähnlich ähnlich
üblich üblich
nötig nötig
möglich möglich
Maß maß
Maße maße
Masse masse
ÄHNLICH Ähnlich
Now the same with unicode. To read text with a specific encoding use either
codecs.open() or io.open() instead of the built-in (replace utf-8 with your
actual encoding):
>>> import io
>>> for line in io.open("tmp.txt", encoding="utf-8"):
... line = line.strip()
... print line, line.lower()
...
ähnlich ähnlich
üblich üblich
nötig nötig
möglich möglich
Maß maß
Maße maße
Masse masse
ÄHNLICH ähnlich
Unfortunately this will not give the order that you (or a german speaker in
the example below) will probably expect:
>>> print "".join(sorted(io.open("tmp.txt"), key=unicode.lower))
Masse
Maß
Maße
möglich
nötig
ähnlich
ÄHNLICH
üblich
For case-insensitive sorting you get better results with locale.strxfrm() --
but this doesn't accept unicode:
>>> import locale
>>> locale.setlocale(locale.LC_ALL, "")
'de_DE.UTF-8'
>>> print "".join(sorted(io.open("tmp.txt"), key=locale.strxfrm))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position
0: ordinal not in range(128)
As a workaround you can sort first:
>>> print "".join(sorted(open("tmp.txt"), key=locale.strxfrm))
ähnlich
ÄHNLICH
Maß
Masse
Maße
möglich
nötig
üblich
You should still convert the result to unicode if you want to do further
processing in Python.
[toc] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2015-01-23 21:14 +0200 |
| Message-ID | <873871fgxk.fsf@elektro.pacujo.net> |
| In reply to | #84368 |
Peter Otten <__peter__@web.de>:
> The standard recommendation is to convert bytes to unicode as early as
> possible and only manipulate unicode.
Unicode doesn't get you off the hook (as you explain later in your
post). Upper/lowercase as well as collation order is ambiguous. Python
even with decent locale support can't be expected to do it all for you.
Well, if Python can't, then who can? Probably nobody in the world, not
generically, anyway.
Example:
>>> print("re\u0301sume\u0301")
résumé
>>> print("r\u00e9sum\u00e9")
résumé
>>> print("re\u0301sume\u0301" == "r\u00e9sum\u00e9")
False
>>> print("\ufb01nd")
find
>>> print("find")
find
>>> print("\ufb01nd" == "find")
False
If equality can't be determined, words really can't be sorted.
Marko
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-01-24 06:56 +1100 |
| Message-ID | <mailman.18057.1422042982.18130.python-list@python.org> |
| In reply to | #84387 |
On Sat, Jan 24, 2015 at 6:14 AM, Marko Rauhamaa <marko@pacujo.net> wrote:
> Well, if Python can't, then who can? Probably nobody in the world, not
> generically, anyway.
>
> Example:
>
> >>> print("re\u0301sume\u0301")
> résumé
> >>> print("r\u00e9sum\u00e9")
> résumé
> >>> print("re\u0301sume\u0301" == "r\u00e9sum\u00e9")
> False
> >>> print("\ufb01nd")
> find
> >>> print("find")
> find
> >>> print("\ufb01nd" == "find")
> False
>
> If equality can't be determined, words really can't be sorted.
Ah, that's a bit easier to deal with. Just use Unicode normalization.
>>> print(unicodedata.normalize("NFC","re\u0301sume\u0301") == unicodedata.normalize("NFC","r\u00e9sum\u00e9"))
True
It's a bit verbose, but if you're doing a lot of comparisons, you
probably want to make a key-function that folds together everything
that you want to be treated the same way, for instance:
def key(s):
"""Normalize a Unicode string for comparison purposes.
Composes, case-folds, and trims excess spaces.
"""
return unicodedata.normalize("NFC",s).strip().casefold()
Then it's much tidier:
>>> print(key("re\u0301sume\u0301") == key("r\u00e9sum\u00e9"))
True
>>> print(key("\ufb01nd") == key("find"))
True
You may want to go further, too; for search comparisons, you'll want
to use NFKC normalization, and probably translate all strings of
Unicode whitespace into single U+0020s, or completely strip out
zero-width non-breaking spaces (and maybe zero-width breaking spaces,
too), etc, etc. It all depends on what you mean by "equality". But
certainly a basic NFC or NFD normalization is safe for general work.
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2015-01-24 02:34 -0800 |
| Message-ID | <2260ad2f-6581-4c2f-896f-b2c3d7a27ebc@googlegroups.com> |
| In reply to | #84368 |
Le vendredi 23 janvier 2015 18:54:11 UTC+1, Peter Otten a écrit :
> John Sampson wrote:
>
> > I notice that the string method 'lower' seems to convert some strings
> > (input from a text file) to Unicode but not others.
> > This messes up sorting if it is used on arguments of 'sorted' since
> > Unicode strings come before ordinary ones.
> >
> > Is there a better way of case-insensitive sorting of strings in a list?
> > Is it necessary to convert strings read from a plaintext file
> > to Unicode? If so, how? This is Python 2.7.8.
>
> The standard recommendation is to convert bytes to unicode as early as
> possible and only manipulate unicode. This is more likely to give correct
> results when slicing or converting a string.
>
> $ cat tmp.txt
> ähnlich
> üblich
> nötig
> möglich
> Maß
> Maße
> Masse
> ÄHNLICH
> $ python
> Python 2.7.6 (default, Mar 22 2014, 22:59:56)
> [GCC 4.8.2] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> for line in open("tmp.txt"):
> ... line = line.strip()
> ... print line, line.lower()
> ...
> ähnlich ähnlich
> üblich üblich
> nötig nötig
> möglich möglich
> Maß maß
> Maße maße
> Masse masse
> ÄHNLICH Ähnlich
>
> Now the same with unicode. To read text with a specific encoding use either
> codecs.open() or io.open() instead of the built-in (replace utf-8 with your
> actual encoding):
>
> >>> import io
> >>> for line in io.open("tmp.txt", encoding="utf-8"):
> ... line = line.strip()
> ... print line, line.lower()
> ...
> ähnlich ähnlich
> üblich üblich
> nötig nötig
> möglich möglich
> Maß maß
> Maße maße
> Masse masse
> ÄHNLICH ähnlich
>
> Unfortunately this will not give the order that you (or a german speaker in
> the example below) will probably expect:
>
> >>> print "".join(sorted(io.open("tmp.txt"), key=unicode.lower))
> Masse
> Maß
> Maße
> möglich
> nötig
> ähnlich
> ÄHNLICH
> üblich
>
> For case-insensitive sorting you get better results with locale.strxfrm() --
> but this doesn't accept unicode:
>
> >>> import locale
> >>> locale.setlocale(locale.LC_ALL, "")
> 'de_DE.UTF-8'
> >>> print "".join(sorted(io.open("tmp.txt"), key=locale.strxfrm))
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position
> 0: ordinal not in range(128)
>
> As a workaround you can sort first:
>
> >>> print "".join(sorted(open("tmp.txt"), key=locale.strxfrm))
> ähnlich
> ÄHNLICH
> Maß
> Masse
> Maße
> möglich
> nötig
> üblich
>
> You should still convert the result to unicode if you want to do further
> processing in Python.
-------
Hard drive archeology. Python 2 and Python 3.
One (among other) way(s) to work is to use the Unicode
Collation Algorithm (Default Unicode Collation Element
Table (DUCET)).
In action with a reduced (latin only, > ~1000 code points)
characters set from allkeys.txt. Dirty work.
I added the French word éléphant.
code:
[...]
li = ['ähnlich', 'ÄHNLICH', 'Maß', 'Masse', 'Maße', \
'möglich', 'nötig', 'üblich']
li.insert(0, 'éléphant')
print(li)
r = sorted(li, key=c.tri)
print(r)
[...]
output:
>c:\python32\pythonw -u "unicodecollation.py"
['éléphant', 'ähnlich', 'ÄHNLICH', 'Maß', 'Masse', 'Maße', 'möglich', 'nötig', 'üblich']
['ähnlich', 'ÄHNLICH', 'éléphant', 'Maß', 'Maße', 'Masse', 'möglich', 'nötig', 'üblich']
>Exit code: 0
---
Why to continue to waste time with this product?
Its ridiculous(?), absurd(?), ascii-centric, non std,
buggy (definitively) Unicode implementation?
This just become only nice for pedacogical purposes.
jmf
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web