Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #58247

Re: trying to strip out non ascii.. or rather convert non ascii

Newsgroups comp.lang.python
Date 2013-11-01 02:00 -0700
References (6 earlier) <526f46a2$0$6512$c3e8da3$5496439d@news.astraweb.com> <e018a4c6-e7a5-4356-8929-e26a3fdcb75d@googlegroups.com> <5272025a$0$29862$c3e8da3$5496439d@news.astraweb.com> <4460346f-c715-42fb-8e94-e20b46f1bbf8@googlegroups.com> <52735554$0$29972$c3e8da3$5496439d@news.astraweb.com>
Message-ID <39f9588e-d60d-4e34-8b61-33de32a99d08@googlegroups.com> (permalink)
Subject Re: trying to strip out non ascii.. or rather convert non ascii
From wxjmfauth@gmail.com

Show all headers | View raw


Le vendredi 1 novembre 2013 08:16:36 UTC+1, Steven D'Aprano a écrit :
> On Thu, 31 Oct 2013 03:33:15 -0700, wxjmfauth wrote:
> 
> 
> 
> > Le jeudi 31 octobre 2013 08:10:18 UTC+1, Steven D'Aprano a écrit :
> 
> 
> 
> >> I'm glad that you know so much better than Google, Bing, Yahoo, and
> 
> >> other
> 
> >> search engines. When I search for "mispealled" Google gives me:
> 
> [...]
> 
> > As far as I know, I recognized my mistake. I had more text processing
> 
> > systems in mind, than search engines.
> 
> 
> 
> Yes, you have, I acknowledge that now. I see now that at the time I made 
> 
> my response to you, you had already replied recognising your error. 
> 
> Unfortunately I had not seen that. So in that case, I withdraw my 
> 
> comments and apologize.
> 
> 
> 
> 
> 
> > I can even tell you, I am really stupid. I wrote pure Unicode software
> 
> > to sort French or German strings.
> 
> > 
> 
> > Pure unicode == independent from any locale.
> 
> 
> 
> Unfortunately it is not that simple. The same code point can have 
> 
> different meanings in different languages, and should be treated 
> 
> differently when sorting. The natural Unicode sort order satisfies very 
> 
> few European languages, including English. A few examples:
> 
> 
> 
> * Swedish ä is a distinct letters of the alphabet, appearing 
> 
>   after z: "a b c z ä" is sorted according to Swedish rules.
> 
>   But in German ä is considered to be the letter 'a' plus an
> 
>   umlaut, and is collated after 'a': "a ä b c z" is sorted 
> 
>   according to German rules.
> 
> 
> 
> * In German ö is considered to be a variant of o, equivalent
> 
>   to 'oe', while in Finish ö is a distinct letter which 
> 
>   cannot be expanded to 'oe', and which appears at the end
> 
>   of the alphabet.
> 
> 
> 
> * Similarly, in modern English æ is a ligature of ae, while in
> 
>   Danish and Norwegian is it a distinct letter of the alphabet
> 
>   appearing after z: in English dictionaries, "Æsir" will be 
> 
>   found with other "A" words, often expanded to "Aesir", while
> 
>   in Norwegian it will be found after "Z" words.
> 
> 
> 
> * Most European languages convert uppercase I to lowercase i, 
> 
>   but Turkish has distinct letters for dotted and dotless I. 
> 
>   According to Turkish rules, lowercase(I) is ı and uppercase(i)
> 
>   is İ.
> 
> 
> 
> 
> 
> While it is true that the Unicode character set is independent of locale, 
> 
> for natural processing of characters, it isn't enough to just use Unicode.
> 
> 
> 
> 
> 
> -- 
> 
> Steven


I'm aware of all the points you gave. That's why
I wrote "French or German strings".

The hard task is not on the side of Unicode or sorting,
it is on the creation of key(s) used for sorting.

Eg, cote, côte, coté, côté. French editors are not all
sorting these words in the same way (diacritics).

jmf

PS A *real* case to test the FSR.

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

trying to strip out non ascii.. or rather convert non ascii bruce <badouglas@gmail.com> - 2013-10-26 16:11 -0400
  Re: trying to strip out non ascii.. or rather convert non ascii Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-10-26 22:24 +0000
    Re: trying to strip out non ascii.. or rather convert non ascii Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-10-26 20:51 -0400
      Re: trying to strip out non ascii.. or rather convert non ascii Roy Smith <roy@panix.com> - 2013-10-26 21:11 -0400
        Re: trying to strip out non ascii.. or rather convert non ascii Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-10-27 02:05 +0000
          Re: trying to strip out non ascii.. or rather convert non ascii Chris Angelico <rosuav@gmail.com> - 2013-10-27 13:15 +1100
        Re: trying to strip out non ascii.. or rather convert non ascii Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-10-27 09:21 +0000
    Re: trying to strip out non ascii.. or rather convert non ascii Tim Chase <python.list@tim.thechases.com> - 2013-10-26 20:41 -0500
      Re: trying to strip out non ascii.. or rather convert non ascii Roy Smith <roy@panix.com> - 2013-10-26 21:54 -0400
        Re: trying to strip out non ascii.. or rather convert non ascii Tim Chase <python.list@tim.thechases.com> - 2013-10-26 21:17 -0500
      Re: trying to strip out non ascii.. or rather convert non ascii Nobody <nobody@nowhere.com> - 2013-10-27 03:21 +0000
        Re: trying to strip out non ascii.. or rather convert non ascii wxjmfauth@gmail.com - 2013-10-28 07:01 -0700
          Re: trying to strip out non ascii.. or rather convert non ascii Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-10-28 14:13 +0000
          Re: trying to strip out non ascii.. or rather convert non ascii Tim Chase <python.list@tim.thechases.com> - 2013-10-28 09:23 -0500
            Re: trying to strip out non ascii.. or rather convert non ascii Steven D'Aprano <steve@pearwood.info> - 2013-10-29 05:24 +0000
              Re: trying to strip out non ascii.. or rather convert non ascii wxjmfauth@gmail.com - 2013-10-30 01:49 -0700
                Re: trying to strip out non ascii.. or rather convert non ascii Ned Batchelder <ned@nedbatchelder.com> - 2013-10-30 08:44 -0400
                Re: trying to strip out non ascii.. or rather convert non ascii wxjmfauth@gmail.com - 2013-10-30 09:08 -0700
                Re: trying to strip out non ascii.. or rather convert non ascii Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-10-30 16:24 +0000
                Re: trying to strip out non ascii.. or rather convert non ascii Ned Batchelder <ned@nedbatchelder.com> - 2013-10-30 13:10 -0400
                Re: trying to strip out non ascii.. or rather convert non ascii Michael Torrie <torriem@gmail.com> - 2013-10-30 11:54 -0600
                Re: trying to strip out non ascii.. or rather convert non ascii wxjmfauth@gmail.com - 2013-10-30 11:38 -0700
                Re: trying to strip out non ascii.. or rather convert non ascii Roy Smith <roy@panix.com> - 2013-10-30 19:28 -0400
                Re: trying to strip out non ascii.. or rather convert non ascii Tim Chase <python.list@tim.thechases.com> - 2013-10-31 06:46 -0500
                Re: trying to strip out non ascii.. or rather convert non ascii Terry Reedy <tjreedy@udel.edu> - 2013-10-30 17:56 -0400
                Re: trying to strip out non ascii.. or rather convert non ascii Steven D'Aprano <steve@pearwood.info> - 2013-10-31 07:10 +0000
                Re: trying to strip out non ascii.. or rather convert non ascii Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-10-31 07:23 +0000
                Re: trying to strip out non ascii.. or rather convert non ascii wxjmfauth@gmail.com - 2013-10-31 03:33 -0700
                Re: trying to strip out non ascii.. or rather convert non ascii Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-11-01 07:16 +0000
                Re: trying to strip out non ascii.. or rather convert non ascii wxjmfauth@gmail.com - 2013-11-01 02:00 -0700
                Re: trying to strip out non ascii.. or rather convert non ascii Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-11-01 09:18 +0000
          Re: trying to strip out non ascii.. or rather convert non ascii Steven D'Aprano <steve@pearwood.info> - 2013-10-29 05:22 +0000
            Re: trying to strip out non ascii.. or rather convert non ascii wxjmfauth@gmail.com - 2013-10-29 08:38 -0700
              Re: trying to strip out non ascii.. or rather convert non ascii Tim Chase <python.list@tim.thechases.com> - 2013-10-29 10:52 -0500
                Re: trying to strip out non ascii.. or rather convert non ascii wxjmfauth@gmail.com - 2013-10-29 12:16 -0700
                Re: trying to strip out non ascii.. or rather convert non ascii Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-10-29 19:54 +0000
                Re: trying to strip out non ascii.. or rather convert non ascii Piet van Oostrum <piet@vanoostrum.org> - 2013-10-29 21:33 -0400
                Re: trying to strip out non ascii.. or rather convert non ascii Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-10-30 09:19 +0000
              Re: trying to strip out non ascii.. or rather convert non ascii Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-10-29 15:56 +0000
              Re: trying to strip out non ascii.. or rather convert non ascii Chris Angelico <rosuav@gmail.com> - 2013-10-30 13:17 +1100
                Re: trying to strip out non ascii.. or rather convert non ascii wxjmfauth@gmail.com - 2013-10-30 01:13 -0700
                Re: trying to strip out non ascii.. or rather convert non ascii Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-10-30 15:25 +0000

csiph-web