Groups > comp.lang.python > #57656 > unrolled thread

trying to strip out non ascii.. or rather convert non ascii

Started by	bruce <badouglas@gmail.com>
First post	2013-10-26 16:11 -0400
Last post	2013-10-30 15:25 +0000
Articles	20 on this page of 42 — 14 participants

Back to article view | Back to comp.lang.python

Page 2 of 3 — ← Prev page 1 [2] 3 Next page →

#58076

From	Michael Torrie <torriem@gmail.com>
Date	2013-10-30 11:54 -0600
Message-ID	<mailman.1821.1383156703.18130.python-list@python.org>
In reply to	#58058

On 10/30/2013 10:08 AM, wxjmfauth@gmail.com wrote:
> My comment had nothing to do with Python, it was a
> general comment. A diacritical mark just makes a letter
> a different letter; a "ï " and a "i" are "as
> diferent" as a "a" from a "z". A diacritical mark
> is more than a simple ornementation.

That's nice, but you didn't actually read what Ned said (or the OP).
The OP doesn't care that "ï " and a "i" are as different as "a" and "z".
 For the purposes of his search he wants them treated as the same
letter.  A fuzzy searching treats them all the same. For example, a
search for "Godel, Escher, Bach" should find "Gödel, Escher, Bach" just
fine.  Even though "o" and "ö" are different characters.  And lo and
behold Google actually does this!  Try it.  It's nice for those of use
who want to find something and our US keyboards don't have the right marks.

https://www.google.ca/search?q=godel+escher+bach

After all this nonsense, that's what the original poster is looking for
(I think... can't be sure since it's been so many days now).  Seems to
me a python module does this quite nicely:

https://pypi.python.org/pypi/Unidecode

[toc] | [prev] | [next] | [standalone]

#58079

From	wxjmfauth@gmail.com
Date	2013-10-30 11:38 -0700
Message-ID	<78fce490-a583-4a0e-845e-73fec6bf705a@googlegroups.com>
In reply to	#58076

Le mercredi 30 octobre 2013 18:54:05 UTC+1, Michael Torrie a écrit :
> On 10/30/2013 10:08 AM, wxjmfauth@gmail.com wrote:
> 
> > My comment had nothing to do with Python, it was a
> 
> > general comment. A diacritical mark just makes a letter
> 
> > a different letter; a "ï " and a "i" are "as
> 
> > diferent" as a "a" from a "z". A diacritical mark
> 
> > is more than a simple ornementation.
> 
> 
> 
> That's nice, but you didn't actually read what Ned said (or the OP).
> 
> The OP doesn't care that "ï " and a "i" are as different as "a" and "z".
> 
>  For the purposes of his search he wants them treated as the same
> 
> letter.  A fuzzy searching treats them all the same. For example, a
> 
> search for "Godel, Escher, Bach" should find "Gödel, Escher, Bach" just
> 
> fine.  Even though "o" and "ö" are different characters.  And lo and
> 
> behold Google actually does this!  Try it.  It's nice for those of use
> 
> who want to find something and our US keyboards don't have the right marks.
> 
> 
> 
> https://www.google.ca/search?q=godel+escher+bach
> 
> 
> 
> After all this nonsense, that's what the original poster is looking for
> 
> (I think... can't be sure since it's been so many days now).  Seems to
> 
> me a python module does this quite nicely:
> 
> 
> 
> https://pypi.python.org/pypi/Unidecode


Ok. You are right. I recognize my mistake. Independently
from the top poster's task, I did not understand in that
way.

Let say it depends on the context, for a general
search engine, it's good that diacritics are ignored.
For, let say, a text processing system, it's good
to have only precised matches. It does not mean, other
matching possibilities may exist.

jmf

[toc] | [prev] | [next] | [standalone]

#58141

From	Roy Smith <roy@panix.com>
Date	2013-10-30 19:28 -0400
Message-ID	<roy-445BFB.19284330102013@news.panix.com>
In reply to	#58076

In article <mailman.1821.1383156703.18130.python-list@python.org>,
 Michael Torrie <torriem@gmail.com> wrote:

> On 10/30/2013 10:08 AM, wxjmfauth@gmail.com wrote:
> > My comment had nothing to do with Python, it was a
> > general comment. A diacritical mark just makes a letter
> > a different letter; a "ï " and a "i" are "as
> > diferent" as a "a" from a "z". A diacritical mark
> > is more than a simple ornementation.
> 
> That's nice, but you didn't actually read what Ned said (or the OP).
> The OP doesn't care that "ï " and a "i" are as different as "a" and "z".
> For the purposes of his search he wants them treated as the same
> letter.  A fuzzy searching treats them all the same.

That's one definition of fuzzy.  But, there's nothing that says you 
can't build a fuzzy matching algorithm which considers some mismatches 
to be worse than others.

For example, it's reasonable to consider any vowel (or string of vowels, 
for that matter) to be closer to another vowel than to a consonant.  A 
great example is the word, "bureaucrat".  As far as I'm concerned, it's 
spelled {b, vowels, r, vowels, c, r, a, t}.  It usually takes me three 
or four tries to get auto-correct to even recognize what I'm trying to 
type and fix it for me.

Likewise for pairs like {c, s}, {j, g}, {v, w}, and so on.

In that spirit, I would think that a, á, and â would all be considered 
more conservative replacements for each other than they would be for k, 
x, or z.

[toc] | [prev] | [next] | [standalone]

#58168

From	Tim Chase <python.list@tim.thechases.com>
Date	2013-10-31 06:46 -0500
Message-ID	<mailman.1872.1383219887.18130.python-list@python.org>
In reply to	#58141

On 2013-10-30 19:28, Roy Smith wrote:
> For example, it's reasonable to consider any vowel (or string of
> vowels, for that matter) to be closer to another vowel than to a
> consonant.  A great example is the word, "bureaucrat".  As far as
> I'm concerned, it's spelled {b, vowels, r, vowels, c, r, a, t}.  It
> usually takes me three or four tries to get auto-correct to even
> recognize what I'm trying to type and fix it for me.

[glad I'm not the only one who has trouble spelling "bureaucrat"]

Steven D'Aprano wisely mentioned elsewhere in the thread that "The
right solution to that is to treat it no differently from other fuzzy
searches. A good search engine should be tolerant of spelling errors
and alternative spellings for any letter, not just those with
diacritics."

Often the Levenshtein distance is used for calculating closeness, and
the off-the-shelf algorithm assigns a cost of one per difference
(addition, change, or removal).  It doesn't sound like it would be
that hard[1] to assign varying costs based on what character was
added/changed/removed.  A diacritic might have a cost of N while a
similar character (vowel->vowel or consonant->consonant, or
consonant-cluster shift) might have a cost of 2N, and a totally
arbitrary character shift might have a cost of 3N (or higher).
Unfortunately, the Levenshtein algorithm is already O(M*N) slow and
can't be reasonably precalculated without knowing both strings, so
this just ends up heaping additional lookups/comparisons atop
already-slow code.

-tkc

[1]
http://en.wikipedia.org/wiki/Levenshtein_distance#Possible_modifications

.

[toc] | [prev] | [next] | [standalone]

#58137

From	Terry Reedy <tjreedy@udel.edu>
Date	2013-10-30 17:56 -0400
Message-ID	<mailman.1858.1383170226.18130.python-list@python.org>
In reply to	#58058

On 10/30/2013 12:08 PM, wxjmfauth@gmail.com wrote:

>  From a unicode perspective.
> Unicode.org "knows", these chars a very important, that's
> the reason why they exist in two forms, precomposed and
> composed forms.

Only some chars have both forms. I believe the precomposed forms are 
partly a historical accident of what precomposed forms were in the 
various latin-1 sets.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#58152

From	Steven D'Aprano <steve@pearwood.info>
Date	2013-10-31 07:10 +0000
Message-ID	<5272025a$0$29862$c3e8da3$5496439d@news.astraweb.com>
In reply to	#58012

On Wed, 30 Oct 2013 01:49:28 -0700, wxjmfauth wrote:

>> The right solution to that is to treat it no differently from other
>> fuzzy
>> searches. A good search engine should be tolerant of spelling errors
>> and
>> alternative spellings for any letter, not just those with diacritics.
>> Ideally, a good search engine would successfully match all three of
>> "naïve", "naive" and "niave", and it shouldn't rely on special handling
>> of diacritics.
> 
> This is a non sense. The purpose of a diacritical mark is to make a
> letter a different letter. If a tool is supposed to match an ô, there is
> absolutely no reason to match something else.


I'm glad that you know so much better than Google, Bing, Yahoo, and other 
search engines. When I search for "mispealled" Google gives me:

    Showing results for misspelled
    Search instead for mispealled


But I see now that this is nonsense and there is *absolutely no reason* 
to match something other than the ecaxt wrods I typed.

Perhaps you should submit a bug report to Google:

"When I mistype a word, Google correctly gives me the search results I 
wanted, instead of the wrong results I didn't want."



-- 
Steven

[toc] | [prev] | [next] | [standalone]

#58153

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2013-10-31 07:23 +0000
Message-ID	<mailman.1867.1383204246.18130.python-list@python.org>
In reply to	#58152

On 31/10/2013 07:10, Steven D'Aprano wrote:
> On Wed, 30 Oct 2013 01:49:28 -0700, wxjmfauth wrote:
>
>>> The right solution to that is to treat it no differently from other
>>> fuzzy
>>> searches. A good search engine should be tolerant of spelling errors
>>> and
>>> alternative spellings for any letter, not just those with diacritics.
>>> Ideally, a good search engine would successfully match all three of
>>> "naïve", "naive" and "niave", and it shouldn't rely on special handling
>>> of diacritics.
>>
>> This is a non sense. The purpose of a diacritical mark is to make a
>> letter a different letter. If a tool is supposed to match an ô, there is
>> absolutely no reason to match something else.
>
>
> I'm glad that you know so much better than Google, Bing, Yahoo, and other
> search engines. When I search for "mispealled" Google gives me:
>
>      Showing results for misspelled
>      Search instead for mispealled
>
>
> But I see now that this is nonsense and there is *absolutely no reason*
> to match something other than the ecaxt wrods I typed.
>
> Perhaps you should submit a bug report to Google:
>
> "When I mistype a word, Google correctly gives me the search results I
> wanted, instead of the wrong results I didn't want."
>

I'm sorry Steven but you're completely out of your depth here.  Please 
bow down to the superior intellect of jmf, where jm is for Joseph McCarthy.

-- 
Python is the second best programming language in the world.
But the best has yet to be invented.  Christian Tismer

Mark Lawrence

[toc] | [prev] | [next] | [standalone]

#58161

From	wxjmfauth@gmail.com
Date	2013-10-31 03:33 -0700
Message-ID	<4460346f-c715-42fb-8e94-e20b46f1bbf8@googlegroups.com>
In reply to	#58152

Le jeudi 31 octobre 2013 08:10:18 UTC+1, Steven D'Aprano a écrit :
> On Wed, 30 Oct 2013 01:49:28 -0700, wxjmfauth wrote:
> 
> 
> 
> >> The right solution to that is to treat it no differently from other
> 
> >> fuzzy
> 
> >> searches. A good search engine should be tolerant of spelling errors
> 
> >> and
> 
> >> alternative spellings for any letter, not just those with diacritics.
> 
> >> Ideally, a good search engine would successfully match all three of
> 
> >> "naïve", "naive" and "niave", and it shouldn't rely on special handling
> 
> >> of diacritics.
> 
> > 
> 
> > This is a non sense. The purpose of a diacritical mark is to make a
> 
> > letter a different letter. If a tool is supposed to match an ô, there is
> 
> > absolutely no reason to match something else.
> 
> 
> 
> 
> 
> I'm glad that you know so much better than Google, Bing, Yahoo, and other 
> 
> search engines. When I search for "mispealled" Google gives me:
> 
> 
> 
>     Showing results for misspelled
> 
>     Search instead for mispealled
> 
> 
> 
> 
> 
> But I see now that this is nonsense and there is *absolutely no reason* 
> 
> to match something other than the ecaxt wrods I typed.
> 
> 
> 
> Perhaps you should submit a bug report to Google:
> 
> 
> 
> "When I mistype a word, Google correctly gives me the search results I 
> 
> wanted, instead of the wrong results I didn't want."
> 
> 
> 
> 
> 
> 
> 
> -- 
> 
> Steven


As far as I know, I recognized my mistake. I had more
text processing systems in mind, than search engines.

I can even tell you, I am really stupid. I wrote pure
Unicode software to sort French or German strings.

Pure unicode == independent from any locale.

jmf

[toc] | [prev] | [next] | [standalone]

#58243

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-11-01 07:16 +0000
Message-ID	<52735554$0$29972$c3e8da3$5496439d@news.astraweb.com>
In reply to	#58161

On Thu, 31 Oct 2013 03:33:15 -0700, wxjmfauth wrote:

> Le jeudi 31 octobre 2013 08:10:18 UTC+1, Steven D'Aprano a écrit :

>> I'm glad that you know so much better than Google, Bing, Yahoo, and
>> other
>> search engines. When I search for "mispealled" Google gives me:
[...]
> As far as I know, I recognized my mistake. I had more text processing
> systems in mind, than search engines.

Yes, you have, I acknowledge that now. I see now that at the time I made 
my response to you, you had already replied recognising your error. 
Unfortunately I had not seen that. So in that case, I withdraw my 
comments and apologize.

> I can even tell you, I am really stupid. I wrote pure Unicode software
> to sort French or German strings.
> 
> Pure unicode == independent from any locale.

Unfortunately it is not that simple. The same code point can have 
different meanings in different languages, and should be treated 
differently when sorting. The natural Unicode sort order satisfies very 
few European languages, including English. A few examples:

* Swedish ä is a distinct letters of the alphabet, appearing 
  after z: "a b c z ä" is sorted according to Swedish rules.
  But in German ä is considered to be the letter 'a' plus an
  umlaut, and is collated after 'a': "a ä b c z" is sorted 
  according to German rules.

* In German ö is considered to be a variant of o, equivalent
  to 'oe', while in Finish ö is a distinct letter which 
  cannot be expanded to 'oe', and which appears at the end
  of the alphabet.

* Similarly, in modern English æ is a ligature of ae, while in
  Danish and Norwegian is it a distinct letter of the alphabet
  appearing after z: in English dictionaries, "Æsir" will be 
  found with other "A" words, often expanded to "Aesir", while
  in Norwegian it will be found after "Z" words.

* Most European languages convert uppercase I to lowercase i, 
  but Turkish has distinct letters for dotted and dotless I. 
  According to Turkish rules, lowercase(I) is ı and uppercase(i)
  is İ.

While it is true that the Unicode character set is independent of locale, 
for natural processing of characters, it isn't enough to just use Unicode.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#58247

From	wxjmfauth@gmail.com
Date	2013-11-01 02:00 -0700
Message-ID	<39f9588e-d60d-4e34-8b61-33de32a99d08@googlegroups.com>
In reply to	#58243

Le vendredi 1 novembre 2013 08:16:36 UTC+1, Steven D'Aprano a écrit :
> On Thu, 31 Oct 2013 03:33:15 -0700, wxjmfauth wrote:
> 
> 
> 
> > Le jeudi 31 octobre 2013 08:10:18 UTC+1, Steven D'Aprano a écrit :
> 
> 
> 
> >> I'm glad that you know so much better than Google, Bing, Yahoo, and
> 
> >> other
> 
> >> search engines. When I search for "mispealled" Google gives me:
> 
> [...]
> 
> > As far as I know, I recognized my mistake. I had more text processing
> 
> > systems in mind, than search engines.
> 
> 
> 
> Yes, you have, I acknowledge that now. I see now that at the time I made 
> 
> my response to you, you had already replied recognising your error. 
> 
> Unfortunately I had not seen that. So in that case, I withdraw my 
> 
> comments and apologize.
> 
> 
> 
> 
> 
> > I can even tell you, I am really stupid. I wrote pure Unicode software
> 
> > to sort French or German strings.
> 
> > 
> 
> > Pure unicode == independent from any locale.
> 
> 
> 
> Unfortunately it is not that simple. The same code point can have 
> 
> different meanings in different languages, and should be treated 
> 
> differently when sorting. The natural Unicode sort order satisfies very 
> 
> few European languages, including English. A few examples:
> 
> 
> 
> * Swedish ä is a distinct letters of the alphabet, appearing 
> 
>   after z: "a b c z ä" is sorted according to Swedish rules.
> 
>   But in German ä is considered to be the letter 'a' plus an
> 
>   umlaut, and is collated after 'a': "a ä b c z" is sorted 
> 
>   according to German rules.
> 
> 
> 
> * In German ö is considered to be a variant of o, equivalent
> 
>   to 'oe', while in Finish ö is a distinct letter which 
> 
>   cannot be expanded to 'oe', and which appears at the end
> 
>   of the alphabet.
> 
> 
> 
> * Similarly, in modern English æ is a ligature of ae, while in
> 
>   Danish and Norwegian is it a distinct letter of the alphabet
> 
>   appearing after z: in English dictionaries, "Æsir" will be 
> 
>   found with other "A" words, often expanded to "Aesir", while
> 
>   in Norwegian it will be found after "Z" words.
> 
> 
> 
> * Most European languages convert uppercase I to lowercase i, 
> 
>   but Turkish has distinct letters for dotted and dotless I. 
> 
>   According to Turkish rules, lowercase(I) is ı and uppercase(i)
> 
>   is İ.
> 
> 
> 
> 
> 
> While it is true that the Unicode character set is independent of locale, 
> 
> for natural processing of characters, it isn't enough to just use Unicode.
> 
> 
> 
> 
> 
> -- 
> 
> Steven


I'm aware of all the points you gave. That's why
I wrote "French or German strings".

The hard task is not on the side of Unicode or sorting,
it is on the creation of key(s) used for sorting.

Eg, cote, côte, coté, côté. French editors are not all
sorting these words in the same way (diacritics).

jmf

PS A *real* case to test the FSR.

[toc] | [prev] | [next] | [standalone]

#58250

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2013-11-01 09:18 +0000
Message-ID	<mailman.1913.1383297511.18130.python-list@python.org>
In reply to	#58247

On 01/11/2013 09:00, wxjmfauth@gmail.com wrote:

I'll ask again, would you please read, digest and action this 
https://wiki.python.org/moin/GoogleGroupsPython

-- 
Python is the second best programming language in the world.
But the best has yet to be invented.  Christian Tismer

Mark Lawrence

[toc] | [prev] | [next] | [standalone]

#57881

From	Steven D'Aprano <steve@pearwood.info>
Date	2013-10-29 05:22 +0000
Message-ID	<526f4612$0$6512$c3e8da3$5496439d@news.astraweb.com>
In reply to	#57823

On Mon, 28 Oct 2013 07:01:16 -0700, wxjmfauth wrote:

> And of course, logically, they are very, very badly handled with the
> Flexible String Representation.

I'm reminded of Cato the Elder, the Roman senator who would end every 
speech, no matter the topic, with "Ceterum censeo Carthaginem esse 
delendam" ("Furthermore, I consider that Carthage must be destroyed").

But at least he had the good grace to present that as an opinion, instead 
of repeating a falsehood as if it were a fact.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#57923

From	wxjmfauth@gmail.com
Date	2013-10-29 08:38 -0700
Message-ID	<63fa9fcd-6445-41ee-8873-e1ee046e2031@googlegroups.com>
In reply to	#57881

Le mardi 29 octobre 2013 06:22:27 UTC+1, Steven D'Aprano a écrit :
> On Mon, 28 Oct 2013 07:01:16 -0700, wxjmfauth wrote:
> 
> 
> 
> > And of course, logically, they are very, very badly handled with the
> 
> > Flexible String Representation.
> 
> 
> 
> I'm reminded of Cato the Elder, the Roman senator who would end every 
> 
> speech, no matter the topic, with "Ceterum censeo Carthaginem esse 
> 
> delendam" ("Furthermore, I consider that Carthage must be destroyed").
> 
> 
> 
> But at least he had the good grace to present that as an opinion, instead 
> 
> of repeating a falsehood as if it were a fact.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> -- 
> 
> Steven

------

>>> import timeit
>>> timeit.timeit("a = 'hundred'; 'x' in a")
0.12621293837694095
>>> timeit.timeit("a = 'hundreĳ'; 'x' in a")
0.26411553466961735

If you are understanding the coding of characters, Unicode
and what this FSR does, it is a child play to produce gazillion
of examples like this.

(Notice the usage of a Dutch character instead of a boring €).

jmf

[toc] | [prev] | [next] | [standalone]

#57924

From	Tim Chase <python.list@tim.thechases.com>
Date	2013-10-29 10:52 -0500
Message-ID	<mailman.1761.1383061878.18130.python-list@python.org>
In reply to	#57923

On 2013-10-29 08:38, wxjmfauth@gmail.com wrote:
> >>> import timeit
> >>> timeit.timeit("a = 'hundred'; 'x' in a")  
> 0.12621293837694095
> >>> timeit.timeit("a = 'hundreĳ'; 'x' in a")  
> 0.26411553466961735

That reads to me as "If things were purely UCS4 internally, Python
would normally take 0.264... seconds to execute this test, but core
devs managed to optimize a particular (lower 127 ASCII characters
only) case so that it runs in less than half the time."

Is this not what you intended to demonstrate?  'cuz that sounds
like a pretty awesome optimization to me.

-tkc

[toc] | [prev] | [next] | [standalone]

#57961

From	wxjmfauth@gmail.com
Date	2013-10-29 12:16 -0700
Message-ID	<9319e982-4628-4f32-b5cc-60eadca121fc@googlegroups.com>
In reply to	#57924

Le mardi 29 octobre 2013 16:52:49 UTC+1, Tim Chase a écrit :
> On 2013-10-29 08:38, wxjmfauth@gmail.com wrote:
> 
> > >>> import timeit
> 
> > >>> timeit.timeit("a = 'hundred'; 'x' in a")  
> 
> > 0.12621293837694095
> 
> > >>> timeit.timeit("a = 'hundreĳ'; 'x' in a")  
> 
> > 0.26411553466961735
> 
> 
> 
> That reads to me as "If things were purely UCS4 internally, Python
> 
> would normally take 0.264... seconds to execute this test, but core
> 
> devs managed to optimize a particular (lower 127 ASCII characters
> 
> only) case so that it runs in less than half the time."
> 
> 
> 
> Is this not what you intended to demonstrate?  'cuz that sounds
> 
> like a pretty awesome optimization to me.
> 
> 
> 
> -tkc

--------

That's very naive. In fact, what happens is just the opposite.
The "best case" with the FSR is worst than the "worst case"
without the FSR.

And this is just without counting the effect that this poor
Python is spending its time in switching from one internal
representation to one another, without forgetting the fact
that this has to be tested every time.
The more unicode manipulations one applies, the more time
it demands.

Two tasks, that come in my mind: re and normalization.
It's very interesting to observe what happens when one
normalizes latin text and polytonic Greek text, both with
plenty of diactrics.

----

Something different, based on my previous example.

What a European user is supposed to think, when she/he
sees, she/he can be "penalized" by such an amount,
simply by using non ascii characters for a product
which is supposed to be "unicode compliant" ?

jmf

[toc] | [prev] | [next] | [standalone]

#57964

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2013-10-29 19:54 +0000
Message-ID	<mailman.1773.1383076460.18130.python-list@python.org>
In reply to	#57961

On 29/10/2013 19:16, wxjmfauth@gmail.com wrote:
> Le mardi 29 octobre 2013 16:52:49 UTC+1, Tim Chase a écrit :
>> On 2013-10-29 08:38, wxjmfauth@gmail.com wrote:
>>
>>>>>> import timeit
>>
>>>>>> timeit.timeit("a = 'hundred'; 'x' in a")
>>
>>> 0.12621293837694095
>>
>>>>>> timeit.timeit("a = 'hundreĳ'; 'x' in a")
>>
>>> 0.26411553466961735
>>
>>
>>
>> That reads to me as "If things were purely UCS4 internally, Python
>>
>> would normally take 0.264... seconds to execute this test, but core
>>
>> devs managed to optimize a particular (lower 127 ASCII characters
>>
>> only) case so that it runs in less than half the time."
>>
>>
>>
>> Is this not what you intended to demonstrate?  'cuz that sounds
>>
>> like a pretty awesome optimization to me.
>>
>>
>>
>> -tkc
>
> --------
>
> That's very naive. In fact, what happens is just the opposite.
> The "best case" with the FSR is worst than the "worst case"
> without the FSR.
>
> And this is just without counting the effect that this poor
> Python is spending its time in switching from one internal
> representation to one another, without forgetting the fact
> that this has to be tested every time.
> The more unicode manipulations one applies, the more time
> it demands.
>
> Two tasks, that come in my mind: re and normalization.
> It's very interesting to observe what happens when one
> normalizes latin text and polytonic Greek text, both with
> plenty of diactrics.
>
> ----
>
> Something different, based on my previous example.
>
> What a European user is supposed to think, when she/he
> sees, she/he can be "penalized" by such an amount,
> simply by using non ascii characters for a product
> which is supposed to be "unicode compliant" ?
>
> jmf
>

Please provide hard evidence to support your claims or stop posting this 
ridiculous nonsense.  Give us real world problems that can be reported 
on the bug tracker, investigated and resolved.

-- 
Python is the second best programming language in the world.
But the best has yet to be invented.  Christian Tismer

Mark Lawrence

[toc] | [prev] | [next] | [standalone]

#57988

From	Piet van Oostrum <piet@vanoostrum.org>
Date	2013-10-29 21:33 -0400
Message-ID	<m27gcv1r0x.fsf@cochabamba.vanoostrum.org>
In reply to	#57964

Mark Lawrence <breamoreboy@yahoo.co.uk> writes:

> Please provide hard evidence to support your claims or stop posting this
> ridiculous nonsense.  Give us real world problems that can be reported
> on the bug tracker, investigated and resolved.

I think it is much better just to ignore this nonsense instead of asking for evidence you know you will never get.
-- 
Piet van Oostrum <piet@vanoostrum.org>
WWW: http://pietvanoostrum.com/
PGP key: [8DAE142BE17999C4]

[toc] | [prev] | [next] | [standalone]

#58014

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2013-10-30 09:19 +0000
Message-ID	<mailman.1797.1383124791.18130.python-list@python.org>
In reply to	#57988

On 30/10/2013 01:33, Piet van Oostrum wrote:
> Mark Lawrence <breamoreboy@yahoo.co.uk> writes:
>
>> Please provide hard evidence to support your claims or stop posting this
>> ridiculous nonsense.  Give us real world problems that can be reported
>> on the bug tracker, investigated and resolved.
>
> I think it is much better just to ignore this nonsense instead of asking for evidence you know you will never get.
>

A good point, but note he doesn't have the courage to reply to me but 
always to others.  I guess he spends a lot of time clucking, not because 
he's run out of supplies, but because he's simply a chicken.

-- 
Python is the second best programming language in the world.
But the best has yet to be invented.  Christian Tismer

Mark Lawrence

[toc] | [prev] | [next] | [standalone]

#57925

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2013-10-29 15:56 +0000
Message-ID	<mailman.1762.1383062192.18130.python-list@python.org>
In reply to	#57923

On 29/10/2013 15:38, wxjmfauth@gmail.com wrote:

It's okay folks I'll snip all the double spaced google crap as the 
poster is clearly too bone idle to follow the instructions that have 
been repeatedly posted here asking for people not to post double spaced 
google crap.

> Le mardi 29 octobre 2013 06:22:27 UTC+1, Steven D'Aprano a écrit :
>> On Mon, 28 Oct 2013 07:01:16 -0700, wxjmfauth wrote:
>>> And of course, logically, they are very, very badly handled with the
>>> Flexible String Representation.
>>
>> I'm reminded of Cato the Elder, the Roman senator who would end every
>> speech, no matter the topic, with "Ceterum censeo Carthaginem esse
>> delendam" ("Furthermore, I consider that Carthage must be destroyed").
>>
>> But at least he had the good grace to present that as an opinion, instead
>> of repeating a falsehood as if it were a fact.
>>
>> --
>>
>> Steven
>
> ------
>
>>>> import timeit
>>>> timeit.timeit("a = 'hundred'; 'x' in a")
> 0.12621293837694095
>>>> timeit.timeit("a = 'hundreĳ'; 'x' in a")
> 0.26411553466961735
>
> If you are understanding the coding of characters, Unicode
> and what this FSR does, it is a child play to produce gazillion
> of examples like this.
>
> (Notice the usage of a Dutch character instead of a boring €).
>
> jmf
>

You've stated above that logically unicode is badly handled by the fsr. 
  You then provide a trivial timing example.  WTF???

-- 
Python is the second best programming language in the world.
But the best has yet to be invented.  Christian Tismer

Mark Lawrence

[toc] | [prev] | [next] | [standalone]

#57995

From	Chris Angelico <rosuav@gmail.com>
Date	2013-10-30 13:17 +1100
Message-ID	<mailman.1787.1383099445.18130.python-list@python.org>
In reply to	#57923

On Wed, Oct 30, 2013 at 2:56 AM, Mark Lawrence <breamoreboy@yahoo.co.uk> wrote:
> You've stated above that logically unicode is badly handled by the fsr.  You
> then provide a trivial timing example.  WTF???

His idea of bad handling is "oh how terrible, ASCII and BMP have
optimizations". He hates the idea that it could be better in some
areas instead of even timings all along. But the FSR actually has some
distinct benefits even in the areas he's citing - watch this:

>>> import timeit
>>> timeit.timeit("a = 'hundred'; 'x' in a")
0.3625614428649451
>>> timeit.timeit("a = 'hundreĳ'; 'x' in a")
0.6753936603674484
>>> timeit.timeit("a = 'hundred'; 'ģ' in a")
0.25663261671525106
>>> timeit.timeit("a = 'hundreĳ'; 'ģ' in a")
0.3582399439035271

The first two examples are his examples done on my computer, so you
can see how all four figures compare. Note how testing for the
presence of a non-Latin1 character in an 8-bit string is very fast.
Same goes for testing for non-BMP character in a 16-bit string. The
difference gets even larger if the string is longer:

>>> timeit.timeit("a = 'hundred'*1000; 'x' in a")
10.083378194714726
>>> timeit.timeit("a = 'hundreĳ'*1000; 'x' in a")
18.656413035735
>>> timeit.timeit("a = 'hundreĳ'*1000; 'ģ' in a")
18.436268855399135
>>> timeit.timeit("a = 'hundred'*1000; 'ģ' in a")
2.8308718007456264

Wow! The FSR speeds up searches immensely! It's obviously the best
thing since sliced bread!

ChrisA

[toc] | [prev] | [next] | [standalone]

Page 2 of 3 — ← Prev page 1 [2] 3 Next page →

csiph-web

trying to strip out non ascii.. or rather convert non ascii

Contents

#58076

#58079

#58141

#58168

#58137

#58152

#58153

#58161

#58243

#58247

#58250

#57881

#57923

#57924

#57961

#57964

#57988

#58014

#57925

#57995