Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #57656 > unrolled thread
| Started by | bruce <badouglas@gmail.com> |
|---|---|
| First post | 2013-10-26 16:11 -0400 |
| Last post | 2013-10-30 15:25 +0000 |
| Articles | 20 on this page of 42 — 14 participants |
Back to article view | Back to comp.lang.python
trying to strip out non ascii.. or rather convert non ascii bruce <badouglas@gmail.com> - 2013-10-26 16:11 -0400
Re: trying to strip out non ascii.. or rather convert non ascii Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-10-26 22:24 +0000
Re: trying to strip out non ascii.. or rather convert non ascii Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-10-26 20:51 -0400
Re: trying to strip out non ascii.. or rather convert non ascii Roy Smith <roy@panix.com> - 2013-10-26 21:11 -0400
Re: trying to strip out non ascii.. or rather convert non ascii Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-10-27 02:05 +0000
Re: trying to strip out non ascii.. or rather convert non ascii Chris Angelico <rosuav@gmail.com> - 2013-10-27 13:15 +1100
Re: trying to strip out non ascii.. or rather convert non ascii Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-10-27 09:21 +0000
Re: trying to strip out non ascii.. or rather convert non ascii Tim Chase <python.list@tim.thechases.com> - 2013-10-26 20:41 -0500
Re: trying to strip out non ascii.. or rather convert non ascii Roy Smith <roy@panix.com> - 2013-10-26 21:54 -0400
Re: trying to strip out non ascii.. or rather convert non ascii Tim Chase <python.list@tim.thechases.com> - 2013-10-26 21:17 -0500
Re: trying to strip out non ascii.. or rather convert non ascii Nobody <nobody@nowhere.com> - 2013-10-27 03:21 +0000
Re: trying to strip out non ascii.. or rather convert non ascii wxjmfauth@gmail.com - 2013-10-28 07:01 -0700
Re: trying to strip out non ascii.. or rather convert non ascii Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-10-28 14:13 +0000
Re: trying to strip out non ascii.. or rather convert non ascii Tim Chase <python.list@tim.thechases.com> - 2013-10-28 09:23 -0500
Re: trying to strip out non ascii.. or rather convert non ascii Steven D'Aprano <steve@pearwood.info> - 2013-10-29 05:24 +0000
Re: trying to strip out non ascii.. or rather convert non ascii wxjmfauth@gmail.com - 2013-10-30 01:49 -0700
Re: trying to strip out non ascii.. or rather convert non ascii Ned Batchelder <ned@nedbatchelder.com> - 2013-10-30 08:44 -0400
Re: trying to strip out non ascii.. or rather convert non ascii wxjmfauth@gmail.com - 2013-10-30 09:08 -0700
Re: trying to strip out non ascii.. or rather convert non ascii Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-10-30 16:24 +0000
Re: trying to strip out non ascii.. or rather convert non ascii Ned Batchelder <ned@nedbatchelder.com> - 2013-10-30 13:10 -0400
Re: trying to strip out non ascii.. or rather convert non ascii Michael Torrie <torriem@gmail.com> - 2013-10-30 11:54 -0600
Re: trying to strip out non ascii.. or rather convert non ascii wxjmfauth@gmail.com - 2013-10-30 11:38 -0700
Re: trying to strip out non ascii.. or rather convert non ascii Roy Smith <roy@panix.com> - 2013-10-30 19:28 -0400
Re: trying to strip out non ascii.. or rather convert non ascii Tim Chase <python.list@tim.thechases.com> - 2013-10-31 06:46 -0500
Re: trying to strip out non ascii.. or rather convert non ascii Terry Reedy <tjreedy@udel.edu> - 2013-10-30 17:56 -0400
Re: trying to strip out non ascii.. or rather convert non ascii Steven D'Aprano <steve@pearwood.info> - 2013-10-31 07:10 +0000
Re: trying to strip out non ascii.. or rather convert non ascii Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-10-31 07:23 +0000
Re: trying to strip out non ascii.. or rather convert non ascii wxjmfauth@gmail.com - 2013-10-31 03:33 -0700
Re: trying to strip out non ascii.. or rather convert non ascii Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-11-01 07:16 +0000
Re: trying to strip out non ascii.. or rather convert non ascii wxjmfauth@gmail.com - 2013-11-01 02:00 -0700
Re: trying to strip out non ascii.. or rather convert non ascii Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-11-01 09:18 +0000
Re: trying to strip out non ascii.. or rather convert non ascii Steven D'Aprano <steve@pearwood.info> - 2013-10-29 05:22 +0000
Re: trying to strip out non ascii.. or rather convert non ascii wxjmfauth@gmail.com - 2013-10-29 08:38 -0700
Re: trying to strip out non ascii.. or rather convert non ascii Tim Chase <python.list@tim.thechases.com> - 2013-10-29 10:52 -0500
Re: trying to strip out non ascii.. or rather convert non ascii wxjmfauth@gmail.com - 2013-10-29 12:16 -0700
Re: trying to strip out non ascii.. or rather convert non ascii Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-10-29 19:54 +0000
Re: trying to strip out non ascii.. or rather convert non ascii Piet van Oostrum <piet@vanoostrum.org> - 2013-10-29 21:33 -0400
Re: trying to strip out non ascii.. or rather convert non ascii Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-10-30 09:19 +0000
Re: trying to strip out non ascii.. or rather convert non ascii Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-10-29 15:56 +0000
Re: trying to strip out non ascii.. or rather convert non ascii Chris Angelico <rosuav@gmail.com> - 2013-10-30 13:17 +1100
Re: trying to strip out non ascii.. or rather convert non ascii wxjmfauth@gmail.com - 2013-10-30 01:13 -0700
Re: trying to strip out non ascii.. or rather convert non ascii Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-10-30 15:25 +0000
Page 2 of 3 — ← Prev page 1 [2] 3 Next page →
| From | Michael Torrie <torriem@gmail.com> |
|---|---|
| Date | 2013-10-30 11:54 -0600 |
| Message-ID | <mailman.1821.1383156703.18130.python-list@python.org> |
| In reply to | #58058 |
On 10/30/2013 10:08 AM, wxjmfauth@gmail.com wrote: > My comment had nothing to do with Python, it was a > general comment. A diacritical mark just makes a letter > a different letter; a "ï " and a "i" are "as > diferent" as a "a" from a "z". A diacritical mark > is more than a simple ornementation. That's nice, but you didn't actually read what Ned said (or the OP). The OP doesn't care that "ï " and a "i" are as different as "a" and "z". For the purposes of his search he wants them treated as the same letter. A fuzzy searching treats them all the same. For example, a search for "Godel, Escher, Bach" should find "Gödel, Escher, Bach" just fine. Even though "o" and "ö" are different characters. And lo and behold Google actually does this! Try it. It's nice for those of use who want to find something and our US keyboards don't have the right marks. https://www.google.ca/search?q=godel+escher+bach After all this nonsense, that's what the original poster is looking for (I think... can't be sure since it's been so many days now). Seems to me a python module does this quite nicely: https://pypi.python.org/pypi/Unidecode
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2013-10-30 11:38 -0700 |
| Message-ID | <78fce490-a583-4a0e-845e-73fec6bf705a@googlegroups.com> |
| In reply to | #58076 |
Le mercredi 30 octobre 2013 18:54:05 UTC+1, Michael Torrie a écrit : > On 10/30/2013 10:08 AM, wxjmfauth@gmail.com wrote: > > > My comment had nothing to do with Python, it was a > > > general comment. A diacritical mark just makes a letter > > > a different letter; a "ï " and a "i" are "as > > > diferent" as a "a" from a "z". A diacritical mark > > > is more than a simple ornementation. > > > > That's nice, but you didn't actually read what Ned said (or the OP). > > The OP doesn't care that "ï " and a "i" are as different as "a" and "z". > > For the purposes of his search he wants them treated as the same > > letter. A fuzzy searching treats them all the same. For example, a > > search for "Godel, Escher, Bach" should find "Gödel, Escher, Bach" just > > fine. Even though "o" and "ö" are different characters. And lo and > > behold Google actually does this! Try it. It's nice for those of use > > who want to find something and our US keyboards don't have the right marks. > > > > https://www.google.ca/search?q=godel+escher+bach > > > > After all this nonsense, that's what the original poster is looking for > > (I think... can't be sure since it's been so many days now). Seems to > > me a python module does this quite nicely: > > > > https://pypi.python.org/pypi/Unidecode Ok. You are right. I recognize my mistake. Independently from the top poster's task, I did not understand in that way. Let say it depends on the context, for a general search engine, it's good that diacritics are ignored. For, let say, a text processing system, it's good to have only precised matches. It does not mean, other matching possibilities may exist. jmf
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2013-10-30 19:28 -0400 |
| Message-ID | <roy-445BFB.19284330102013@news.panix.com> |
| In reply to | #58076 |
In article <mailman.1821.1383156703.18130.python-list@python.org>,
Michael Torrie <torriem@gmail.com> wrote:
> On 10/30/2013 10:08 AM, wxjmfauth@gmail.com wrote:
> > My comment had nothing to do with Python, it was a
> > general comment. A diacritical mark just makes a letter
> > a different letter; a "ï " and a "i" are "as
> > diferent" as a "a" from a "z". A diacritical mark
> > is more than a simple ornementation.
>
> That's nice, but you didn't actually read what Ned said (or the OP).
> The OP doesn't care that "ï " and a "i" are as different as "a" and "z".
> For the purposes of his search he wants them treated as the same
> letter. A fuzzy searching treats them all the same.
That's one definition of fuzzy. But, there's nothing that says you
can't build a fuzzy matching algorithm which considers some mismatches
to be worse than others.
For example, it's reasonable to consider any vowel (or string of vowels,
for that matter) to be closer to another vowel than to a consonant. A
great example is the word, "bureaucrat". As far as I'm concerned, it's
spelled {b, vowels, r, vowels, c, r, a, t}. It usually takes me three
or four tries to get auto-correct to even recognize what I'm trying to
type and fix it for me.
Likewise for pairs like {c, s}, {j, g}, {v, w}, and so on.
In that spirit, I would think that a, á, and â would all be considered
more conservative replacements for each other than they would be for k,
x, or z.
[toc] | [prev] | [next] | [standalone]
| From | Tim Chase <python.list@tim.thechases.com> |
|---|---|
| Date | 2013-10-31 06:46 -0500 |
| Message-ID | <mailman.1872.1383219887.18130.python-list@python.org> |
| In reply to | #58141 |
On 2013-10-30 19:28, Roy Smith wrote:
> For example, it's reasonable to consider any vowel (or string of
> vowels, for that matter) to be closer to another vowel than to a
> consonant. A great example is the word, "bureaucrat". As far as
> I'm concerned, it's spelled {b, vowels, r, vowels, c, r, a, t}. It
> usually takes me three or four tries to get auto-correct to even
> recognize what I'm trying to type and fix it for me.
[glad I'm not the only one who has trouble spelling "bureaucrat"]
Steven D'Aprano wisely mentioned elsewhere in the thread that "The
right solution to that is to treat it no differently from other fuzzy
searches. A good search engine should be tolerant of spelling errors
and alternative spellings for any letter, not just those with
diacritics."
Often the Levenshtein distance is used for calculating closeness, and
the off-the-shelf algorithm assigns a cost of one per difference
(addition, change, or removal). It doesn't sound like it would be
that hard[1] to assign varying costs based on what character was
added/changed/removed. A diacritic might have a cost of N while a
similar character (vowel->vowel or consonant->consonant, or
consonant-cluster shift) might have a cost of 2N, and a totally
arbitrary character shift might have a cost of 3N (or higher).
Unfortunately, the Levenshtein algorithm is already O(M*N) slow and
can't be reasonably precalculated without knowing both strings, so
this just ends up heaping additional lookups/comparisons atop
already-slow code.
-tkc
[1]
http://en.wikipedia.org/wiki/Levenshtein_distance#Possible_modifications
.
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2013-10-30 17:56 -0400 |
| Message-ID | <mailman.1858.1383170226.18130.python-list@python.org> |
| In reply to | #58058 |
On 10/30/2013 12:08 PM, wxjmfauth@gmail.com wrote: > From a unicode perspective. > Unicode.org "knows", these chars a very important, that's > the reason why they exist in two forms, precomposed and > composed forms. Only some chars have both forms. I believe the precomposed forms are partly a historical accident of what precomposed forms were in the various latin-1 sets. -- Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2013-10-31 07:10 +0000 |
| Message-ID | <5272025a$0$29862$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #58012 |
On Wed, 30 Oct 2013 01:49:28 -0700, wxjmfauth wrote:
>> The right solution to that is to treat it no differently from other
>> fuzzy
>> searches. A good search engine should be tolerant of spelling errors
>> and
>> alternative spellings for any letter, not just those with diacritics.
>> Ideally, a good search engine would successfully match all three of
>> "naïve", "naive" and "niave", and it shouldn't rely on special handling
>> of diacritics.
>
> This is a non sense. The purpose of a diacritical mark is to make a
> letter a different letter. If a tool is supposed to match an ô, there is
> absolutely no reason to match something else.
I'm glad that you know so much better than Google, Bing, Yahoo, and other
search engines. When I search for "mispealled" Google gives me:
Showing results for misspelled
Search instead for mispealled
But I see now that this is nonsense and there is *absolutely no reason*
to match something other than the ecaxt wrods I typed.
Perhaps you should submit a bug report to Google:
"When I mistype a word, Google correctly gives me the search results I
wanted, instead of the wrong results I didn't want."
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2013-10-31 07:23 +0000 |
| Message-ID | <mailman.1867.1383204246.18130.python-list@python.org> |
| In reply to | #58152 |
On 31/10/2013 07:10, Steven D'Aprano wrote: > On Wed, 30 Oct 2013 01:49:28 -0700, wxjmfauth wrote: > >>> The right solution to that is to treat it no differently from other >>> fuzzy >>> searches. A good search engine should be tolerant of spelling errors >>> and >>> alternative spellings for any letter, not just those with diacritics. >>> Ideally, a good search engine would successfully match all three of >>> "naïve", "naive" and "niave", and it shouldn't rely on special handling >>> of diacritics. >> >> This is a non sense. The purpose of a diacritical mark is to make a >> letter a different letter. If a tool is supposed to match an ô, there is >> absolutely no reason to match something else. > > > I'm glad that you know so much better than Google, Bing, Yahoo, and other > search engines. When I search for "mispealled" Google gives me: > > Showing results for misspelled > Search instead for mispealled > > > But I see now that this is nonsense and there is *absolutely no reason* > to match something other than the ecaxt wrods I typed. > > Perhaps you should submit a bug report to Google: > > "When I mistype a word, Google correctly gives me the search results I > wanted, instead of the wrong results I didn't want." > I'm sorry Steven but you're completely out of your depth here. Please bow down to the superior intellect of jmf, where jm is for Joseph McCarthy. -- Python is the second best programming language in the world. But the best has yet to be invented. Christian Tismer Mark Lawrence
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2013-10-31 03:33 -0700 |
| Message-ID | <4460346f-c715-42fb-8e94-e20b46f1bbf8@googlegroups.com> |
| In reply to | #58152 |
Le jeudi 31 octobre 2013 08:10:18 UTC+1, Steven D'Aprano a écrit : > On Wed, 30 Oct 2013 01:49:28 -0700, wxjmfauth wrote: > > > > >> The right solution to that is to treat it no differently from other > > >> fuzzy > > >> searches. A good search engine should be tolerant of spelling errors > > >> and > > >> alternative spellings for any letter, not just those with diacritics. > > >> Ideally, a good search engine would successfully match all three of > > >> "naïve", "naive" and "niave", and it shouldn't rely on special handling > > >> of diacritics. > > > > > > This is a non sense. The purpose of a diacritical mark is to make a > > > letter a different letter. If a tool is supposed to match an ô, there is > > > absolutely no reason to match something else. > > > > > > I'm glad that you know so much better than Google, Bing, Yahoo, and other > > search engines. When I search for "mispealled" Google gives me: > > > > Showing results for misspelled > > Search instead for mispealled > > > > > > But I see now that this is nonsense and there is *absolutely no reason* > > to match something other than the ecaxt wrods I typed. > > > > Perhaps you should submit a bug report to Google: > > > > "When I mistype a word, Google correctly gives me the search results I > > wanted, instead of the wrong results I didn't want." > > > > > > > > -- > > Steven As far as I know, I recognized my mistake. I had more text processing systems in mind, than search engines. I can even tell you, I am really stupid. I wrote pure Unicode software to sort French or German strings. Pure unicode == independent from any locale. jmf
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-11-01 07:16 +0000 |
| Message-ID | <52735554$0$29972$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #58161 |
On Thu, 31 Oct 2013 03:33:15 -0700, wxjmfauth wrote: > Le jeudi 31 octobre 2013 08:10:18 UTC+1, Steven D'Aprano a écrit : >> I'm glad that you know so much better than Google, Bing, Yahoo, and >> other >> search engines. When I search for "mispealled" Google gives me: [...] > As far as I know, I recognized my mistake. I had more text processing > systems in mind, than search engines. Yes, you have, I acknowledge that now. I see now that at the time I made my response to you, you had already replied recognising your error. Unfortunately I had not seen that. So in that case, I withdraw my comments and apologize. > I can even tell you, I am really stupid. I wrote pure Unicode software > to sort French or German strings. > > Pure unicode == independent from any locale. Unfortunately it is not that simple. The same code point can have different meanings in different languages, and should be treated differently when sorting. The natural Unicode sort order satisfies very few European languages, including English. A few examples: * Swedish ä is a distinct letters of the alphabet, appearing after z: "a b c z ä" is sorted according to Swedish rules. But in German ä is considered to be the letter 'a' plus an umlaut, and is collated after 'a': "a ä b c z" is sorted according to German rules. * In German ö is considered to be a variant of o, equivalent to 'oe', while in Finish ö is a distinct letter which cannot be expanded to 'oe', and which appears at the end of the alphabet. * Similarly, in modern English æ is a ligature of ae, while in Danish and Norwegian is it a distinct letter of the alphabet appearing after z: in English dictionaries, "Æsir" will be found with other "A" words, often expanded to "Aesir", while in Norwegian it will be found after "Z" words. * Most European languages convert uppercase I to lowercase i, but Turkish has distinct letters for dotted and dotless I. According to Turkish rules, lowercase(I) is ı and uppercase(i) is İ. While it is true that the Unicode character set is independent of locale, for natural processing of characters, it isn't enough to just use Unicode. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2013-11-01 02:00 -0700 |
| Message-ID | <39f9588e-d60d-4e34-8b61-33de32a99d08@googlegroups.com> |
| In reply to | #58243 |
Le vendredi 1 novembre 2013 08:16:36 UTC+1, Steven D'Aprano a écrit : > On Thu, 31 Oct 2013 03:33:15 -0700, wxjmfauth wrote: > > > > > Le jeudi 31 octobre 2013 08:10:18 UTC+1, Steven D'Aprano a écrit : > > > > >> I'm glad that you know so much better than Google, Bing, Yahoo, and > > >> other > > >> search engines. When I search for "mispealled" Google gives me: > > [...] > > > As far as I know, I recognized my mistake. I had more text processing > > > systems in mind, than search engines. > > > > Yes, you have, I acknowledge that now. I see now that at the time I made > > my response to you, you had already replied recognising your error. > > Unfortunately I had not seen that. So in that case, I withdraw my > > comments and apologize. > > > > > > > I can even tell you, I am really stupid. I wrote pure Unicode software > > > to sort French or German strings. > > > > > > Pure unicode == independent from any locale. > > > > Unfortunately it is not that simple. The same code point can have > > different meanings in different languages, and should be treated > > differently when sorting. The natural Unicode sort order satisfies very > > few European languages, including English. A few examples: > > > > * Swedish ä is a distinct letters of the alphabet, appearing > > after z: "a b c z ä" is sorted according to Swedish rules. > > But in German ä is considered to be the letter 'a' plus an > > umlaut, and is collated after 'a': "a ä b c z" is sorted > > according to German rules. > > > > * In German ö is considered to be a variant of o, equivalent > > to 'oe', while in Finish ö is a distinct letter which > > cannot be expanded to 'oe', and which appears at the end > > of the alphabet. > > > > * Similarly, in modern English æ is a ligature of ae, while in > > Danish and Norwegian is it a distinct letter of the alphabet > > appearing after z: in English dictionaries, "Æsir" will be > > found with other "A" words, often expanded to "Aesir", while > > in Norwegian it will be found after "Z" words. > > > > * Most European languages convert uppercase I to lowercase i, > > but Turkish has distinct letters for dotted and dotless I. > > According to Turkish rules, lowercase(I) is ı and uppercase(i) > > is İ. > > > > > > While it is true that the Unicode character set is independent of locale, > > for natural processing of characters, it isn't enough to just use Unicode. > > > > > > -- > > Steven I'm aware of all the points you gave. That's why I wrote "French or German strings". The hard task is not on the side of Unicode or sorting, it is on the creation of key(s) used for sorting. Eg, cote, côte, coté, côté. French editors are not all sorting these words in the same way (diacritics). jmf PS A *real* case to test the FSR.
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2013-11-01 09:18 +0000 |
| Message-ID | <mailman.1913.1383297511.18130.python-list@python.org> |
| In reply to | #58247 |
On 01/11/2013 09:00, wxjmfauth@gmail.com wrote: I'll ask again, would you please read, digest and action this https://wiki.python.org/moin/GoogleGroupsPython -- Python is the second best programming language in the world. But the best has yet to be invented. Christian Tismer Mark Lawrence
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2013-10-29 05:22 +0000 |
| Message-ID | <526f4612$0$6512$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #57823 |
On Mon, 28 Oct 2013 07:01:16 -0700, wxjmfauth wrote:
> And of course, logically, they are very, very badly handled with the
> Flexible String Representation.
I'm reminded of Cato the Elder, the Roman senator who would end every
speech, no matter the topic, with "Ceterum censeo Carthaginem esse
delendam" ("Furthermore, I consider that Carthage must be destroyed").
But at least he had the good grace to present that as an opinion, instead
of repeating a falsehood as if it were a fact.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2013-10-29 08:38 -0700 |
| Message-ID | <63fa9fcd-6445-41ee-8873-e1ee046e2031@googlegroups.com> |
| In reply to | #57881 |
Le mardi 29 octobre 2013 06:22:27 UTC+1, Steven D'Aprano a écrit :
> On Mon, 28 Oct 2013 07:01:16 -0700, wxjmfauth wrote:
>
>
>
> > And of course, logically, they are very, very badly handled with the
>
> > Flexible String Representation.
>
>
>
> I'm reminded of Cato the Elder, the Roman senator who would end every
>
> speech, no matter the topic, with "Ceterum censeo Carthaginem esse
>
> delendam" ("Furthermore, I consider that Carthage must be destroyed").
>
>
>
> But at least he had the good grace to present that as an opinion, instead
>
> of repeating a falsehood as if it were a fact.
>
>
>
>
>
>
>
>
>
> --
>
> Steven
------
>>> import timeit
>>> timeit.timeit("a = 'hundred'; 'x' in a")
0.12621293837694095
>>> timeit.timeit("a = 'hundreij'; 'x' in a")
0.26411553466961735
If you are understanding the coding of characters, Unicode
and what this FSR does, it is a child play to produce gazillion
of examples like this.
(Notice the usage of a Dutch character instead of a boring €).
jmf
[toc] | [prev] | [next] | [standalone]
| From | Tim Chase <python.list@tim.thechases.com> |
|---|---|
| Date | 2013-10-29 10:52 -0500 |
| Message-ID | <mailman.1761.1383061878.18130.python-list@python.org> |
| In reply to | #57923 |
On 2013-10-29 08:38, wxjmfauth@gmail.com wrote:
> >>> import timeit
> >>> timeit.timeit("a = 'hundred'; 'x' in a")
> 0.12621293837694095
> >>> timeit.timeit("a = 'hundreij'; 'x' in a")
> 0.26411553466961735
That reads to me as "If things were purely UCS4 internally, Python
would normally take 0.264... seconds to execute this test, but core
devs managed to optimize a particular (lower 127 ASCII characters
only) case so that it runs in less than half the time."
Is this not what you intended to demonstrate? 'cuz that sounds
like a pretty awesome optimization to me.
-tkc
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2013-10-29 12:16 -0700 |
| Message-ID | <9319e982-4628-4f32-b5cc-60eadca121fc@googlegroups.com> |
| In reply to | #57924 |
Le mardi 29 octobre 2013 16:52:49 UTC+1, Tim Chase a écrit :
> On 2013-10-29 08:38, wxjmfauth@gmail.com wrote:
>
> > >>> import timeit
>
> > >>> timeit.timeit("a = 'hundred'; 'x' in a")
>
> > 0.12621293837694095
>
> > >>> timeit.timeit("a = 'hundreij'; 'x' in a")
>
> > 0.26411553466961735
>
>
>
> That reads to me as "If things were purely UCS4 internally, Python
>
> would normally take 0.264... seconds to execute this test, but core
>
> devs managed to optimize a particular (lower 127 ASCII characters
>
> only) case so that it runs in less than half the time."
>
>
>
> Is this not what you intended to demonstrate? 'cuz that sounds
>
> like a pretty awesome optimization to me.
>
>
>
> -tkc
--------
That's very naive. In fact, what happens is just the opposite.
The "best case" with the FSR is worst than the "worst case"
without the FSR.
And this is just without counting the effect that this poor
Python is spending its time in switching from one internal
representation to one another, without forgetting the fact
that this has to be tested every time.
The more unicode manipulations one applies, the more time
it demands.
Two tasks, that come in my mind: re and normalization.
It's very interesting to observe what happens when one
normalizes latin text and polytonic Greek text, both with
plenty of diactrics.
----
Something different, based on my previous example.
What a European user is supposed to think, when she/he
sees, she/he can be "penalized" by such an amount,
simply by using non ascii characters for a product
which is supposed to be "unicode compliant" ?
jmf
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2013-10-29 19:54 +0000 |
| Message-ID | <mailman.1773.1383076460.18130.python-list@python.org> |
| In reply to | #57961 |
On 29/10/2013 19:16, wxjmfauth@gmail.com wrote:
> Le mardi 29 octobre 2013 16:52:49 UTC+1, Tim Chase a écrit :
>> On 2013-10-29 08:38, wxjmfauth@gmail.com wrote:
>>
>>>>>> import timeit
>>
>>>>>> timeit.timeit("a = 'hundred'; 'x' in a")
>>
>>> 0.12621293837694095
>>
>>>>>> timeit.timeit("a = 'hundreij'; 'x' in a")
>>
>>> 0.26411553466961735
>>
>>
>>
>> That reads to me as "If things were purely UCS4 internally, Python
>>
>> would normally take 0.264... seconds to execute this test, but core
>>
>> devs managed to optimize a particular (lower 127 ASCII characters
>>
>> only) case so that it runs in less than half the time."
>>
>>
>>
>> Is this not what you intended to demonstrate? 'cuz that sounds
>>
>> like a pretty awesome optimization to me.
>>
>>
>>
>> -tkc
>
> --------
>
> That's very naive. In fact, what happens is just the opposite.
> The "best case" with the FSR is worst than the "worst case"
> without the FSR.
>
> And this is just without counting the effect that this poor
> Python is spending its time in switching from one internal
> representation to one another, without forgetting the fact
> that this has to be tested every time.
> The more unicode manipulations one applies, the more time
> it demands.
>
> Two tasks, that come in my mind: re and normalization.
> It's very interesting to observe what happens when one
> normalizes latin text and polytonic Greek text, both with
> plenty of diactrics.
>
> ----
>
> Something different, based on my previous example.
>
> What a European user is supposed to think, when she/he
> sees, she/he can be "penalized" by such an amount,
> simply by using non ascii characters for a product
> which is supposed to be "unicode compliant" ?
>
> jmf
>
Please provide hard evidence to support your claims or stop posting this
ridiculous nonsense. Give us real world problems that can be reported
on the bug tracker, investigated and resolved.
--
Python is the second best programming language in the world.
But the best has yet to be invented. Christian Tismer
Mark Lawrence
[toc] | [prev] | [next] | [standalone]
| From | Piet van Oostrum <piet@vanoostrum.org> |
|---|---|
| Date | 2013-10-29 21:33 -0400 |
| Message-ID | <m27gcv1r0x.fsf@cochabamba.vanoostrum.org> |
| In reply to | #57964 |
Mark Lawrence <breamoreboy@yahoo.co.uk> writes: > Please provide hard evidence to support your claims or stop posting this > ridiculous nonsense. Give us real world problems that can be reported > on the bug tracker, investigated and resolved. I think it is much better just to ignore this nonsense instead of asking for evidence you know you will never get. -- Piet van Oostrum <piet@vanoostrum.org> WWW: http://pietvanoostrum.com/ PGP key: [8DAE142BE17999C4]
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2013-10-30 09:19 +0000 |
| Message-ID | <mailman.1797.1383124791.18130.python-list@python.org> |
| In reply to | #57988 |
On 30/10/2013 01:33, Piet van Oostrum wrote: > Mark Lawrence <breamoreboy@yahoo.co.uk> writes: > >> Please provide hard evidence to support your claims or stop posting this >> ridiculous nonsense. Give us real world problems that can be reported >> on the bug tracker, investigated and resolved. > > I think it is much better just to ignore this nonsense instead of asking for evidence you know you will never get. > A good point, but note he doesn't have the courage to reply to me but always to others. I guess he spends a lot of time clucking, not because he's run out of supplies, but because he's simply a chicken. -- Python is the second best programming language in the world. But the best has yet to be invented. Christian Tismer Mark Lawrence
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2013-10-29 15:56 +0000 |
| Message-ID | <mailman.1762.1383062192.18130.python-list@python.org> |
| In reply to | #57923 |
On 29/10/2013 15:38, wxjmfauth@gmail.com wrote:
It's okay folks I'll snip all the double spaced google crap as the
poster is clearly too bone idle to follow the instructions that have
been repeatedly posted here asking for people not to post double spaced
google crap.
> Le mardi 29 octobre 2013 06:22:27 UTC+1, Steven D'Aprano a écrit :
>> On Mon, 28 Oct 2013 07:01:16 -0700, wxjmfauth wrote:
>>> And of course, logically, they are very, very badly handled with the
>>> Flexible String Representation.
>>
>> I'm reminded of Cato the Elder, the Roman senator who would end every
>> speech, no matter the topic, with "Ceterum censeo Carthaginem esse
>> delendam" ("Furthermore, I consider that Carthage must be destroyed").
>>
>> But at least he had the good grace to present that as an opinion, instead
>> of repeating a falsehood as if it were a fact.
>>
>> --
>>
>> Steven
>
> ------
>
>>>> import timeit
>>>> timeit.timeit("a = 'hundred'; 'x' in a")
> 0.12621293837694095
>>>> timeit.timeit("a = 'hundreij'; 'x' in a")
> 0.26411553466961735
>
> If you are understanding the coding of characters, Unicode
> and what this FSR does, it is a child play to produce gazillion
> of examples like this.
>
> (Notice the usage of a Dutch character instead of a boring €).
>
> jmf
>
You've stated above that logically unicode is badly handled by the fsr.
You then provide a trivial timing example. WTF???
--
Python is the second best programming language in the world.
But the best has yet to be invented. Christian Tismer
Mark Lawrence
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-10-30 13:17 +1100 |
| Message-ID | <mailman.1787.1383099445.18130.python-list@python.org> |
| In reply to | #57923 |
On Wed, Oct 30, 2013 at 2:56 AM, Mark Lawrence <breamoreboy@yahoo.co.uk> wrote:
> You've stated above that logically unicode is badly handled by the fsr. You
> then provide a trivial timing example. WTF???
His idea of bad handling is "oh how terrible, ASCII and BMP have
optimizations". He hates the idea that it could be better in some
areas instead of even timings all along. But the FSR actually has some
distinct benefits even in the areas he's citing - watch this:
>>> import timeit
>>> timeit.timeit("a = 'hundred'; 'x' in a")
0.3625614428649451
>>> timeit.timeit("a = 'hundreij'; 'x' in a")
0.6753936603674484
>>> timeit.timeit("a = 'hundred'; 'ģ' in a")
0.25663261671525106
>>> timeit.timeit("a = 'hundreij'; 'ģ' in a")
0.3582399439035271
The first two examples are his examples done on my computer, so you
can see how all four figures compare. Note how testing for the
presence of a non-Latin1 character in an 8-bit string is very fast.
Same goes for testing for non-BMP character in a 16-bit string. The
difference gets even larger if the string is longer:
>>> timeit.timeit("a = 'hundred'*1000; 'x' in a")
10.083378194714726
>>> timeit.timeit("a = 'hundreij'*1000; 'x' in a")
18.656413035735
>>> timeit.timeit("a = 'hundreij'*1000; 'ģ' in a")
18.436268855399135
>>> timeit.timeit("a = 'hundred'*1000; 'ģ' in a")
2.8308718007456264
Wow! The FSR speeds up searches immensely! It's obviously the best
thing since sliced bread!
ChrisA
[toc] | [prev] | [next] | [standalone]
Page 2 of 3 — ← Prev page 1 [2] 3 Next page →
Back to top | Article view | comp.lang.python
csiph-web