Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #57656 > unrolled thread
| Started by | bruce <badouglas@gmail.com> |
|---|---|
| First post | 2013-10-26 16:11 -0400 |
| Last post | 2013-10-30 15:25 +0000 |
| Articles | 20 on this page of 42 — 14 participants |
Back to article view | Back to comp.lang.python
trying to strip out non ascii.. or rather convert non ascii bruce <badouglas@gmail.com> - 2013-10-26 16:11 -0400
Re: trying to strip out non ascii.. or rather convert non ascii Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-10-26 22:24 +0000
Re: trying to strip out non ascii.. or rather convert non ascii Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-10-26 20:51 -0400
Re: trying to strip out non ascii.. or rather convert non ascii Roy Smith <roy@panix.com> - 2013-10-26 21:11 -0400
Re: trying to strip out non ascii.. or rather convert non ascii Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-10-27 02:05 +0000
Re: trying to strip out non ascii.. or rather convert non ascii Chris Angelico <rosuav@gmail.com> - 2013-10-27 13:15 +1100
Re: trying to strip out non ascii.. or rather convert non ascii Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-10-27 09:21 +0000
Re: trying to strip out non ascii.. or rather convert non ascii Tim Chase <python.list@tim.thechases.com> - 2013-10-26 20:41 -0500
Re: trying to strip out non ascii.. or rather convert non ascii Roy Smith <roy@panix.com> - 2013-10-26 21:54 -0400
Re: trying to strip out non ascii.. or rather convert non ascii Tim Chase <python.list@tim.thechases.com> - 2013-10-26 21:17 -0500
Re: trying to strip out non ascii.. or rather convert non ascii Nobody <nobody@nowhere.com> - 2013-10-27 03:21 +0000
Re: trying to strip out non ascii.. or rather convert non ascii wxjmfauth@gmail.com - 2013-10-28 07:01 -0700
Re: trying to strip out non ascii.. or rather convert non ascii Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-10-28 14:13 +0000
Re: trying to strip out non ascii.. or rather convert non ascii Tim Chase <python.list@tim.thechases.com> - 2013-10-28 09:23 -0500
Re: trying to strip out non ascii.. or rather convert non ascii Steven D'Aprano <steve@pearwood.info> - 2013-10-29 05:24 +0000
Re: trying to strip out non ascii.. or rather convert non ascii wxjmfauth@gmail.com - 2013-10-30 01:49 -0700
Re: trying to strip out non ascii.. or rather convert non ascii Ned Batchelder <ned@nedbatchelder.com> - 2013-10-30 08:44 -0400
Re: trying to strip out non ascii.. or rather convert non ascii wxjmfauth@gmail.com - 2013-10-30 09:08 -0700
Re: trying to strip out non ascii.. or rather convert non ascii Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-10-30 16:24 +0000
Re: trying to strip out non ascii.. or rather convert non ascii Ned Batchelder <ned@nedbatchelder.com> - 2013-10-30 13:10 -0400
Re: trying to strip out non ascii.. or rather convert non ascii Michael Torrie <torriem@gmail.com> - 2013-10-30 11:54 -0600
Re: trying to strip out non ascii.. or rather convert non ascii wxjmfauth@gmail.com - 2013-10-30 11:38 -0700
Re: trying to strip out non ascii.. or rather convert non ascii Roy Smith <roy@panix.com> - 2013-10-30 19:28 -0400
Re: trying to strip out non ascii.. or rather convert non ascii Tim Chase <python.list@tim.thechases.com> - 2013-10-31 06:46 -0500
Re: trying to strip out non ascii.. or rather convert non ascii Terry Reedy <tjreedy@udel.edu> - 2013-10-30 17:56 -0400
Re: trying to strip out non ascii.. or rather convert non ascii Steven D'Aprano <steve@pearwood.info> - 2013-10-31 07:10 +0000
Re: trying to strip out non ascii.. or rather convert non ascii Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-10-31 07:23 +0000
Re: trying to strip out non ascii.. or rather convert non ascii wxjmfauth@gmail.com - 2013-10-31 03:33 -0700
Re: trying to strip out non ascii.. or rather convert non ascii Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-11-01 07:16 +0000
Re: trying to strip out non ascii.. or rather convert non ascii wxjmfauth@gmail.com - 2013-11-01 02:00 -0700
Re: trying to strip out non ascii.. or rather convert non ascii Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-11-01 09:18 +0000
Re: trying to strip out non ascii.. or rather convert non ascii Steven D'Aprano <steve@pearwood.info> - 2013-10-29 05:22 +0000
Re: trying to strip out non ascii.. or rather convert non ascii wxjmfauth@gmail.com - 2013-10-29 08:38 -0700
Re: trying to strip out non ascii.. or rather convert non ascii Tim Chase <python.list@tim.thechases.com> - 2013-10-29 10:52 -0500
Re: trying to strip out non ascii.. or rather convert non ascii wxjmfauth@gmail.com - 2013-10-29 12:16 -0700
Re: trying to strip out non ascii.. or rather convert non ascii Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-10-29 19:54 +0000
Re: trying to strip out non ascii.. or rather convert non ascii Piet van Oostrum <piet@vanoostrum.org> - 2013-10-29 21:33 -0400
Re: trying to strip out non ascii.. or rather convert non ascii Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-10-30 09:19 +0000
Re: trying to strip out non ascii.. or rather convert non ascii Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-10-29 15:56 +0000
Re: trying to strip out non ascii.. or rather convert non ascii Chris Angelico <rosuav@gmail.com> - 2013-10-30 13:17 +1100
Re: trying to strip out non ascii.. or rather convert non ascii wxjmfauth@gmail.com - 2013-10-30 01:13 -0700
Re: trying to strip out non ascii.. or rather convert non ascii Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-10-30 15:25 +0000
Page 1 of 3 [1] 2 3 Next page →
| From | bruce <badouglas@gmail.com> |
|---|---|
| Date | 2013-10-26 16:11 -0400 |
| Subject | trying to strip out non ascii.. or rather convert non ascii |
| Message-ID | <mailman.1604.1382818293.18130.python-list@python.org> |
hi.. getting some files via curl, and want to convert them from what i'm guessing to be unicode. I'd like to convert a string like this:: <div class="profName"><a href="ShowRatings.jsp?tid=1312168">Alcántar, Iliana</a></div> to:: <div class="profName"><a href="ShowRatings.jsp?tid=1312168">Alcantar, Iliana</a></div> where I convert the " á " to " a" which appears to be a shift of 128, but I'm not sure how to accomplish this.. I've tested using the different decode/encode functions using utf-8/ascii with no luck. I've reviewed stack overflow, as well as a few other sites, but haven't hit the aha moment. pointers/comments would be welcome. thanks
[toc] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-10-26 22:24 +0000 |
| Message-ID | <526c412a$0$29972$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #57656 |
On Sat, 26 Oct 2013 16:11:25 -0400, bruce wrote: > hi.. > > getting some files via curl, and want to convert them from what i'm > guessing to be unicode. > > I'd like to convert a string like this:: <div class="profName"><a > href="ShowRatings.jsp?tid=1312168">Alcántar, Iliana</a></div> > > to:: > <div class="profName"><a href="ShowRatings.jsp?tid=1312168">Alcantar, > Iliana</a></div> > > where I convert the > " á " to " a" Why on earth would you want to throw away perfectly good information? It's 2013, not 1953, and if you're still unable to cope with languages other than English, you need to learn new skills. (Actually, not even English, since ASCII doesn't even support all the characters used in American English, let alone British English. ASCII was broken from the day it was invented.) Start by getting some understanding: http://www.joelonsoftware.com/articles/Unicode.html Then read this post from just over a week ago: https://mail.python.org/pipermail/python-list/2013-October/657827.html -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Dennis Lee Bieber <wlfraed@ix.netcom.com> |
|---|---|
| Date | 2013-10-26 20:51 -0400 |
| Message-ID | <mailman.1626.1382835129.18130.python-list@python.org> |
| In reply to | #57676 |
On 26 Oct 2013 22:24:43 GMT, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> declaimed the following:
>Why on earth would you want to throw away perfectly good information?
>It's 2013, not 1953, and if you're still unable to cope with languages
>other than English, you need to learn new skills.
>
>(Actually, not even English, since ASCII doesn't even support all the
>characters used in American English, let alone British English. ASCII was
>broken from the day it was invented.)
>
Compared to Baudot, both ASCII and EBCDIC were probably considered
wondrous.
--
Wulfraed Dennis Lee Bieber AF6VN
wlfraed@ix.netcom.com HTTP://wlfraed.home.netcom.com/
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2013-10-26 21:11 -0400 |
| Message-ID | <roy-2E63D2.21115526102013@news.panix.com> |
| In reply to | #57690 |
In article <mailman.1626.1382835129.18130.python-list@python.org>, Dennis Lee Bieber <wlfraed@ix.netcom.com> wrote: > Compared to Baudot, both ASCII and EBCDIC were probably considered > wondrous. Wonderous, indeed. Why would anybody ever need more than one case of the alphabet? It's almost as absurd as somebody wanting to put funny little marks on top of their vowels.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-10-27 02:05 +0000 |
| Message-ID | <526c74fa$0$29972$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #57691 |
On Sat, 26 Oct 2013 21:11:55 -0400, Roy Smith wrote: > In article <mailman.1626.1382835129.18130.python-list@python.org>, > Dennis Lee Bieber <wlfraed@ix.netcom.com> wrote: > >> Compared to Baudot, both ASCII and EBCDIC were probably considered >> wondrous. > > Wonderous, indeed. Why would anybody ever need more than one case of > the alphabet? It's almost as absurd as somebody wanting to put funny > little marks on top of their vowels. Vwls? Wh wst tm wrtng dwn th vwls? -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-10-27 13:15 +1100 |
| Message-ID | <mailman.1631.1382840117.18130.python-list@python.org> |
| In reply to | #57700 |
On Sun, Oct 27, 2013 at 1:05 PM, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > On Sat, 26 Oct 2013 21:11:55 -0400, Roy Smith wrote: > >> In article <mailman.1626.1382835129.18130.python-list@python.org>, >> Dennis Lee Bieber <wlfraed@ix.netcom.com> wrote: >> >>> Compared to Baudot, both ASCII and EBCDIC were probably considered >>> wondrous. >> >> Wonderous, indeed. Why would anybody ever need more than one case of >> the alphabet? It's almost as absurd as somebody wanting to put funny >> little marks on top of their vowels. > > Vwls? Wh wst tm wrtng dwn th vwls? There's really no reason to; you can always provide them by their entities! ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2013-10-27 09:21 +0000 |
| Message-ID | <mailman.1643.1382866927.18130.python-list@python.org> |
| In reply to | #57691 |
On 27/10/2013 01:11, Roy Smith wrote: > In article <mailman.1626.1382835129.18130.python-list@python.org>, > Dennis Lee Bieber <wlfraed@ix.netcom.com> wrote: > >> Compared to Baudot, both ASCII and EBCDIC were probably considered >> wondrous. > > Wonderous, indeed. Why would anybody ever need more than one case of > the alphabet? It's almost as absurd as somebody wanting to put funny > little marks on top of their vowels. > True indeed but it gets worse. For example those silly Spanish speaking types consider this ñ a letter in its own right and not a funny little mark on top of a consonant :) -- Python is the second best programming language in the world. But the best has yet to be invented. Christian Tismer Mark Lawrence
[toc] | [prev] | [next] | [standalone]
| From | Tim Chase <python.list@tim.thechases.com> |
|---|---|
| Date | 2013-10-26 20:41 -0500 |
| Message-ID | <mailman.1628.1382838024.18130.python-list@python.org> |
| In reply to | #57676 |
On 2013-10-26 22:24, Steven D'Aprano wrote:
> Why on earth would you want to throw away perfectly good
> information?
The main reason I've needed to do it in the past is for normalization
of search queries. When a user wants to find something containing
"pingüino", I want to have those results come back even if they type
"pinguino" in the search box.
For the same reason searches are often normalized to ignore case.
The difference between "Polish" and "polish" is visually just
capitalization, but most folks don't think twice about
if term.upper() in datum.upper():
it_matches()
I'd be just as happy if Python provided a "sloppy string compare"
that ignored case, diacritical marks, and the like.
unicode_haystack1 = u"pingüino"
unicode_haystack2 = u"¡Miré un pingüino!"
needle = u"pinguino"
if unicode_haystack1.sloppy_equals(needle):
it_matches()
if unicode_haystack2.sloppy_contains(needle):
it_contains()
As a matter of fact, I'd even be happier if Python did the heavy
lifting, since I wouldn't have to think about whether I want my code
to force upper-vs-lower for the comparison. :-)
-tkc
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2013-10-26 21:54 -0400 |
| Message-ID | <roy-4807F0.21542726102013@news.panix.com> |
| In reply to | #57695 |
In article <mailman.1628.1382838024.18130.python-list@python.org>, Tim Chase <python.list@tim.thechases.com> wrote: > I'd be just as happy if Python provided a "sloppy string compare" > that ignored case, diacritical marks, and the like. The problem with putting fuzzy matching in the core language is that there is no general agreement on how it's supposed to work. There are, however, third-party libraries which do fuzzy matching. One popular one is jellyfish (https://pypi.python.org/pypi/jellyfish/0.1.2). Don't expect you can just download and use it right out of the box, however. You'll need to do a little thinking about which of the several algorithms it includes makes sense for your application. So, for example, you probably expect U+004 (Latin Capital letter N) to match U+006 (Latin Small Letter N). But, what about these (all cribbed from Wikipedia): U+00D1 Ñ Ñ Ñ Latin Capital letter N with tilde U+00F1 ñ ñ ñ Latin Small Letter N with tilde U+0143 C Ń Latin Capital Letter N with acute U+0144 D ń Latin Small Letter N with acute U+0145 E Ņ Latin Capital Letter N with cedilla U+0146 F ņ Latin Small Letter N with cedilla U+0147 G Ň Latin Capital Letter N with caron U+0148 H ň Latin Small Letter N with caron U+0149 I ʼn Latin Small Letter N preceded by apostrophe[1] U+014A J Ŋ Latin Capital Letter Eng U+014B K ŋ Latin Small Letter Eng U+019D #413; Latin Capital Letter N with left hook U+019E #414; Latin Small Letter N with long right leg U+01CA #458; Latin Capital Letter NJ U+01CB #459; Latin Capital Letter N with Small Letter J U+01CC #460; Latin Small Letter NJ U+0235 #565; Latin Small Letter N with curl I can't even begin to guess if they should match for your application.
[toc] | [prev] | [next] | [standalone]
| From | Tim Chase <python.list@tim.thechases.com> |
|---|---|
| Date | 2013-10-26 21:17 -0500 |
| Message-ID | <mailman.1632.1382840149.18130.python-list@python.org> |
| In reply to | #57697 |
On 2013-10-26 21:54, Roy Smith wrote:
> In article <mailman.1628.1382838024.18130.python-list@python.org>,
> Tim Chase <python.list@tim.thechases.com> wrote:
>> I'd be just as happy if Python provided a "sloppy string compare"
>> that ignored case, diacritical marks, and the like.
>
> The problem with putting fuzzy matching in the core language is
> that there is no general agreement on how it's supposed to work.
>
> There are, however, third-party libraries which do fuzzy matching.
> One popular one is jellyfish
> (https://pypi.python.org/pypi/jellyfish/0.1.2).
Bookmarking and archiving your email for future reference.
> Don't expect you can just download and use it right out of the box,
> however. You'll need to do a little thinking about which of the
> several algorithms it includes makes sense for your application.
I'd be content with a baseline that denormalizes and then strips out
combining diacritical marks, something akin to MRAB's
from unicodedata import normalize
"".join(c for c in normalize("NFKD", s) if ord(c) < 0x80)
and tweaking it if that was insufficient.
Thanks for the link to Jellyfish.
-tkc
[toc] | [prev] | [next] | [standalone]
| From | Nobody <nobody@nowhere.com> |
|---|---|
| Date | 2013-10-27 03:21 +0000 |
| Message-ID | <pan.2013.10.27.03.21.57.202000@nowhere.com> |
| In reply to | #57695 |
On Sat, 26 Oct 2013 20:41:58 -0500, Tim Chase wrote: > I'd be just as happy if Python provided a "sloppy string compare" > that ignored case, diacritical marks, and the like. Simply ignoring diactrics won't get you very far. Most languages which use diactrics have standard conversions, e.g. ö -> oe, which are likely to be used by anyone familiar with the language e.g. when using software (or a keyboard) which can't handle diactrics. OTOH, others (particularly native English speakers) may simply discard the diactric. So to be of much use, a fuzzy match needs to handle either possibility.
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2013-10-28 07:01 -0700 |
| Message-ID | <d205042e-29cd-49df-9f6e-600e123f8483@googlegroups.com> |
| In reply to | #57709 |
Le dimanche 27 octobre 2013 04:21:46 UTC+1, Nobody a écrit : > > > > Simply ignoring diactrics won't get you very far. > > Right. As an example, these four French words : cote, côte, coté, côté . > > Most languages which use diactrics have standard conversions, e.g. > > ö -> oe, which are likely to be used by anyone familiar with the > > language e.g. when using software (or a keyboard) which can't handle > > diactrics. > > I'm quite confortable with Unicode, esp. with the Latin blocks. Except this German case (I remember very old typewriters), what are the other languages presenting this kind of allowed feature ? Just as a reminder. They are 1272 characters considered as Latin characters (how to count them it not a simple task), and if my knowledge is correct, they are covering and/or are here to cover the 17 languages, to be exact, the 17 European languages based on a Latin alphabet which can not be covered with iso-8859-1. And of course, logically, they are very, very badly handled with the Flexible String Representation. jmf
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2013-10-28 14:13 +0000 |
| Message-ID | <mailman.1701.1382969640.18130.python-list@python.org> |
| In reply to | #57823 |
On 28/10/2013 14:01, wxjmfauth@gmail.com wrote: > > Just as a reminder. They are 1272 characters considered > as Latin characters (how to count them it not a simple > task), and if my knowledge is correct, they are covering > and/or are here to cover the 17 languages, to be exact, > the 17 European languages based on a Latin alphabet which > can not be covered with iso-8859-1. > > And of course, logically, they are very, very badly handled > with the Flexible String Representation. > > jmf > Please provide us with evidence to back up your statement. -- Python is the second best programming language in the world. But the best has yet to be invented. Christian Tismer Mark Lawrence
[toc] | [prev] | [next] | [standalone]
| From | Tim Chase <python.list@tim.thechases.com> |
|---|---|
| Date | 2013-10-28 09:23 -0500 |
| Message-ID | <mailman.1702.1382970129.18130.python-list@python.org> |
| In reply to | #57823 |
On 2013-10-28 07:01, wxjmfauth@gmail.com wrote: >> Simply ignoring diactrics won't get you very far. > > Right. As an example, these four French words : > cote, côte, coté, côté . Distinct words with distinct meanings, sure. But when a naïve (naive? ☺) person or one without the easy ability to enter characters with diacritics searches for "cote", I want to return possible matches containing any of your 4 examples. It's slightly fuzzier if they search for "coté", in which case they may mean "coté" or they might mean be unable to figure out how to add a hat and want to type "côté". Though I'd rather get more results, even if it has some that only match fuzzily. Circumflexually-circumspectly-yers, -tkc
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2013-10-29 05:24 +0000 |
| Message-ID | <526f46a2$0$6512$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #57826 |
On Mon, 28 Oct 2013 09:23:41 -0500, Tim Chase wrote: > On 2013-10-28 07:01, wxjmfauth@gmail.com wrote: >>> Simply ignoring diactrics won't get you very far. >> >> Right. As an example, these four French words : cote, côte, coté, côté >> . > > Distinct words with distinct meanings, sure. > > But when a naïve (naive? ☺) person or one without the easy ability to > enter characters with diacritics searches for "cote", I want to return > possible matches containing any of your 4 examples. It's slightly > fuzzier if they search for "coté", in which case they may mean "coté" or > they might mean be unable to figure out how to add a hat and want to > type "côté". Though I'd rather get more results, even if it has some > that only match fuzzily. The right solution to that is to treat it no differently from other fuzzy searches. A good search engine should be tolerant of spelling errors and alternative spellings for any letter, not just those with diacritics. Ideally, a good search engine would successfully match all three of "naïve", "naive" and "niave", and it shouldn't rely on special handling of diacritics. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2013-10-30 01:49 -0700 |
| Message-ID | <e018a4c6-e7a5-4356-8929-e26a3fdcb75d@googlegroups.com> |
| In reply to | #57882 |
Le mardi 29 octobre 2013 06:24:50 UTC+1, Steven D'Aprano a écrit : > On Mon, 28 Oct 2013 09:23:41 -0500, Tim Chase wrote: > > > > > On 2013-10-28 07:01, wxjmfauth@gmail.com wrote: > > >>> Simply ignoring diactrics won't get you very far. > > >> > > >> Right. As an example, these four French words : cote, côte, coté, côté > > >> . > > > > > > Distinct words with distinct meanings, sure. > > > > > > But when a naïve (naive? ☺) person or one without the easy ability to > > > enter characters with diacritics searches for "cote", I want to return > > > possible matches containing any of your 4 examples. It's slightly > > > fuzzier if they search for "coté", in which case they may mean "coté" or > > > they might mean be unable to figure out how to add a hat and want to > > > type "côté". Though I'd rather get more results, even if it has some > > > that only match fuzzily. > > > > The right solution to that is to treat it no differently from other fuzzy > > searches. A good search engine should be tolerant of spelling errors and > > alternative spellings for any letter, not just those with diacritics. > > Ideally, a good search engine would successfully match all three of > > "naïve", "naive" and "niave", and it shouldn't rely on special handling > > of diacritics. > > > ------ This is a non sense. The purpose of a diacritical mark is to make a letter a different letter. If a tool is supposed to match an ô, there is absolutely no reason to match something else. jmf
[toc] | [prev] | [next] | [standalone]
| From | Ned Batchelder <ned@nedbatchelder.com> |
|---|---|
| Date | 2013-10-30 08:44 -0400 |
| Message-ID | <mailman.1806.1383137592.18130.python-list@python.org> |
| In reply to | #58012 |
On 10/30/13 4:49 AM, wxjmfauth@gmail.com wrote: > Le mardi 29 octobre 2013 06:24:50 UTC+1, Steven D'Aprano a écrit : >> On Mon, 28 Oct 2013 09:23:41 -0500, Tim Chase wrote: >> >> >> >>> On 2013-10-28 07:01, wxjmfauth@gmail.com wrote: >>>>> Simply ignoring diactrics won't get you very far. >>>> Right. As an example, these four French words : cote, côte, coté, côté >>>> . >>> Distinct words with distinct meanings, sure. >>> But when a naïve (naive? ☺) person or one without the easy ability to >>> enter characters with diacritics searches for "cote", I want to return >>> possible matches containing any of your 4 examples. It's slightly >>> fuzzier if they search for "coté", in which case they may mean "coté" or >>> they might mean be unable to figure out how to add a hat and want to >>> type "côté". Though I'd rather get more results, even if it has some >>> that only match fuzzily. >> >> >> The right solution to that is to treat it no differently from other fuzzy >> >> searches. A good search engine should be tolerant of spelling errors and >> >> alternative spellings for any letter, not just those with diacritics. >> >> Ideally, a good search engine would successfully match all three of >> >> "naïve", "naive" and "niave", and it shouldn't rely on special handling >> >> of diacritics. >> >> >> > ------ > > This is a non sense. The purpose of a diacritical mark is to > make a letter a different letter. If a tool is supposed to > match an ô, there is absolutely no reason to match something > else. > > jmf > jmf, Tim Chase described his use case, and it seems reasonable to me. I'm not sure why you would describe it as nonsense. --Ned.
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2013-10-30 09:08 -0700 |
| Message-ID | <d4e620ab-e42c-4939-92b4-0c5c62c0bc8b@googlegroups.com> |
| In reply to | #58033 |
Le mercredi 30 octobre 2013 13:44:47 UTC+1, Ned Batchelder a écrit : > On 10/30/13 4:49 AM, wxjmfauth@gmail.com wrote: > > > Le mardi 29 octobre 2013 06:24:50 UTC+1, Steven D'Aprano a écrit : > > >> On Mon, 28 Oct 2013 09:23:41 -0500, Tim Chase wrote: > > >> > > >> > > >> > > >>> On 2013-10-28 07:01, wxjmfauth@gmail.com wrote: > > >>>>> Simply ignoring diactrics won't get you very far. > > >>>> Right. As an example, these four French words : cote, côte, coté, côté > > >>>> . > > >>> Distinct words with distinct meanings, sure. > > >>> But when a naïve (naive? ☺) person or one without the easy ability to > > >>> enter characters with diacritics searches for "cote", I want to return > > >>> possible matches containing any of your 4 examples. It's slightly > > >>> fuzzier if they search for "coté", in which case they may mean "coté" or > > >>> they might mean be unable to figure out how to add a hat and want to > > >>> type "côté". Though I'd rather get more results, even if it has some > > >>> that only match fuzzily. > > >> > > >> > > >> The right solution to that is to treat it no differently from other fuzzy > > >> > > >> searches. A good search engine should be tolerant of spelling errors and > > >> > > >> alternative spellings for any letter, not just those with diacritics. > > >> > > >> Ideally, a good search engine would successfully match all three of > > >> > > >> "naïve", "naive" and "niave", and it shouldn't rely on special handling > > >> > > >> of diacritics. > > >> > > >> > > >> > > > ------ > > > > > > This is a non sense. The purpose of a diacritical mark is to > > > make a letter a different letter. If a tool is supposed to > > > match an ô, there is absolutely no reason to match something > > > else. > > > > > > jmf > > > > > > > jmf, Tim Chase described his use case, and it seems reasonable to me. > > I'm not sure why you would describe it as nonsense. > > > > --Ned. -------- My comment had nothing to do with Python, it was a general comment. A diacritical mark just makes a letter a different letter; a "ï " and a "i" are "as diferent" as a "a" from a "z". A diacritical mark is more than a simple ornementation. From a unicode perspective. Unicode.org "knows", these chars a very important, that's the reason why they exist in two forms, precomposed and composed forms. From a software perspective. Luckily for the end users, all the serious software are considering all these chars in an equal way. They are all belonging to the BMP plane. An "Ą" is treated as an "ê", same memory consumption, same performance, ==> very smooth software. jmf
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2013-10-30 16:24 +0000 |
| Message-ID | <mailman.1816.1383150305.18130.python-list@python.org> |
| In reply to | #58058 |
On 30/10/2013 16:08, wxjmfauth@gmail.com wrote: Would you please read, digest and action this https://wiki.python.org/moin/GoogleGroupsPython TIA. -- Python is the second best programming language in the world. But the best has yet to be invented. Christian Tismer Mark Lawrence
[toc] | [prev] | [next] | [standalone]
| From | Ned Batchelder <ned@nedbatchelder.com> |
|---|---|
| Date | 2013-10-30 13:10 -0400 |
| Message-ID | <mailman.1819.1383153035.18130.python-list@python.org> |
| In reply to | #58058 |
On 10/30/13 12:08 PM, wxjmfauth@gmail.com wrote: > Le mercredi 30 octobre 2013 13:44:47 UTC+1, Ned Batchelder a écrit : >> On 10/30/13 4:49 AM, wxjmfauth@gmail.com wrote: >> >>> Le mardi 29 octobre 2013 06:24:50 UTC+1, Steven D'Aprano a écrit : >>>> On Mon, 28 Oct 2013 09:23:41 -0500, Tim Chase wrote: >>>>> On 2013-10-28 07:01, wxjmfauth@gmail.com wrote: >>>>>>> Simply ignoring diactrics won't get you very far. >>>>>> Right. As an example, these four French words : cote, côte, coté, côté >>>>>> . >>>>> Distinct words with distinct meanings, sure. >>>>> But when a naïve (naive? ☺) person or one without the easy ability to >>>>> enter characters with diacritics searches for "cote", I want to return >>>>> possible matches containing any of your 4 examples. It's slightly >>>>> fuzzier if they search for "coté", in which case they may mean "coté" or >>>>> they might mean be unable to figure out how to add a hat and want to >>>>> type "côté". Though I'd rather get more results, even if it has some >>>>> that only match fuzzily. >>>> The right solution to that is to treat it no differently from other fuzzy >>>> searches. A good search engine should be tolerant of spelling errors and >>>> alternative spellings for any letter, not just those with diacritics. >>>> Ideally, a good search engine would successfully match all three of >>>> "naïve", "naive" and "niave", and it shouldn't rely on special handling >>>> of diacritics. >>> ------ >>> This is a non sense. The purpose of a diacritical mark is to >>> make a letter a different letter. If a tool is supposed to >>> match an ô, there is absolutely no reason to match something >>> else. >>> jmf >> >> >> jmf, Tim Chase described his use case, and it seems reasonable to me. >> >> I'm not sure why you would describe it as nonsense. >> >> >> >> --Ned. > -------- > > My comment had nothing to do with Python, it was a > general comment. A diacritical mark just makes a letter > a different letter; a "ï " and a "i" are "as > diferent" as a "a" from a "z". A diacritical mark > is more than a simple ornementation. Yes, we understand that. Tim outlined a need that had to do with users' informal typing. In his case, he needs to deal with that sloppiness. You can't simply insist that users be more precise. Unicode is a way to represent text, and text gets used in many different ways. Each of us has to acknowledge that our text needs may be different than someone else's. jmf, I'm guessing from your comments over the last few months that you are doing detailed linguistic work with corpora in many languages. That work leads to one style of Unicode use. In your domain, it is "nonsense" to ignore diacriticals. Other people do different kinds of work with Unicode, and that leads to different needs. In Tim's system, it is important to ignore diacriticals. You might not have a use personally for Tim's system. That doesn't make it nonsense. --Ned. > From a unicode perspective. > Unicode.org "knows", these chars a very important, that's > the reason why they exist in two forms, precomposed and > composed forms. > > From a software perspective. > Luckily for the end users, all the serious software > are considering all these chars in an equal way. They > are all belonging to the BMP plane. An "Ą" is treated > as an "ê", same memory consumption, same performance, > ==> very smooth software. > > jmf >
[toc] | [prev] | [next] | [standalone]
Page 1 of 3 [1] 2 3 Next page →
Back to top | Article view | comp.lang.python
csiph-web