Groups > comp.lang.python > #57656 > unrolled thread

trying to strip out non ascii.. or rather convert non ascii

Started by	bruce <badouglas@gmail.com>
First post	2013-10-26 16:11 -0400
Last post	2013-10-30 15:25 +0000
Articles	20 on this page of 42 — 14 participants

Back to article view | Back to comp.lang.python

Page 1 of 3 [1] 2 3 Next page →

#57656 — trying to strip out non ascii.. or rather convert non ascii

From	bruce <badouglas@gmail.com>
Date	2013-10-26 16:11 -0400
Subject	trying to strip out non ascii.. or rather convert non ascii
Message-ID	<mailman.1604.1382818293.18130.python-list@python.org>

hi..

getting some files via curl, and want to convert them from what i'm
guessing to be unicode.

I'd like to convert a string like this::
<div class="profName"><a href="ShowRatings.jsp?tid=1312168">Alcántar,
Iliana</a></div>

to::
<div class="profName"><a href="ShowRatings.jsp?tid=1312168">Alcantar,
Iliana</a></div>

where I convert the
" á " to " a"

which appears to be a shift of 128, but I'm not sure how to accomplish this..

I've tested using the different decode/encode functions using
utf-8/ascii with no luck.

I've reviewed stack overflow, as well as a few other sites, but
haven't hit the aha moment.

pointers/comments would be welcome.

thanks

[toc] | [next] | [standalone]

#57676

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-10-26 22:24 +0000
Message-ID	<526c412a$0$29972$c3e8da3$5496439d@news.astraweb.com>
In reply to	#57656

On Sat, 26 Oct 2013 16:11:25 -0400, bruce wrote:

> hi..
> 
> getting some files via curl, and want to convert them from what i'm
> guessing to be unicode.
> 
> I'd like to convert a string like this:: <div class="profName"><a
> href="ShowRatings.jsp?tid=1312168">Alcántar, Iliana</a></div>
> 
> to::
> <div class="profName"><a href="ShowRatings.jsp?tid=1312168">Alcantar,
> Iliana</a></div>
> 
> where I convert the
> " á " to " a"

Why on earth would you want to throw away perfectly good information? 
It's 2013, not 1953, and if you're still unable to cope with languages 
other than English, you need to learn new skills.

(Actually, not even English, since ASCII doesn't even support all the 
characters used in American English, let alone British English. ASCII was 
broken from the day it was invented.)

Start by getting some understanding:

http://www.joelonsoftware.com/articles/Unicode.html

Then read this post from just over a week ago:

https://mail.python.org/pipermail/python-list/2013-October/657827.html

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#57690

From	Dennis Lee Bieber <wlfraed@ix.netcom.com>
Date	2013-10-26 20:51 -0400
Message-ID	<mailman.1626.1382835129.18130.python-list@python.org>
In reply to	#57676

On 26 Oct 2013 22:24:43 GMT, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> declaimed the following:


>Why on earth would you want to throw away perfectly good information? 
>It's 2013, not 1953, and if you're still unable to cope with languages 
>other than English, you need to learn new skills.
>
>(Actually, not even English, since ASCII doesn't even support all the 
>characters used in American English, let alone British English. ASCII was 
>broken from the day it was invented.)
>

	Compared to Baudot, both ASCII and EBCDIC were probably considered
wondrous.
-- 
	Wulfraed                 Dennis Lee Bieber         AF6VN
    wlfraed@ix.netcom.com    HTTP://wlfraed.home.netcom.com/

[toc] | [prev] | [next] | [standalone]

#57691

From	Roy Smith <roy@panix.com>
Date	2013-10-26 21:11 -0400
Message-ID	<roy-2E63D2.21115526102013@news.panix.com>
In reply to	#57690

In article <mailman.1626.1382835129.18130.python-list@python.org>,
 Dennis Lee Bieber <wlfraed@ix.netcom.com> wrote:

> Compared to Baudot, both ASCII and EBCDIC were probably considered
> wondrous.

Wonderous, indeed.  Why would anybody ever need more than one case of 
the alphabet?  It's almost as absurd as somebody wanting to put funny 
little marks on top of their vowels.

[toc] | [prev] | [next] | [standalone]

#57700

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-10-27 02:05 +0000
Message-ID	<526c74fa$0$29972$c3e8da3$5496439d@news.astraweb.com>
In reply to	#57691

On Sat, 26 Oct 2013 21:11:55 -0400, Roy Smith wrote:

> In article <mailman.1626.1382835129.18130.python-list@python.org>,
>  Dennis Lee Bieber <wlfraed@ix.netcom.com> wrote:
> 
>> Compared to Baudot, both ASCII and EBCDIC were probably considered
>> wondrous.
> 
> Wonderous, indeed.  Why would anybody ever need more than one case of
> the alphabet?  It's almost as absurd as somebody wanting to put funny
> little marks on top of their vowels.

Vwls? Wh wst tm wrtng dwn th vwls?



-- 
Steven

[toc] | [prev] | [next] | [standalone]

#57702

From	Chris Angelico <rosuav@gmail.com>
Date	2013-10-27 13:15 +1100
Message-ID	<mailman.1631.1382840117.18130.python-list@python.org>
In reply to	#57700

On Sun, Oct 27, 2013 at 1:05 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> On Sat, 26 Oct 2013 21:11:55 -0400, Roy Smith wrote:
>
>> In article <mailman.1626.1382835129.18130.python-list@python.org>,
>>  Dennis Lee Bieber <wlfraed@ix.netcom.com> wrote:
>>
>>> Compared to Baudot, both ASCII and EBCDIC were probably considered
>>> wondrous.
>>
>> Wonderous, indeed.  Why would anybody ever need more than one case of
>> the alphabet?  It's almost as absurd as somebody wanting to put funny
>> little marks on top of their vowels.
>
> Vwls? Wh wst tm wrtng dwn th vwls?

There&apos;s really no reason to&#59; you can always provide them by
their entities&#33;

ChrisA

[toc] | [prev] | [next] | [standalone]

#57733

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2013-10-27 09:21 +0000
Message-ID	<mailman.1643.1382866927.18130.python-list@python.org>
In reply to	#57691

On 27/10/2013 01:11, Roy Smith wrote:
> In article <mailman.1626.1382835129.18130.python-list@python.org>,
>   Dennis Lee Bieber <wlfraed@ix.netcom.com> wrote:
>
>> Compared to Baudot, both ASCII and EBCDIC were probably considered
>> wondrous.
>
> Wonderous, indeed.  Why would anybody ever need more than one case of
> the alphabet?  It's almost as absurd as somebody wanting to put funny
> little marks on top of their vowels.
>

True indeed but it gets worse.  For example those silly Spanish speaking 
types consider this ñ a letter in its own right and not a funny little 
mark on top of a consonant :)

-- 
Python is the second best programming language in the world.
But the best has yet to be invented.  Christian Tismer

Mark Lawrence

[toc] | [prev] | [next] | [standalone]

#57695

From	Tim Chase <python.list@tim.thechases.com>
Date	2013-10-26 20:41 -0500
Message-ID	<mailman.1628.1382838024.18130.python-list@python.org>
In reply to	#57676

On 2013-10-26 22:24, Steven D'Aprano wrote:
> Why on earth would you want to throw away perfectly good
> information? 

The main reason I've needed to do it in the past is for normalization
of search queries.  When a user wants to find something containing
"pingüino", I want to have those results come back even if they type
"pinguino" in the search box.

For the same reason searches are often normalized to ignore case.
The difference between "Polish" and "polish" is visually just
capitalization, but most folks don't think twice about

  if term.upper() in datum.upper():
    it_matches()

I'd be just as happy if Python provided a "sloppy string compare"
that ignored case, diacritical marks, and the like.

  unicode_haystack1 = u"pingüino"
  unicode_haystack2 = u"¡Miré un pingüino!"
  needle = u"pinguino"
  if unicode_haystack1.sloppy_equals(needle):
    it_matches()
  if unicode_haystack2.sloppy_contains(needle):
    it_contains()

As a matter of fact, I'd even be happier if Python did the heavy
lifting, since I wouldn't have to think about whether I want my code
to force upper-vs-lower for the comparison. :-)

-tkc

[toc] | [prev] | [next] | [standalone]

#57697

From	Roy Smith <roy@panix.com>
Date	2013-10-26 21:54 -0400
Message-ID	<roy-4807F0.21542726102013@news.panix.com>
In reply to	#57695

In article <mailman.1628.1382838024.18130.python-list@python.org>,
 Tim Chase <python.list@tim.thechases.com> wrote:

> I'd be just as happy if Python provided a "sloppy string compare"
> that ignored case, diacritical marks, and the like.

The problem with putting fuzzy matching in the core language is that 
there is no general agreement on how it's supposed to work.

There are, however, third-party libraries which do fuzzy matching.  One 
popular one is jellyfish (https://pypi.python.org/pypi/jellyfish/0.1.2).  
Don't expect you can just download and use it right out of the box, 
however. You'll need to do a little thinking about which of the several 
algorithms it includes makes sense for your application.

So, for example, you probably expect U+004 (Latin Capital letter N) to 
match U+006 (Latin Small Letter N).  But, what about these (all cribbed 
from Wikipedia):

U+00D1   Ñ	&#209;  &Ntilde; Latin Capital letter N with tilde
U+00F1   ñ	&#241;  &ntilde; Latin Small Letter N with tilde
U+0143   C  &#323;      Latin Capital Letter N with acute
U+0144   D  &#324;      Latin Small Letter N with acute
U+0145   E  &#325;      Latin Capital Letter N with cedilla
U+0146   F  &#326;      Latin Small Letter N with cedilla
U+0147   G  &#327;      Latin Capital Letter N with caron
U+0148   H  &#328;      Latin Small Letter N with caron
U+0149   I  &#329;      Latin Small Letter N preceded by apostrophe[1]
U+014A   J  &#330;      Latin Capital Letter Eng
U+014B   K  &#331;      Latin Small Letter Eng
U+019D   #413;   Latin Capital Letter N with left hook
U+019E   #414;   Latin Small Letter N with long right leg
U+01CA   #458;   Latin Capital Letter NJ
U+01CB   #459;   Latin Capital Letter N with Small Letter J
U+01CC   #460;   Latin Small Letter NJ
U+0235   #565;   Latin Small Letter N with curl

I can't even begin to guess if they should match for your application.

[toc] | [prev] | [next] | [standalone]

#57703

From	Tim Chase <python.list@tim.thechases.com>
Date	2013-10-26 21:17 -0500
Message-ID	<mailman.1632.1382840149.18130.python-list@python.org>
In reply to	#57697

On 2013-10-26 21:54, Roy Smith wrote:
> In article <mailman.1628.1382838024.18130.python-list@python.org>,
>  Tim Chase <python.list@tim.thechases.com> wrote:
>> I'd be just as happy if Python provided a "sloppy string compare"
>> that ignored case, diacritical marks, and the like.
> 
> The problem with putting fuzzy matching in the core language is
> that there is no general agreement on how it's supposed to work.
> 
> There are, however, third-party libraries which do fuzzy matching.
> One popular one is jellyfish
> (https://pypi.python.org/pypi/jellyfish/0.1.2).

Bookmarking and archiving your email for future reference.

> Don't expect you can just download and use it right out of the box,
> however. You'll need to do a little thinking about which of the
> several algorithms it includes makes sense for your application.

I'd be content with a baseline that denormalizes and then strips out
combining diacritical marks, something akin to MRAB's

  from unicodedata import normalize
  "".join(c for c in normalize("NFKD", s) if ord(c) < 0x80)

and tweaking it if that was insufficient.

Thanks for the link to Jellyfish.

-tkc

[toc] | [prev] | [next] | [standalone]

#57709

From	Nobody <nobody@nowhere.com>
Date	2013-10-27 03:21 +0000
Message-ID	<pan.2013.10.27.03.21.57.202000@nowhere.com>
In reply to	#57695

On Sat, 26 Oct 2013 20:41:58 -0500, Tim Chase wrote:

> I'd be just as happy if Python provided a "sloppy string compare"
> that ignored case, diacritical marks, and the like.

Simply ignoring diactrics won't get you very far.

Most languages which use diactrics have standard conversions, e.g.
ö -> oe, which are likely to be used by anyone familiar with the
language e.g. when using software (or a keyboard) which can't handle
diactrics.

OTOH, others (particularly native English speakers) may simply discard the
diactric. So to be of much use, a fuzzy match needs to handle either
possibility.

[toc] | [prev] | [next] | [standalone]

#57823

From	wxjmfauth@gmail.com
Date	2013-10-28 07:01 -0700
Message-ID	<d205042e-29cd-49df-9f6e-600e123f8483@googlegroups.com>
In reply to	#57709

Le dimanche 27 octobre 2013 04:21:46 UTC+1, Nobody a écrit :
> 
> 
> 
> Simply ignoring diactrics won't get you very far.
> 
> 

Right. As an example, these four French words :
cote, côte, coté, côté .

> 
> Most languages which use diactrics have standard conversions, e.g.
> 
> ö -> oe, which are likely to be used by anyone familiar with the
> 
> language e.g. when using software (or a keyboard) which can't handle
> 
> diactrics.
> 
> 

I'm quite confortable with Unicode, esp. with the
Latin blocks.
Except this German case (I remember very old typewriters),
what are the other languages presenting this kind of
allowed feature ?

Just as a reminder. They are 1272 characters considered
as Latin characters (how to count them it not a simple
task), and if my knowledge is correct, they are covering
and/or are here to cover the 17 languages, to be exact,
the 17 European languages based on a Latin alphabet which
can not be covered with iso-8859-1.

And of course, logically, they are very, very badly handled
with the Flexible String Representation.

jmf

[toc] | [prev] | [next] | [standalone]

#57824

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2013-10-28 14:13 +0000
Message-ID	<mailman.1701.1382969640.18130.python-list@python.org>
In reply to	#57823

On 28/10/2013 14:01, wxjmfauth@gmail.com wrote:
>
> Just as a reminder. They are 1272 characters considered
> as Latin characters (how to count them it not a simple
> task), and if my knowledge is correct, they are covering
> and/or are here to cover the 17 languages, to be exact,
> the 17 European languages based on a Latin alphabet which
> can not be covered with iso-8859-1.
>
> And of course, logically, they are very, very badly handled
> with the Flexible String Representation.
>
> jmf
>

Please provide us with evidence to back up your statement.

-- 
Python is the second best programming language in the world.
But the best has yet to be invented.  Christian Tismer

Mark Lawrence

[toc] | [prev] | [next] | [standalone]

#57826

From	Tim Chase <python.list@tim.thechases.com>
Date	2013-10-28 09:23 -0500
Message-ID	<mailman.1702.1382970129.18130.python-list@python.org>
In reply to	#57823

On 2013-10-28 07:01, wxjmfauth@gmail.com wrote:
>> Simply ignoring diactrics won't get you very far.
> 
> Right. As an example, these four French words :
> cote, côte, coté, côté .

Distinct words with distinct meanings, sure.

But when a naïve (naive? ☺) person or one without the easy ability
to enter characters with diacritics searches for "cote", I want to
return possible matches containing any of your 4 examples.  It's
slightly fuzzier if they search for "coté", in which case they may
mean "coté" or they might mean be unable to figure out how to
add a hat and want to type "côté". Though I'd rather get more
results, even if it has some that only match fuzzily.

Circumflexually-circumspectly-yers,

-tkc

[toc] | [prev] | [next] | [standalone]

#57882

From	Steven D'Aprano <steve@pearwood.info>
Date	2013-10-29 05:24 +0000
Message-ID	<526f46a2$0$6512$c3e8da3$5496439d@news.astraweb.com>
In reply to	#57826

On Mon, 28 Oct 2013 09:23:41 -0500, Tim Chase wrote:

> On 2013-10-28 07:01, wxjmfauth@gmail.com wrote:
>>> Simply ignoring diactrics won't get you very far.
>> 
>> Right. As an example, these four French words : cote, côte, coté, côté
>> .
> 
> Distinct words with distinct meanings, sure.
> 
> But when a naïve (naive? ☺) person or one without the easy ability to
> enter characters with diacritics searches for "cote", I want to return
> possible matches containing any of your 4 examples.  It's slightly
> fuzzier if they search for "coté", in which case they may mean "coté" or
> they might mean be unable to figure out how to add a hat and want to
> type "côté". Though I'd rather get more results, even if it has some
> that only match fuzzily.

The right solution to that is to treat it no differently from other fuzzy 
searches. A good search engine should be tolerant of spelling errors and 
alternative spellings for any letter, not just those with diacritics. 
Ideally, a good search engine would successfully match all three of 
"naïve", "naive" and "niave", and it shouldn't rely on special handling 
of diacritics.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#58012

From	wxjmfauth@gmail.com
Date	2013-10-30 01:49 -0700
Message-ID	<e018a4c6-e7a5-4356-8929-e26a3fdcb75d@googlegroups.com>
In reply to	#57882

Le mardi 29 octobre 2013 06:24:50 UTC+1, Steven D'Aprano a écrit :
> On Mon, 28 Oct 2013 09:23:41 -0500, Tim Chase wrote:
> 
> 
> 
> > On 2013-10-28 07:01, wxjmfauth@gmail.com wrote:
> 
> >>> Simply ignoring diactrics won't get you very far.
> 
> >> 
> 
> >> Right. As an example, these four French words : cote, côte, coté, côté
> 
> >> .
> 
> > 
> 
> > Distinct words with distinct meanings, sure.
> 
> > 
> 
> > But when a naïve (naive? ☺) person or one without the easy ability to
> 
> > enter characters with diacritics searches for "cote", I want to return
> 
> > possible matches containing any of your 4 examples.  It's slightly
> 
> > fuzzier if they search for "coté", in which case they may mean "coté" or
> 
> > they might mean be unable to figure out how to add a hat and want to
> 
> > type "côté". Though I'd rather get more results, even if it has some
> 
> > that only match fuzzily.
> 
> 
> 
> The right solution to that is to treat it no differently from other fuzzy 
> 
> searches. A good search engine should be tolerant of spelling errors and 
> 
> alternative spellings for any letter, not just those with diacritics. 
> 
> Ideally, a good search engine would successfully match all three of 
> 
> "naïve", "naive" and "niave", and it shouldn't rely on special handling 
> 
> of diacritics.
> 
> 
> 
------

This is a non sense. The purpose of a diacritical mark is to
make a letter a different letter. If a tool is supposed to
match an ô, there is absolutely no reason to match something
else.

jmf

[toc] | [prev] | [next] | [standalone]

#58033

From	Ned Batchelder <ned@nedbatchelder.com>
Date	2013-10-30 08:44 -0400
Message-ID	<mailman.1806.1383137592.18130.python-list@python.org>
In reply to	#58012

On 10/30/13 4:49 AM, wxjmfauth@gmail.com wrote:
> Le mardi 29 octobre 2013 06:24:50 UTC+1, Steven D'Aprano a écrit :
>> On Mon, 28 Oct 2013 09:23:41 -0500, Tim Chase wrote:
>>
>>
>>
>>> On 2013-10-28 07:01, wxjmfauth@gmail.com wrote:
>>>>> Simply ignoring diactrics won't get you very far.
>>>> Right. As an example, these four French words : cote, côte, coté, côté
>>>> .
>>> Distinct words with distinct meanings, sure.
>>> But when a naïve (naive? ☺) person or one without the easy ability to
>>> enter characters with diacritics searches for "cote", I want to return
>>> possible matches containing any of your 4 examples.  It's slightly
>>> fuzzier if they search for "coté", in which case they may mean "coté" or
>>> they might mean be unable to figure out how to add a hat and want to
>>> type "côté". Though I'd rather get more results, even if it has some
>>> that only match fuzzily.
>>
>>
>> The right solution to that is to treat it no differently from other fuzzy
>>
>> searches. A good search engine should be tolerant of spelling errors and
>>
>> alternative spellings for any letter, not just those with diacritics.
>>
>> Ideally, a good search engine would successfully match all three of
>>
>> "naïve", "naive" and "niave", and it shouldn't rely on special handling
>>
>> of diacritics.
>>
>>
>>
> ------
>
> This is a non sense. The purpose of a diacritical mark is to
> make a letter a different letter. If a tool is supposed to
> match an ô, there is absolutely no reason to match something
> else.
>
> jmf
>

jmf, Tim Chase described his use case, and it seems reasonable to me.  
I'm not sure why you would describe it as nonsense.

--Ned.

[toc] | [prev] | [next] | [standalone]

#58058

From	wxjmfauth@gmail.com
Date	2013-10-30 09:08 -0700
Message-ID	<d4e620ab-e42c-4939-92b4-0c5c62c0bc8b@googlegroups.com>
In reply to	#58033

Le mercredi 30 octobre 2013 13:44:47 UTC+1, Ned Batchelder a écrit :
> On 10/30/13 4:49 AM, wxjmfauth@gmail.com wrote:
> 
> > Le mardi 29 octobre 2013 06:24:50 UTC+1, Steven D'Aprano a écrit :
> 
> >> On Mon, 28 Oct 2013 09:23:41 -0500, Tim Chase wrote:
> 
> >>
> 
> >>
> 
> >>
> 
> >>> On 2013-10-28 07:01, wxjmfauth@gmail.com wrote:
> 
> >>>>> Simply ignoring diactrics won't get you very far.
> 
> >>>> Right. As an example, these four French words : cote, côte, coté, côté
> 
> >>>> .
> 
> >>> Distinct words with distinct meanings, sure.
> 
> >>> But when a naïve (naive? ☺) person or one without the easy ability to
> 
> >>> enter characters with diacritics searches for "cote", I want to return
> 
> >>> possible matches containing any of your 4 examples.  It's slightly
> 
> >>> fuzzier if they search for "coté", in which case they may mean "coté" or
> 
> >>> they might mean be unable to figure out how to add a hat and want to
> 
> >>> type "côté". Though I'd rather get more results, even if it has some
> 
> >>> that only match fuzzily.
> 
> >>
> 
> >>
> 
> >> The right solution to that is to treat it no differently from other fuzzy
> 
> >>
> 
> >> searches. A good search engine should be tolerant of spelling errors and
> 
> >>
> 
> >> alternative spellings for any letter, not just those with diacritics.
> 
> >>
> 
> >> Ideally, a good search engine would successfully match all three of
> 
> >>
> 
> >> "naïve", "naive" and "niave", and it shouldn't rely on special handling
> 
> >>
> 
> >> of diacritics.
> 
> >>
> 
> >>
> 
> >>
> 
> > ------
> 
> >
> 
> > This is a non sense. The purpose of a diacritical mark is to
> 
> > make a letter a different letter. If a tool is supposed to
> 
> > match an ô, there is absolutely no reason to match something
> 
> > else.
> 
> >
> 
> > jmf
> 
> >
> 
> 
> 
> jmf, Tim Chase described his use case, and it seems reasonable to me.  
> 
> I'm not sure why you would describe it as nonsense.
> 
> 
> 
> --Ned.

--------

My comment had nothing to do with Python, it was a
general comment. A diacritical mark just makes a letter
a different letter; a "ï " and a "i" are "as
diferent" as a "a" from a "z". A diacritical mark
is more than a simple ornementation.

From a unicode perspective.
Unicode.org "knows", these chars a very important, that's
the reason why they exist in two forms, precomposed and
composed forms.

From a software perspective.
Luckily for the end users, all the serious software
are considering all these chars in an equal way. They
are all belonging to the BMP plane. An "Ą" is treated
as an "ê", same memory consumption, same performance,
==> very smooth software.

jmf

[toc] | [prev] | [next] | [standalone]

#58065

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2013-10-30 16:24 +0000
Message-ID	<mailman.1816.1383150305.18130.python-list@python.org>
In reply to	#58058

On 30/10/2013 16:08, wxjmfauth@gmail.com wrote:

Would you please read, digest and action this 
https://wiki.python.org/moin/GoogleGroupsPython

TIA.

-- 
Python is the second best programming language in the world.
But the best has yet to be invented.  Christian Tismer

Mark Lawrence

[toc] | [prev] | [next] | [standalone]

#58072

From	Ned Batchelder <ned@nedbatchelder.com>
Date	2013-10-30 13:10 -0400
Message-ID	<mailman.1819.1383153035.18130.python-list@python.org>
In reply to	#58058

On 10/30/13 12:08 PM, wxjmfauth@gmail.com wrote:
> Le mercredi 30 octobre 2013 13:44:47 UTC+1, Ned Batchelder a écrit :
>> On 10/30/13 4:49 AM, wxjmfauth@gmail.com wrote:
>>
>>> Le mardi 29 octobre 2013 06:24:50 UTC+1, Steven D'Aprano a écrit :
>>>> On Mon, 28 Oct 2013 09:23:41 -0500, Tim Chase wrote:
>>>>> On 2013-10-28 07:01, wxjmfauth@gmail.com wrote:
>>>>>>> Simply ignoring diactrics won't get you very far.
>>>>>> Right. As an example, these four French words : cote, côte, coté, côté
>>>>>> .
>>>>> Distinct words with distinct meanings, sure.
>>>>> But when a naïve (naive? ☺) person or one without the easy ability to
>>>>> enter characters with diacritics searches for "cote", I want to return
>>>>> possible matches containing any of your 4 examples.  It's slightly
>>>>> fuzzier if they search for "coté", in which case they may mean "coté" or
>>>>> they might mean be unable to figure out how to add a hat and want to
>>>>> type "côté". Though I'd rather get more results, even if it has some
>>>>> that only match fuzzily.
>>>> The right solution to that is to treat it no differently from other fuzzy
>>>> searches. A good search engine should be tolerant of spelling errors and
>>>> alternative spellings for any letter, not just those with diacritics.
>>>> Ideally, a good search engine would successfully match all three of
>>>> "naïve", "naive" and "niave", and it shouldn't rely on special handling
>>>> of diacritics.
>>> ------
>>> This is a non sense. The purpose of a diacritical mark is to
>>> make a letter a different letter. If a tool is supposed to
>>> match an ô, there is absolutely no reason to match something
>>> else.
>>> jmf
>>
>>
>> jmf, Tim Chase described his use case, and it seems reasonable to me.
>>
>> I'm not sure why you would describe it as nonsense.
>>
>>
>>
>> --Ned.
> --------
>
> My comment had nothing to do with Python, it was a
> general comment. A diacritical mark just makes a letter
> a different letter; a "ï " and a "i" are "as
> diferent" as a "a" from a "z". A diacritical mark
> is more than a simple ornementation.

Yes, we understand that.  Tim outlined a need that had to do with users' 
informal typing.  In his case, he needs to deal with that sloppiness.  
You can't simply insist that users be more precise.

Unicode is a way to represent text, and text gets used in many different 
ways.  Each of us has to acknowledge that our text needs may be 
different than someone else's.  jmf, I'm guessing from your comments 
over the last few months that you are doing detailed linguistic work 
with corpora in many languages.  That work leads to one style of Unicode 
use.  In your domain, it is "nonsense" to ignore diacriticals.

Other people do different kinds of work with Unicode, and that leads to 
different needs.  In Tim's system, it is important to ignore 
diacriticals.  You might not have a use personally for Tim's system.  
That doesn't make it nonsense.

--Ned.
>  From a unicode perspective.
> Unicode.org "knows", these chars a very important, that's
> the reason why they exist in two forms, precomposed and
> composed forms.
>
>  From a software perspective.
> Luckily for the end users, all the serious software
> are considering all these chars in an equal way. They
> are all belonging to the BMP plane. An "Ą" is treated
> as an "ê", same memory consumption, same performance,
> ==> very smooth software.
>
> jmf
>

[toc] | [prev] | [next] | [standalone]

Page 1 of 3 [1] 2 3 Next page →

csiph-web

trying to strip out non ascii.. or rather convert non ascii

Contents

#57656 — trying to strip out non ascii.. or rather convert non ascii

#57676

#57690

#57691

#57700

#57702

#57733

#57695

#57697

#57703

#57709

#57823

#57824

#57826

#57882

#58012

#58033

#58058

#58065

#58072