Path: csiph.com!usenet.pasdenom.info!aioe.org!news.stack.nl!newsfeed.xs4all.nl!newsfeed1.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.004 X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'python,': 0.02; 'languages.': 0.04; '"as': 0.07; 'differently': 0.07; 'important,': 0.07; 'matches': 0.07; 'users,': 0.07; 'forms,': 0.09; 'outlined': 0.09; 'spelling': 0.09; 'subject:trying': 0.09; '--------': 0.10; '"a"': 0.16; 'belonging': 0.16; 'distinct': 0.16; 'fuzzy': 0.16; 'guessing': 0.16; 'ideally,': 0.16; 'ignoring': 0.16; 'informal': 0.16; 'luckily': 0.16; 'mardi': 0.16; 'subject:non': 0.16; 'sure.': 0.16; "tim's": 0.16; 'unicode,': 0.16; 'ignore': 0.16; 'sender:addr:gmail.com': 0.17; 'wrote:': 0.18; 'slightly': 0.19; 'seems': 0.21; '>>>': 0.22; 'memory': 0.22; 'header:User-Agent:1': 0.23; 'forms.': 0.24; "shouldn't": 0.24; 'text,': 0.24; 'unicode': 0.24; 'mon,': 0.24; 'non': 0.24; 'software.': 0.24; 'handling': 0.26; 'right.': 0.26; 'gets': 0.27; 'header:In-Reply-To:1': 0.27; 'to:2**1': 0.27; 'am,': 0.29; 'tim': 0.29; 'to:no real name:2**1': 0.29; 'words': 0.29; "doesn't": 0.30; 'characters': 0.30; 'errors': 0.30; "i'm": 0.30; 'comments': 0.31; 'that.': 0.31; '>>>>': 0.31; 'chase': 0.31; "d'aprano": 0.31; 'letter.': 0.31; 'searches': 0.31; 'steven': 0.31; 'figure': 0.32; 'supposed': 0.32; 'text': 0.33; 'style': 0.33; 'comment': 0.34; "i'd": 0.34; "can't": 0.35; 'tool': 0.35; 'something': 0.35; 'case,': 0.35; 'equal': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'add': 0.35; 'there': 0.35; 'described': 0.36; 'leads': 0.36; 'doing': 0.36; 'possible': 0.36; 'should': 0.36; 'example,': 0.37; 'two': 0.37; 'represent': 0.38; 'system,': 0.38; 'hat': 0.38; 'to:addr:python-list': 0.38; 'pm,': 0.38; 'rather': 0.38; 'ability': 0.39; 'use.': 0.39; 'sure': 0.39; 'to:addr:python.org': 0.39; 'system.': 0.39; 'unable': 0.39; 'users': 0.40; 'how': 0.40; 'even': 0.60; 'easy': 0.60; 'french': 0.61; 'simply': 0.61; 'simple': 0.61; 'email addr:gmail.com': 0.63; 'our': 0.64; 'more': 0.64; 'different': 0.65; 'to:addr:gmail.com': 0.65; 'containing': 0.69; 'special': 0.74; 'batchelder': 0.84; 'comment.': 0.84; 'domain,': 0.84; 'examples.': 0.84; 'nonsense.': 0.84; 'perspective.': 0.84; 'results,': 0.84; 'subject:.. ': 0.84; 'tolerant': 0.84; 'absolutely': 0.87; 'acknowledge': 0.93; 'serious': 0.97; '2013': 0.98 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=sender:message-id:date:from:user-agent:mime-version:to:subject :references:in-reply-to:content-type:content-transfer-encoding; bh=aiCJ/WWE7cd4z6mkXO7CD5gfE21tIRjvauheQ6HA6sQ=; b=ECSEWQ4PQHDuPjSiVZOSpw6qaJAUBOf5oHJLf/lgkJL8uJ7H1dCx42v8i3giVU08wZ 9FkcEsN/XIm/laOvHHiHPrFKUsjpuLyd/4XNHP+H62wPx1hScaBViY17wrxX3HJajmy0 EzarrMasnEd6uemSIlcrcrLnkiVdpcC5vpRBsicC5DY9Gmv5AfQj4ryaxtX6w8SwdJ90 Vi9UeHa2Mromp7OwXT4z98KEn1hxRC+5BaikuqJm2ktQ9rN4wRVR3Jd78fzJa9220kdI Tt/sFE2eChW9wbvWTfXUX8tWsWny2hiyty0xUcMw2jYrOU7jjOjKSS3M7vuF/Kzq02Ov 8q1Q== X-Received: by 10.236.55.4 with SMTP id j4mr28105yhc.166.1383153024548; Wed, 30 Oct 2013 10:10:24 -0700 (PDT) Sender: Ned Batchelder Date: Wed, 30 Oct 2013 13:10:23 -0400 From: Ned Batchelder User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:24.0) Gecko/20100101 Thunderbird/24.0.1 MIME-Version: 1.0 To: wxjmfauth@gmail.com, python-list@python.org Subject: Re: trying to strip out non ascii.. or rather convert non ascii References: <526c412a$0$29972$c3e8da3$5496439d@news.astraweb.com> <526f46a2$0$6512$c3e8da3$5496439d@news.astraweb.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 79 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1383153035 news.xs4all.nl 16008 [2001:888:2000:d::a6]:50004 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:58072 On 10/30/13 12:08 PM, wxjmfauth@gmail.com wrote: > Le mercredi 30 octobre 2013 13:44:47 UTC+1, Ned Batchelder a écrit : >> On 10/30/13 4:49 AM, wxjmfauth@gmail.com wrote: >> >>> Le mardi 29 octobre 2013 06:24:50 UTC+1, Steven D'Aprano a écrit : >>>> On Mon, 28 Oct 2013 09:23:41 -0500, Tim Chase wrote: >>>>> On 2013-10-28 07:01, wxjmfauth@gmail.com wrote: >>>>>>> Simply ignoring diactrics won't get you very far. >>>>>> Right. As an example, these four French words : cote, côte, coté, côté >>>>>> . >>>>> Distinct words with distinct meanings, sure. >>>>> But when a naïve (naive? ☺) person or one without the easy ability to >>>>> enter characters with diacritics searches for "cote", I want to return >>>>> possible matches containing any of your 4 examples. It's slightly >>>>> fuzzier if they search for "coté", in which case they may mean "coté" or >>>>> they might mean be unable to figure out how to add a hat and want to >>>>> type "côté". Though I'd rather get more results, even if it has some >>>>> that only match fuzzily. >>>> The right solution to that is to treat it no differently from other fuzzy >>>> searches. A good search engine should be tolerant of spelling errors and >>>> alternative spellings for any letter, not just those with diacritics. >>>> Ideally, a good search engine would successfully match all three of >>>> "naïve", "naive" and "niave", and it shouldn't rely on special handling >>>> of diacritics. >>> ------ >>> This is a non sense. The purpose of a diacritical mark is to >>> make a letter a different letter. If a tool is supposed to >>> match an ô, there is absolutely no reason to match something >>> else. >>> jmf >> >> >> jmf, Tim Chase described his use case, and it seems reasonable to me. >> >> I'm not sure why you would describe it as nonsense. >> >> >> >> --Ned. > -------- > > My comment had nothing to do with Python, it was a > general comment. A diacritical mark just makes a letter > a different letter; a "ï " and a "i" are "as > diferent" as a "a" from a "z". A diacritical mark > is more than a simple ornementation. Yes, we understand that. Tim outlined a need that had to do with users' informal typing. In his case, he needs to deal with that sloppiness. You can't simply insist that users be more precise. Unicode is a way to represent text, and text gets used in many different ways. Each of us has to acknowledge that our text needs may be different than someone else's. jmf, I'm guessing from your comments over the last few months that you are doing detailed linguistic work with corpora in many languages. That work leads to one style of Unicode use. In your domain, it is "nonsense" to ignore diacriticals. Other people do different kinds of work with Unicode, and that leads to different needs. In Tim's system, it is important to ignore diacriticals. You might not have a use personally for Tim's system. That doesn't make it nonsense. --Ned. > From a unicode perspective. > Unicode.org "knows", these chars a very important, that's > the reason why they exist in two forms, precomposed and > composed forms. > > From a software perspective. > Luckily for the end users, all the serious software > are considering all these chars in an equal way. They > are all belonging to the BMP plane. An "Ą" is treated > as an "ê", same memory consumption, same performance, > ==> very smooth software. > > jmf >