Groups | Search | Server Info | Login | Register


Groups > perl.unicode > #198

Re: Comparing inputs with source strings

Newsgroups perl.unicode
Path csiph.com!usenet.blueworldhosting.com!feeder01.blueworldhosting.com!news.glorb.com!usenet.stanford.edu!nntp.perl.org
Return-Path <public@khwilliamson.com>
Mailing-List contact perl-unicode-help@perl.org; run by ezmlm
Delivered-To mailing list perl-unicode@perl.org
Received (qmail 8399 invoked from network); 11 May 2016 20:51:40 -0000
Received from x1.develooper.com (207.171.7.70) by x6.develooper.com with SMTP; 11 May 2016 20:51:40 -0000
Received (qmail 25220 invoked by uid 225); 11 May 2016 20:51:40 -0000
Delivered-To perl-unicode@perl.org
Received (qmail 25216 invoked by alias); 11 May 2016 20:51:40 -0000
X-Spam-Status No, hits=-1.9 required=8.0 tests=BAYES_00,RCVD_IN_DNSWL_NONE,SPF_PASS
X-Spam-Check-By la.mx.develooper.com
Received from mta1.indra.com (HELO mta1.indra.com) (209.169.0.19) by la.mx.develooper.com (qpsmtpd/0.28) with ESMTP; Wed, 11 May 2016 13:51:38 -0700
Received from localhost (localhost [127.0.0.1]) by mta1.indra.com (Postfix) with ESMTP id 9DE2840925 for <perl-unicode@perl.org>; Wed, 11 May 2016 14:51:22 -0600 (MDT)
Received from mta1.indra.com ([127.0.0.1]) by localhost (mta1.indra.com [127.0.0.1]) (amavisd-new, port 10032) with ESMTP id OaPCEminxKGh for <perl-unicode@perl.org>; Wed, 11 May 2016 14:51:12 -0600 (MDT)
Received from localhost (localhost [127.0.0.1]) by mta1.indra.com (Postfix) with ESMTP id 601BC40913 for <perl-unicode@perl.org>; Wed, 11 May 2016 14:51:12 -0600 (MDT)
X-Virus-Scanned amavisd-new at mta1.indra.com
Received from mta1.indra.com ([127.0.0.1]) by localhost (mta1.indra.com [127.0.0.1]) (amavisd-new, port 10026) with ESMTP id La9JMIp6_Mxv for <perl-unicode@perl.org>; Wed, 11 May 2016 14:51:12 -0600 (MDT)
Received from [10.0.0.4] (unknown [98.245.114.25]) by mta1.indra.com (Postfix) with ESMTPSA id 270AF40AA9 for <perl-unicode@perl.org>; Wed, 11 May 2016 14:51:12 -0600 (MDT)
Subject Re: Comparing inputs with source strings
To perl-unicode@perl.org
References <87y47j1dwk.fsf@hati.baby-gnu.org> <5732B99E.9090302@khwilliamson.com> <87d1otyqa6.fsf@hati.baby-gnu.org>
Message-ID <57339B49.9060205@khwilliamson.com> (permalink)
Date Wed, 11 May 2016 14:51:21 -0600
User-Agent Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.7.2
MIME-Version 1.0
In-Reply-To <87d1otyqa6.fsf@hati.baby-gnu.org>
Content-Type text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding quoted-printable
Approved news@nntp.perl.org
From public@khwilliamson.com (Karl Williamson)
Xref csiph.com perl.unicode:198

Show key headers only | View raw


On 05/11/2016 02:04 AM, Daniel Dehennin wrote:
> Karl Williamson <public@khwilliamson.com> writes:
>
>> On 05/09/2016 08:53 AM, Daniel Dehennin wrote:
>>> Hello,
>>>
>>> I tried to make my Perl5 code unicode compliant after reading a post on
>>> stackoverflow[1].
>>>
>>> As suggested in the post:
>>>
>>>       “always run incoming stuff through NFD and outbound stuff from NFC.”
>>>
>>> I got a hard time finding why my Test::More was failing but displaying
>>> exactly the same strings for “got” and “expected”.
>>>
>>> I finally check how UTF-8 sources are handled and found that they are in
>>> NFC form, I run the following script:
>
> [...]
>
>> I'm afraid that when it comes to normalization in Perl5, you have to
>> do it yourself.  I hear that Perl6 is much friendlier in this regard,
>> but I have no personal experience with it.  Your $unistring is in
>> whatever normalization you made it when you typed it into your editor,
>> or whatever your editor did with it as you were typing.  You could
>> have typed it in NFD, but probably the most natural way to enter
>> things on your keyboard will underlying it all be NFC.
>
> That's what I finally find out in another post, normally all my inputs
> are NFD but my tests used static string to match, I declared them with
> NFD to make it explicit.
>
> I added a note in my POD to signal that the sub returns NFD strings.

I forgot to mention that if you're just dealing with collation, it may 
be that comparisons actually work properly regardless of normalization, 
if you are doing the comparisons within the scope of 'use locale' and 
the locale is recognized by Perl5 to be a UTF-8 locale.  It depends on 
the libc implementation for your platform.  There are bugs in Perl5's 
handling of these, however, which I have fixes for, and expect to put 
into the latest development version, called blead, within the next week 
or two.
>
>> Normalization is tricky, and the Unicode Consortium has had to modify
>> things years after they were first specified, because no one could
>> reasonably implement what was expected.  I may tackle getting
>> normalization to be more developer friendly in future Perl5 versions,
>> but not in the next couple of years.
>
> Thanks, as soon as my little work project is working well I'll try to
> redo it in Perl6.
>
> Regards.
>

Back to perl.unicode | Previous | NextPrevious in thread | Find similar


Thread

Comparing inputs with source strings daniel.dehennin@baby-gnu.org (Daniel Dehennin) - 2016-05-09 16:53 +0200
  Re: Comparing inputs with source strings daniel.dehennin@baby-gnu.org (Daniel Dehennin) - 2016-05-10 13:21 +0200
    Re: Comparing inputs with source strings daniel.dehennin@baby-gnu.org (Daniel Dehennin) - 2016-05-10 13:45 +0200
  Re: Comparing inputs with source strings public@khwilliamson.com (Karl Williamson) - 2016-05-10 22:48 -0600
    Re: Comparing inputs with source strings daniel.dehennin@baby-gnu.org (Daniel Dehennin) - 2016-05-11 10:04 +0200
      Re: Comparing inputs with source strings public@khwilliamson.com (Karl Williamson) - 2016-05-11 14:51 -0600

csiph-web