Groups > perl.unicode > #202 > unrolled thread

Re: Encode UTF-8 optimizations

Started by	pali@cpan.org
First post	2016-08-12 17:31 +0200
Last post	2016-11-01 10:53 +0100
Articles	20 on this page of 25 — 3 participants

Back to article view | Back to perl.unicode

  Re: Encode UTF-8 optimizations pali@cpan.org - 2016-08-12 17:31 +0200
    Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-08-18 23:06 -0600
      Re: Encode UTF-8 optimizations pali@cpan.org - 2016-08-19 10:42 +0200
        Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-08-20 19:10 -0600
          Re: Encode UTF-8 optimizations pagaltzis@gmx.de (Aristotle Pagaltzis) - 2016-08-21 04:33 +0200
            Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-08-20 20:55 -0600
          Re: Encode UTF-8 optimizations pali@cpan.org - 2016-08-21 10:34 +0200
            Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-08-21 08:49 -0600
              Re: Encode UTF-8 optimizations pali@cpan.org - 2016-08-22 15:05 +0200
                Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-08-22 13:43 -0600
                  Re: Encode UTF-8 optimizations pali@cpan.org - 2016-08-22 22:47 +0200
                    Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-08-22 15:19 -0600
                      Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-08-22 15:38 -0600
                        Re: Encode UTF-8 optimizations pali@cpan.org - 2016-08-22 23:45 +0200
                      Re: Encode UTF-8 optimizations pali@cpan.org - 2016-08-22 23:39 +0200
                    Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-08-24 22:49 -0600
                      Re: Encode UTF-8 optimizations pali@cpan.org - 2016-08-25 09:48 +0200
                        Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-08-29 09:00 -0600
                          Re: Encode UTF-8 optimizations pali@cpan.org - 2016-08-31 23:43 +0200
                            Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-08-31 21:27 -0600
                              Re: Encode UTF-8 optimizations pali@cpan.org - 2016-09-01 09:30 +0200
                                Re: Encode UTF-8 optimizations pali@cpan.org - 2016-09-25 12:06 +0200
                                  Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-09-25 10:49 -0600
                                    Re: Encode UTF-8 optimizations pali@cpan.org - 2016-10-27 10:25 +0200
                                      Re: Encode UTF-8 optimizations pali@cpan.org - 2016-11-01 10:53 +0100

Page 1 of 2 [1] 2 Next page →

#202 — Re: Encode UTF-8 optimizations

From	pali@cpan.org
Date	2016-08-12 17:31 +0200
Subject	Re: Encode UTF-8 optimizations
Message-ID	<201608121731.32716@pali>

On Thursday 11 August 2016 17:41:23 Karl Williamson wrote:
> On 07/09/2016 05:12 PM, pali@cpan.org wrote:
> >Hi! As we know utf8::encode() does not provide correct UTF-8 encoding
> >and Encode::encode("UTF-8", ...) should be used instead. Also opening
> >file should be done by :encoding(UTF-8) layer instead :utf8.
> >
> >But UTF-8 strict implementation in Encode module is horrible slow when
> >comparing to utf8::encode(). It is implemented in Encode.xs file and for
> >benchmarking can be this XS implementation called directly by:
> >
> >  use Encode;
> >  my $output = Encode::utf8::encode_xs({strict_utf8 => 1}, $input)
> >
> >(without overhead of Encode module...)
> >
> >Here are my results on 160 bytes long input string:
> >
> >  Encode::utf8::encode_xs({strict_utf8 => 1}, ...):  8 wallclock secs ( 8.56 usr +  
0.00 sys =  8.56 CPU) @ 467289.72/s (n=4000000)
> >  Encode::utf8::encode_xs({strict_utf8 => 0}, ...):  1 wallclock secs ( 1.66 usr +  
0.00 sys =  1.66 CPU) @ 2409638.55/s (n=4000000)
> >  utf8::encode:  1 wallclock secs ( 0.39 usr +  0.00 sys =  0.39 CPU) @ 
10256410.26/s (n=4000000)
> >
> >I found two bottle necks (slow sv_catpv* and utf8n_to_uvuni functions)
> >and did some optimizations. Final results are:
> >
> >  Encode::utf8::encode_xs({strict_utf8 => 1}, ...):  2 wallclock secs ( 3.27 usr +  
0.00 sys =  3.27 CPU) @ 1223241.59/s (n=4000000)
> >  Encode::utf8::encode_xs({strict_utf8 => 0}, ...):  1 wallclock secs ( 1.68 usr +  
0.00 sys =  1.68 CPU) @ 2380952.38/s (n=4000000)
> >  utf8::encode:  1 wallclock secs ( 0.40 usr +  0.00 sys =  0.40 CPU) @ 
10000000.00/s (n=4000000)
> >
> >Patches are on github at pull request:
> >https://github.com/dankogai/p5-encode/pull/56
> >
> >I would like if somebody review my patches and tell if this is the
> >right way for optimizations...
> >
> 
> I'm sorry that this slipped off my radar until I saw it in the new Encode
> release
> 
> There are a couple of things I see wrong with your patch.
> 
> 1) It does not catch the malformation of an overlong sequence.  This is a
> serious malformation which has been used for attacks.  Basically, after you
> get the result, you need to check that it is the expected length for that
> result.  For example, \xC2\x80 will have an input length of 2, and evaluates
> to \x00, whose expected length is 1, and so the input is overlong.  In
> modern perls, you can just do an OFFUNISKIP(uv) and compare that with the
> passed-in length.  This can be rewritten for perls back to 5.8 using
> UNI_SKIP and UNI_TO_NATIVE

I do not see where can be a problem. At least I think my patches should
be compatible with previous implementation of Encode.xs...

First UTF8_IS_INVARIANT is checked and one character processed.

Otherwise UTF8_IS_START is checked and UTF8SKIP is used to get length of
sequence. And then len-1 characters are checked if they pass test for
UTF8_IS_CONTINUATION.

If there are less characters then following does not
UTF8_IS_CONTINUATION and error is reported. If there are more, then next
iteration of loop starts and it fail on both UTF8_IS_CONTINUATION and
UTF8_IS_START.

Can you describe in details what do you think it wrong and how to do
that attack?

> 2) It does not work on EBCDIC platforms.  The NATIVE_TO_UTF() call is a good
> start, but the result uv needs to be transformed back to native, using
> UNI_TO_NATIVE(uv).

uv is used just to check if it is valid Unicode code point. Real value
is used only for error/warn message. Previous implementation used
utf8n_to_uvuni which convert return value with NATIVE_TO_UNI.

> 3) The assumptions the subroutine runs under need to be documented for
> future maintainers and code readers.  For example, it assumes that there is
> enough space in the input to hold all the bytes.

Function process_utf8 does not assume that. It calls SvGROW to increase
buffer size when needed.

> Other than that, it looks ok to me.  But, to be sure, I think you should run
> it on the tests included in the core t/op/utf8decode.t which came from an
> internet repository of edge cases.

How to use and run that test with Encode?

[toc] | [next] | [standalone]

#204

From	public@khwilliamson.com (Karl Williamson)
Date	2016-08-18 23:06 -0600
Message-ID	<8a4fb1c9-615e-2cf7-2fa2-28e3ff637406@khwilliamson.com>
In reply to	#202

On 08/12/2016 09:31 AM, pali@cpan.org wrote:
> On Thursday 11 August 2016 17:41:23 Karl Williamson wrote:
>> On 07/09/2016 05:12 PM, pali@cpan.org wrote:
>>> Hi! As we know utf8::encode() does not provide correct UTF-8 encoding
>>> and Encode::encode("UTF-8", ...) should be used instead. Also opening
>>> file should be done by :encoding(UTF-8) layer instead :utf8.
>>>
>>> But UTF-8 strict implementation in Encode module is horrible slow when
>>> comparing to utf8::encode(). It is implemented in Encode.xs file and for
>>> benchmarking can be this XS implementation called directly by:
>>>
>>>  use Encode;
>>>  my $output = Encode::utf8::encode_xs({strict_utf8 => 1}, $input)
>>>
>>> (without overhead of Encode module...)
>>>
>>> Here are my results on 160 bytes long input string:
>>>
>>>  Encode::utf8::encode_xs({strict_utf8 => 1}, ...):  8 wallclock secs ( 8.56 usr +
> 0.00 sys =  8.56 CPU) @ 467289.72/s (n=4000000)
>>>  Encode::utf8::encode_xs({strict_utf8 => 0}, ...):  1 wallclock secs ( 1.66 usr +
> 0.00 sys =  1.66 CPU) @ 2409638.55/s (n=4000000)
>>>  utf8::encode:  1 wallclock secs ( 0.39 usr +  0.00 sys =  0.39 CPU) @
> 10256410.26/s (n=4000000)
>>>
>>> I found two bottle necks (slow sv_catpv* and utf8n_to_uvuni functions)
>>> and did some optimizations. Final results are:
>>>
>>>  Encode::utf8::encode_xs({strict_utf8 => 1}, ...):  2 wallclock secs ( 3.27 usr +
> 0.00 sys =  3.27 CPU) @ 1223241.59/s (n=4000000)
>>>  Encode::utf8::encode_xs({strict_utf8 => 0}, ...):  1 wallclock secs ( 1.68 usr +
> 0.00 sys =  1.68 CPU) @ 2380952.38/s (n=4000000)
>>>  utf8::encode:  1 wallclock secs ( 0.40 usr +  0.00 sys =  0.40 CPU) @
> 10000000.00/s (n=4000000)
>>>
>>> Patches are on github at pull request:
>>> https://github.com/dankogai/p5-encode/pull/56
>>>
>>> I would like if somebody review my patches and tell if this is the
>>> right way for optimizations...
>>>
>>
>> I'm sorry that this slipped off my radar until I saw it in the new Encode
>> release
>>
>> There are a couple of things I see wrong with your patch.
>>
>> 1) It does not catch the malformation of an overlong sequence.  This is a
>> serious malformation which has been used for attacks.  Basically, after you
>> get the result, you need to check that it is the expected length for that
>> result.  For example, \xC2\x80 will have an input length of 2, and evaluates
>> to \x00, whose expected length is 1, and so the input is overlong.  In
>> modern perls, you can just do an OFFUNISKIP(uv) and compare that with the
>> passed-in length.  This can be rewritten for perls back to 5.8 using
>> UNI_SKIP and UNI_TO_NATIVE
>
> I do not see where can be a problem. At least I think my patches should
> be compatible with previous implementation of Encode.xs...
>
> First UTF8_IS_INVARIANT is checked and one character processed.
>
> Otherwise UTF8_IS_START is checked and UTF8SKIP is used to get length of
> sequence. And then len-1 characters are checked if they pass test for
> UTF8_IS_CONTINUATION.
>
> If there are less characters then following does not
> UTF8_IS_CONTINUATION and error is reported. If there are more, then next
> iteration of loop starts and it fail on both UTF8_IS_CONTINUATION and
> UTF8_IS_START.
>
> Can you describe in details what do you think it wrong and how to do
> that attack?

This discussion has been active at
https://github.com/dankogai/p5-encode/issues/64

For the curious out there, please refer to that discussion.  My bottom 
line is that I have come to believe the security risks are too high to 
have modules do their own security checking for UTF-8 correctness.
>
>> 2) It does not work on EBCDIC platforms.  The NATIVE_TO_UTF() call is a good
>> start, but the result uv needs to be transformed back to native, using
>> UNI_TO_NATIVE(uv).
>
> uv is used just to check if it is valid Unicode code point. Real value
> is used only for error/warn message. Previous implementation used
> utf8n_to_uvuni which convert return value with NATIVE_TO_UNI.

As I said on that other thread, if this is really true, then it's faster 
to use a boolean function to verify the inputs.  Also, performance 
should not be a consideration for errors or warnings.  One can check 
validity fast; and then spend the time getting the message right in the 
rare cases where a message is generated.
>
>> 3) The assumptions the subroutine runs under need to be documented for
>> future maintainers and code readers.  For example, it assumes that there is
>> enough space in the input to hold all the bytes.
>
> Function process_utf8 does not assume that. It calls SvGROW to increase
> buffer size when needed.

You misunderstand what I meant here.  The bottom line is your patch adds 
a significant amount of code without any comments in a risky area.  The 
name of the function does not indicate that its value is to be thrown 
away, and even after looking at the code that calls it some more, it's 
not obvious to me that the value isn't kept.  All subtleties in code 
should be commented in that code.  To do otherwise is a disservice to 
future maintainers.  I personally will never push to blead someone's 
commit that I think  unfairly burdens future maintainers.  One of the 
subtleties of this function is that it doesn't check that it is not 
running off the end of the buffer.  It relies on the caller to do that 
check, but someone coming along might see that function and think from 
its name that it's more general purpose than it actually is.  Someone 
looking at its name and return value would likely think it generates a 
valid code point from UTF-8 input.
>
>> Other than that, it looks ok to me.  But, to be sure, I think you should run
>> it on the tests included in the core t/op/utf8decode.t which came from an
>> internet repository of edge cases.

I later realized that under non-strict calls, this can overflow, and 
there is some code in your amendments to this patch that check that.  I 
have not looked those over.

But again, I don't think Encode should undertake its own security 
checking.  I'm willing to work with you to get something in core that 
adequately meets Encode's needs.
>
> How to use and run that test with Encode?

It looks like you figured that out for the most part in your amended 
patches.
>

[toc] | [prev] | [next] | [standalone]

#205

From	pali@cpan.org
Date	2016-08-19 10:42 +0200
Message-ID	<20160819084221.GA5236@atrey.karlin.mff.cuni.cz>
In reply to	#204

On Thursday 18 August 2016 23:06:27 Karl Williamson wrote:
> On 08/12/2016 09:31 AM, pali@cpan.org wrote:
> >On Thursday 11 August 2016 17:41:23 Karl Williamson wrote:
> >>On 07/09/2016 05:12 PM, pali@cpan.org wrote:
> >>>Hi! As we know utf8::encode() does not provide correct UTF-8 encoding
> >>>and Encode::encode("UTF-8", ...) should be used instead. Also opening
> >>>file should be done by :encoding(UTF-8) layer instead :utf8.
> >>>
> >>>But UTF-8 strict implementation in Encode module is horrible slow when
> >>>comparing to utf8::encode(). It is implemented in Encode.xs file and for
> >>>benchmarking can be this XS implementation called directly by:
> >>>
> >>> use Encode;
> >>> my $output = Encode::utf8::encode_xs({strict_utf8 => 1}, $input)
> >>>
> >>>(without overhead of Encode module...)
> >>>
> >>>Here are my results on 160 bytes long input string:
> >>>
> >>> Encode::utf8::encode_xs({strict_utf8 => 1}, ...):  8 wallclock secs ( 8.56 usr +
> >0.00 sys =  8.56 CPU) @ 467289.72/s (n=4000000)
> >>> Encode::utf8::encode_xs({strict_utf8 => 0}, ...):  1 wallclock secs ( 1.66 usr +
> >0.00 sys =  1.66 CPU) @ 2409638.55/s (n=4000000)
> >>> utf8::encode:  1 wallclock secs ( 0.39 usr +  0.00 sys =  0.39 CPU) @
> >10256410.26/s (n=4000000)
> >>>
> >>>I found two bottle necks (slow sv_catpv* and utf8n_to_uvuni functions)
> >>>and did some optimizations. Final results are:
> >>>
> >>> Encode::utf8::encode_xs({strict_utf8 => 1}, ...):  2 wallclock secs ( 3.27 usr +
> >0.00 sys =  3.27 CPU) @ 1223241.59/s (n=4000000)
> >>> Encode::utf8::encode_xs({strict_utf8 => 0}, ...):  1 wallclock secs ( 1.68 usr +
> >0.00 sys =  1.68 CPU) @ 2380952.38/s (n=4000000)
> >>> utf8::encode:  1 wallclock secs ( 0.40 usr +  0.00 sys =  0.40 CPU) @
> >10000000.00/s (n=4000000)
> >>>
> >>>Patches are on github at pull request:
> >>>https://github.com/dankogai/p5-encode/pull/56
> >>>
> >>>I would like if somebody review my patches and tell if this is the
> >>>right way for optimizations...
> >>>
> >>
> >>I'm sorry that this slipped off my radar until I saw it in the new Encode
> >>release
> >>
> >>There are a couple of things I see wrong with your patch.
> >>
> >>1) It does not catch the malformation of an overlong sequence.  This is a
> >>serious malformation which has been used for attacks.  Basically, after you
> >>get the result, you need to check that it is the expected length for that
> >>result.  For example, \xC2\x80 will have an input length of 2, and evaluates
> >>to \x00, whose expected length is 1, and so the input is overlong.  In
> >>modern perls, you can just do an OFFUNISKIP(uv) and compare that with the
> >>passed-in length.  This can be rewritten for perls back to 5.8 using
> >>UNI_SKIP and UNI_TO_NATIVE
> >
> >I do not see where can be a problem. At least I think my patches should
> >be compatible with previous implementation of Encode.xs...
> >
> >First UTF8_IS_INVARIANT is checked and one character processed.
> >
> >Otherwise UTF8_IS_START is checked and UTF8SKIP is used to get length of
> >sequence. And then len-1 characters are checked if they pass test for
> >UTF8_IS_CONTINUATION.
> >
> >If there are less characters then following does not
> >UTF8_IS_CONTINUATION and error is reported. If there are more, then next
> >iteration of loop starts and it fail on both UTF8_IS_CONTINUATION and
> >UTF8_IS_START.
> >
> >Can you describe in details what do you think it wrong and how to do
> >that attack?
> 
> This discussion has been active at
> https://github.com/dankogai/p5-encode/issues/64
> 
> For the curious out there, please refer to that discussion.  My bottom line
> is that I have come to believe the security risks are too high to have
> modules do their own security checking for UTF-8 correctness.
> >
> >>2) It does not work on EBCDIC platforms.  The NATIVE_TO_UTF() call is a good
> >>start, but the result uv needs to be transformed back to native, using
> >>UNI_TO_NATIVE(uv).
> >
> >uv is used just to check if it is valid Unicode code point. Real value
> >is used only for error/warn message. Previous implementation used
> >utf8n_to_uvuni which convert return value with NATIVE_TO_UNI.
> 
> As I said on that other thread, if this is really true, then it's faster to
> use a boolean function to verify the inputs.

Value of uv is used only in warn/error message.

> Also, performance should not
> be a consideration for errors or warnings.  One can check validity fast; and
> then spend the time getting the message right in the rare cases where a
> message is generated.

Yes, fully I agree.

> >>3) The assumptions the subroutine runs under need to be documented for
> >>future maintainers and code readers.  For example, it assumes that there is
> >>enough space in the input to hold all the bytes.
> >
> >Function process_utf8 does not assume that. It calls SvGROW to increase
> >buffer size when needed.
> 
> You misunderstand what I meant here.  The bottom line is your patch adds a
> significant amount of code without any comments in a risky area.  The name
> of the function does not indicate that its value is to be thrown away, and
> even after looking at the code that calls it some more, it's not obvious to
> me that the value isn't kept.  All subtleties in code should be commented in
> that code.  To do otherwise is a disservice to future maintainers.  I
> personally will never push to blead someone's commit that I think  unfairly
> burdens future maintainers.  One of the subtleties of this function is that
> it doesn't check that it is not running off the end of the buffer.  It
> relies on the caller to do that check, but someone coming along might see
> that function and think from its name that it's more general purpose than it
> actually is.  Someone looking at its name and return value would likely
> think it generates a valid code point from UTF-8 input.

It is intended to do not add another non-needed checks as it again slow
down function performance. Checks are done before on another place and
that function is designed to be used only in Encode's process_utf8.

So rather write comments/description about that function. But on the
other hand, I dislike comments which just write what is function doing
in case that comments are longer then function code itself. Then it is
easier to read function code itself and so whole comment is useless...

> >>Other than that, it looks ok to me.  But, to be sure, I think you should run
> >>it on the tests included in the core t/op/utf8decode.t which came from an
> >>internet repository of edge cases.
> 
> I later realized that under non-strict calls, this can overflow, and there
> is some code in your amendments to this patch that check that.  I have not
> looked those over.

Yes, check for overlong and overflow is there...

> But again, I don't think Encode should undertake its own security checking.
> I'm willing to work with you to get something in core that adequately meets
> Encode's needs.

Ok. We can discuss about it. First I see there big problems:

1) Encode module is for Perl 5.8+. In 5.8+ perl's versions will never be
   your (new) functions, so Encode module needs to have at least copy of
   them. And this does not solve problem which you want to prevent :-(

2) Functions must be defined and declared in some header file. So C
   compiler can inline and optimize them in Encode module. Calling such
   hot function via shared library must be avoided.

3) Such function needs to be designed in way that it do only what is
   needed in case for Encode module. On the other hand I do not think it
   is good idea to create special function just for Encode module living
   in core perl... But maybe general-useful function can be designed.

> >How to use and run that test with Encode?
> 
> It looks like you figured that out for the most part in your amended
> patches.

Yes, meanwhile you wrote reply, I figured out about those problems and
also copied those tests :-) Anyway, test for EBCDIC are missing in core.

[toc] | [prev] | [next] | [standalone]

#206

From	public@khwilliamson.com (Karl Williamson)
Date	2016-08-20 19:10 -0600
Message-ID	<bcfddfeb-5045-95b8-b283-7a243542bcab@khwilliamson.com>
In reply to	#205

[Multipart message — attachments visible in raw view] — view raw

Top posting.

Attached is my alternative patch.  It effectively uses a different 
algorithm to avoid decoding the input into code points, and to copy all 
spans of valid input at once, instead of character at a time.

And it uses only currently available functions.  Any of these that are 
missing or buggy in previous perls can and will be dealt with by the 
Devel::PPPort mechanism.

On 08/19/2016 02:42 AM, pali@cpan.org wrote:
> On Thursday 18 August 2016 23:06:27 Karl Williamson wrote:
>> On 08/12/2016 09:31 AM, pali@cpan.org wrote:
>>> On Thursday 11 August 2016 17:41:23 Karl Williamson wrote:
>>>> On 07/09/2016 05:12 PM, pali@cpan.org wrote:
>>>>> Hi! As we know utf8::encode() does not provide correct UTF-8 encoding
>>>>> and Encode::encode("UTF-8", ...) should be used instead. Also opening
>>>>> file should be done by :encoding(UTF-8) layer instead :utf8.
>>>>>
>>>>> But UTF-8 strict implementation in Encode module is horrible slow when
>>>>> comparing to utf8::encode(). It is implemented in Encode.xs file and for
>>>>> benchmarking can be this XS implementation called directly by:
>>>>>
>>>>> use Encode;
>>>>> my $output = Encode::utf8::encode_xs({strict_utf8 => 1}, $input)
>>>>>
>>>>> (without overhead of Encode module...)
>>>>>
>>>>> Here are my results on 160 bytes long input string:
>>>>>
>>>>> Encode::utf8::encode_xs({strict_utf8 => 1}, ...):  8 wallclock secs ( 8.56 usr +
>>> 0.00 sys =  8.56 CPU) @ 467289.72/s (n=4000000)
>>>>> Encode::utf8::encode_xs({strict_utf8 => 0}, ...):  1 wallclock secs ( 1.66 usr +
>>> 0.00 sys =  1.66 CPU) @ 2409638.55/s (n=4000000)
>>>>> utf8::encode:  1 wallclock secs ( 0.39 usr +  0.00 sys =  0.39 CPU) @
>>> 10256410.26/s (n=4000000)
>>>>>
>>>>> I found two bottle necks (slow sv_catpv* and utf8n_to_uvuni functions)
>>>>> and did some optimizations. Final results are:
>>>>>
>>>>> Encode::utf8::encode_xs({strict_utf8 => 1}, ...):  2 wallclock secs ( 3.27 usr +
>>> 0.00 sys =  3.27 CPU) @ 1223241.59/s (n=4000000)
>>>>> Encode::utf8::encode_xs({strict_utf8 => 0}, ...):  1 wallclock secs ( 1.68 usr +
>>> 0.00 sys =  1.68 CPU) @ 2380952.38/s (n=4000000)
>>>>> utf8::encode:  1 wallclock secs ( 0.40 usr +  0.00 sys =  0.40 CPU) @
>>> 10000000.00/s (n=4000000)
>>>>>
>>>>> Patches are on github at pull request:
>>>>> https://github.com/dankogai/p5-encode/pull/56
>>>>>
>>>>> I would like if somebody review my patches and tell if this is the
>>>>> right way for optimizations...
>>>>>
>>>>
>>>> I'm sorry that this slipped off my radar until I saw it in the new Encode
>>>> release
>>>>
>>>> There are a couple of things I see wrong with your patch.
>>>>
>>>> 1) It does not catch the malformation of an overlong sequence.  This is a
>>>> serious malformation which has been used for attacks.  Basically, after you
>>>> get the result, you need to check that it is the expected length for that
>>>> result.  For example, \xC2\x80 will have an input length of 2, and evaluates
>>>> to \x00, whose expected length is 1, and so the input is overlong.  In
>>>> modern perls, you can just do an OFFUNISKIP(uv) and compare that with the
>>>> passed-in length.  This can be rewritten for perls back to 5.8 using
>>>> UNI_SKIP and UNI_TO_NATIVE
>>>
>>> I do not see where can be a problem. At least I think my patches should
>>> be compatible with previous implementation of Encode.xs...
>>>
>>> First UTF8_IS_INVARIANT is checked and one character processed.
>>>
>>> Otherwise UTF8_IS_START is checked and UTF8SKIP is used to get length of
>>> sequence. And then len-1 characters are checked if they pass test for
>>> UTF8_IS_CONTINUATION.
>>>
>>> If there are less characters then following does not
>>> UTF8_IS_CONTINUATION and error is reported. If there are more, then next
>>> iteration of loop starts and it fail on both UTF8_IS_CONTINUATION and
>>> UTF8_IS_START.
>>>
>>> Can you describe in details what do you think it wrong and how to do
>>> that attack?
>>
>> This discussion has been active at
>> https://github.com/dankogai/p5-encode/issues/64
>>
>> For the curious out there, please refer to that discussion.  My bottom line
>> is that I have come to believe the security risks are too high to have
>> modules do their own security checking for UTF-8 correctness.
>>>
>>>> 2) It does not work on EBCDIC platforms.  The NATIVE_TO_UTF() call is a good
>>>> start, but the result uv needs to be transformed back to native, using
>>>> UNI_TO_NATIVE(uv).
>>>
>>> uv is used just to check if it is valid Unicode code point. Real value
>>> is used only for error/warn message. Previous implementation used
>>> utf8n_to_uvuni which convert return value with NATIVE_TO_UNI.
>>
>> As I said on that other thread, if this is really true, then it's faster to
>> use a boolean function to verify the inputs.
>
> Value of uv is used only in warn/error message.
>
>> Also, performance should not
>> be a consideration for errors or warnings.  One can check validity fast; and
>> then spend the time getting the message right in the rare cases where a
>> message is generated.
>
> Yes, fully I agree.
>
>>>> 3) The assumptions the subroutine runs under need to be documented for
>>>> future maintainers and code readers.  For example, it assumes that there is
>>>> enough space in the input to hold all the bytes.
>>>
>>> Function process_utf8 does not assume that. It calls SvGROW to increase
>>> buffer size when needed.
>>
>> You misunderstand what I meant here.  The bottom line is your patch adds a
>> significant amount of code without any comments in a risky area.  The name
>> of the function does not indicate that its value is to be thrown away, and
>> even after looking at the code that calls it some more, it's not obvious to
>> me that the value isn't kept.  All subtleties in code should be commented in
>> that code.  To do otherwise is a disservice to future maintainers.  I
>> personally will never push to blead someone's commit that I think  unfairly
>> burdens future maintainers.  One of the subtleties of this function is that
>> it doesn't check that it is not running off the end of the buffer.  It
>> relies on the caller to do that check, but someone coming along might see
>> that function and think from its name that it's more general purpose than it
>> actually is.  Someone looking at its name and return value would likely
>> think it generates a valid code point from UTF-8 input.
>
> It is intended to do not add another non-needed checks as it again slow
> down function performance. Checks are done before on another place and
> that function is designed to be used only in Encode's process_utf8.
>
> So rather write comments/description about that function. But on the
> other hand, I dislike comments which just write what is function doing
> in case that comments are longer then function code itself. Then it is
> easier to read function code itself and so whole comment is useless...
>
>>>> Other than that, it looks ok to me.  But, to be sure, I think you should run
>>>> it on the tests included in the core t/op/utf8decode.t which came from an
>>>> internet repository of edge cases.
>>
>> I later realized that under non-strict calls, this can overflow, and there
>> is some code in your amendments to this patch that check that.  I have not
>> looked those over.
>
> Yes, check for overlong and overflow is there...
>
>> But again, I don't think Encode should undertake its own security checking.
>> I'm willing to work with you to get something in core that adequately meets
>> Encode's needs.
>
> Ok. We can discuss about it. First I see there big problems:
>
> 1) Encode module is for Perl 5.8+. In 5.8+ perl's versions will never be
>    your (new) functions, so Encode module needs to have at least copy of
>    them. And this does not solve problem which you want to prevent :-(
>
> 2) Functions must be defined and declared in some header file. So C
>    compiler can inline and optimize them in Encode module. Calling such
>    hot function via shared library must be avoided.
>
> 3) Such function needs to be designed in way that it do only what is
>    needed in case for Encode module. On the other hand I do not think it
>    is good idea to create special function just for Encode module living
>    in core perl... But maybe general-useful function can be designed.
>
>>> How to use and run that test with Encode?
>>
>> It looks like you figured that out for the most part in your amended
>> patches.
>
> Yes, meanwhile you wrote reply, I figured out about those problems and
> also copied those tests :-) Anyway, test for EBCDIC are missing in core.
>

[toc] | [prev] | [next] | [standalone]

#207

From	pagaltzis@gmx.de (Aristotle Pagaltzis)
Date	2016-08-21 04:33 +0200
Message-ID	<20160821023319.GA33610@plasmasturm.org>
In reply to	#206

* Karl Williamson <public@khwilliamson.com> [2016-08-21 03:12]:
> That should be done anyway to make sure we've got less buggy Unicode
> handling code available to older modules.

I think you meant “available to older perls”?

[toc] | [prev] | [next] | [standalone]

#208

From	public@khwilliamson.com (Karl Williamson)
Date	2016-08-20 20:55 -0600
Message-ID	<15c10fed-2476-a308-1081-8fc4de292885@khwilliamson.com>
In reply to	#207

On 08/20/2016 08:33 PM, Aristotle Pagaltzis wrote:
> * Karl Williamson <public@khwilliamson.com> [2016-08-21 03:12]:
>> That should be done anyway to make sure we've got less buggy Unicode
>> handling code available to older modules.
>
> I think you meant “available to older perls”?
>

Yes, thanks

[toc] | [prev] | [next] | [standalone]

#209

From	pali@cpan.org
Date	2016-08-21 10:34 +0200
Message-ID	<201608211034.35281@pali>
In reply to	#206

On Sunday 21 August 2016 03:10:40 Karl Williamson wrote:
> Top posting.
> 
> Attached is my alternative patch.  It effectively uses a different
> algorithm to avoid decoding the input into code points, and to copy
> all spans of valid input at once, instead of character at a time.
> 
> And it uses only currently available functions.

And that's the problem. As already wrote in previous email, calling 
function from shared library cannot be heavy optimized as inlined 
function and cause slow down. You are calling is_utf8_string_loc for 
non-strict mode which is not inlined and so encode/decode of non-strict 
mode will be slower...

And also in is_strict_utf8_string_loc you are calling isUTF8_CHAR which 
is calling _is_utf8_char_slow and which is calling utf8n_to_uvchr which 
cannot be inlined too...

Therefore I think this is not good approach...

[toc] | [prev] | [next] | [standalone]

#210

From	public@khwilliamson.com (Karl Williamson)
Date	2016-08-21 08:49 -0600
Message-ID	<5ddbd58f-4d4c-c58e-71d3-188f46a4e052@khwilliamson.com>
In reply to	#209

On 08/21/2016 02:34 AM, pali@cpan.org wrote:
> On Sunday 21 August 2016 03:10:40 Karl Williamson wrote:
>> Top posting.
>>
>> Attached is my alternative patch.  It effectively uses a different
>> algorithm to avoid decoding the input into code points, and to copy
>> all spans of valid input at once, instead of character at a time.
>>
>> And it uses only currently available functions.
>
> And that's the problem. As already wrote in previous email, calling
> function from shared library cannot be heavy optimized as inlined
> function and cause slow down. You are calling is_utf8_string_loc for
> non-strict mode which is not inlined and so encode/decode of non-strict
> mode will be slower...
>
> And also in is_strict_utf8_string_loc you are calling isUTF8_CHAR which
> is calling _is_utf8_char_slow and which is calling utf8n_to_uvchr which
> cannot be inlined too...
>
> Therefore I think this is not good approach...
>

Then you should run your benchmarks to find out the performance.

On valid input, is_utf8_string_loc() is called once per string.  The 
function call overhead and non-inlining should be not noticeable.

On valid input, is_utf8_char_slow() is never called.  The used-parts can 
be inlined.

On invalid input, performance should be a minor consideration.

The inner loop is much tighter in both functions; likely it can be held 
in the cache.  The algorithm avoids a bunch of work compared to the 
previous one.  I doubt that it will be slower than that.  The only way 
to know in any performance situation is to actually test.  And know that 
things will be different depending on the underlying hardware, so only 
large differences are really significant.

[toc] | [prev] | [next] | [standalone]

#211

From	pali@cpan.org
Date	2016-08-22 15:05 +0200
Message-ID	<20160822130518.GA9176@pali>
In reply to	#210

On Sunday 21 August 2016 08:49:08 Karl Williamson wrote:
> On 08/21/2016 02:34 AM, pali@cpan.org wrote:
> >On Sunday 21 August 2016 03:10:40 Karl Williamson wrote:
> >>Top posting.
> >>
> >>Attached is my alternative patch.  It effectively uses a different
> >>algorithm to avoid decoding the input into code points, and to copy
> >>all spans of valid input at once, instead of character at a time.
> >>
> >>And it uses only currently available functions.
> >
> >And that's the problem. As already wrote in previous email, calling
> >function from shared library cannot be heavy optimized as inlined
> >function and cause slow down. You are calling is_utf8_string_loc for
> >non-strict mode which is not inlined and so encode/decode of non-strict
> >mode will be slower...
> >
> >And also in is_strict_utf8_string_loc you are calling isUTF8_CHAR which
> >is calling _is_utf8_char_slow and which is calling utf8n_to_uvchr which
> >cannot be inlined too...
> >
> >Therefore I think this is not good approach...
> >
> 
> Then you should run your benchmarks to find out the performance.

You are right, benchmarks are needed to show final results.

> On valid input, is_utf8_string_loc() is called once per string.  The
> function call overhead and non-inlining should be not noticeable.

Ah right, I misread it as it is called per one valid sequence, not for
whole string. You are right.

> On valid input, is_utf8_char_slow() is never called.  The used-parts can be
> inlined.

Yes, but this function is there to be called primary on unknown input
which does not have to be valid. If I know that input is valid then
utf8::encode/decode is enough :-)

> On invalid input, performance should be a minor consideration.

See below...

> The inner loop is much tighter in both functions; likely it can be held in
> the cache.  The algorithm avoids a bunch of work compared to the previous
> one.

Right, for valid input algorithm is really faster. If it is because of
less case misses... maybe... I can play with perf or another tool to
look what is bottle neck now.

> I doubt that it will be slower than that.  The only way to know in any
> performance situation is to actually test.  And know that things will be
> different depending on the underlying hardware, so only large differences
> are really significant.

So, here are my test results. You can say that they are subjective, but
I would be happy if somebody provide better input for performance tests.

Abbreviations:
strict = Encode::encode/decode "UTF-8"
lax = Encode::encode/decode "utf8"
int = utf8::encode/decode
orig = commit 92d73bfab7792718f9ad5c5dc54013176ed9c76b
your = orig + 0001-Speed-up-Encode-UTF-8-validation-checking.patch
my = orig + revert commit c8247c27c13d1cf152398e453793a91916d2185d

Test cases:
all = join "", map { chr } 0 .. 0x10FFFF
short = "žluťoučký kůň pěl ďábelské ódy " x 45
long = $short x 1000
invalid-short = "\xA0" x 1000
invalid-long = "\xA0" x 1000000

Encoding was called on string with Encode::_utf8_on() flag.

Rates:

encode:
                   all       short     long  invalid-short invalid-long
orig - strict      41/s    124533/s    132/s     115197/s        172/s
your - strict     176/s    411523/s    427/s      54813/s         66/s
my   - strict      80/s    172712/s    186/s     113787/s        138/s

orig - lax       1010/s   3225806/s   6250/s     546800/s       5151/s
your - lax        952/s   3225806/s   5882/s     519325/s       4919/s
my   - lax       1060/s   3125000/s   6250/s     645119/s       5009/s

orig - int    8154604/s  10000000/s    infty    9787566/s    9748151/s
your - int    9135243/s  11111111/s    infty    8922821/s    9737657/s
my   - int    9779395/s  10000000/s    infty    9822046/s    8949861/s

decode:
                   all       short     long  invalid-short invalid-long
orig - strict      39/s    119048/s    131/s     108574/s        171/s
your - strict     173/s    353357/s    442/s      42440/s         55/s
my   - strict      69/s    166667/s    182/s     117291/s        135/s

orig - lax         39/s    123609/s    137/s     127302/s        172/s
your - lax        230/s    393701/s    495/s      37346/s         65/s
my   - lax         79/s    158983/s    180/s     121456/s        138/s

orig - int        274/s    546448/s    565/s    8219513/s      12357/s
your - int        273/s    540541/s    562/s    7226066/s      12948/s
my   - int        274/s    543478/s    562/s    8502902/s      12421/s

int is there just for verifications of tests as utf8::encode/decode
functions was not changed.

Results are: your patch is faster for valid sequences (as you wrote
above), but slower for invalid (in some cases radically).

So I would propose two optimizations:

1) Change macro isUTF8_CHAR in is_strict_utf8_string_loc() with some new
   which does not call utf8n_to_uvchr. That call is not needed as in
   that case sequence is already invalid.

2) Try to make inline version of function is_utf8_string_loc(). Maybe
   merge with is_strict_utf8_string_loc()? That should boost non strict
   decoder for invalid sequences (now it is slower then strict one).

And maybe it could make sense make all needed functions as part of
public API.

Are you going to prepare pull request for Encode module?

Anyway, how it behave on EBCDIC platforms? And maybe another question
what should  Encode::encode('UTF-8', $str) do on EBCDIC? Encode $str to
UTF-8 or to UTF-EBCDIC?

[toc] | [prev] | [next] | [standalone]

#213

From	public@khwilliamson.com (Karl Williamson)
Date	2016-08-22 13:43 -0600
Message-ID	<62e8d4d6-b474-b037-5d77-5f67d3e20371@khwilliamson.com>
In reply to	#211

On 08/22/2016 07:05 AM, pali@cpan.org wrote:
> On Sunday 21 August 2016 08:49:08 Karl Williamson wrote:
>> On 08/21/2016 02:34 AM, pali@cpan.org wrote:
>>> On Sunday 21 August 2016 03:10:40 Karl Williamson wrote:
>>>> Top posting.
>>>>
>>>> Attached is my alternative patch.  It effectively uses a different
>>>> algorithm to avoid decoding the input into code points, and to copy
>>>> all spans of valid input at once, instead of character at a time.
>>>>
>>>> And it uses only currently available functions.
>>>
>>> And that's the problem. As already wrote in previous email, calling
>>> function from shared library cannot be heavy optimized as inlined
>>> function and cause slow down. You are calling is_utf8_string_loc for
>>> non-strict mode which is not inlined and so encode/decode of non-strict
>>> mode will be slower...
>>>
>>> And also in is_strict_utf8_string_loc you are calling isUTF8_CHAR which
>>> is calling _is_utf8_char_slow and which is calling utf8n_to_uvchr which
>>> cannot be inlined too...
>>>
>>> Therefore I think this is not good approach...
>>>
>>
>> Then you should run your benchmarks to find out the performance.
>
> You are right, benchmarks are needed to show final results.
>
>> On valid input, is_utf8_string_loc() is called once per string.  The
>> function call overhead and non-inlining should be not noticeable.
>
> Ah right, I misread it as it is called per one valid sequence, not for
> whole string. You are right.

It is called once per valid sequence.  See below.

>
>> On valid input, is_utf8_char_slow() is never called.  The used-parts can be
>> inlined.
>
> Yes, but this function is there to be called primary on unknown input
> which does not have to be valid. If I know that input is valid then
> utf8::encode/decode is enough :-)

What process_utf8() does is to copy the alleged UTF-8 input to the 
output, verifying along the way that it actually is legal UTF-8 (with 2 
levels of strictness, depending on the input parameter), and taking some 
actions (exactly what depends on other input parameters) if and when it 
finds invalid UTF-8.

The way it works after my patch is like an instruction pipeline.  You 
start it up, and it stays in the pipeline as long as the next character 
in the input is legal or it reaches the end.  When it finds illegal 
input, it drops out of the pipeline, handles that, and starts up the 
pipeline to process any remaining input.  If the entire input string is 
valid, a single instance of the pipeline is all that gets invoked.

The use-case I envision is that the input is supposed to be valid UTF-8, 
and the purpose of process_utf8() is to verify that that is in fact 
true, and to take specified actions when it isn't.  Under that use-case, 
taking longer to deal with invalid input is not a problem.  If that is 
not your use-case, please explain what yours is.

And I think you misunderstand when is_utf8_char_slow() is called.  It is 
called only when the next byte in the input indicates that the only 
legal UTF-8 that might follow would be for a code point that is at least 
U+200000, almost twice as high as the highest legal Unicode code point. 
It is a Perl extension to handle such code points, unlike other 
languages.  But the Perl core is not optimized for them, nor will it be. 
  My point is that is_utf8_char_slow() will only be called in very 
specialized cases, and we need not make those cases have as good a 
performance as normal ones.

>> On invalid input, performance should be a minor consideration.
>
> See below...

See above. :)

>
>> The inner loop is much tighter in both functions; likely it can be held in
>> the cache.  The algorithm avoids a bunch of work compared to the previous
>> one.
>
> Right, for valid input algorithm is really faster. If it is because of
> less case misses... maybe... I can play with perf or another tool to
> look what is bottle neck now.
>
>> I doubt that it will be slower than that.  The only way to know in any
>> performance situation is to actually test.  And know that things will be
>> different depending on the underlying hardware, so only large differences
>> are really significant.
>
> So, here are my test results. You can say that they are subjective, but
> I would be happy if somebody provide better input for performance tests.
>
> Abbreviations:
> strict = Encode::encode/decode "UTF-8"
> lax = Encode::encode/decode "utf8"
> int = utf8::encode/decode
> orig = commit 92d73bfab7792718f9ad5c5dc54013176ed9c76b
> your = orig + 0001-Speed-up-Encode-UTF-8-validation-checking.patch
> my = orig + revert commit c8247c27c13d1cf152398e453793a91916d2185d
>
> Test cases:
> all = join "", map { chr } 0 .. 0x10FFFF
> short = "žluťoučký kůň pěl ďábelské ódy " x 45
> long = $short x 1000
> invalid-short = "\xA0" x 1000
> invalid-long = "\xA0" x 1000000
>
> Encoding was called on string with Encode::_utf8_on() flag.
>
>
> Rates:
>
> encode:
>                    all       short     long  invalid-short invalid-long
> orig - strict      41/s    124533/s    132/s     115197/s        172/s
> your - strict     176/s    411523/s    427/s      54813/s         66/s
> my   - strict      80/s    172712/s    186/s     113787/s        138/s
>
> orig - lax       1010/s   3225806/s   6250/s     546800/s       5151/s
> your - lax        952/s   3225806/s   5882/s     519325/s       4919/s
> my   - lax       1060/s   3125000/s   6250/s     645119/s       5009/s
>
> orig - int    8154604/s  10000000/s    infty    9787566/s    9748151/s
> your - int    9135243/s  11111111/s    infty    8922821/s    9737657/s
> my   - int    9779395/s  10000000/s    infty    9822046/s    8949861/s
>
>
> decode:
>                    all       short     long  invalid-short invalid-long
> orig - strict      39/s    119048/s    131/s     108574/s        171/s
> your - strict     173/s    353357/s    442/s      42440/s         55/s
> my   - strict      69/s    166667/s    182/s     117291/s        135/s
>
> orig - lax         39/s    123609/s    137/s     127302/s        172/s
> your - lax        230/s    393701/s    495/s      37346/s         65/s
> my   - lax         79/s    158983/s    180/s     121456/s        138/s
>
> orig - int        274/s    546448/s    565/s    8219513/s      12357/s
> your - int        273/s    540541/s    562/s    7226066/s      12948/s
> my   - int        274/s    543478/s    562/s    8502902/s      12421/s
>
>
> int is there just for verifications of tests as utf8::encode/decode
> functions was not changed.
>
> Results are: your patch is faster for valid sequences (as you wrote
> above), but slower for invalid (in some cases radically).
>
> So I would propose two optimizations:
>
> 1) Change macro isUTF8_CHAR in is_strict_utf8_string_loc() with some new
>    which does not call utf8n_to_uvchr. That call is not needed as in
>    that case sequence is already invalid.

And an optimizing compiler should figure that out and omit the call, so 
we shouldn't have to manually.  In the next few months I will be working 
on some fixes to the :utf8 layer.  That could lead to a core function 
similar to this, but without relying on the compiler to figure things out.
>
> 2) Try to make inline version of function is_utf8_string_loc(). Maybe
>    merge with is_strict_utf8_string_loc()? That should boost non strict
>    decoder for invalid sequences (now it is slower then strict one).

I'll look at which functions in utf8.c should be inlined.  This is a 
candidate.  But I doubt that that is why this is slower in this case. 
Read blead's perlhack.pod for performance testing tool hints.

Your comments caused me to realize that the call to utf8n_to_uvchr() 
when the input is syntactically valid, but is an illegal code point 
(like a surrogate, etc) could be replaced by the faster 
valid_utf8_to_uvchr(), which has not been in the public API.  I have 
pushed to

http://perl5.git.perl.org/perl.git/shortlog/refs/heads/smoke-me/khw-encode

a version of blead which has this function made public, and inlined, in 
case you want to test with it.

There are some things in the error handling code that could be improved 
upon.  For example, memcpy() of a single character is slower than just 
copying a byte at a time, as the function call overhead dwarfs the savings.

>
> And maybe it could make sense make all needed functions as part of
> public API.

Yes

>
> Are you going to prepare pull request for Encode module?

I will, after everything is settled.  This may include changes to 
Devel::PPPort to make sure this works on all perls that Encode supports.

However, I think it would be good to get the extra tests of malformed 
inputs into the distribution.  (I haven't looked at your PR yet for 
flaws.  I'll do that, and hopefully will find none, so will recommend 
your PR be pulled.)
>
>
> Anyway, how it behave on EBCDIC platforms? And maybe another question
> what should  Encode::encode('UTF-8', $str) do on EBCDIC? Encode $str to
> UTF-8 or to UTF-EBCDIC?

It works fine on EBCDIC platforms.  There are other bugs in Encode on 
EBCDIC that I plan on investigating as time permits.  Doing this has 
fixed some of these for free.  The uvuni() functions should in almost 
all instances be uvchr(), and my patch does that.

On EBCDIC platforms, UTF-8 is defined to be UTF-EBCDIC (or vice versa if 
you prefer), so $str will effectively be in the version of UTF-EBCDIC 
valid for the platform it is running on (there are differences depending 
on the platform's underlying code page).
>

[toc] | [prev] | [next] | [standalone]

#214

From	pali@cpan.org
Date	2016-08-22 22:47 +0200
Message-ID	<201608222247.46077@pali>
In reply to	#213

On Monday 22 August 2016 21:43:59 Karl Williamson wrote:
> On 08/22/2016 07:05 AM, pali@cpan.org wrote:
> > On Sunday 21 August 2016 08:49:08 Karl Williamson wrote:
> >> On 08/21/2016 02:34 AM, pali@cpan.org wrote:
> >>> On Sunday 21 August 2016 03:10:40 Karl Williamson wrote:
> >>>> Top posting.
> >>>>
> >>>> Attached is my alternative patch.  It effectively uses a different
> >>>> algorithm to avoid decoding the input into code points, and to copy
> >>>> all spans of valid input at once, instead of character at a time.
> >>>>
> >>>> And it uses only currently available functions.
> >>>
> >>> And that's the problem. As already wrote in previous email, calling
> >>> function from shared library cannot be heavy optimized as inlined
> >>> function and cause slow down. You are calling is_utf8_string_loc for
> >>> non-strict mode which is not inlined and so encode/decode of non-strict
> >>> mode will be slower...
> >>>
> >>> And also in is_strict_utf8_string_loc you are calling isUTF8_CHAR which
> >>> is calling _is_utf8_char_slow and which is calling utf8n_to_uvchr which
> >>> cannot be inlined too...
> >>>
> >>> Therefore I think this is not good approach...
> >>>
> >>
> >> Then you should run your benchmarks to find out the performance.
> >
> > You are right, benchmarks are needed to show final results.
> >
> >> On valid input, is_utf8_string_loc() is called once per string.  The
> >> function call overhead and non-inlining should be not noticeable.
> >
> > Ah right, I misread it as it is called per one valid sequence, not for
> > whole string. You are right.
> 
> It is called once per valid sequence.  See below.
> 
> >
> >> On valid input, is_utf8_char_slow() is never called.  The used-parts can be
> >> inlined.
> >
> > Yes, but this function is there to be called primary on unknown input
> > which does not have to be valid. If I know that input is valid then
> > utf8::encode/decode is enough :-)
> 
> What process_utf8() does is to copy the alleged UTF-8 input to the 
> output, verifying along the way that it actually is legal UTF-8 (with 2 
> levels of strictness, depending on the input parameter), and taking some 
> actions (exactly what depends on other input parameters) if and when it 
> finds invalid UTF-8.
> 
> The way it works after my patch is like an instruction pipeline.  You 
> start it up, and it stays in the pipeline as long as the next character 
> in the input is legal or it reaches the end.  When it finds illegal 
> input, it drops out of the pipeline, handles that, and starts up the 
> pipeline to process any remaining input.  If the entire input string is 
> valid, a single instance of the pipeline is all that gets invoked.

Yes, I figured out how it works.

> The use-case I envision is that the input is supposed to be valid UTF-8, 
> and the purpose of process_utf8() is to verify that that is in fact 
> true, and to take specified actions when it isn't.

Right!

> Under that use-case, 
> taking longer to deal with invalid input is not a problem.  If that is 
> not your use-case, please explain what yours is.

Basically Encode::decode("UTF-8", $input) is used for converting "untrusted" input 
sequence (e.g. from network or local file) to perl Unicode scalar. And if input 
contains something invalid, then Encode::decode do anything needed to return valid 
Unicode string (= replace invalid subsequences by Unicode replacement character). So 
Encode::decode("UTF-8", $input) is there for processing any input sequences, not only 
valid, also broken or totally invalid.

> And I think you misunderstand when is_utf8_char_slow() is called.  It is 
> called only when the next byte in the input indicates that the only 
> legal UTF-8 that might follow would be for a code point that is at least 
> U+200000, almost twice as high as the highest legal Unicode code point. 
> It is a Perl extension to handle such code points, unlike other 
> languages.  But the Perl core is not optimized for them, nor will it be. 
>   My point is that is_utf8_char_slow() will only be called in very 
> specialized cases, and we need not make those cases have as good a 
> performance as normal ones.

In strict mode, there is absolutely no need to call is_utf8_char_slow(). As in strict 
mode such sequence must be always invalid (it is above last valid Unicode character) 
This is what I tried to tell.

And currently is_strict_utf8_string_loc() first calls isUTF8_CHAR() (which could call 
is_utf8_char_slow()) and after that is check for UTF8_IS_SUPER().

So maybe it could make sense to provide some "strict" version of isUTF8_CHAR() macro as 
it such strict version does not have to call is_utf8_char_slow().

> >> On invalid input, performance should be a minor consideration.
> >
> > See below...
> 
> See above. :)
> 
> >
> >> The inner loop is much tighter in both functions; likely it can be held in
> >> the cache.  The algorithm avoids a bunch of work compared to the previous
> >> one.
> >
> > Right, for valid input algorithm is really faster. If it is because of
> > less case misses... maybe... I can play with perf or another tool to
> > look what is bottle neck now.
> >
> >> I doubt that it will be slower than that.  The only way to know in any
> >> performance situation is to actually test.  And know that things will be
> >> different depending on the underlying hardware, so only large differences
> >> are really significant.
> >
> > So, here are my test results. You can say that they are subjective, but
> > I would be happy if somebody provide better input for performance tests.
> >
> > Abbreviations:
> > strict = Encode::encode/decode "UTF-8"
> > lax = Encode::encode/decode "utf8"
> > int = utf8::encode/decode
> > orig = commit 92d73bfab7792718f9ad5c5dc54013176ed9c76b
> > your = orig + 0001-Speed-up-Encode-UTF-8-validation-checking.patch
> > my = orig + revert commit c8247c27c13d1cf152398e453793a91916d2185d
> >
> > Test cases:
> > all = join "", map { chr } 0 .. 0x10FFFF
> > short = "žluťoučký kůň pěl ďábelské ódy " x 45
> > long = $short x 1000
> > invalid-short = "\xA0" x 1000
> > invalid-long = "\xA0" x 1000000
> >
> > Encoding was called on string with Encode::_utf8_on() flag.
> >
> >
> > Rates:
> >
> > encode:
> >                    all       short     long  invalid-short invalid-long
> > orig - strict      41/s    124533/s    132/s     115197/s        172/s
> > your - strict     176/s    411523/s    427/s      54813/s         66/s
> > my   - strict      80/s    172712/s    186/s     113787/s        138/s
> >
> > orig - lax       1010/s   3225806/s   6250/s     546800/s       5151/s
> > your - lax        952/s   3225806/s   5882/s     519325/s       4919/s
> > my   - lax       1060/s   3125000/s   6250/s     645119/s       5009/s
> >
> > orig - int    8154604/s  10000000/s    infty    9787566/s    9748151/s
> > your - int    9135243/s  11111111/s    infty    8922821/s    9737657/s
> > my   - int    9779395/s  10000000/s    infty    9822046/s    8949861/s
> >
> >
> > decode:
> >                    all       short     long  invalid-short invalid-long
> > orig - strict      39/s    119048/s    131/s     108574/s        171/s
> > your - strict     173/s    353357/s    442/s      42440/s         55/s
> > my   - strict      69/s    166667/s    182/s     117291/s        135/s
> >
> > orig - lax         39/s    123609/s    137/s     127302/s        172/s
> > your - lax        230/s    393701/s    495/s      37346/s         65/s
> > my   - lax         79/s    158983/s    180/s     121456/s        138/s
> >
> > orig - int        274/s    546448/s    565/s    8219513/s      12357/s
> > your - int        273/s    540541/s    562/s    7226066/s      12948/s
> > my   - int        274/s    543478/s    562/s    8502902/s      12421/s
> >
> >
> > int is there just for verifications of tests as utf8::encode/decode
> > functions was not changed.
> >
> > Results are: your patch is faster for valid sequences (as you wrote
> > above), but slower for invalid (in some cases radically).
> >
> > So I would propose two optimizations:
> >
> > 1) Change macro isUTF8_CHAR in is_strict_utf8_string_loc() with some new
> >    which does not call utf8n_to_uvchr. That call is not needed as in
> >    that case sequence is already invalid.
> 
> And an optimizing compiler should figure that out and omit the call, so 
> we shouldn't have to manually.  In the next few months I will be working 
> on some fixes to the :utf8 layer.  That could lead to a core function 
> similar to this, but without relying on the compiler to figure things out.

Great! But :utf8 layer is cannot be used for reading "untrusted" arbitrary files... 

> > 2) Try to make inline version of function is_utf8_string_loc(). Maybe
> >    merge with is_strict_utf8_string_loc()? That should boost non strict
> >    decoder for invalid sequences (now it is slower then strict one).
> 
> I'll look at which functions in utf8.c should be inlined.  This is a 
> candidate.  But I doubt that that is why this is slower in this case. 
> Read blead's perlhack.pod for performance testing tool hints.
> 
> Your comments caused me to realize that the call to utf8n_to_uvchr() 
> when the input is syntactically valid, but is an illegal code point 
> (like a surrogate, etc) could be replaced by the faster 
> valid_utf8_to_uvchr(), which has not been in the public API.  I have 
> pushed to
> 
> http://perl5.git.perl.org/perl.git/shortlog/refs/heads/smoke-me/khw-encode
> 
> a version of blead which has this function made public, and inlined, in 
> case you want to test with it.
> 
> There are some things in the error handling code that could be improved 
> upon.  For example, memcpy() of a single character is slower than just 
> copying a byte at a time, as the function call overhead dwarfs the savings.

This one memcpy in OK. It is called on constant string and I checked that gcc 4.6 with 
-O2 on x86_64 already replace that call by instructions. No memcpy function call is 
used.

But maybe it could be rewritten... no idea if other compilers have optimizations for 
strlen/strcpy/memcpy functions on constant strings and in case when length is known.

> >
> > And maybe it could make sense make all needed functions as part of
> > public API.
> 
> Yes
> 
> >
> > Are you going to prepare pull request for Encode module?
> 
> I will, after everything is settled.  This may include changes to 
> Devel::PPPort to make sure this works on all perls that Encode supports.
> 
> However, I think it would be good to get the extra tests of malformed 
> inputs into the distribution.  (I haven't looked at your PR yet for 
> flaws.  I'll do that, and hopefully will find none, so will recommend 
> your PR be pulled.)

I added some tests for overlong sequences. Only for ASCII platforms, tests for EBCDIC 
are missing (sorry, I do not have access to any EBCDIC platform for testing).

> > Anyway, how it behave on EBCDIC platforms? And maybe another question
> > what should  Encode::encode('UTF-8', $str) do on EBCDIC? Encode $str to
> > UTF-8 or to UTF-EBCDIC?
> 
> It works fine on EBCDIC platforms.  There are other bugs in Encode on 
> EBCDIC that I plan on investigating as time permits.  Doing this has 
> fixed some of these for free.  The uvuni() functions should in almost 
> all instances be uvchr(), and my patch does that.

Now I'm thinking if FBCHAR_UTF8 define is working also on EBCDIC... I think that it 
should be different for UTF-EBCDIC.

> On EBCDIC platforms, UTF-8 is defined to be UTF-EBCDIC (or vice versa if 
> you prefer), so $str will effectively be in the version of UTF-EBCDIC 
> valid for the platform it is running on (there are differences depending 
> on the platform's underlying code page).

So it means that on EBCDIC platforms you cannot process file which is encoded in UTF-8? 
As Encode::decode("UTF-8", $str) expect $str to be in UTF-EBCDIC and not in UTF-8 (as I 
understood).

[toc] | [prev] | [next] | [standalone]

#215

From	public@khwilliamson.com (Karl Williamson)
Date	2016-08-22 15:19 -0600
Message-ID	<4e654c06-2b51-31a8-c2ab-f98c4dcf421d@khwilliamson.com>
In reply to	#214

On 08/22/2016 02:47 PM, pali@cpan.org wrote:
>> > And I think you misunderstand when is_utf8_char_slow() is called.  It is
>> > called only when the next byte in the input indicates that the only
>> > legal UTF-8 that might follow would be for a code point that is at least
>> > U+200000, almost twice as high as the highest legal Unicode code point.
>> > It is a Perl extension to handle such code points, unlike other
>> > languages.  But the Perl core is not optimized for them, nor will it be.
>> >   My point is that is_utf8_char_slow() will only be called in very
>> > specialized cases, and we need not make those cases have as good a
>> > performance as normal ones.
> In strict mode, there is absolutely no need to call is_utf8_char_slow(). As in strict
> mode such sequence must be always invalid (it is above last valid Unicode character)
> This is what I tried to tell.
>
> And currently is_strict_utf8_string_loc() first calls isUTF8_CHAR() (which could call
> is_utf8_char_slow()) and after that is check for UTF8_IS_SUPER().

I only have time to respond to this portion just now.

The code could be tweaked to call UTF8_IS_SUPER first, but I'm asserting 
that an optimizing compiler will see that any call to 
is_utf8_char_slow() is pointless, and will optimize it out.

[toc] | [prev] | [next] | [standalone]

#216

From	public@khwilliamson.com (Karl Williamson)
Date	2016-08-22 15:38 -0600
Message-ID	<1d1eb897-c5d7-ff15-5ba0-44c9a4a6393e@khwilliamson.com>
In reply to	#215

On 08/22/2016 03:19 PM, Karl Williamson wrote:
> On 08/22/2016 02:47 PM, pali@cpan.org wrote:
>>> > And I think you misunderstand when is_utf8_char_slow() is called.
>>> It is
>>> > called only when the next byte in the input indicates that the only
>>> > legal UTF-8 that might follow would be for a code point that is at
>>> least
>>> > U+200000, almost twice as high as the highest legal Unicode code
>>> point.
>>> > It is a Perl extension to handle such code points, unlike other
>>> > languages.  But the Perl core is not optimized for them, nor will
>>> it be.
>>> >   My point is that is_utf8_char_slow() will only be called in very
>>> > specialized cases, and we need not make those cases have as good a
>>> > performance as normal ones.
>> In strict mode, there is absolutely no need to call
>> is_utf8_char_slow(). As in strict
>> mode such sequence must be always invalid (it is above last valid
>> Unicode character)
>> This is what I tried to tell.
>>
>> And currently is_strict_utf8_string_loc() first calls isUTF8_CHAR()
>> (which could call
>> is_utf8_char_slow()) and after that is check for UTF8_IS_SUPER().
>
> I only have time to respond to this portion just now.
>
> The code could be tweaked to call UTF8_IS_SUPER first, but I'm asserting
> that an optimizing compiler will see that any call to
> is_utf8_char_slow() is pointless, and will optimize it out.
>

Now, I'm realizing I'm wrong.  It can't be optimized out by the compiler 
because it is not declared (nor can it be) to be a pure function.  And, 
I'd rather not tweak it to call UTF8_IS_SUPER first, because that relies 
on knowing what the current internal implementation is.

But I still argue that it is fine the way it is.  It will only get 
called for code points much higher than Unicode, and the performance on 
those should not affect our decisions in any way.

[toc] | [prev] | [next] | [standalone]

#218

From	pali@cpan.org
Date	2016-08-22 23:45 +0200
Message-ID	<201608222345.07572@pali>
In reply to	#216

On Monday 22 August 2016 23:38:05 Karl Williamson wrote:
> And, I'd rather not tweak it to call UTF8_IS_SUPER first,
> because that relies on knowing what the current internal
> implementation is.

Then maybe add new macro isUTF8_CHAR_STRICT which only check if 
character is strictly valid UTF-8? I think that such macro can be 
useful...

[toc] | [prev] | [next] | [standalone]

#217

From	pali@cpan.org
Date	2016-08-22 23:39 +0200
Message-ID	<201608222339.51155@pali>
In reply to	#215

(this only applies for strict UTF-8)

On Monday 22 August 2016 23:19:51 Karl Williamson wrote:
> The code could be tweaked to call UTF8_IS_SUPER first, but I'm
> asserting that an optimizing compiler will see that any call to
> is_utf8_char_slow() is pointless, and will optimize it out.

Such optimization cannot be done and compiler cannot know such thing...

You have this code:

+        const STRLEN char_len = isUTF8_CHAR(x, send);
+
+        if (    UNLIKELY(! char_len)
+            || (    UNLIKELY(isUTF8_POSSIBLY_PROBLEMATIC(*x))
+                && (   UNLIKELY(UTF8_IS_SURROGATE(x, send))
+                    || UNLIKELY(UTF8_IS_SUPER(x, send))
+                    || UNLIKELY(UTF8_IS_NONCHAR(x, send)))))
+        {
+            *ep = x;
+            return FALSE;
+        }

Here isUTF8_CHAR() macro will call function is_utf8_char_slow() if 
condition IS_UTF8_CHAR_FAST(UTF8SKIP(x))) is truth. And because 
is_utf8_char_slow() is external library function compiler has absolutely 
no idea what that function is doing. In non-functional world such 
function could have side effect, etc and compiler really cannot 
eliminate that call.

Moving UTF8_IS_SUPER before isUTF8_CHAR maybe could help, but I'm septic 
if gcc really can propagate constant from PL_utf8skip[] array back and 
prove that IS_UTF8_CHAR_FAST must be always true when UTF8_IS_SUPER is 
true too...

Rather add IS_UTF8_CHAR_FAST(UTF8SKIP(s))) check (or similar) before 
isUTF8_CHAR() call. That should totally eliminate generating code with 
call to is_utf8_char_slow() function.

With UTF8_IS_SUPER there can be branch in binary code which never will 
be evaluated.

[toc] | [prev] | [next] | [standalone]

#219

From	public@khwilliamson.com (Karl Williamson)
Date	2016-08-24 22:49 -0600
Message-ID	<40d6edc2-8dd7-d869-2387-577d67783ef3@khwilliamson.com>
In reply to	#214

On 08/22/2016 02:47 PM, pali@cpan.org wrote:

snip

> I added some tests for overlong sequences. Only for ASCII platforms, tests for EBCDIC
> are missing (sorry, I do not have access to any EBCDIC platform for testing).

It's fine to skip those tests on EBCDIC.
>
>>> > > Anyway, how it behave on EBCDIC platforms? And maybe another question
>>> > > what should  Encode::encode('UTF-8', $str) do on EBCDIC? Encode $str to
>>> > > UTF-8 or to UTF-EBCDIC?
>> >
>> > It works fine on EBCDIC platforms.  There are other bugs in Encode on
>> > EBCDIC that I plan on investigating as time permits.  Doing this has
>> > fixed some of these for free.  The uvuni() functions should in almost
>> > all instances be uvchr(), and my patch does that.
> Now I'm thinking if FBCHAR_UTF8 define is working also on EBCDIC... I think that it
> should be different for UTF-EBCDIC.

I'll fix that
>
>> > On EBCDIC platforms, UTF-8 is defined to be UTF-EBCDIC (or vice versa if
>> > you prefer), so $str will effectively be in the version of UTF-EBCDIC
>> > valid for the platform it is running on (there are differences depending
>> > on the platform's underlying code page).
> So it means that on EBCDIC platforms you cannot process file which is encoded in UTF-8?
> As Encode::decode("UTF-8", $str) expect $str to be in UTF-EBCDIC and not in UTF-8 (as I
> understood).
>
Yes.  The two worlds do not meet.  If you are on an EBCDIC platform, the 
native encoding is UTF-EBCDIC tailored to the code page the platform 
runs on.

In searching, I did not find anything that converts between the two, so 
I wrote a Perl script to do so.  Our OS/390 man, Yaroslav, wrote one in C.

[toc] | [prev] | [next] | [standalone]

#220

From	pali@cpan.org
Date	2016-08-25 09:48 +0200
Message-ID	<20160825074833.GA27810@pali>
In reply to	#219

On Wednesday 24 August 2016 22:49:21 Karl Williamson wrote:
> On 08/22/2016 02:47 PM, pali@cpan.org wrote:
> 
> snip
> 
> >I added some tests for overlong sequences. Only for ASCII platforms, tests for EBCDIC
> >are missing (sorry, I do not have access to any EBCDIC platform for testing).
> 
> It's fine to skip those tests on EBCDIC.

Ok.

> >>>> > Anyway, how it behave on EBCDIC platforms? And maybe another question
> >>>> > what should  Encode::encode('UTF-8', $str) do on EBCDIC? Encode $str to
> >>>> > UTF-8 or to UTF-EBCDIC?
> >>>
> >>> It works fine on EBCDIC platforms.  There are other bugs in Encode on
> >>> EBCDIC that I plan on investigating as time permits.  Doing this has
> >>> fixed some of these for free.  The uvuni() functions should in almost
> >>> all instances be uvchr(), and my patch does that.
> >Now I'm thinking if FBCHAR_UTF8 define is working also on EBCDIC... I think that it
> >should be different for UTF-EBCDIC.
> 
> I'll fix that
> >
> >>> On EBCDIC platforms, UTF-8 is defined to be UTF-EBCDIC (or vice versa if
> >>> you prefer), so $str will effectively be in the version of UTF-EBCDIC
> >>> valid for the platform it is running on (there are differences depending
> >>> on the platform's underlying code page).
> >So it means that on EBCDIC platforms you cannot process file which is encoded in UTF-8?
> >As Encode::decode("UTF-8", $str) expect $str to be in UTF-EBCDIC and not in UTF-8 (as I
> >understood).
> >
> Yes.  The two worlds do not meet.  If you are on an EBCDIC platform, the
> native encoding is UTF-EBCDIC tailored to the code page the platform runs
> on.
> 
> In searching, I did not find anything that converts between the two, so I
> wrote a Perl script to do so.  Our OS/390 man, Yaroslav, wrote one in C.

Thank you for information! I though that "UTF-8" encoding (with hyphen)
is that strict and correct UTF-8 version on both ASCII & EBCDIC
platforms as in Encode documentation is nothing written that on EBCDIC
is is different...

Anyway, if you need some help with Encode module or something different,
let me know. As I want to have UTF-8 support in Encode correctly
working...

[toc] | [prev] | [next] | [standalone]

#221

From	public@khwilliamson.com (Karl Williamson)
Date	2016-08-29 09:00 -0600
Message-ID	<e3b394fe-77a8-51f8-7793-318f50a88f69@khwilliamson.com>
In reply to	#220

On 08/25/2016 01:48 AM, pali@cpan.org wrote:
> Anyway, if you need some help with Encode module or something different,
> let me know. As I want to have UTF-8 support in Encode correctly
> working...

I now have a branch with my proposed changes at:
http://perl5.git.perl.org/perl.git/shortlog/refs/heads/smoke-me/khw-encode

If you'd be willing to test this out, especially the performance parts 
that would be great!

There are a bunch of commits.  Most significantly, I inlined several of 
the short functions formerly in utf8.c.  I realized that isUTF8_CHAR 
could be implemented differently for large code points that allowed it 
to never have to decode the UTF-8 into a UV.  I presume that is faster.
The function valid_utf8_to_uvchr() is now documented, inlined, and used 
in the Encode code where appropriate.  This avoids the overhead of 
checking for errors while decoding when they've already been ruled out.

There are 2 experimental performance commits.  If you want to see if 
they actually improve performance by doing a before/after compare that 
would be nice.

One unrolls the loop in valid_utf8_to_uvchr().  I tried doing this, but 
not this way some time ago, and it made no appreciable difference.  But 
now that it is inlined, that may be different.  On the other hand, the 
unrolling makes the function bigger, and the compiler may now refuse to 
inline it.

And the other commit changes utf8n_to_uvchr() to first call 
isUTF8_CHAR(), and if it passes, call valid_utf8_to_uvchr().  Again, 
there should be less overhead for the normal case where the input is 
well-formed, at the expense of slowing down somewhat malformed input. 
It's unclear if this is worth changing or not.

[toc] | [prev] | [next] | [standalone]

#222

From	pali@cpan.org
Date	2016-08-31 23:43 +0200
Message-ID	<201608312343.43936@pali>
In reply to	#221

On Monday 29 August 2016 17:00:00 Karl Williamson wrote:
> If you'd be willing to test this out, especially the performance
> parts that would be great!
[snip]
> There are 2 experimental performance commits.  If you want to see if
> they actually improve performance by doing a before/after compare
> that would be nice.

So here are my results:

strict = bless({strict_utf8 => 1}, "Encode::utf8")->encode_xs/decode_xs
lax    = bless({strict_utf8 => 0}, "Encode::utf8")->encode_xs/decode_xs
int    = utf8::encode/decode

all    = join "", map { chr } 0 .. 0x10FFFF
short  = "žluťoučký kůň pěl ďábelské ódy " x 45
long   = $short x 1000
ishort = "\xA0" x 1000
ilong  = "\xA0" x 1000000

your   = 9c03449800417dd02cc1af613951a1002490a52a
orig   = f16e7fa35c1302aa056db5d8d022b7861c1dd2e8
my     = orig without c8247c27c13d1cf152398e453793a91916d2185d
your1  = your without b65e9a52d8b428146ee554d724b9274f8e77286c
your2  = your without 9ccc3ecd1119ccdb64e91b1f03376916aa8cc6f7


decode
                          all            ilong          ishort         long           short 
    my: - int           285.94/s     14988.61/s   4694109.54/s       704.15/s    599678.93/s
  orig: - int           292.41/s     15121.98/s   4782883.50/s       494.33/s    553182.28/s
 your1: - int           271.21/s     14232.25/s   4706722.93/s       599.68/s    554941.90/s
 your2: - int           280.85/s     14090.33/s   4210573.40/s       593.93/s    558487.86/s
  your: - int           283.23/s     15121.98/s   4500252.51/s       691.95/s    678859.55/s

                          all            ilong          ishort         long           short 
    my: - lax            83.28/s       202.22/s    142049.67/s       181.82/s    163352.41/s
  orig: - lax            53.49/s       201.58/s    152422.11/s       147.13/s    133974.37/s
 your1: - lax           255.13/s        53.75/s     47590.82/s       560.34/s    431447.77/s
 your2: - lax           281.71/s        48.41/s     43260.19/s       634.16/s    445365.29/s
  your: - lax           286.96/s        46.35/s     42848.40/s       632.20/s    442546.52/s

                          all            ilong          ishort         long           short 
    my: - strict         90.48/s       200.00/s    143081.15/s       197.53/s    175800.00/s
  orig: - strict         49.21/s       202.22/s    149447.34/s       142.81/s    128290.63/s
 your1: - strict        154.94/s        48.16/s     44237.93/s       191.36/s    169228.16/s
 your2: - strict        158.75/s        40.06/s     37244.06/s       195.95/s    173588.68/s
  your: - strict        158.26/s        38.54/s     36898.14/s       195.95/s    172504.61/s


encode
                          all            ilong          ishort         long           short 
    my: - int       5197722.67/s   5227338.26/s   5210583.97/s   5163520.62/s   5227338.26/s
  orig: - int       5449888.54/s   5381336.48/s   5370254.05/s   5449888.54/s   5301624.60/s
 your1: - int       5244200.62/s   5293830.28/s   5277183.02/s   5361483.07/s   5260640.13/s
 your2: - int       5435994.67/s   5432587.30/s   5398312.30/s   5487602.22/s   5606457.74/s
  your: - int       5261172.17/s   5327441.90/s   5310582.91/s   5310582.91/s   5361483.07/s

                          all            ilong          ishort         long           short 
    my: - lax          2442.24/s     15084.08/s   2882995.00/s      7993.15/s   2716293.65/s
  orig: - lax          2438.39/s     15121.98/s   2933419.33/s      7965.22/s   2665521.81/s
 your1: - lax          2229.94/s     14908.60/s   2117316.51/s      7428.89/s   2011133.75/s
 your2: - lax          2400.92/s     15121.98/s   3046739.87/s      8065.41/s   2742961.18/s
  your: - lax          2368.00/s     15168.94/s   2862328.67/s      8090.85/s   2685694.50/s

                          all            ilong          ishort         long           short 
    my: - strict         92.16/s       204.81/s    157772.05/s       200.00/s    190344.59/s
  orig: - strict         49.04/s       202.22/s    160767.72/s       142.81/s    133548.90/s
 your1: - strict        147.75/s        46.91/s     46095.57/s       194.36/s    176949.84/s
 your2: - strict        159.25/s        40.19/s     38034.59/s       196.20/s    185166.45/s
  your: - strict        158.26/s        38.54/s     37012.73/s       196.20/s    186357.23/s


So looks like that experimental commits did not speed up encoder or decoder.

What is relevant from these tests is that your patches slow down encoding
and decoding of illegal sequences like "\xA0" x 1000000 about 4-5 times.

[toc] | [prev] | [next] | [standalone]

#223

From	public@khwilliamson.com (Karl Williamson)
Date	2016-08-31 21:27 -0600
Message-ID	<d5301b47-c970-54ad-144a-95c31655e828@khwilliamson.com>
In reply to	#222

On 08/31/2016 03:43 PM, pali@cpan.org wrote:
> On Monday 29 August 2016 17:00:00 Karl Williamson wrote:
>> If you'd be willing to test this out, especially the performance
>> parts that would be great!
> [snip]
>> There are 2 experimental performance commits.  If you want to see if
>> they actually improve performance by doing a before/after compare
>> that would be nice.
>
> So here are my results:
>
> strict = bless({strict_utf8 => 1}, "Encode::utf8")->encode_xs/decode_xs
> lax    = bless({strict_utf8 => 0}, "Encode::utf8")->encode_xs/decode_xs
> int    = utf8::encode/decode
>
> all    = join "", map { chr } 0 .. 0x10FFFF
> short  = "žluťoučký kůň pěl ďábelské ódy " x 45
> long   = $short x 1000
> ishort = "\xA0" x 1000
> ilong  = "\xA0" x 1000000
>
> your   = 9c03449800417dd02cc1af613951a1002490a52a
> orig   = f16e7fa35c1302aa056db5d8d022b7861c1dd2e8
> my     = orig without c8247c27c13d1cf152398e453793a91916d2185d
> your1  = your without b65e9a52d8b428146ee554d724b9274f8e77286c
> your2  = your without 9ccc3ecd1119ccdb64e91b1f03376916aa8cc6f7
>
>
> decode
>                           all            ilong          ishort         long           short
>     my: - int           285.94/s     14988.61/s   4694109.54/s       704.15/s    599678.93/s
>   orig: - int           292.41/s     15121.98/s   4782883.50/s       494.33/s    553182.28/s
>  your1: - int           271.21/s     14232.25/s   4706722.93/s       599.68/s    554941.90/s
>  your2: - int           280.85/s     14090.33/s   4210573.40/s       593.93/s    558487.86/s
>   your: - int           283.23/s     15121.98/s   4500252.51/s       691.95/s    678859.55/s
>
>                           all            ilong          ishort         long           short
>     my: - lax            83.28/s       202.22/s    142049.67/s       181.82/s    163352.41/s
>   orig: - lax            53.49/s       201.58/s    152422.11/s       147.13/s    133974.37/s
>  your1: - lax           255.13/s        53.75/s     47590.82/s       560.34/s    431447.77/s
>  your2: - lax           281.71/s        48.41/s     43260.19/s       634.16/s    445365.29/s
>   your: - lax           286.96/s        46.35/s     42848.40/s       632.20/s    442546.52/s
>
>                           all            ilong          ishort         long           short
>     my: - strict         90.48/s       200.00/s    143081.15/s       197.53/s    175800.00/s
>   orig: - strict         49.21/s       202.22/s    149447.34/s       142.81/s    128290.63/s
>  your1: - strict        154.94/s        48.16/s     44237.93/s       191.36/s    169228.16/s
>  your2: - strict        158.75/s        40.06/s     37244.06/s       195.95/s    173588.68/s
>   your: - strict        158.26/s        38.54/s     36898.14/s       195.95/s    172504.61/s
>
>
> encode
>                           all            ilong          ishort         long           short
>     my: - int       5197722.67/s   5227338.26/s   5210583.97/s   5163520.62/s   5227338.26/s
>   orig: - int       5449888.54/s   5381336.48/s   5370254.05/s   5449888.54/s   5301624.60/s
>  your1: - int       5244200.62/s   5293830.28/s   5277183.02/s   5361483.07/s   5260640.13/s
>  your2: - int       5435994.67/s   5432587.30/s   5398312.30/s   5487602.22/s   5606457.74/s
>   your: - int       5261172.17/s   5327441.90/s   5310582.91/s   5310582.91/s   5361483.07/s
>
>                           all            ilong          ishort         long           short
>     my: - lax          2442.24/s     15084.08/s   2882995.00/s      7993.15/s   2716293.65/s
>   orig: - lax          2438.39/s     15121.98/s   2933419.33/s      7965.22/s   2665521.81/s
>  your1: - lax          2229.94/s     14908.60/s   2117316.51/s      7428.89/s   2011133.75/s
>  your2: - lax          2400.92/s     15121.98/s   3046739.87/s      8065.41/s   2742961.18/s
>   your: - lax          2368.00/s     15168.94/s   2862328.67/s      8090.85/s   2685694.50/s
>
>                           all            ilong          ishort         long           short
>     my: - strict         92.16/s       204.81/s    157772.05/s       200.00/s    190344.59/s
>   orig: - strict         49.04/s       202.22/s    160767.72/s       142.81/s    133548.90/s
>  your1: - strict        147.75/s        46.91/s     46095.57/s       194.36/s    176949.84/s
>  your2: - strict        159.25/s        40.19/s     38034.59/s       196.20/s    185166.45/s
>   your: - strict        158.26/s        38.54/s     37012.73/s       196.20/s    186357.23/s
>
>
> So looks like that experimental commits did not speed up encoder or decoder.
>
> What is relevant from these tests is that your patches slow down encoding
> and decoding of illegal sequences like "\xA0" x 1000000 about 4-5 times.
>

Thanks for your efforts.  Also relevant is that this speeds up 
validation under decode by over a factor of 5.  Given that most inputs 
will be mostly valid, this outweighs the slowdown, and so I have pushed 
the non-experimental non-Encode-changes portions to blead.

We may change Encode in blead too, since it already differs from cpan. 
I'll have to get Sawyer's opinion on that.  But the next step is for me 
to fix Devel::PPPort to handle the things that Encode needs, issue a 
pull request there, and after that is resolved issue an Encode PR.

[toc] | [prev] | [next] | [standalone]

Page 1 of 2 [1] 2 Next page →

csiph-web

Re: Encode UTF-8 optimizations

Contents

#202 — Re: Encode UTF-8 optimizations

#204

#205

#206

#207

#208

#209

#210

#211

#213

#214

#215

#216

#218

#217

#219

#220

#221

#222

#223