Groups | Search | Server Info | Login | Register

Re: Encode UTF-8 optimizations

Newsgroups	perl.unicode
Subject	Re: Encode UTF-8 optimizations
References	<201608121731.32716@pali> <8a4fb1c9-615e-2cf7-2fa2-28e3ff637406@khwilliamson.com> <20160819084221.GA5236@atrey.karlin.mff.cuni.cz>
Message-ID	<bcfddfeb-5045-95b8-b283-7a243542bcab@khwilliamson.com> (permalink)
Date	2016-08-20 19:10 -0600
From	public@khwilliamson.com (Karl Williamson)

Show all headers | View raw

[Multipart message — attachments visible in raw view] - view raw

Top posting.

Attached is my alternative patch.  It effectively uses a different 
algorithm to avoid decoding the input into code points, and to copy all 
spans of valid input at once, instead of character at a time.

And it uses only currently available functions.  Any of these that are 
missing or buggy in previous perls can and will be dealt with by the 
Devel::PPPort mechanism.

On 08/19/2016 02:42 AM, pali@cpan.org wrote:
> On Thursday 18 August 2016 23:06:27 Karl Williamson wrote:
>> On 08/12/2016 09:31 AM, pali@cpan.org wrote:
>>> On Thursday 11 August 2016 17:41:23 Karl Williamson wrote:
>>>> On 07/09/2016 05:12 PM, pali@cpan.org wrote:
>>>>> Hi! As we know utf8::encode() does not provide correct UTF-8 encoding
>>>>> and Encode::encode("UTF-8", ...) should be used instead. Also opening
>>>>> file should be done by :encoding(UTF-8) layer instead :utf8.
>>>>>
>>>>> But UTF-8 strict implementation in Encode module is horrible slow when
>>>>> comparing to utf8::encode(). It is implemented in Encode.xs file and for
>>>>> benchmarking can be this XS implementation called directly by:
>>>>>
>>>>> use Encode;
>>>>> my $output = Encode::utf8::encode_xs({strict_utf8 => 1}, $input)
>>>>>
>>>>> (without overhead of Encode module...)
>>>>>
>>>>> Here are my results on 160 bytes long input string:
>>>>>
>>>>> Encode::utf8::encode_xs({strict_utf8 => 1}, ...):  8 wallclock secs ( 8.56 usr +
>>> 0.00 sys =  8.56 CPU) @ 467289.72/s (n=4000000)
>>>>> Encode::utf8::encode_xs({strict_utf8 => 0}, ...):  1 wallclock secs ( 1.66 usr +
>>> 0.00 sys =  1.66 CPU) @ 2409638.55/s (n=4000000)
>>>>> utf8::encode:  1 wallclock secs ( 0.39 usr +  0.00 sys =  0.39 CPU) @
>>> 10256410.26/s (n=4000000)
>>>>>
>>>>> I found two bottle necks (slow sv_catpv* and utf8n_to_uvuni functions)
>>>>> and did some optimizations. Final results are:
>>>>>
>>>>> Encode::utf8::encode_xs({strict_utf8 => 1}, ...):  2 wallclock secs ( 3.27 usr +
>>> 0.00 sys =  3.27 CPU) @ 1223241.59/s (n=4000000)
>>>>> Encode::utf8::encode_xs({strict_utf8 => 0}, ...):  1 wallclock secs ( 1.68 usr +
>>> 0.00 sys =  1.68 CPU) @ 2380952.38/s (n=4000000)
>>>>> utf8::encode:  1 wallclock secs ( 0.40 usr +  0.00 sys =  0.40 CPU) @
>>> 10000000.00/s (n=4000000)
>>>>>
>>>>> Patches are on github at pull request:
>>>>> https://github.com/dankogai/p5-encode/pull/56
>>>>>
>>>>> I would like if somebody review my patches and tell if this is the
>>>>> right way for optimizations...
>>>>>
>>>>
>>>> I'm sorry that this slipped off my radar until I saw it in the new Encode
>>>> release
>>>>
>>>> There are a couple of things I see wrong with your patch.
>>>>
>>>> 1) It does not catch the malformation of an overlong sequence.  This is a
>>>> serious malformation which has been used for attacks.  Basically, after you
>>>> get the result, you need to check that it is the expected length for that
>>>> result.  For example, \xC2\x80 will have an input length of 2, and evaluates
>>>> to \x00, whose expected length is 1, and so the input is overlong.  In
>>>> modern perls, you can just do an OFFUNISKIP(uv) and compare that with the
>>>> passed-in length.  This can be rewritten for perls back to 5.8 using
>>>> UNI_SKIP and UNI_TO_NATIVE
>>>
>>> I do not see where can be a problem. At least I think my patches should
>>> be compatible with previous implementation of Encode.xs...
>>>
>>> First UTF8_IS_INVARIANT is checked and one character processed.
>>>
>>> Otherwise UTF8_IS_START is checked and UTF8SKIP is used to get length of
>>> sequence. And then len-1 characters are checked if they pass test for
>>> UTF8_IS_CONTINUATION.
>>>
>>> If there are less characters then following does not
>>> UTF8_IS_CONTINUATION and error is reported. If there are more, then next
>>> iteration of loop starts and it fail on both UTF8_IS_CONTINUATION and
>>> UTF8_IS_START.
>>>
>>> Can you describe in details what do you think it wrong and how to do
>>> that attack?
>>
>> This discussion has been active at
>> https://github.com/dankogai/p5-encode/issues/64
>>
>> For the curious out there, please refer to that discussion.  My bottom line
>> is that I have come to believe the security risks are too high to have
>> modules do their own security checking for UTF-8 correctness.
>>>
>>>> 2) It does not work on EBCDIC platforms.  The NATIVE_TO_UTF() call is a good
>>>> start, but the result uv needs to be transformed back to native, using
>>>> UNI_TO_NATIVE(uv).
>>>
>>> uv is used just to check if it is valid Unicode code point. Real value
>>> is used only for error/warn message. Previous implementation used
>>> utf8n_to_uvuni which convert return value with NATIVE_TO_UNI.
>>
>> As I said on that other thread, if this is really true, then it's faster to
>> use a boolean function to verify the inputs.
>
> Value of uv is used only in warn/error message.
>
>> Also, performance should not
>> be a consideration for errors or warnings.  One can check validity fast; and
>> then spend the time getting the message right in the rare cases where a
>> message is generated.
>
> Yes, fully I agree.
>
>>>> 3) The assumptions the subroutine runs under need to be documented for
>>>> future maintainers and code readers.  For example, it assumes that there is
>>>> enough space in the input to hold all the bytes.
>>>
>>> Function process_utf8 does not assume that. It calls SvGROW to increase
>>> buffer size when needed.
>>
>> You misunderstand what I meant here.  The bottom line is your patch adds a
>> significant amount of code without any comments in a risky area.  The name
>> of the function does not indicate that its value is to be thrown away, and
>> even after looking at the code that calls it some more, it's not obvious to
>> me that the value isn't kept.  All subtleties in code should be commented in
>> that code.  To do otherwise is a disservice to future maintainers.  I
>> personally will never push to blead someone's commit that I think  unfairly
>> burdens future maintainers.  One of the subtleties of this function is that
>> it doesn't check that it is not running off the end of the buffer.  It
>> relies on the caller to do that check, but someone coming along might see
>> that function and think from its name that it's more general purpose than it
>> actually is.  Someone looking at its name and return value would likely
>> think it generates a valid code point from UTF-8 input.
>
> It is intended to do not add another non-needed checks as it again slow
> down function performance. Checks are done before on another place and
> that function is designed to be used only in Encode's process_utf8.
>
> So rather write comments/description about that function. But on the
> other hand, I dislike comments which just write what is function doing
> in case that comments are longer then function code itself. Then it is
> easier to read function code itself and so whole comment is useless...
>
>>>> Other than that, it looks ok to me.  But, to be sure, I think you should run
>>>> it on the tests included in the core t/op/utf8decode.t which came from an
>>>> internet repository of edge cases.
>>
>> I later realized that under non-strict calls, this can overflow, and there
>> is some code in your amendments to this patch that check that.  I have not
>> looked those over.
>
> Yes, check for overlong and overflow is there...
>
>> But again, I don't think Encode should undertake its own security checking.
>> I'm willing to work with you to get something in core that adequately meets
>> Encode's needs.
>
> Ok. We can discuss about it. First I see there big problems:
>
> 1) Encode module is for Perl 5.8+. In 5.8+ perl's versions will never be
>    your (new) functions, so Encode module needs to have at least copy of
>    them. And this does not solve problem which you want to prevent :-(
>
> 2) Functions must be defined and declared in some header file. So C
>    compiler can inline and optimize them in Encode module. Calling such
>    hot function via shared library must be avoided.
>
> 3) Such function needs to be designed in way that it do only what is
>    needed in case for Encode module. On the other hand I do not think it
>    is good idea to create special function just for Encode module living
>    in core perl... But maybe general-useful function can be designed.
>
>>> How to use and run that test with Encode?
>>
>> It looks like you figured that out for the most part in your amended
>> patches.
>
> Yes, meanwhile you wrote reply, I figured out about those problems and
> also copied those tests :-) Anyway, test for EBCDIC are missing in core.
>

Back to perl.unicode | Previous | Next — Previous in thread | Next in thread | Find similar

Thread

Re: Encode UTF-8 optimizations pali@cpan.org - 2016-08-12 17:31 +0200
  Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-08-18 23:06 -0600
    Re: Encode UTF-8 optimizations pali@cpan.org - 2016-08-19 10:42 +0200
      Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-08-20 19:10 -0600
        Re: Encode UTF-8 optimizations pagaltzis@gmx.de (Aristotle Pagaltzis) - 2016-08-21 04:33 +0200
          Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-08-20 20:55 -0600
        Re: Encode UTF-8 optimizations pali@cpan.org - 2016-08-21 10:34 +0200
          Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-08-21 08:49 -0600
            Re: Encode UTF-8 optimizations pali@cpan.org - 2016-08-22 15:05 +0200
              Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-08-22 13:43 -0600
                Re: Encode UTF-8 optimizations pali@cpan.org - 2016-08-22 22:47 +0200
                Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-08-22 15:19 -0600
                Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-08-22 15:38 -0600
                Re: Encode UTF-8 optimizations pali@cpan.org - 2016-08-22 23:45 +0200
                Re: Encode UTF-8 optimizations pali@cpan.org - 2016-08-22 23:39 +0200
                Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-08-24 22:49 -0600
                Re: Encode UTF-8 optimizations pali@cpan.org - 2016-08-25 09:48 +0200
                Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-08-29 09:00 -0600
                Re: Encode UTF-8 optimizations pali@cpan.org - 2016-08-31 23:43 +0200
                Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-08-31 21:27 -0600
                Re: Encode UTF-8 optimizations pali@cpan.org - 2016-09-01 09:30 +0200
                Re: Encode UTF-8 optimizations pali@cpan.org - 2016-09-25 12:06 +0200
                Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-09-25 10:49 -0600
                Re: Encode UTF-8 optimizations pali@cpan.org - 2016-10-27 10:25 +0200
                Re: Encode UTF-8 optimizations pali@cpan.org - 2016-11-01 10:53 +0100

csiph-web