Groups | Search | Server Info | Login | Register

Re: Encode UTF-8 optimizations

Newsgroups	perl.unicode
Subject	Re: Encode UTF-8 optimizations
Date	2016-08-12 17:31 +0200
Message-ID	<201608121731.32716@pali> (permalink)
From	pali@cpan.org

Show all headers | View raw

On Thursday 11 August 2016 17:41:23 Karl Williamson wrote:
> On 07/09/2016 05:12 PM, pali@cpan.org wrote:
> >Hi! As we know utf8::encode() does not provide correct UTF-8 encoding
> >and Encode::encode("UTF-8", ...) should be used instead. Also opening
> >file should be done by :encoding(UTF-8) layer instead :utf8.
> >
> >But UTF-8 strict implementation in Encode module is horrible slow when
> >comparing to utf8::encode(). It is implemented in Encode.xs file and for
> >benchmarking can be this XS implementation called directly by:
> >
> >  use Encode;
> >  my $output = Encode::utf8::encode_xs({strict_utf8 => 1}, $input)
> >
> >(without overhead of Encode module...)
> >
> >Here are my results on 160 bytes long input string:
> >
> >  Encode::utf8::encode_xs({strict_utf8 => 1}, ...):  8 wallclock secs ( 8.56 usr +  
0.00 sys =  8.56 CPU) @ 467289.72/s (n=4000000)
> >  Encode::utf8::encode_xs({strict_utf8 => 0}, ...):  1 wallclock secs ( 1.66 usr +  
0.00 sys =  1.66 CPU) @ 2409638.55/s (n=4000000)
> >  utf8::encode:  1 wallclock secs ( 0.39 usr +  0.00 sys =  0.39 CPU) @ 
10256410.26/s (n=4000000)
> >
> >I found two bottle necks (slow sv_catpv* and utf8n_to_uvuni functions)
> >and did some optimizations. Final results are:
> >
> >  Encode::utf8::encode_xs({strict_utf8 => 1}, ...):  2 wallclock secs ( 3.27 usr +  
0.00 sys =  3.27 CPU) @ 1223241.59/s (n=4000000)
> >  Encode::utf8::encode_xs({strict_utf8 => 0}, ...):  1 wallclock secs ( 1.68 usr +  
0.00 sys =  1.68 CPU) @ 2380952.38/s (n=4000000)
> >  utf8::encode:  1 wallclock secs ( 0.40 usr +  0.00 sys =  0.40 CPU) @ 
10000000.00/s (n=4000000)
> >
> >Patches are on github at pull request:
> >https://github.com/dankogai/p5-encode/pull/56
> >
> >I would like if somebody review my patches and tell if this is the
> >right way for optimizations...
> >
> 
> I'm sorry that this slipped off my radar until I saw it in the new Encode
> release
> 
> There are a couple of things I see wrong with your patch.
> 
> 1) It does not catch the malformation of an overlong sequence.  This is a
> serious malformation which has been used for attacks.  Basically, after you
> get the result, you need to check that it is the expected length for that
> result.  For example, \xC2\x80 will have an input length of 2, and evaluates
> to \x00, whose expected length is 1, and so the input is overlong.  In
> modern perls, you can just do an OFFUNISKIP(uv) and compare that with the
> passed-in length.  This can be rewritten for perls back to 5.8 using
> UNI_SKIP and UNI_TO_NATIVE

I do not see where can be a problem. At least I think my patches should
be compatible with previous implementation of Encode.xs...

First UTF8_IS_INVARIANT is checked and one character processed.

Otherwise UTF8_IS_START is checked and UTF8SKIP is used to get length of
sequence. And then len-1 characters are checked if they pass test for
UTF8_IS_CONTINUATION.

If there are less characters then following does not
UTF8_IS_CONTINUATION and error is reported. If there are more, then next
iteration of loop starts and it fail on both UTF8_IS_CONTINUATION and
UTF8_IS_START.

Can you describe in details what do you think it wrong and how to do
that attack?

> 2) It does not work on EBCDIC platforms.  The NATIVE_TO_UTF() call is a good
> start, but the result uv needs to be transformed back to native, using
> UNI_TO_NATIVE(uv).

uv is used just to check if it is valid Unicode code point. Real value
is used only for error/warn message. Previous implementation used
utf8n_to_uvuni which convert return value with NATIVE_TO_UNI.

> 3) The assumptions the subroutine runs under need to be documented for
> future maintainers and code readers.  For example, it assumes that there is
> enough space in the input to hold all the bytes.

Function process_utf8 does not assume that. It calls SvGROW to increase
buffer size when needed.

> Other than that, it looks ok to me.  But, to be sure, I think you should run
> it on the tests included in the core t/op/utf8decode.t which came from an
> internet repository of edge cases.

How to use and run that test with Encode?

Back to perl.unicode | Previous | Next — Next in thread | Find similar

Thread

Re: Encode UTF-8 optimizations pali@cpan.org - 2016-08-12 17:31 +0200
  Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-08-18 23:06 -0600
    Re: Encode UTF-8 optimizations pali@cpan.org - 2016-08-19 10:42 +0200
      Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-08-20 19:10 -0600
        Re: Encode UTF-8 optimizations pagaltzis@gmx.de (Aristotle Pagaltzis) - 2016-08-21 04:33 +0200
          Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-08-20 20:55 -0600
        Re: Encode UTF-8 optimizations pali@cpan.org - 2016-08-21 10:34 +0200
          Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-08-21 08:49 -0600
            Re: Encode UTF-8 optimizations pali@cpan.org - 2016-08-22 15:05 +0200
              Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-08-22 13:43 -0600
                Re: Encode UTF-8 optimizations pali@cpan.org - 2016-08-22 22:47 +0200
                Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-08-22 15:19 -0600
                Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-08-22 15:38 -0600
                Re: Encode UTF-8 optimizations pali@cpan.org - 2016-08-22 23:45 +0200
                Re: Encode UTF-8 optimizations pali@cpan.org - 2016-08-22 23:39 +0200
                Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-08-24 22:49 -0600
                Re: Encode UTF-8 optimizations pali@cpan.org - 2016-08-25 09:48 +0200
                Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-08-29 09:00 -0600
                Re: Encode UTF-8 optimizations pali@cpan.org - 2016-08-31 23:43 +0200
                Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-08-31 21:27 -0600
                Re: Encode UTF-8 optimizations pali@cpan.org - 2016-09-01 09:30 +0200
                Re: Encode UTF-8 optimizations pali@cpan.org - 2016-09-25 12:06 +0200
                Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-09-25 10:49 -0600
                Re: Encode UTF-8 optimizations pali@cpan.org - 2016-10-27 10:25 +0200
                Re: Encode UTF-8 optimizations pali@cpan.org - 2016-11-01 10:53 +0100

csiph-web