Groups | Search | Server Info | Login | Register
| Newsgroups | perl.unicode |
|---|---|
| Subject | Re: Encode UTF-8 optimizations |
| Date | 2016-08-12 17:31 +0200 |
| Message-ID | <201608121731.32716@pali> (permalink) |
| From | pali@cpan.org |
On Thursday 11 August 2016 17:41:23 Karl Williamson wrote:
> On 07/09/2016 05:12 PM, pali@cpan.org wrote:
> >Hi! As we know utf8::encode() does not provide correct UTF-8 encoding
> >and Encode::encode("UTF-8", ...) should be used instead. Also opening
> >file should be done by :encoding(UTF-8) layer instead :utf8.
> >
> >But UTF-8 strict implementation in Encode module is horrible slow when
> >comparing to utf8::encode(). It is implemented in Encode.xs file and for
> >benchmarking can be this XS implementation called directly by:
> >
> > use Encode;
> > my $output = Encode::utf8::encode_xs({strict_utf8 => 1}, $input)
> >
> >(without overhead of Encode module...)
> >
> >Here are my results on 160 bytes long input string:
> >
> > Encode::utf8::encode_xs({strict_utf8 => 1}, ...): 8 wallclock secs ( 8.56 usr +
0.00 sys = 8.56 CPU) @ 467289.72/s (n=4000000)
> > Encode::utf8::encode_xs({strict_utf8 => 0}, ...): 1 wallclock secs ( 1.66 usr +
0.00 sys = 1.66 CPU) @ 2409638.55/s (n=4000000)
> > utf8::encode: 1 wallclock secs ( 0.39 usr + 0.00 sys = 0.39 CPU) @
10256410.26/s (n=4000000)
> >
> >I found two bottle necks (slow sv_catpv* and utf8n_to_uvuni functions)
> >and did some optimizations. Final results are:
> >
> > Encode::utf8::encode_xs({strict_utf8 => 1}, ...): 2 wallclock secs ( 3.27 usr +
0.00 sys = 3.27 CPU) @ 1223241.59/s (n=4000000)
> > Encode::utf8::encode_xs({strict_utf8 => 0}, ...): 1 wallclock secs ( 1.68 usr +
0.00 sys = 1.68 CPU) @ 2380952.38/s (n=4000000)
> > utf8::encode: 1 wallclock secs ( 0.40 usr + 0.00 sys = 0.40 CPU) @
10000000.00/s (n=4000000)
> >
> >Patches are on github at pull request:
> >https://github.com/dankogai/p5-encode/pull/56
> >
> >I would like if somebody review my patches and tell if this is the
> >right way for optimizations...
> >
>
> I'm sorry that this slipped off my radar until I saw it in the new Encode
> release
>
> There are a couple of things I see wrong with your patch.
>
> 1) It does not catch the malformation of an overlong sequence. This is a
> serious malformation which has been used for attacks. Basically, after you
> get the result, you need to check that it is the expected length for that
> result. For example, \xC2\x80 will have an input length of 2, and evaluates
> to \x00, whose expected length is 1, and so the input is overlong. In
> modern perls, you can just do an OFFUNISKIP(uv) and compare that with the
> passed-in length. This can be rewritten for perls back to 5.8 using
> UNI_SKIP and UNI_TO_NATIVE
I do not see where can be a problem. At least I think my patches should
be compatible with previous implementation of Encode.xs...
First UTF8_IS_INVARIANT is checked and one character processed.
Otherwise UTF8_IS_START is checked and UTF8SKIP is used to get length of
sequence. And then len-1 characters are checked if they pass test for
UTF8_IS_CONTINUATION.
If there are less characters then following does not
UTF8_IS_CONTINUATION and error is reported. If there are more, then next
iteration of loop starts and it fail on both UTF8_IS_CONTINUATION and
UTF8_IS_START.
Can you describe in details what do you think it wrong and how to do
that attack?
> 2) It does not work on EBCDIC platforms. The NATIVE_TO_UTF() call is a good
> start, but the result uv needs to be transformed back to native, using
> UNI_TO_NATIVE(uv).
uv is used just to check if it is valid Unicode code point. Real value
is used only for error/warn message. Previous implementation used
utf8n_to_uvuni which convert return value with NATIVE_TO_UNI.
> 3) The assumptions the subroutine runs under need to be documented for
> future maintainers and code readers. For example, it assumes that there is
> enough space in the input to hold all the bytes.
Function process_utf8 does not assume that. It calls SvGROW to increase
buffer size when needed.
> Other than that, it looks ok to me. But, to be sure, I think you should run
> it on the tests included in the core t/op/utf8decode.t which came from an
> internet repository of edge cases.
How to use and run that test with Encode?
Back to perl.unicode | Previous | Next — Next in thread | Find similar
Re: Encode UTF-8 optimizations pali@cpan.org - 2016-08-12 17:31 +0200
Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-08-18 23:06 -0600
Re: Encode UTF-8 optimizations pali@cpan.org - 2016-08-19 10:42 +0200
Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-08-20 19:10 -0600
Re: Encode UTF-8 optimizations pagaltzis@gmx.de (Aristotle Pagaltzis) - 2016-08-21 04:33 +0200
Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-08-20 20:55 -0600
Re: Encode UTF-8 optimizations pali@cpan.org - 2016-08-21 10:34 +0200
Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-08-21 08:49 -0600
Re: Encode UTF-8 optimizations pali@cpan.org - 2016-08-22 15:05 +0200
Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-08-22 13:43 -0600
Re: Encode UTF-8 optimizations pali@cpan.org - 2016-08-22 22:47 +0200
Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-08-22 15:19 -0600
Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-08-22 15:38 -0600
Re: Encode UTF-8 optimizations pali@cpan.org - 2016-08-22 23:45 +0200
Re: Encode UTF-8 optimizations pali@cpan.org - 2016-08-22 23:39 +0200
Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-08-24 22:49 -0600
Re: Encode UTF-8 optimizations pali@cpan.org - 2016-08-25 09:48 +0200
Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-08-29 09:00 -0600
Re: Encode UTF-8 optimizations pali@cpan.org - 2016-08-31 23:43 +0200
Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-08-31 21:27 -0600
Re: Encode UTF-8 optimizations pali@cpan.org - 2016-09-01 09:30 +0200
Re: Encode UTF-8 optimizations pali@cpan.org - 2016-09-25 12:06 +0200
Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-09-25 10:49 -0600
Re: Encode UTF-8 optimizations pali@cpan.org - 2016-10-27 10:25 +0200
Re: Encode UTF-8 optimizations pali@cpan.org - 2016-11-01 10:53 +0100
csiph-web