Groups > perl.unicode > #200 > unrolled thread

Encode UTF-8 optimizations

Started by	pali@cpan.org
First post	2016-07-10 01:12 +0200
Last post	2016-08-11 17:41 -0600
Articles	2 — 2 participants

Back to article view | Back to perl.unicode

  Encode UTF-8 optimizations pali@cpan.org - 2016-07-10 01:12 +0200
    Re: Encode UTF-8 optimizations public@khwilliamson.com (Karl Williamson) - 2016-08-11 17:41 -0600

#200 — Encode UTF-8 optimizations

From	pali@cpan.org
Date	2016-07-10 01:12 +0200
Subject	Encode UTF-8 optimizations
Message-ID	<201607100112.45201@pali>

Hi! As we know utf8::encode() does not provide correct UTF-8 encoding
and Encode::encode("UTF-8", ...) should be used instead. Also opening
file should be done by :encoding(UTF-8) layer instead :utf8.

But UTF-8 strict implementation in Encode module is horrible slow when
comparing to utf8::encode(). It is implemented in Encode.xs file and for
benchmarking can be this XS implementation called directly by:

  use Encode;
  my $output = Encode::utf8::encode_xs({strict_utf8 => 1}, $input)

(without overhead of Encode module...)

Here are my results on 160 bytes long input string:

  Encode::utf8::encode_xs({strict_utf8 => 1}, ...):  8 wallclock secs ( 8.56 usr +  0.00 sys =  8.56 CPU) @ 467289.72/s (n=4000000)
  Encode::utf8::encode_xs({strict_utf8 => 0}, ...):  1 wallclock secs ( 1.66 usr +  0.00 sys =  1.66 CPU) @ 2409638.55/s (n=4000000)
  utf8::encode:  1 wallclock secs ( 0.39 usr +  0.00 sys =  0.39 CPU) @ 10256410.26/s (n=4000000)

I found two bottle necks (slow sv_catpv* and utf8n_to_uvuni functions)
and did some optimizations. Final results are:

  Encode::utf8::encode_xs({strict_utf8 => 1}, ...):  2 wallclock secs ( 3.27 usr +  0.00 sys =  3.27 CPU) @ 1223241.59/s (n=4000000)
  Encode::utf8::encode_xs({strict_utf8 => 0}, ...):  1 wallclock secs ( 1.68 usr +  0.00 sys =  1.68 CPU) @ 2380952.38/s (n=4000000)
  utf8::encode:  1 wallclock secs ( 0.40 usr +  0.00 sys =  0.40 CPU) @ 10000000.00/s (n=4000000)

Patches are on github at pull request:
https://github.com/dankogai/p5-encode/pull/56

I would like if somebody review my patches and tell if this is the
right way for optimizations...

[toc] | [next] | [standalone]

#201

From	public@khwilliamson.com (Karl Williamson)
Date	2016-08-11 17:41 -0600
Message-ID	<076f2b15-ccf9-80bf-bdbe-383fcd64e7bf@khwilliamson.com>
In reply to	#200

On 07/09/2016 05:12 PM, pali@cpan.org wrote:
> Hi! As we know utf8::encode() does not provide correct UTF-8 encoding
> and Encode::encode("UTF-8", ...) should be used instead. Also opening
> file should be done by :encoding(UTF-8) layer instead :utf8.
>
> But UTF-8 strict implementation in Encode module is horrible slow when
> comparing to utf8::encode(). It is implemented in Encode.xs file and for
> benchmarking can be this XS implementation called directly by:
>
>   use Encode;
>   my $output = Encode::utf8::encode_xs({strict_utf8 => 1}, $input)
>
> (without overhead of Encode module...)
>
> Here are my results on 160 bytes long input string:
>
>   Encode::utf8::encode_xs({strict_utf8 => 1}, ...):  8 wallclock secs ( 8.56 usr +  0.00 sys =  8.56 CPU) @ 467289.72/s (n=4000000)
>   Encode::utf8::encode_xs({strict_utf8 => 0}, ...):  1 wallclock secs ( 1.66 usr +  0.00 sys =  1.66 CPU) @ 2409638.55/s (n=4000000)
>   utf8::encode:  1 wallclock secs ( 0.39 usr +  0.00 sys =  0.39 CPU) @ 10256410.26/s (n=4000000)
>
> I found two bottle necks (slow sv_catpv* and utf8n_to_uvuni functions)
> and did some optimizations. Final results are:
>
>   Encode::utf8::encode_xs({strict_utf8 => 1}, ...):  2 wallclock secs ( 3.27 usr +  0.00 sys =  3.27 CPU) @ 1223241.59/s (n=4000000)
>   Encode::utf8::encode_xs({strict_utf8 => 0}, ...):  1 wallclock secs ( 1.68 usr +  0.00 sys =  1.68 CPU) @ 2380952.38/s (n=4000000)
>   utf8::encode:  1 wallclock secs ( 0.40 usr +  0.00 sys =  0.40 CPU) @ 10000000.00/s (n=4000000)
>
> Patches are on github at pull request:
> https://github.com/dankogai/p5-encode/pull/56
>
> I would like if somebody review my patches and tell if this is the
> right way for optimizations...
>

I'm sorry that this slipped off my radar until I saw it in the new 
Encode release

There are a couple of things I see wrong with your patch.

1) It does not catch the malformation of an overlong sequence.  This is 
a serious malformation which has been used for attacks.  Basically, 
after you get the result, you need to check that it is the expected 
length for that result.  For example, \xC2\x80 will have an input length 
of 2, and evaluates to \x00, whose expected length is 1, and so the 
input is overlong.  In modern perls, you can just do an OFFUNISKIP(uv) 
and compare that with the passed-in length.  This can be rewritten for 
perls back to 5.8 using UNI_SKIP and UNI_TO_NATIVE

2) It does not work on EBCDIC platforms.  The NATIVE_TO_UTF() call is a 
good start, but the result uv needs to be transformed back to native, 
using UNI_TO_NATIVE(uv).

3) The assumptions the subroutine runs under need to be documented for 
future maintainers and code readers.  For example, it assumes that there 
is enough space in the input to hold all the bytes.

Other than that, it looks ok to me.  But, to be sure, I think you should 
run it on the tests included in the core t/op/utf8decode.t which came 
from an internet repository of edge cases.

[toc] | [prev] | [standalone]

csiph-web

Encode UTF-8 optimizations

Contents

#200 — Encode UTF-8 optimizations

#201