Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #85737 > unrolled thread

python implementation of a new integer encoding algorithm.

Started byjanhein.vanderburg@gmail.com
First post2015-02-17 03:22 -0800
Last post2015-02-20 00:58 -0800
Articles 20 on this page of 53 — 15 participants

Back to article view | Back to comp.lang.python


Contents

  python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-17 03:22 -0800
    Re: python implementation of a new integer encoding algorithm. Chris Angelico <rosuav@gmail.com> - 2015-02-18 00:16 +1100
      Re: python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-18 00:55 -0800
        Re: python implementation of a new integer encoding algorithm. Chris Angelico <rosuav@gmail.com> - 2015-02-18 20:36 +1100
          Re: python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-18 11:29 -0800
        Re: python implementation of a new integer encoding algorithm. Laura Creighton <lac@openend.se> - 2015-02-18 11:32 +0100
          Re: python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-18 11:48 -0800
        People hated it for the same reasons I found them cool (was: python implementation of a new integer encoding algorithm.) Ben Finney <ben+python@benfinney.id.au> - 2015-02-18 21:57 +1100
    Re: python implementation of a new integer encoding algorithm. Dave Angel <davea@davea.name> - 2015-02-17 09:12 -0500
      Re: python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-18 00:59 -0800
        Re: python implementation of a new integer encoding algorithm. Dave Angel <davea@davea.name> - 2015-02-18 11:46 -0500
          Re: python implementation of a new integer encoding algorithm. Grant Edwards <invalid@invalid.invalid> - 2015-02-18 17:30 +0000
            Re: python implementation of a new integer encoding algorithm. Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-02-18 18:12 +0000
          Re: python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-18 11:55 -0800
            Re: python implementation of a new integer encoding algorithm. Marko Rauhamaa <marko@pacujo.net> - 2015-02-18 23:54 +0200
              Re: python implementation of a new integer encoding algorithm. Marko Rauhamaa <marko@pacujo.net> - 2015-02-19 00:08 +0200
              Re: python implementation of a new integer encoding algorithm. Grant Edwards <invalid@invalid.invalid> - 2015-02-18 22:58 +0000
            Re: python implementation of a new integer encoding algorithm. Dave Angel <davea@davea.name> - 2015-02-18 17:19 -0500
              Re: python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-19 07:45 -0800
                Re: python implementation of a new integer encoding algorithm. Ian Kelly <ian.g.kelly@gmail.com> - 2015-02-19 11:04 -0700
                Re: python implementation of a new integer encoding algorithm. Ian Kelly <ian.g.kelly@gmail.com> - 2015-02-19 11:16 -0700
                Re: python implementation of a new integer encoding algorithm. Dave Angel <davea@davea.name> - 2015-02-19 13:24 -0500
                Re: python implementation of a new integer encoding algorithm. Chris Angelico <rosuav@gmail.com> - 2015-02-20 05:34 +1100
                Re: python implementation of a new integer encoding algorithm. Ian Kelly <ian.g.kelly@gmail.com> - 2015-02-19 11:32 -0700
                Re: python implementation of a new integer encoding algorithm. Dave Angel <davea@davea.name> - 2015-02-19 13:41 -0500
                Re: python implementation of a new integer encoding algorithm. Dave Angel <davea@davea.name> - 2015-02-19 13:46 -0500
                Re: python implementation of a new integer encoding algorithm. Chris Angelico <rosuav@gmail.com> - 2015-02-20 05:49 +1100
        Re: python implementation of a new integer encoding algorithm. Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-02-18 17:00 +0000
    Re: python implementation of a new integer encoding algorithm. Chris Angelico <rosuav@gmail.com> - 2015-02-18 01:34 +1100
      Re: python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-18 01:04 -0800
        Re: python implementation of a new integer encoding algorithm. Dave Angel <davea@davea.name> - 2015-02-18 08:54 -0500
          Re: python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-18 11:52 -0800
        Re: python implementation of a new integer encoding algorithm. Chris Angelico <rosuav@gmail.com> - 2015-02-19 01:16 +1100
    Re: python implementation of a new integer encoding algorithm. Dave Angel <davea@davea.name> - 2015-02-17 09:50 -0500
    Re: python implementation of a new integer encoding algorithm. Chris Angelico <rosuav@gmail.com> - 2015-02-18 01:58 +1100
    Re: python implementation of a new integer encoding algorithm. Dave Angel <davea@davea.name> - 2015-02-17 10:18 -0500
    Re: python implementation of a new integer encoding algorithm. Chris Angelico <rosuav@gmail.com> - 2015-02-18 02:25 +1100
    Re: python implementation of a new integer encoding algorithm. Paul Rubin <no.email@nospam.invalid> - 2015-02-17 08:43 -0800
      Re: python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-18 01:06 -0800
    Re: python implementation of a new integer encoding algorithm. Mario Figueiredo <marfig@gmail.com> - 2015-02-19 08:44 +0100
      Re: python implementation of a new integer encoding algorithm. Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-02-19 08:06 +0000
        Re: python implementation of a new integer encoding algorithm. Marko Rauhamaa <marko@pacujo.net> - 2015-02-19 10:36 +0200
          Re: python implementation of a new integer encoding algorithm. Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-02-19 09:33 +0000
          Re: python implementation of a new integer encoding algorithm. Terry Reedy <tjreedy@udel.edu> - 2015-02-19 14:50 -0500
            Re: python implementation of a new integer encoding algorithm. Marko Rauhamaa <marko@pacujo.net> - 2015-02-19 21:55 +0200
      Re: python implementation of a new integer encoding algorithm. Chris Angelico <rosuav@gmail.com> - 2015-02-19 19:36 +1100
      Re: python implementation of a new integer encoding algorithm. Mario Figueiredo <marfig@gmail.com> - 2015-02-19 10:42 +0100
      Re: python implementation of a new integer encoding algorithm. Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-02-19 10:28 +0000
      Re: python implementation of a new integer encoding algorithm. Mario Figueiredo <marfig@gmail.com> - 2015-02-19 14:27 +0100
    Re: python implementation of a new integer encoding algorithm. Jonas Wielicki <jonas@wielicki.name> - 2015-02-19 09:38 +0100
      Re: python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-19 07:58 -0800
    Re: python implementation of a new integer encoding algorithm. Denis McMahon <denismfmcmahon@gmail.com> - 2015-02-20 02:46 +0000
      Re: python implementation of a new integer encoding algorithm. wxjmfauth@gmail.com - 2015-02-20 00:58 -0800

Page 2 of 3 — ← Prev page 1 [2] 3  Next page →


#85917

FromIan Kelly <ian.g.kelly@gmail.com>
Date2015-02-19 11:16 -0700
Message-ID<mailman.18893.1424369820.18130.python-list@python.org>
In reply to#85910
On Thu, Feb 19, 2015 at 11:04 AM, Ian Kelly <ian.g.kelly@gmail.com> wrote:
> There's also an optimization that can be added here if we wish to
> inject a bit of cleverness. Notice that all values with more than one
> group start with 11, never 10. We can borrow a trick from IEEE
> floating point and make the leading 1 bit of the mantissa implicit for
> all values greater than 3 (we can't do it for 2 and 3 because then we
> couldn't distinguish them from 0 and 1). Applying this optimization
> removes one full group from the representation of all values greater
> than 3, which appears to make the stop-bit representation as short as
> or shorter than the "difficult" one for all the values that have been
> enumerated above.

I made a mistake. This trick only works if the second group is 10, so
it only applies to half the numbers. That makes 12 and 13 a group
longer than the difficult representation, but on the other hand 18-23
are still a group shorter.

[toc] | [prev] | [next] | [standalone]


#85918

FromDave Angel <davea@davea.name>
Date2015-02-19 13:24 -0500
Message-ID<mailman.18894.1424370292.18130.python-list@python.org>
In reply to#85910
On 02/19/2015 10:45 AM, janhein.vanderburg@gmail.com wrote:
> On Wednesday, February 18, 2015 at 11:20:12 PM UTC+1, Dave Angel wrote:
>> I'm not necessarily doubting it, just challenging you to provide a data
>> sample that actually shows it.  And of course, I'm not claiming that
>> 7bit is in any way optimal.  You cannot define optimal without first
>> defining the distribution.
>
> Weird results.

In what way weird?

> For a character size 2 the growth processes are shown below.
> I listed the decimal representations, the difficult representation, a stop bit encoding, and the number of characters they differ in length:
> 0:  00				00				0
> 1:  01				01				0
> 2:  10, 00			10, 00				0
> 3:  10, 01			10, 01				0
> 4:  10, 10			11, 00				0
> 5:  10, 11			11, 01				0
> 6:  11, 00.00			11, 10, 00			0
> 7:  11, 00.01			11, 10, 01			0
> 8:  11, 00.10			11, 11, 00			0
> 9:  11, 00.11			11, 11, 01			0
> 10: 11, 01.00			11, 11, 10, 00			1
> 11: 11, 01.01			11, 11, 10, 01			1
> 12: 11, 01.10			11, 11, 11, 00			1
> 13: 11, 01.11			11, 11, 11, 01			1
> 14: 11, 10.00, 00		11, 11, 11, 10, 00		1
> 15: 11, 10.00, 01		11, 11, 11, 10, 01		1
> 16: 11, 10.00, 10		11, 11, 11, 11, 00		1
> 17: 11, 10.00, 11		11, 11, 11, 11, 01		1
> 18: 11, 10.01, 00.00		11, 11, 11, 11, 10, 00		1
> 19: 11, 10.01, 00.01		11, 11, 11, 11, 10, 01		1
> 20: 11, 10.01, 00.10		11, 11, 11, 11, 11, 00		1
> 21: 11, 10.01, 00.11		11, 11, 11, 11, 11, 01		1
> 22: 11, 10.01, 01.00		11, 11, 11, 11, 11, 10, 00	2
> 23: 11, 10.01, 01.01		11, 11, 11, 11, 11, 10, 01	2
> 24: 11, 10.01, 01.10		11, 11, 11, 11, 11, 11, 00	2
> 25: 11, 10.01, 01.11		11, 11, 11, 11, 11, 11, 01	2
> 26: 11, 10.01, 10.00		11, 11, 11, 11, 11, 11, 10, 00	3
>
> I didn't take the time to prove it mathematically, but these results suggest to me that the complicated encoding beats the stop bit encoding.
>

Clearly, one wouldn't use what you call "stop bit encoding" with a 2bit 
character.  And since the Python code you posted uses an 8bit character, 
I can't validate the "difficult" column either.

I wrote the following pair of functions:

def seven_code(n):
     acc = bytearray()
     if n == 0:
         acc.append(0)
     while n > 0:
         quotient, remainder = divmod(n, 128)
         acc.append(remainder)
         n = quotient
     acc[-1] |= 0x80            #add a stop bit to last byte
     return acc

def seven_decode(sequ):
     acc = 0
     multiplier = 1
     for by in sequ:
         acc += (by & 0x7f) * multiplier
         if by > 0x7f:
             break
         multiplier *= 128
     return acc

Here's a couple of ranges of output, showing that the 7bit scheme does 
better for values between 384 and 16379.

382 2 80fe --- 2 7e82
383 2 80ff --- 2 7f82
384 3 810000 --- 2 0083
384  jan grew 3 810000
385 3 810001 --- 2 0183
386 3 810002 --- 2 0283
387 3 810003 --- 2 0383
388 3 810004 --- 2 0483
389 3 810005 --- 2 0583

16380 3 813e7c --- 2 7cff
16380  jan grew 3 813e7c
16380 7bit grew 2 7cff
16381 3 813e7d --- 2 7dff
16382 3 813e7e --- 2 7eff
16383 3 813e7f --- 2 7fff
16384 3 813e80 --- 3 000081
16384 7bit grew 3 000081
16385 3 813e81 --- 3 010081
16386 3 813e82 --- 3 020081
16387 3 813e83 --- 3 030081
16388 3 813e84 --- 3 040081
16389 3 813e85 --- 3 050081

In all my experimenting, I haven't found any values where the 7bit 
scheme does worse.  It seems likely that for extremely large integers, 
it will, but if those are to be the intended distribution, the 7bit 
scheme could be replaced by something else, like just encoding a length 
at the beginning, and using raw bytes after that.


--
DaveA

[toc] | [prev] | [next] | [standalone]


#85919

FromChris Angelico <rosuav@gmail.com>
Date2015-02-20 05:34 +1100
Message-ID<mailman.18895.1424370900.18130.python-list@python.org>
In reply to#85910
On Fri, Feb 20, 2015 at 5:24 AM, Dave Angel <davea@davea.name> wrote:
> In all my experimenting, I haven't found any values where the 7bit scheme
> does worse.  It seems likely that for extremely large integers, it will, but
> if those are to be the intended distribution, the 7bit scheme could be
> replaced by something else, like just encoding a length at the beginning,
> and using raw bytes after that.

Encoding a length (as varlen) and then using eight bits to the byte
thereafter is worse for small numbers, breaks even around 2**56, and
then is better. So unless your numbers are mainly going to be above
2**56, it's better to just use varlen for the entire number. On the
other hand, if you have to stream this without over-reading (imagine
streaming from a TCP/IP socket; you want to block until you have the
whole number, but not block after that), it may be more efficient to
take the length, and then do a blocking read for the main data,
instead of a large number of single-byte reads. But on the gripping
hand, you can probably just do those one-byte reads and rely on (or
implement) lower-level buffering.

Ask not the python-list for advice, because they will say both "yes"
and "no" and "maybe"... because they will say all three of "yes",
"no", "maybe", and "you don't need to do that"... erm, AMONG our
responses will be such diverse elements as...

ChrisA

[toc] | [prev] | [next] | [standalone]


#85920

FromIan Kelly <ian.g.kelly@gmail.com>
Date2015-02-19 11:32 -0700
Message-ID<mailman.18896.1424371191.18130.python-list@python.org>
In reply to#85910
On Thu, Feb 19, 2015 at 11:24 AM, Dave Angel <davea@davea.name> wrote:
> Here's a couple of ranges of output, showing that the 7bit scheme does
> better for values between 384 and 16379.
>
> 382 2 80fe --- 2 7e82
> 383 2 80ff --- 2 7f82
> 384 3 810000 --- 2 0083
> 384  jan grew 3 810000
> 385 3 810001 --- 2 0183
> 386 3 810002 --- 2 0283
> 387 3 810003 --- 2 0383
> 388 3 810004 --- 2 0483
> 389 3 810005 --- 2 0583
>
> 16380 3 813e7c --- 2 7cff
> 16380  jan grew 3 813e7c
> 16380 7bit grew 2 7cff
> 16381 3 813e7d --- 2 7dff
> 16382 3 813e7e --- 2 7eff
> 16383 3 813e7f --- 2 7fff
> 16384 3 813e80 --- 3 000081
> 16384 7bit grew 3 000081
> 16385 3 813e81 --- 3 010081
> 16386 3 813e82 --- 3 020081
> 16387 3 813e83 --- 3 030081
> 16388 3 813e84 --- 3 040081
> 16389 3 813e85 --- 3 050081
>
> In all my experimenting, I haven't found any values where the 7bit scheme
> does worse.  It seems likely that for extremely large integers, it will, but
> if those are to be the intended distribution, the 7bit scheme could be
> replaced by something else, like just encoding a length at the beginning,
> and using raw bytes after that.

It looks like you're counting whole bytes, not bits. That would be
important since the "difficult" encoding uses fractional bytes.

[toc] | [prev] | [next] | [standalone]


#85921

FromDave Angel <davea@davea.name>
Date2015-02-19 13:41 -0500
Message-ID<mailman.18897.1424371314.18130.python-list@python.org>
In reply to#85910
On 02/19/2015 01:34 PM, Chris Angelico wrote:
> On Fri, Feb 20, 2015 at 5:24 AM, Dave Angel <davea@davea.name> wrote:
>> In all my experimenting, I haven't found any values where the 7bit scheme
>> does worse.  It seems likely that for extremely large integers, it will, but
>> if those are to be the intended distribution, the 7bit scheme could be
>> replaced by something else, like just encoding a length at the beginning,
>> and using raw bytes after that.
>
> Encoding a length (as varlen) and then using eight bits to the byte
> thereafter is worse for small numbers,

I only suggested this if it turns out that the distribution is primarily 
extremely large numbers, large enough that 7bit isn't good enough.

As I (and others) have said many times, making it optimal means making 
some assumptions about the distribution of likely values.

> breaks even around 2**56, and
> then is better. So unless your numbers are mainly going to be above
> 2**56, it's better to just use varlen for the entire number. On the
> other hand, if you have to stream this without over-reading (imagine
> streaming from a TCP/IP socket; you want to block until you have the
> whole number, but not block after that), it may be more efficient to
> take the length, and then do a blocking read for the main data,
> instead of a large number of single-byte reads. But on the gripping
> hand, you can probably just do those one-byte reads and rely on (or
> implement) lower-level buffering.
>
> Ask not the python-list for advice, because they will say both "yes"
> and "no" and "maybe"... because they will say all three of "yes",
> "no", "maybe", and "you don't need to do that"... erm, AMONG our
> responses will be such diverse elements as...
>
> ChrisA
>


-- 
DaveA

[toc] | [prev] | [next] | [standalone]


#85922

FromDave Angel <davea@davea.name>
Date2015-02-19 13:46 -0500
Message-ID<mailman.18898.1424371589.18130.python-list@python.org>
In reply to#85910
On 02/19/2015 01:32 PM, Ian Kelly wrote:
> On Thu, Feb 19, 2015 at 11:24 AM, Dave Angel <davea@davea.name> wrote:
>> Here's a couple of ranges of output, showing that the 7bit scheme does
>> better for values between 384 and 16379.
>>
>> 382 2 80fe --- 2 7e82
>> 383 2 80ff --- 2 7f82
>> 384 3 810000 --- 2 0083
>> 384  jan grew 3 810000
>> 385 3 810001 --- 2 0183
>> 386 3 810002 --- 2 0283
>> 387 3 810003 --- 2 0383
>> 388 3 810004 --- 2 0483
>> 389 3 810005 --- 2 0583
>>
>> 16380 3 813e7c --- 2 7cff
>> 16380  jan grew 3 813e7c
>> 16380 7bit grew 2 7cff
>> 16381 3 813e7d --- 2 7dff
>> 16382 3 813e7e --- 2 7eff
>> 16383 3 813e7f --- 2 7fff
>> 16384 3 813e80 --- 3 000081
>> 16384 7bit grew 3 000081
>> 16385 3 813e81 --- 3 010081
>> 16386 3 813e82 --- 3 020081
>> 16387 3 813e83 --- 3 030081
>> 16388 3 813e84 --- 3 040081
>> 16389 3 813e85 --- 3 050081
>>
>> In all my experimenting, I haven't found any values where the 7bit scheme
>> does worse.  It seems likely that for extremely large integers, it will, but
>> if those are to be the intended distribution, the 7bit scheme could be
>> replaced by something else, like just encoding a length at the beginning,
>> and using raw bytes after that.
>
> It looks like you're counting whole bytes, not bits. That would be
> important since the "difficult" encoding uses fractional bytes.
>

Not the implementation that was shared.  I've only seen one set of 
Python code for "difficult", and it was strictly bytes.  As i said 
earlier in the message you quoted from.

Naturally, I question whether the original description makes sense for 
sub-bytes, since it was claimed that these are NOT for lists or 
sequences of arbitrary integers, but only for a single one at a time. 
Presumably mixed with other data which may or may not like bit encoding.




-- 
DaveA

[toc] | [prev] | [next] | [standalone]


#85923

FromChris Angelico <rosuav@gmail.com>
Date2015-02-20 05:49 +1100
Message-ID<mailman.18899.1424371770.18130.python-list@python.org>
In reply to#85910
On Fri, Feb 20, 2015 at 5:41 AM, Dave Angel <davea@davea.name> wrote:
> As I (and others) have said many times, making it optimal means making some
> assumptions about the distribution of likely values.

In fact, the very word "optimal" implies that. You have to have a set
of criteria on which you base your evaluation, and those criteria will
include target data.

ChrisA

[toc] | [prev] | [next] | [standalone]


#85801

FromMark Lawrence <breamoreboy@yahoo.co.uk>
Date2015-02-18 17:00 +0000
Message-ID<mailman.18825.1424278837.18130.python-list@python.org>
In reply to#85777
On 18/02/2015 16:46, Dave Angel wrote:
> On 02/18/2015 03:59 AM, janhein.vanderburg@gmail.com wrote:
>
>
>> encoding individual integers optimally without any assumptions about
>> their values.
>>
>
> Contradiction in terms.
>

I'm just pleased to see new blood coming through for my dream team, it's 
been a bit quiet recently.

-- 
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

[toc] | [prev] | [next] | [standalone]


#85740

FromChris Angelico <rosuav@gmail.com>
Date2015-02-18 01:34 +1100
Message-ID<mailman.18785.1424183690.18130.python-list@python.org>
In reply to#85737
On Wed, Feb 18, 2015 at 1:12 AM, Dave Angel <davea@davea.name> wrote:
> They had a field type called a "compressed integer."  It could vary between
> one byte and I think about six.  And the theory was that it took less space
> than the equivalent format of fixed size integers.

Oh, incidentally: If you want a decent binary format for
variable-sized integer, check out the MIDI spec. Its varlen integer is
simple: each byte carries seven bits of payload, and the high bit is
set on all except the last byte. Anything under 128 is represented by
the byte with that value; after that, you get a series of continuation
bytes followed by a final byte. It could scale to infinity if you
wanted to, though I think the MIDI spec doesn't have anything that can
go quite that high, and it's a worst-case wastage of 12.5% (or,
looking the other way, worst-case expansion of 14.2%), compared to an
optimal eight-bit encoding. Given that MIDI often works with very
small numbers, but occasionally could have to cope with much larger
ones (eg "time to next event", which is often zero or very low, but
once in a while could be anything at all), this is a good trade-off.
If you know you're frequently going to work with numbers up to 2**31
but only occasionally larger than that, it might be better to take
four bytes at a time, and use the high bit as a continuation bit, in
the same way. It's all trade-offs.

Decimal digits give you roughly 10/3 bits per character (== per byte,
if you encode UTF-8). If you consider that your baseline, you can then
try to justify your algorithm on the basis of "it's twice as efficient
as decimal". Hexadecimal gives you 4 bits per character, at a
relatively small cost; if you can't get better than about 6 bits per
byte average throughput, I would say your algorithm is completely
valueless - the difference will so rarely even matter, and you lose
all the benefits I mentioned in my last email.

I've tried to read through the original algorithm description, but I'm
not entirely sure: How many payload bits per transmitted byte does it
actually achieve?

ChrisA

[toc] | [prev] | [next] | [standalone]


#85778

Fromjanhein.vanderburg@gmail.com
Date2015-02-18 01:04 -0800
Message-ID<65b209a0-33fc-4c1f-8af6-de3a4626623c@googlegroups.com>
In reply to#85740
On Tuesday, February 17, 2015 at 3:35:16 PM UTC+1, Chris Angelico wrote:
> Oh, incidentally: If you want a decent binary format for
> variable-sized integer, check out the MIDI spec.

I did some time ago, thanks, and it is indeed a decent format.
I also looked at variations of that approach.
None of them beats "my" concept of two counters that cooperatively specify field lengths and represented integer values.

> I've tried to read through the original algorithm description, but I'm
> not entirely sure: How many payload bits per transmitted byte does it
> actually achieve?

I don't think that payload bits per byte makes sense in this concept.

[toc] | [prev] | [next] | [standalone]


#85792

FromDave Angel <davea@davea.name>
Date2015-02-18 08:54 -0500
Message-ID<mailman.18818.1424267695.18130.python-list@python.org>
In reply to#85778
On 02/18/2015 04:04 AM, janhein.vanderburg@gmail.com wrote:
> On Tuesday, February 17, 2015 at 3:35:16 PM UTC+1, Chris Angelico wrote:
>> Oh, incidentally: If you want a decent binary format for
>> variable-sized integer, check out the MIDI spec.
>
> I did some time ago, thanks, and it is indeed a decent format.
> I also looked at variations of that approach.
> None of them beats

Define "beats."  You might mean beats in simplicity, or in elegance, or 
in clarity of code.  But you probably mean in space efficiency, or 
"compression."  But that's meaningless without a target distribution of 
values that you expect to encode.

For example, if 99.9% of your values are going to be less than 255, then 
the most efficient byte encoding would be one that simply stores a value 
less than 255, and starts with an FF for larger values.  It's almost 
irrelevant how it encodes those larger values.

On the other hand, if most values are going to be in the 10,000 to 
20,000 bit size range, and a few will be much smaller, and a few will be 
very much larger, then it would be very practical to start with a size 
field, say 16 bits, followed by the raw packed data.  Naturally, the 
size field would need to have an escape value that indicates a larger 
field was needed.  In fact, the size field could be encoded in a 
7bits-per-byte manner, so it would encode an arbitrary sized number as well.


> "my" concept of two counters that cooperatively specify field lengths and represented integer values.
>
>> I've tried to read through the original algorithm description, but I'm
>> not entirely sure: How many payload bits per transmitted byte does it
>> actually achieve?
>
> I don't think that payload bits per byte makes sense in this concept.
>

Correct.  Presumably one means average payload bits per byte.

First one would have to define what the "standard" unencoded variable 
length integer format was.  Then one could call that size the payload 
size.  Then, in order to compute an average, one would have to specify 
an expected, or target distribution of values.  One then compares and 
averages the payload size for each typical value with the encoded size.

-- 
DaveA

[toc] | [prev] | [next] | [standalone]


#85814

Fromjanhein.vanderburg@gmail.com
Date2015-02-18 11:52 -0800
Message-ID<dbfbf5b9-6d5a-4738-ab04-779a15045d68@googlegroups.com>
In reply to#85792
Op woensdag 18 februari 2015 14:55:07 UTC+1 schreef Dave Angel:
> Define "beats."  You might mean beats in simplicity, or in elegance, or 
> in clarity of code.  But you probably mean in space efficiency, or 
> "compression."  But that's meaningless without a target distribution of 
> values that you expect to encode.
I do indeed mean space efficiency, not compression.
And it  is not meaningless and I don't do distribution tricks.
Don't take this personally.

[toc] | [prev] | [next] | [standalone]


#85794

FromChris Angelico <rosuav@gmail.com>
Date2015-02-19 01:16 +1100
Message-ID<mailman.18820.1424268970.18130.python-list@python.org>
In reply to#85778
On Thu, Feb 19, 2015 at 12:54 AM, Dave Angel <davea@davea.name> wrote:
>>> I've tried to read through the original algorithm description, but I'm
>>> not entirely sure: How many payload bits per transmitted byte does it
>>> actually achieve?
>>
>>
>> I don't think that payload bits per byte makes sense in this concept.
>>
>
> Correct.  Presumably one means average payload bits per byte.
>
> First one would have to define what the "standard" unencoded variable length
> integer format was.  Then one could call that size the payload size.  Then,
> in order to compute an average, one would have to specify an expected, or
> target distribution of values.  One then compares and averages the payload
> size for each typical value with the encoded size.

Average, or separately for different ranges. Alternatively, identify
which ranges of numbers can be encoded with how many bytes. For
instance, if you have a varlen length (in bytes) followed by that many
packed bytes, you would have:

2: up to 1<<8-1
3: up to 1<<16-1
4: up to 1<<24-1
...
128: up to 1<<1016-1
130: up to 1<<1024-1
131: up to 1<<1032-1
etc

Using varlen directly gives:

1: up to 1<<7-1
2: up to 1<<14-1
3: up to 1<<21-1
etc

So if your number is around about 1<<20, you know it's going to take 3
bytes to encode as varlen, or 4 to encode with the meta-length. Easy
comparisons.

I'm not sure how this algorithm compares, because it's extremely clever.

ChrisA

[toc] | [prev] | [next] | [standalone]


#85741

FromDave Angel <davea@davea.name>
Date2015-02-17 09:50 -0500
Message-ID<mailman.18786.1424184618.18130.python-list@python.org>
In reply to#85737
On 02/17/2015 09:34 AM, Chris Angelico wrote:
> On Wed, Feb 18, 2015 at 1:12 AM, Dave Angel <davea@davea.name> wrote:
>> They had a field type called a "compressed integer."  It could vary between
>> one byte and I think about six.  And the theory was that it took less space
>> than the equivalent format of fixed size integers.
>
> Oh, incidentally: If you want a decent binary format for
> variable-sized integer, check out the MIDI spec. Its varlen integer is
> simple: each byte carries seven bits of payload, and the high bit is
> set on all except the last byte. Anything under 128 is represented by
> the byte with that value; after that, you get a series of continuation
> bytes followed by a final byte. It could scale to infinity if you
> wanted to, though I think the MIDI spec doesn't have anything that can
> go quite that high, and it's a worst-case wastage of 12.5% (or,
> looking the other way, worst-case expansion of 14.2%), compared to an
> optimal eight-bit encoding. Given that MIDI often works with very
> small numbers, but occasionally could have to cope with much larger
> ones (eg "time to next event", which is often zero or very low, but
> once in a while could be anything at all), this is a good trade-off.
> If you know you're frequently going to work with numbers up to 2**31
> but only occasionally larger than that, it might be better to take
> four bytes at a time, and use the high bit as a continuation bit, in
> the same way. It's all trade-offs.

I have used that 7 bits/byte encoding myself for variable length 
integers.  It worked well enough for its purpose.

>
> Decimal digits give you roughly 10/3 bits per character (== per byte,
> if you encode UTF-8). If you consider that your baseline, you can then
> try to justify your algorithm on the basis of "it's twice as efficient
> as decimal". Hexadecimal gives you 4 bits per character, at a
> relatively small cost; if you can't get better than about 6 bits per
> byte average throughput, I would say your algorithm is completely
> valueless - the difference will so rarely even matter, and you lose
> all the benefits I mentioned in my last email.
>
> I've tried to read through the original algorithm description, but I'm
> not entirely sure: How many payload bits per transmitted byte does it
> actually achieve?

Somehow when I first read the message, I missed the fact that there was 
a link to a web page with source code.

But the first thing I'd expect to see would be a target estimate of the 
anticipated distribution of number values/magnitudes.  For example, if a 
typical integer is 1500 bits, plus/minus 200 bits, I'd probably try 
encoding in base 65535, and have a terminator character of 0xffff.

Without such a target, it's not a compression scheme, but an encoding one.

An interesting point of history.  At the time of Huffman's paper, it was 
provable that it was the best possible lossless compression.  But there 
were some unwritten assumptions, such as that each character would be 
encoded with a whole number of bits.  Changing that assumption makes 
arithmetic coding possible.  Next assumption was that probabilities of 
each character didn't depend on context.  Changing that assumption 
enables LZ.  And so on.

-- 
DaveA

[toc] | [prev] | [next] | [standalone]


#85742

FromChris Angelico <rosuav@gmail.com>
Date2015-02-18 01:58 +1100
Message-ID<mailman.18787.1424185098.18130.python-list@python.org>
In reply to#85737
On Wed, Feb 18, 2015 at 1:50 AM, Dave Angel <davea@davea.name> wrote:
> But the first thing I'd expect to see would be a target estimate of the
> anticipated distribution of number values/magnitudes.  For example, if a
> typical integer is 1500 bits, plus/minus 200 bits, I'd probably try encoding
> in base 65535, and have a terminator character of 0xffff.

Sure. Not sure how you'd cope with an interior FFFF in the stream
without drastically losing efficiency, though.

> An interesting point of history.  At the time of Huffman's paper, it was
> provable that it was the best possible lossless compression.  But there were
> some unwritten assumptions, such as that each character would be encoded
> with a whole number of bits.  Changing that assumption makes arithmetic
> coding possible.  Next assumption was that probabilities of each character
> didn't depend on context.  Changing that assumption enables LZ.  And so on.

Didn't LZ predate arithmetic coding? But yes, removing those
assumptions has enabled some pretty amazing alternatives. Just
describing arithmetic coding to them is enough to blow most people's
minds. (Especially my sister. She hates infinity, mainly - I think -
because it doesn't fit inside her brain. I troll her occasionally with
Hilbert's Hotel or anything involving the notion of different sizes of
infinity.)

ChrisA

[toc] | [prev] | [next] | [standalone]


#85743

FromDave Angel <davea@davea.name>
Date2015-02-17 10:18 -0500
Message-ID<mailman.18788.1424186328.18130.python-list@python.org>
In reply to#85737
On 02/17/2015 09:58 AM, Chris Angelico wrote:
> On Wed, Feb 18, 2015 at 1:50 AM, Dave Angel <davea@davea.name> wrote:
>> But the first thing I'd expect to see would be a target estimate of the
>> anticipated distribution of number values/magnitudes.  For example, if a
>> typical integer is 1500 bits, plus/minus 200 bits, I'd probably try encoding
>> in base 65535, and have a terminator character of 0xffff.
>
> Sure. Not sure how you'd cope with an interior FFFF in the stream
> without drastically losing efficiency, though.

That's why it was base 65535, not 65536.

>
>> An interesting point of history.  At the time of Huffman's paper, it was
>> provable that it was the best possible lossless compression.  But there were
>> some unwritten assumptions, such as that each character would be encoded
>> with a whole number of bits.  Changing that assumption makes arithmetic
>> coding possible.  Next assumption was that probabilities of each character
>> didn't depend on context.  Changing that assumption enables LZ.  And so on.
>
> Didn't LZ predate arithmetic coding?

I really don't know.  I looked up Huffman's article (before Internet, I 
found it in a library) from the 1952 IRE journal (I believe that's where 
it was).  I probably read it about 1980.

But arithmetic coding and LZ both came later to me, and that's the order 
I saw them in.

> But yes, removing those
> assumptions has enabled some pretty amazing alternatives. Just
> describing arithmetic coding to them is enough to blow most people's
> minds. (Especially my sister. She hates infinity, mainly - I think -
> because it doesn't fit inside her brain. I troll her occasionally with
> Hilbert's Hotel or anything involving the notion of different sizes of
> infinity.)


-- 
DaveA

[toc] | [prev] | [next] | [standalone]


#85744

FromChris Angelico <rosuav@gmail.com>
Date2015-02-18 02:25 +1100
Message-ID<mailman.18789.1424186722.18130.python-list@python.org>
In reply to#85737
On Wed, Feb 18, 2015 at 2:18 AM, Dave Angel <davea@davea.name> wrote:
>> Sure. Not sure how you'd cope with an interior FFFF in the stream
>> without drastically losing efficiency, though.
>
>
> That's why it was base 65535, not 65536.

Doh. Yeah. I autocorrected in my head, but yes, base 65535 is safe.

ChrisA

[toc] | [prev] | [next] | [standalone]


#85746

FromPaul Rubin <no.email@nospam.invalid>
Date2015-02-17 08:43 -0800
Message-ID<87a90cfpzh.fsf@jester.gateway.pace.com>
In reply to#85737
janhein.vanderburg@gmail.com writes:
> The next step is the development of the python code that minimizes
> processor requirements without compromising the algorithm.

This is a reasonable place to ask specific python questions.  The
algorithm description itself is pretty confusing though, and it seems to
address a problem that doesn't particularly seem to need a solution.
It's pretty typical for applications needing compact representations to
encode smallish variable-length integers in the VInt format, basically
using 7 bits from each byte for data, and the 8th bit as a terminator
flag.  So VInt encodes numbers 0-127 in one byte, 128 to 16383 in 2
bytes, etc.  Bignum applications generally use a length cell and a data
array.

[toc] | [prev] | [next] | [standalone]


#85779

Fromjanhein.vanderburg@gmail.com
Date2015-02-18 01:06 -0800
Message-ID<b0aa5dd2-78bb-491f-bf08-b278cd9b9934@googlegroups.com>
In reply to#85746
On Tuesday, February 17, 2015 at 5:43:43 PM UTC+1, Paul Rubin wrote:

> This is a reasonable place to ask specific python questions.  The
> algorithm description itself is pretty confusing though, and it seems to
> address a problem that doesn't particularly seem to need a solution.
> It's pretty typical for applications needing compact representations to
> encode smallish variable-length integers in the VInt format, basically
> using 7 bits from each byte for data, and the 8th bit as a terminator
> flag.  So VInt encodes numbers 0-127 in one byte, 128 to 16383 in 2
> bytes, etc.  Bignum applications generally use a length cell and a data
> array.

Thanks Paul,

The algorithm sort of combines the Vint  and Bignum approaches.
I'm sorry to read that my description is confusing; is the illustration with 2 bit characters helpful or should it be extended?

[toc] | [prev] | [next] | [standalone]


#85876

FromMario Figueiredo <marfig@gmail.com>
Date2015-02-19 08:44 +0100
Message-ID<m15bea9qn2332b4mcvr192ju7e6in1s44n@4ax.com>
In reply to#85737
A lot of patronizing egos running around in these groups. This is a
sad thread...

What is being asked is for help, not whether this is useful or needed.
Jan-Hein is after some directions, not whether your bloody opinion on
how he should use his free time.

If the interest and usability of a project would somehow become a
problem, then boy, oh, boy; Most everyone in here, including the
patronizing posters, would probably be left without anything to code.

[toc] | [prev] | [next] | [standalone]


Page 2 of 3 — ← Prev page 1 [2] 3  Next page →

Back to top | Article view | comp.lang.python


csiph-web