Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #85737 > unrolled thread
| Started by | janhein.vanderburg@gmail.com |
|---|---|
| First post | 2015-02-17 03:22 -0800 |
| Last post | 2015-02-20 00:58 -0800 |
| Articles | 20 on this page of 53 — 15 participants |
Back to article view | Back to comp.lang.python
python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-17 03:22 -0800
Re: python implementation of a new integer encoding algorithm. Chris Angelico <rosuav@gmail.com> - 2015-02-18 00:16 +1100
Re: python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-18 00:55 -0800
Re: python implementation of a new integer encoding algorithm. Chris Angelico <rosuav@gmail.com> - 2015-02-18 20:36 +1100
Re: python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-18 11:29 -0800
Re: python implementation of a new integer encoding algorithm. Laura Creighton <lac@openend.se> - 2015-02-18 11:32 +0100
Re: python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-18 11:48 -0800
People hated it for the same reasons I found them cool (was: python implementation of a new integer encoding algorithm.) Ben Finney <ben+python@benfinney.id.au> - 2015-02-18 21:57 +1100
Re: python implementation of a new integer encoding algorithm. Dave Angel <davea@davea.name> - 2015-02-17 09:12 -0500
Re: python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-18 00:59 -0800
Re: python implementation of a new integer encoding algorithm. Dave Angel <davea@davea.name> - 2015-02-18 11:46 -0500
Re: python implementation of a new integer encoding algorithm. Grant Edwards <invalid@invalid.invalid> - 2015-02-18 17:30 +0000
Re: python implementation of a new integer encoding algorithm. Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-02-18 18:12 +0000
Re: python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-18 11:55 -0800
Re: python implementation of a new integer encoding algorithm. Marko Rauhamaa <marko@pacujo.net> - 2015-02-18 23:54 +0200
Re: python implementation of a new integer encoding algorithm. Marko Rauhamaa <marko@pacujo.net> - 2015-02-19 00:08 +0200
Re: python implementation of a new integer encoding algorithm. Grant Edwards <invalid@invalid.invalid> - 2015-02-18 22:58 +0000
Re: python implementation of a new integer encoding algorithm. Dave Angel <davea@davea.name> - 2015-02-18 17:19 -0500
Re: python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-19 07:45 -0800
Re: python implementation of a new integer encoding algorithm. Ian Kelly <ian.g.kelly@gmail.com> - 2015-02-19 11:04 -0700
Re: python implementation of a new integer encoding algorithm. Ian Kelly <ian.g.kelly@gmail.com> - 2015-02-19 11:16 -0700
Re: python implementation of a new integer encoding algorithm. Dave Angel <davea@davea.name> - 2015-02-19 13:24 -0500
Re: python implementation of a new integer encoding algorithm. Chris Angelico <rosuav@gmail.com> - 2015-02-20 05:34 +1100
Re: python implementation of a new integer encoding algorithm. Ian Kelly <ian.g.kelly@gmail.com> - 2015-02-19 11:32 -0700
Re: python implementation of a new integer encoding algorithm. Dave Angel <davea@davea.name> - 2015-02-19 13:41 -0500
Re: python implementation of a new integer encoding algorithm. Dave Angel <davea@davea.name> - 2015-02-19 13:46 -0500
Re: python implementation of a new integer encoding algorithm. Chris Angelico <rosuav@gmail.com> - 2015-02-20 05:49 +1100
Re: python implementation of a new integer encoding algorithm. Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-02-18 17:00 +0000
Re: python implementation of a new integer encoding algorithm. Chris Angelico <rosuav@gmail.com> - 2015-02-18 01:34 +1100
Re: python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-18 01:04 -0800
Re: python implementation of a new integer encoding algorithm. Dave Angel <davea@davea.name> - 2015-02-18 08:54 -0500
Re: python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-18 11:52 -0800
Re: python implementation of a new integer encoding algorithm. Chris Angelico <rosuav@gmail.com> - 2015-02-19 01:16 +1100
Re: python implementation of a new integer encoding algorithm. Dave Angel <davea@davea.name> - 2015-02-17 09:50 -0500
Re: python implementation of a new integer encoding algorithm. Chris Angelico <rosuav@gmail.com> - 2015-02-18 01:58 +1100
Re: python implementation of a new integer encoding algorithm. Dave Angel <davea@davea.name> - 2015-02-17 10:18 -0500
Re: python implementation of a new integer encoding algorithm. Chris Angelico <rosuav@gmail.com> - 2015-02-18 02:25 +1100
Re: python implementation of a new integer encoding algorithm. Paul Rubin <no.email@nospam.invalid> - 2015-02-17 08:43 -0800
Re: python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-18 01:06 -0800
Re: python implementation of a new integer encoding algorithm. Mario Figueiredo <marfig@gmail.com> - 2015-02-19 08:44 +0100
Re: python implementation of a new integer encoding algorithm. Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-02-19 08:06 +0000
Re: python implementation of a new integer encoding algorithm. Marko Rauhamaa <marko@pacujo.net> - 2015-02-19 10:36 +0200
Re: python implementation of a new integer encoding algorithm. Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-02-19 09:33 +0000
Re: python implementation of a new integer encoding algorithm. Terry Reedy <tjreedy@udel.edu> - 2015-02-19 14:50 -0500
Re: python implementation of a new integer encoding algorithm. Marko Rauhamaa <marko@pacujo.net> - 2015-02-19 21:55 +0200
Re: python implementation of a new integer encoding algorithm. Chris Angelico <rosuav@gmail.com> - 2015-02-19 19:36 +1100
Re: python implementation of a new integer encoding algorithm. Mario Figueiredo <marfig@gmail.com> - 2015-02-19 10:42 +0100
Re: python implementation of a new integer encoding algorithm. Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-02-19 10:28 +0000
Re: python implementation of a new integer encoding algorithm. Mario Figueiredo <marfig@gmail.com> - 2015-02-19 14:27 +0100
Re: python implementation of a new integer encoding algorithm. Jonas Wielicki <jonas@wielicki.name> - 2015-02-19 09:38 +0100
Re: python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-19 07:58 -0800
Re: python implementation of a new integer encoding algorithm. Denis McMahon <denismfmcmahon@gmail.com> - 2015-02-20 02:46 +0000
Re: python implementation of a new integer encoding algorithm. wxjmfauth@gmail.com - 2015-02-20 00:58 -0800
Page 2 of 3 — ← Prev page 1 [2] 3 Next page →
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2015-02-19 11:16 -0700 |
| Message-ID | <mailman.18893.1424369820.18130.python-list@python.org> |
| In reply to | #85910 |
On Thu, Feb 19, 2015 at 11:04 AM, Ian Kelly <ian.g.kelly@gmail.com> wrote: > There's also an optimization that can be added here if we wish to > inject a bit of cleverness. Notice that all values with more than one > group start with 11, never 10. We can borrow a trick from IEEE > floating point and make the leading 1 bit of the mantissa implicit for > all values greater than 3 (we can't do it for 2 and 3 because then we > couldn't distinguish them from 0 and 1). Applying this optimization > removes one full group from the representation of all values greater > than 3, which appears to make the stop-bit representation as short as > or shorter than the "difficult" one for all the values that have been > enumerated above. I made a mistake. This trick only works if the second group is 10, so it only applies to half the numbers. That makes 12 and 13 a group longer than the difficult representation, but on the other hand 18-23 are still a group shorter.
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <davea@davea.name> |
|---|---|
| Date | 2015-02-19 13:24 -0500 |
| Message-ID | <mailman.18894.1424370292.18130.python-list@python.org> |
| In reply to | #85910 |
On 02/19/2015 10:45 AM, janhein.vanderburg@gmail.com wrote:
> On Wednesday, February 18, 2015 at 11:20:12 PM UTC+1, Dave Angel wrote:
>> I'm not necessarily doubting it, just challenging you to provide a data
>> sample that actually shows it. And of course, I'm not claiming that
>> 7bit is in any way optimal. You cannot define optimal without first
>> defining the distribution.
>
> Weird results.
In what way weird?
> For a character size 2 the growth processes are shown below.
> I listed the decimal representations, the difficult representation, a stop bit encoding, and the number of characters they differ in length:
> 0: 00 00 0
> 1: 01 01 0
> 2: 10, 00 10, 00 0
> 3: 10, 01 10, 01 0
> 4: 10, 10 11, 00 0
> 5: 10, 11 11, 01 0
> 6: 11, 00.00 11, 10, 00 0
> 7: 11, 00.01 11, 10, 01 0
> 8: 11, 00.10 11, 11, 00 0
> 9: 11, 00.11 11, 11, 01 0
> 10: 11, 01.00 11, 11, 10, 00 1
> 11: 11, 01.01 11, 11, 10, 01 1
> 12: 11, 01.10 11, 11, 11, 00 1
> 13: 11, 01.11 11, 11, 11, 01 1
> 14: 11, 10.00, 00 11, 11, 11, 10, 00 1
> 15: 11, 10.00, 01 11, 11, 11, 10, 01 1
> 16: 11, 10.00, 10 11, 11, 11, 11, 00 1
> 17: 11, 10.00, 11 11, 11, 11, 11, 01 1
> 18: 11, 10.01, 00.00 11, 11, 11, 11, 10, 00 1
> 19: 11, 10.01, 00.01 11, 11, 11, 11, 10, 01 1
> 20: 11, 10.01, 00.10 11, 11, 11, 11, 11, 00 1
> 21: 11, 10.01, 00.11 11, 11, 11, 11, 11, 01 1
> 22: 11, 10.01, 01.00 11, 11, 11, 11, 11, 10, 00 2
> 23: 11, 10.01, 01.01 11, 11, 11, 11, 11, 10, 01 2
> 24: 11, 10.01, 01.10 11, 11, 11, 11, 11, 11, 00 2
> 25: 11, 10.01, 01.11 11, 11, 11, 11, 11, 11, 01 2
> 26: 11, 10.01, 10.00 11, 11, 11, 11, 11, 11, 10, 00 3
>
> I didn't take the time to prove it mathematically, but these results suggest to me that the complicated encoding beats the stop bit encoding.
>
Clearly, one wouldn't use what you call "stop bit encoding" with a 2bit
character. And since the Python code you posted uses an 8bit character,
I can't validate the "difficult" column either.
I wrote the following pair of functions:
def seven_code(n):
acc = bytearray()
if n == 0:
acc.append(0)
while n > 0:
quotient, remainder = divmod(n, 128)
acc.append(remainder)
n = quotient
acc[-1] |= 0x80 #add a stop bit to last byte
return acc
def seven_decode(sequ):
acc = 0
multiplier = 1
for by in sequ:
acc += (by & 0x7f) * multiplier
if by > 0x7f:
break
multiplier *= 128
return acc
Here's a couple of ranges of output, showing that the 7bit scheme does
better for values between 384 and 16379.
382 2 80fe --- 2 7e82
383 2 80ff --- 2 7f82
384 3 810000 --- 2 0083
384 jan grew 3 810000
385 3 810001 --- 2 0183
386 3 810002 --- 2 0283
387 3 810003 --- 2 0383
388 3 810004 --- 2 0483
389 3 810005 --- 2 0583
16380 3 813e7c --- 2 7cff
16380 jan grew 3 813e7c
16380 7bit grew 2 7cff
16381 3 813e7d --- 2 7dff
16382 3 813e7e --- 2 7eff
16383 3 813e7f --- 2 7fff
16384 3 813e80 --- 3 000081
16384 7bit grew 3 000081
16385 3 813e81 --- 3 010081
16386 3 813e82 --- 3 020081
16387 3 813e83 --- 3 030081
16388 3 813e84 --- 3 040081
16389 3 813e85 --- 3 050081
In all my experimenting, I haven't found any values where the 7bit
scheme does worse. It seems likely that for extremely large integers,
it will, but if those are to be the intended distribution, the 7bit
scheme could be replaced by something else, like just encoding a length
at the beginning, and using raw bytes after that.
--
DaveA
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-02-20 05:34 +1100 |
| Message-ID | <mailman.18895.1424370900.18130.python-list@python.org> |
| In reply to | #85910 |
On Fri, Feb 20, 2015 at 5:24 AM, Dave Angel <davea@davea.name> wrote: > In all my experimenting, I haven't found any values where the 7bit scheme > does worse. It seems likely that for extremely large integers, it will, but > if those are to be the intended distribution, the 7bit scheme could be > replaced by something else, like just encoding a length at the beginning, > and using raw bytes after that. Encoding a length (as varlen) and then using eight bits to the byte thereafter is worse for small numbers, breaks even around 2**56, and then is better. So unless your numbers are mainly going to be above 2**56, it's better to just use varlen for the entire number. On the other hand, if you have to stream this without over-reading (imagine streaming from a TCP/IP socket; you want to block until you have the whole number, but not block after that), it may be more efficient to take the length, and then do a blocking read for the main data, instead of a large number of single-byte reads. But on the gripping hand, you can probably just do those one-byte reads and rely on (or implement) lower-level buffering. Ask not the python-list for advice, because they will say both "yes" and "no" and "maybe"... because they will say all three of "yes", "no", "maybe", and "you don't need to do that"... erm, AMONG our responses will be such diverse elements as... ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2015-02-19 11:32 -0700 |
| Message-ID | <mailman.18896.1424371191.18130.python-list@python.org> |
| In reply to | #85910 |
On Thu, Feb 19, 2015 at 11:24 AM, Dave Angel <davea@davea.name> wrote: > Here's a couple of ranges of output, showing that the 7bit scheme does > better for values between 384 and 16379. > > 382 2 80fe --- 2 7e82 > 383 2 80ff --- 2 7f82 > 384 3 810000 --- 2 0083 > 384 jan grew 3 810000 > 385 3 810001 --- 2 0183 > 386 3 810002 --- 2 0283 > 387 3 810003 --- 2 0383 > 388 3 810004 --- 2 0483 > 389 3 810005 --- 2 0583 > > 16380 3 813e7c --- 2 7cff > 16380 jan grew 3 813e7c > 16380 7bit grew 2 7cff > 16381 3 813e7d --- 2 7dff > 16382 3 813e7e --- 2 7eff > 16383 3 813e7f --- 2 7fff > 16384 3 813e80 --- 3 000081 > 16384 7bit grew 3 000081 > 16385 3 813e81 --- 3 010081 > 16386 3 813e82 --- 3 020081 > 16387 3 813e83 --- 3 030081 > 16388 3 813e84 --- 3 040081 > 16389 3 813e85 --- 3 050081 > > In all my experimenting, I haven't found any values where the 7bit scheme > does worse. It seems likely that for extremely large integers, it will, but > if those are to be the intended distribution, the 7bit scheme could be > replaced by something else, like just encoding a length at the beginning, > and using raw bytes after that. It looks like you're counting whole bytes, not bits. That would be important since the "difficult" encoding uses fractional bytes.
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <davea@davea.name> |
|---|---|
| Date | 2015-02-19 13:41 -0500 |
| Message-ID | <mailman.18897.1424371314.18130.python-list@python.org> |
| In reply to | #85910 |
On 02/19/2015 01:34 PM, Chris Angelico wrote: > On Fri, Feb 20, 2015 at 5:24 AM, Dave Angel <davea@davea.name> wrote: >> In all my experimenting, I haven't found any values where the 7bit scheme >> does worse. It seems likely that for extremely large integers, it will, but >> if those are to be the intended distribution, the 7bit scheme could be >> replaced by something else, like just encoding a length at the beginning, >> and using raw bytes after that. > > Encoding a length (as varlen) and then using eight bits to the byte > thereafter is worse for small numbers, I only suggested this if it turns out that the distribution is primarily extremely large numbers, large enough that 7bit isn't good enough. As I (and others) have said many times, making it optimal means making some assumptions about the distribution of likely values. > breaks even around 2**56, and > then is better. So unless your numbers are mainly going to be above > 2**56, it's better to just use varlen for the entire number. On the > other hand, if you have to stream this without over-reading (imagine > streaming from a TCP/IP socket; you want to block until you have the > whole number, but not block after that), it may be more efficient to > take the length, and then do a blocking read for the main data, > instead of a large number of single-byte reads. But on the gripping > hand, you can probably just do those one-byte reads and rely on (or > implement) lower-level buffering. > > Ask not the python-list for advice, because they will say both "yes" > and "no" and "maybe"... because they will say all three of "yes", > "no", "maybe", and "you don't need to do that"... erm, AMONG our > responses will be such diverse elements as... > > ChrisA > -- DaveA
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <davea@davea.name> |
|---|---|
| Date | 2015-02-19 13:46 -0500 |
| Message-ID | <mailman.18898.1424371589.18130.python-list@python.org> |
| In reply to | #85910 |
On 02/19/2015 01:32 PM, Ian Kelly wrote: > On Thu, Feb 19, 2015 at 11:24 AM, Dave Angel <davea@davea.name> wrote: >> Here's a couple of ranges of output, showing that the 7bit scheme does >> better for values between 384 and 16379. >> >> 382 2 80fe --- 2 7e82 >> 383 2 80ff --- 2 7f82 >> 384 3 810000 --- 2 0083 >> 384 jan grew 3 810000 >> 385 3 810001 --- 2 0183 >> 386 3 810002 --- 2 0283 >> 387 3 810003 --- 2 0383 >> 388 3 810004 --- 2 0483 >> 389 3 810005 --- 2 0583 >> >> 16380 3 813e7c --- 2 7cff >> 16380 jan grew 3 813e7c >> 16380 7bit grew 2 7cff >> 16381 3 813e7d --- 2 7dff >> 16382 3 813e7e --- 2 7eff >> 16383 3 813e7f --- 2 7fff >> 16384 3 813e80 --- 3 000081 >> 16384 7bit grew 3 000081 >> 16385 3 813e81 --- 3 010081 >> 16386 3 813e82 --- 3 020081 >> 16387 3 813e83 --- 3 030081 >> 16388 3 813e84 --- 3 040081 >> 16389 3 813e85 --- 3 050081 >> >> In all my experimenting, I haven't found any values where the 7bit scheme >> does worse. It seems likely that for extremely large integers, it will, but >> if those are to be the intended distribution, the 7bit scheme could be >> replaced by something else, like just encoding a length at the beginning, >> and using raw bytes after that. > > It looks like you're counting whole bytes, not bits. That would be > important since the "difficult" encoding uses fractional bytes. > Not the implementation that was shared. I've only seen one set of Python code for "difficult", and it was strictly bytes. As i said earlier in the message you quoted from. Naturally, I question whether the original description makes sense for sub-bytes, since it was claimed that these are NOT for lists or sequences of arbitrary integers, but only for a single one at a time. Presumably mixed with other data which may or may not like bit encoding. -- DaveA
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-02-20 05:49 +1100 |
| Message-ID | <mailman.18899.1424371770.18130.python-list@python.org> |
| In reply to | #85910 |
On Fri, Feb 20, 2015 at 5:41 AM, Dave Angel <davea@davea.name> wrote: > As I (and others) have said many times, making it optimal means making some > assumptions about the distribution of likely values. In fact, the very word "optimal" implies that. You have to have a set of criteria on which you base your evaluation, and those criteria will include target data. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2015-02-18 17:00 +0000 |
| Message-ID | <mailman.18825.1424278837.18130.python-list@python.org> |
| In reply to | #85777 |
On 18/02/2015 16:46, Dave Angel wrote: > On 02/18/2015 03:59 AM, janhein.vanderburg@gmail.com wrote: > > >> encoding individual integers optimally without any assumptions about >> their values. >> > > Contradiction in terms. > I'm just pleased to see new blood coming through for my dream team, it's been a bit quiet recently. -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-02-18 01:34 +1100 |
| Message-ID | <mailman.18785.1424183690.18130.python-list@python.org> |
| In reply to | #85737 |
On Wed, Feb 18, 2015 at 1:12 AM, Dave Angel <davea@davea.name> wrote: > They had a field type called a "compressed integer." It could vary between > one byte and I think about six. And the theory was that it took less space > than the equivalent format of fixed size integers. Oh, incidentally: If you want a decent binary format for variable-sized integer, check out the MIDI spec. Its varlen integer is simple: each byte carries seven bits of payload, and the high bit is set on all except the last byte. Anything under 128 is represented by the byte with that value; after that, you get a series of continuation bytes followed by a final byte. It could scale to infinity if you wanted to, though I think the MIDI spec doesn't have anything that can go quite that high, and it's a worst-case wastage of 12.5% (or, looking the other way, worst-case expansion of 14.2%), compared to an optimal eight-bit encoding. Given that MIDI often works with very small numbers, but occasionally could have to cope with much larger ones (eg "time to next event", which is often zero or very low, but once in a while could be anything at all), this is a good trade-off. If you know you're frequently going to work with numbers up to 2**31 but only occasionally larger than that, it might be better to take four bytes at a time, and use the high bit as a continuation bit, in the same way. It's all trade-offs. Decimal digits give you roughly 10/3 bits per character (== per byte, if you encode UTF-8). If you consider that your baseline, you can then try to justify your algorithm on the basis of "it's twice as efficient as decimal". Hexadecimal gives you 4 bits per character, at a relatively small cost; if you can't get better than about 6 bits per byte average throughput, I would say your algorithm is completely valueless - the difference will so rarely even matter, and you lose all the benefits I mentioned in my last email. I've tried to read through the original algorithm description, but I'm not entirely sure: How many payload bits per transmitted byte does it actually achieve? ChrisA
[toc] | [prev] | [next] | [standalone]
| From | janhein.vanderburg@gmail.com |
|---|---|
| Date | 2015-02-18 01:04 -0800 |
| Message-ID | <65b209a0-33fc-4c1f-8af6-de3a4626623c@googlegroups.com> |
| In reply to | #85740 |
On Tuesday, February 17, 2015 at 3:35:16 PM UTC+1, Chris Angelico wrote: > Oh, incidentally: If you want a decent binary format for > variable-sized integer, check out the MIDI spec. I did some time ago, thanks, and it is indeed a decent format. I also looked at variations of that approach. None of them beats "my" concept of two counters that cooperatively specify field lengths and represented integer values. > I've tried to read through the original algorithm description, but I'm > not entirely sure: How many payload bits per transmitted byte does it > actually achieve? I don't think that payload bits per byte makes sense in this concept.
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <davea@davea.name> |
|---|---|
| Date | 2015-02-18 08:54 -0500 |
| Message-ID | <mailman.18818.1424267695.18130.python-list@python.org> |
| In reply to | #85778 |
On 02/18/2015 04:04 AM, janhein.vanderburg@gmail.com wrote: > On Tuesday, February 17, 2015 at 3:35:16 PM UTC+1, Chris Angelico wrote: >> Oh, incidentally: If you want a decent binary format for >> variable-sized integer, check out the MIDI spec. > > I did some time ago, thanks, and it is indeed a decent format. > I also looked at variations of that approach. > None of them beats Define "beats." You might mean beats in simplicity, or in elegance, or in clarity of code. But you probably mean in space efficiency, or "compression." But that's meaningless without a target distribution of values that you expect to encode. For example, if 99.9% of your values are going to be less than 255, then the most efficient byte encoding would be one that simply stores a value less than 255, and starts with an FF for larger values. It's almost irrelevant how it encodes those larger values. On the other hand, if most values are going to be in the 10,000 to 20,000 bit size range, and a few will be much smaller, and a few will be very much larger, then it would be very practical to start with a size field, say 16 bits, followed by the raw packed data. Naturally, the size field would need to have an escape value that indicates a larger field was needed. In fact, the size field could be encoded in a 7bits-per-byte manner, so it would encode an arbitrary sized number as well. > "my" concept of two counters that cooperatively specify field lengths and represented integer values. > >> I've tried to read through the original algorithm description, but I'm >> not entirely sure: How many payload bits per transmitted byte does it >> actually achieve? > > I don't think that payload bits per byte makes sense in this concept. > Correct. Presumably one means average payload bits per byte. First one would have to define what the "standard" unencoded variable length integer format was. Then one could call that size the payload size. Then, in order to compute an average, one would have to specify an expected, or target distribution of values. One then compares and averages the payload size for each typical value with the encoded size. -- DaveA
[toc] | [prev] | [next] | [standalone]
| From | janhein.vanderburg@gmail.com |
|---|---|
| Date | 2015-02-18 11:52 -0800 |
| Message-ID | <dbfbf5b9-6d5a-4738-ab04-779a15045d68@googlegroups.com> |
| In reply to | #85792 |
Op woensdag 18 februari 2015 14:55:07 UTC+1 schreef Dave Angel: > Define "beats." You might mean beats in simplicity, or in elegance, or > in clarity of code. But you probably mean in space efficiency, or > "compression." But that's meaningless without a target distribution of > values that you expect to encode. I do indeed mean space efficiency, not compression. And it is not meaningless and I don't do distribution tricks. Don't take this personally.
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-02-19 01:16 +1100 |
| Message-ID | <mailman.18820.1424268970.18130.python-list@python.org> |
| In reply to | #85778 |
On Thu, Feb 19, 2015 at 12:54 AM, Dave Angel <davea@davea.name> wrote: >>> I've tried to read through the original algorithm description, but I'm >>> not entirely sure: How many payload bits per transmitted byte does it >>> actually achieve? >> >> >> I don't think that payload bits per byte makes sense in this concept. >> > > Correct. Presumably one means average payload bits per byte. > > First one would have to define what the "standard" unencoded variable length > integer format was. Then one could call that size the payload size. Then, > in order to compute an average, one would have to specify an expected, or > target distribution of values. One then compares and averages the payload > size for each typical value with the encoded size. Average, or separately for different ranges. Alternatively, identify which ranges of numbers can be encoded with how many bytes. For instance, if you have a varlen length (in bytes) followed by that many packed bytes, you would have: 2: up to 1<<8-1 3: up to 1<<16-1 4: up to 1<<24-1 ... 128: up to 1<<1016-1 130: up to 1<<1024-1 131: up to 1<<1032-1 etc Using varlen directly gives: 1: up to 1<<7-1 2: up to 1<<14-1 3: up to 1<<21-1 etc So if your number is around about 1<<20, you know it's going to take 3 bytes to encode as varlen, or 4 to encode with the meta-length. Easy comparisons. I'm not sure how this algorithm compares, because it's extremely clever. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <davea@davea.name> |
|---|---|
| Date | 2015-02-17 09:50 -0500 |
| Message-ID | <mailman.18786.1424184618.18130.python-list@python.org> |
| In reply to | #85737 |
On 02/17/2015 09:34 AM, Chris Angelico wrote: > On Wed, Feb 18, 2015 at 1:12 AM, Dave Angel <davea@davea.name> wrote: >> They had a field type called a "compressed integer." It could vary between >> one byte and I think about six. And the theory was that it took less space >> than the equivalent format of fixed size integers. > > Oh, incidentally: If you want a decent binary format for > variable-sized integer, check out the MIDI spec. Its varlen integer is > simple: each byte carries seven bits of payload, and the high bit is > set on all except the last byte. Anything under 128 is represented by > the byte with that value; after that, you get a series of continuation > bytes followed by a final byte. It could scale to infinity if you > wanted to, though I think the MIDI spec doesn't have anything that can > go quite that high, and it's a worst-case wastage of 12.5% (or, > looking the other way, worst-case expansion of 14.2%), compared to an > optimal eight-bit encoding. Given that MIDI often works with very > small numbers, but occasionally could have to cope with much larger > ones (eg "time to next event", which is often zero or very low, but > once in a while could be anything at all), this is a good trade-off. > If you know you're frequently going to work with numbers up to 2**31 > but only occasionally larger than that, it might be better to take > four bytes at a time, and use the high bit as a continuation bit, in > the same way. It's all trade-offs. I have used that 7 bits/byte encoding myself for variable length integers. It worked well enough for its purpose. > > Decimal digits give you roughly 10/3 bits per character (== per byte, > if you encode UTF-8). If you consider that your baseline, you can then > try to justify your algorithm on the basis of "it's twice as efficient > as decimal". Hexadecimal gives you 4 bits per character, at a > relatively small cost; if you can't get better than about 6 bits per > byte average throughput, I would say your algorithm is completely > valueless - the difference will so rarely even matter, and you lose > all the benefits I mentioned in my last email. > > I've tried to read through the original algorithm description, but I'm > not entirely sure: How many payload bits per transmitted byte does it > actually achieve? Somehow when I first read the message, I missed the fact that there was a link to a web page with source code. But the first thing I'd expect to see would be a target estimate of the anticipated distribution of number values/magnitudes. For example, if a typical integer is 1500 bits, plus/minus 200 bits, I'd probably try encoding in base 65535, and have a terminator character of 0xffff. Without such a target, it's not a compression scheme, but an encoding one. An interesting point of history. At the time of Huffman's paper, it was provable that it was the best possible lossless compression. But there were some unwritten assumptions, such as that each character would be encoded with a whole number of bits. Changing that assumption makes arithmetic coding possible. Next assumption was that probabilities of each character didn't depend on context. Changing that assumption enables LZ. And so on. -- DaveA
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-02-18 01:58 +1100 |
| Message-ID | <mailman.18787.1424185098.18130.python-list@python.org> |
| In reply to | #85737 |
On Wed, Feb 18, 2015 at 1:50 AM, Dave Angel <davea@davea.name> wrote: > But the first thing I'd expect to see would be a target estimate of the > anticipated distribution of number values/magnitudes. For example, if a > typical integer is 1500 bits, plus/minus 200 bits, I'd probably try encoding > in base 65535, and have a terminator character of 0xffff. Sure. Not sure how you'd cope with an interior FFFF in the stream without drastically losing efficiency, though. > An interesting point of history. At the time of Huffman's paper, it was > provable that it was the best possible lossless compression. But there were > some unwritten assumptions, such as that each character would be encoded > with a whole number of bits. Changing that assumption makes arithmetic > coding possible. Next assumption was that probabilities of each character > didn't depend on context. Changing that assumption enables LZ. And so on. Didn't LZ predate arithmetic coding? But yes, removing those assumptions has enabled some pretty amazing alternatives. Just describing arithmetic coding to them is enough to blow most people's minds. (Especially my sister. She hates infinity, mainly - I think - because it doesn't fit inside her brain. I troll her occasionally with Hilbert's Hotel or anything involving the notion of different sizes of infinity.) ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <davea@davea.name> |
|---|---|
| Date | 2015-02-17 10:18 -0500 |
| Message-ID | <mailman.18788.1424186328.18130.python-list@python.org> |
| In reply to | #85737 |
On 02/17/2015 09:58 AM, Chris Angelico wrote: > On Wed, Feb 18, 2015 at 1:50 AM, Dave Angel <davea@davea.name> wrote: >> But the first thing I'd expect to see would be a target estimate of the >> anticipated distribution of number values/magnitudes. For example, if a >> typical integer is 1500 bits, plus/minus 200 bits, I'd probably try encoding >> in base 65535, and have a terminator character of 0xffff. > > Sure. Not sure how you'd cope with an interior FFFF in the stream > without drastically losing efficiency, though. That's why it was base 65535, not 65536. > >> An interesting point of history. At the time of Huffman's paper, it was >> provable that it was the best possible lossless compression. But there were >> some unwritten assumptions, such as that each character would be encoded >> with a whole number of bits. Changing that assumption makes arithmetic >> coding possible. Next assumption was that probabilities of each character >> didn't depend on context. Changing that assumption enables LZ. And so on. > > Didn't LZ predate arithmetic coding? I really don't know. I looked up Huffman's article (before Internet, I found it in a library) from the 1952 IRE journal (I believe that's where it was). I probably read it about 1980. But arithmetic coding and LZ both came later to me, and that's the order I saw them in. > But yes, removing those > assumptions has enabled some pretty amazing alternatives. Just > describing arithmetic coding to them is enough to blow most people's > minds. (Especially my sister. She hates infinity, mainly - I think - > because it doesn't fit inside her brain. I troll her occasionally with > Hilbert's Hotel or anything involving the notion of different sizes of > infinity.) -- DaveA
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-02-18 02:25 +1100 |
| Message-ID | <mailman.18789.1424186722.18130.python-list@python.org> |
| In reply to | #85737 |
On Wed, Feb 18, 2015 at 2:18 AM, Dave Angel <davea@davea.name> wrote: >> Sure. Not sure how you'd cope with an interior FFFF in the stream >> without drastically losing efficiency, though. > > > That's why it was base 65535, not 65536. Doh. Yeah. I autocorrected in my head, but yes, base 65535 is safe. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Paul Rubin <no.email@nospam.invalid> |
|---|---|
| Date | 2015-02-17 08:43 -0800 |
| Message-ID | <87a90cfpzh.fsf@jester.gateway.pace.com> |
| In reply to | #85737 |
janhein.vanderburg@gmail.com writes: > The next step is the development of the python code that minimizes > processor requirements without compromising the algorithm. This is a reasonable place to ask specific python questions. The algorithm description itself is pretty confusing though, and it seems to address a problem that doesn't particularly seem to need a solution. It's pretty typical for applications needing compact representations to encode smallish variable-length integers in the VInt format, basically using 7 bits from each byte for data, and the 8th bit as a terminator flag. So VInt encodes numbers 0-127 in one byte, 128 to 16383 in 2 bytes, etc. Bignum applications generally use a length cell and a data array.
[toc] | [prev] | [next] | [standalone]
| From | janhein.vanderburg@gmail.com |
|---|---|
| Date | 2015-02-18 01:06 -0800 |
| Message-ID | <b0aa5dd2-78bb-491f-bf08-b278cd9b9934@googlegroups.com> |
| In reply to | #85746 |
On Tuesday, February 17, 2015 at 5:43:43 PM UTC+1, Paul Rubin wrote: > This is a reasonable place to ask specific python questions. The > algorithm description itself is pretty confusing though, and it seems to > address a problem that doesn't particularly seem to need a solution. > It's pretty typical for applications needing compact representations to > encode smallish variable-length integers in the VInt format, basically > using 7 bits from each byte for data, and the 8th bit as a terminator > flag. So VInt encodes numbers 0-127 in one byte, 128 to 16383 in 2 > bytes, etc. Bignum applications generally use a length cell and a data > array. Thanks Paul, The algorithm sort of combines the Vint and Bignum approaches. I'm sorry to read that my description is confusing; is the illustration with 2 bit characters helpful or should it be extended?
[toc] | [prev] | [next] | [standalone]
| From | Mario Figueiredo <marfig@gmail.com> |
|---|---|
| Date | 2015-02-19 08:44 +0100 |
| Message-ID | <m15bea9qn2332b4mcvr192ju7e6in1s44n@4ax.com> |
| In reply to | #85737 |
A lot of patronizing egos running around in these groups. This is a sad thread... What is being asked is for help, not whether this is useful or needed. Jan-Hein is after some directions, not whether your bloody opinion on how he should use his free time. If the interest and usability of a project would somehow become a problem, then boy, oh, boy; Most everyone in here, including the patronizing posters, would probably be left without anything to code.
[toc] | [prev] | [next] | [standalone]
Page 2 of 3 — ← Prev page 1 [2] 3 Next page →
Back to top | Article view | comp.lang.python
csiph-web