Path: csiph.com!usenet.pasdenom.info!gegeweb.org!usenet-fr.net!nerim.net!novso.com!newsfeed.xs4all.nl!newsfeed4a.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.012 X-Spam-Evidence: '*H*': 0.98; '*S*': 0.00; 'encoding': 0.05; 'column': 0.07; 'wednesday,': 0.07; '128': 0.09; 'character,': 0.09; 'output,': 0.09; 'will,': 0.09; 'python': 0.11; 'def': 0.12; 'jan': 0.12; 'suggest': 0.14; 'wrote': 0.14; "wouldn't": 0.14; 'posted': 0.15; '10:45': 0.16; '18:': 0.16; '384': 0.16; '386': 0.16; 'character.': 0.16; 'encoding.': 0.16; 'integers,': 0.16; 'length:': 0.16; 'remainder': 0.16; 'worse.': 0.16; 'subject:python': 0.16; 'weird': 0.16; 'wrote:': 0.18; 'bit': 0.19; 'differ': 0.19; 'else,': 0.19; 'seems': 0.21; 'header:User- Agent:1': 0.23; '(by': 0.24; 'byte': 0.24; 'bytes': 0.24; 'either.': 0.24; "haven't": 0.24; '---': 0.24; 'define': 0.26; 'shown': 0.26; 'values': 0.27; 'header:In-Reply-To:1': 0.27; 'am,': 0.29; 'character': 0.29; 'characters': 0.30; "i'm": 0.30; 'code': 0.31; 'that.': 0.31; 'claiming': 0.31; 'decimal': 0.31; 'subject:skip:i 10': 0.31; 'raw': 0.33; 'could': 0.34; "can't": 0.35; 'something': 0.35; 'but': 0.35; 'scheme': 0.36; "didn't": 0.36; 'shows': 0.36; 'subject:new': 0.38; 'challenging': 0.38; 'to:addr:python-list': 0.38; 'does': 0.39; 'extremely': 0.39; 'to:addr:python.org': 0.39; 'dave': 0.60; 'results.': 0.60; 'break': 0.61; 'length': 0.61; 'first': 0.61; 'email addr:gmail.com': 0.63; 'provide': 0.64; 'charset:windows-1252': 0.65; 'sample': 0.67; 'between': 0.67; 'received:74.208': 0.68; 'results': 0.69; 'below.': 0.71; 'ranges': 0.74; '10:': 0.84; '11:': 0.84; '12:': 0.84; '13:': 0.84; '14:': 0.84; '15:': 0.84; '2015': 0.84; 'beats': 0.84; 'functions:': 0.84; 'grew': 0.84; '8bit': 0.91; 'angel': 0.91 Date: Thu, 19 Feb 2015 13:24:40 -0500 From: Dave Angel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.4.0 MIME-Version: 1.0 To: python-list@python.org Subject: Re: python implementation of a new integer encoding algorithm. References: <515047c1-a20d-430e-a029-1c2d77db2f1a@googlegroups.com> <2a717ffb-d61d-4407-9082-1c17cd7ee573@googlegroups.com> <993f64c3-fc85-48a7-9d02-a4f12ecb33c6@googlegroups.com> In-Reply-To: <993f64c3-fc85-48a7-9d02-a4f12ecb33c6@googlegroups.com> Content-Type: text/plain; charset=windows-1252; format=flowed Content-Transfer-Encoding: 7bit X-Provags-ID: V02:K0:6i0cPCpO4vD5t1gJZIu2y9S2MbWDX1RDgLMkMBfZ3uU YYXLjjX47PdXolnvVl+puZHKC6B07aZxBPnmDXzIWfhmla6455 WxdCNKfuRTvU+K2y7Dlg5feWqy8u66v6M6kV0DmJiiVG1xFtKc lm7gRt+Z6XbsGwyGnxIigplk7qWid1qly/Tv7MVQvgA756OJGs +19ji1oJZN0WA+Dw+1h3A3u3TruE+twlZz16VmunLSnAQJGzpS C+SSxE1RmWs4WZf/nVOP5AezJSkxdHNHJ7GUlkykHVFa8FJIu3 NQzFBpiM6NKo2+tMQbC/pOefFwLKplcUfAjR/x4U7G888aCmmw 2e8fO9YIN+6BuHqcMZLw= X-UI-Out-Filterresults: notjunk:1; X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 107 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1424370292 news.xs4all.nl 2934 [2001:888:2000:d::a6]:48125 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:85918 On 02/19/2015 10:45 AM, janhein.vanderburg@gmail.com wrote: > On Wednesday, February 18, 2015 at 11:20:12 PM UTC+1, Dave Angel wrote: >> I'm not necessarily doubting it, just challenging you to provide a data >> sample that actually shows it. And of course, I'm not claiming that >> 7bit is in any way optimal. You cannot define optimal without first >> defining the distribution. > > Weird results. In what way weird? > For a character size 2 the growth processes are shown below. > I listed the decimal representations, the difficult representation, a stop bit encoding, and the number of characters they differ in length: > 0: 00 00 0 > 1: 01 01 0 > 2: 10, 00 10, 00 0 > 3: 10, 01 10, 01 0 > 4: 10, 10 11, 00 0 > 5: 10, 11 11, 01 0 > 6: 11, 00.00 11, 10, 00 0 > 7: 11, 00.01 11, 10, 01 0 > 8: 11, 00.10 11, 11, 00 0 > 9: 11, 00.11 11, 11, 01 0 > 10: 11, 01.00 11, 11, 10, 00 1 > 11: 11, 01.01 11, 11, 10, 01 1 > 12: 11, 01.10 11, 11, 11, 00 1 > 13: 11, 01.11 11, 11, 11, 01 1 > 14: 11, 10.00, 00 11, 11, 11, 10, 00 1 > 15: 11, 10.00, 01 11, 11, 11, 10, 01 1 > 16: 11, 10.00, 10 11, 11, 11, 11, 00 1 > 17: 11, 10.00, 11 11, 11, 11, 11, 01 1 > 18: 11, 10.01, 00.00 11, 11, 11, 11, 10, 00 1 > 19: 11, 10.01, 00.01 11, 11, 11, 11, 10, 01 1 > 20: 11, 10.01, 00.10 11, 11, 11, 11, 11, 00 1 > 21: 11, 10.01, 00.11 11, 11, 11, 11, 11, 01 1 > 22: 11, 10.01, 01.00 11, 11, 11, 11, 11, 10, 00 2 > 23: 11, 10.01, 01.01 11, 11, 11, 11, 11, 10, 01 2 > 24: 11, 10.01, 01.10 11, 11, 11, 11, 11, 11, 00 2 > 25: 11, 10.01, 01.11 11, 11, 11, 11, 11, 11, 01 2 > 26: 11, 10.01, 10.00 11, 11, 11, 11, 11, 11, 10, 00 3 > > I didn't take the time to prove it mathematically, but these results suggest to me that the complicated encoding beats the stop bit encoding. > Clearly, one wouldn't use what you call "stop bit encoding" with a 2bit character. And since the Python code you posted uses an 8bit character, I can't validate the "difficult" column either. I wrote the following pair of functions: def seven_code(n): acc = bytearray() if n == 0: acc.append(0) while n > 0: quotient, remainder = divmod(n, 128) acc.append(remainder) n = quotient acc[-1] |= 0x80 #add a stop bit to last byte return acc def seven_decode(sequ): acc = 0 multiplier = 1 for by in sequ: acc += (by & 0x7f) * multiplier if by > 0x7f: break multiplier *= 128 return acc Here's a couple of ranges of output, showing that the 7bit scheme does better for values between 384 and 16379. 382 2 80fe --- 2 7e82 383 2 80ff --- 2 7f82 384 3 810000 --- 2 0083 384 jan grew 3 810000 385 3 810001 --- 2 0183 386 3 810002 --- 2 0283 387 3 810003 --- 2 0383 388 3 810004 --- 2 0483 389 3 810005 --- 2 0583 16380 3 813e7c --- 2 7cff 16380 jan grew 3 813e7c 16380 7bit grew 2 7cff 16381 3 813e7d --- 2 7dff 16382 3 813e7e --- 2 7eff 16383 3 813e7f --- 2 7fff 16384 3 813e80 --- 3 000081 16384 7bit grew 3 000081 16385 3 813e81 --- 3 010081 16386 3 813e82 --- 3 020081 16387 3 813e83 --- 3 030081 16388 3 813e84 --- 3 040081 16389 3 813e85 --- 3 050081 In all my experimenting, I haven't found any values where the 7bit scheme does worse. It seems likely that for extremely large integers, it will, but if those are to be the intended distribution, the 7bit scheme could be replaced by something else, like just encoding a length at the beginning, and using raw bytes after that. -- DaveA