Path: csiph.com!news.swapon.de!eternal-september.org!feeder.eternal-september.org!mx02.eternal-september.org!.POSTED!not-for-mail From: Paul Rubin Newsgroups: comp.lang.python Subject: Re: shorten "compress" long integer to short ascii. Date: Thu, 19 Nov 2015 20:27:07 -0800 Organization: A noiseless patient Spider Lines: 51 Message-ID: <871tbll2hw.fsf@nightsong.com> References: Mime-Version: 1.0 Content-Type: text/plain Injection-Info: mx02.eternal-september.org; posting-host="560a36ee31cc4bcf69e115b311f0cc5c"; logging-data="8488"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18IoEaq9BUjH3/js1eD6gzZ" User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/24.3 (gnu/linux) Cancel-Lock: sha1:RPmIWLxk0QQLlz6BI0xvZs9jBm4= sha1:oBrF9qArvCt4FxhUaqCI3FTq4U0= Xref: csiph.com comp.lang.python:99123 Vincent Davis writes: > My goal is to shorten a long integer into a shorter set of characters. > Below is what I have which gets me about a 45-50% reduction. Any suggestion > on how to improve upon this? You can't improve much. A decimal digit carries log(10,2)=3.32 bits of information. A reasonable character set for Twitter-style links might have 80 or so characters (upper/lower alphabetic, digits, and a dozen or so punctuation characters), or log(80,2)= > I not limited to ascii but I didn't see how going to utf8 would help. If you could use Unicode characters like Chinese ideographs, that gives you a much larger alphabet to work with, so you'd need fewer chars displayed in the link, but they'd be hard for most people to type. > l = string.ascii_lowercase + string.ascii_uppercase + > '!"#$%&\'()*+,-./:;<=>?@[]^_`{|}~' OK, 83 chars, but it may not be ok to use some of them like #, /, and ?, since they can have special meanings in urls. Your algorithm looks basically ok though I didn't examine it closely. Here is my shortened version: import string # alphabet here is 83 chars alphabet = string.ascii_lowercase + \ string.ascii_uppercase +'!"#$%&\'()*+,-./:;<=>?@[]^_`{|}~' alphabet_size = len(alphabet) decoderdict = dict((b,a) for a,b in enumerate(alphabet)) def encoder(integer): a,b = divmod(integer, alphabet_size) if a == 0: return alphabet[b] return encoder(a) + alphabet[b] def decoder(code): return reduce(lambda n,d: n*alphabet_size + decoderdict[d], code, 0) def test(): n = 92928729379271 short = encoder(n) backagain = decoder(short) nlen = len(str(n)) print (nlen, len(short), float(len(short))/nlen) assert n==backagain, (n,short,b) test()