Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #85737 > unrolled thread

python implementation of a new integer encoding algorithm.

Started byjanhein.vanderburg@gmail.com
First post2015-02-17 03:22 -0800
Last post2015-02-20 00:58 -0800
Articles 20 on this page of 53 — 15 participants

Back to article view | Back to comp.lang.python


Contents

  python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-17 03:22 -0800
    Re: python implementation of a new integer encoding algorithm. Chris Angelico <rosuav@gmail.com> - 2015-02-18 00:16 +1100
      Re: python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-18 00:55 -0800
        Re: python implementation of a new integer encoding algorithm. Chris Angelico <rosuav@gmail.com> - 2015-02-18 20:36 +1100
          Re: python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-18 11:29 -0800
        Re: python implementation of a new integer encoding algorithm. Laura Creighton <lac@openend.se> - 2015-02-18 11:32 +0100
          Re: python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-18 11:48 -0800
        People hated it for the same reasons I found them cool (was: python implementation of a new integer encoding algorithm.) Ben Finney <ben+python@benfinney.id.au> - 2015-02-18 21:57 +1100
    Re: python implementation of a new integer encoding algorithm. Dave Angel <davea@davea.name> - 2015-02-17 09:12 -0500
      Re: python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-18 00:59 -0800
        Re: python implementation of a new integer encoding algorithm. Dave Angel <davea@davea.name> - 2015-02-18 11:46 -0500
          Re: python implementation of a new integer encoding algorithm. Grant Edwards <invalid@invalid.invalid> - 2015-02-18 17:30 +0000
            Re: python implementation of a new integer encoding algorithm. Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-02-18 18:12 +0000
          Re: python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-18 11:55 -0800
            Re: python implementation of a new integer encoding algorithm. Marko Rauhamaa <marko@pacujo.net> - 2015-02-18 23:54 +0200
              Re: python implementation of a new integer encoding algorithm. Marko Rauhamaa <marko@pacujo.net> - 2015-02-19 00:08 +0200
              Re: python implementation of a new integer encoding algorithm. Grant Edwards <invalid@invalid.invalid> - 2015-02-18 22:58 +0000
            Re: python implementation of a new integer encoding algorithm. Dave Angel <davea@davea.name> - 2015-02-18 17:19 -0500
              Re: python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-19 07:45 -0800
                Re: python implementation of a new integer encoding algorithm. Ian Kelly <ian.g.kelly@gmail.com> - 2015-02-19 11:04 -0700
                Re: python implementation of a new integer encoding algorithm. Ian Kelly <ian.g.kelly@gmail.com> - 2015-02-19 11:16 -0700
                Re: python implementation of a new integer encoding algorithm. Dave Angel <davea@davea.name> - 2015-02-19 13:24 -0500
                Re: python implementation of a new integer encoding algorithm. Chris Angelico <rosuav@gmail.com> - 2015-02-20 05:34 +1100
                Re: python implementation of a new integer encoding algorithm. Ian Kelly <ian.g.kelly@gmail.com> - 2015-02-19 11:32 -0700
                Re: python implementation of a new integer encoding algorithm. Dave Angel <davea@davea.name> - 2015-02-19 13:41 -0500
                Re: python implementation of a new integer encoding algorithm. Dave Angel <davea@davea.name> - 2015-02-19 13:46 -0500
                Re: python implementation of a new integer encoding algorithm. Chris Angelico <rosuav@gmail.com> - 2015-02-20 05:49 +1100
        Re: python implementation of a new integer encoding algorithm. Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-02-18 17:00 +0000
    Re: python implementation of a new integer encoding algorithm. Chris Angelico <rosuav@gmail.com> - 2015-02-18 01:34 +1100
      Re: python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-18 01:04 -0800
        Re: python implementation of a new integer encoding algorithm. Dave Angel <davea@davea.name> - 2015-02-18 08:54 -0500
          Re: python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-18 11:52 -0800
        Re: python implementation of a new integer encoding algorithm. Chris Angelico <rosuav@gmail.com> - 2015-02-19 01:16 +1100
    Re: python implementation of a new integer encoding algorithm. Dave Angel <davea@davea.name> - 2015-02-17 09:50 -0500
    Re: python implementation of a new integer encoding algorithm. Chris Angelico <rosuav@gmail.com> - 2015-02-18 01:58 +1100
    Re: python implementation of a new integer encoding algorithm. Dave Angel <davea@davea.name> - 2015-02-17 10:18 -0500
    Re: python implementation of a new integer encoding algorithm. Chris Angelico <rosuav@gmail.com> - 2015-02-18 02:25 +1100
    Re: python implementation of a new integer encoding algorithm. Paul Rubin <no.email@nospam.invalid> - 2015-02-17 08:43 -0800
      Re: python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-18 01:06 -0800
    Re: python implementation of a new integer encoding algorithm. Mario Figueiredo <marfig@gmail.com> - 2015-02-19 08:44 +0100
      Re: python implementation of a new integer encoding algorithm. Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-02-19 08:06 +0000
        Re: python implementation of a new integer encoding algorithm. Marko Rauhamaa <marko@pacujo.net> - 2015-02-19 10:36 +0200
          Re: python implementation of a new integer encoding algorithm. Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-02-19 09:33 +0000
          Re: python implementation of a new integer encoding algorithm. Terry Reedy <tjreedy@udel.edu> - 2015-02-19 14:50 -0500
            Re: python implementation of a new integer encoding algorithm. Marko Rauhamaa <marko@pacujo.net> - 2015-02-19 21:55 +0200
      Re: python implementation of a new integer encoding algorithm. Chris Angelico <rosuav@gmail.com> - 2015-02-19 19:36 +1100
      Re: python implementation of a new integer encoding algorithm. Mario Figueiredo <marfig@gmail.com> - 2015-02-19 10:42 +0100
      Re: python implementation of a new integer encoding algorithm. Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-02-19 10:28 +0000
      Re: python implementation of a new integer encoding algorithm. Mario Figueiredo <marfig@gmail.com> - 2015-02-19 14:27 +0100
    Re: python implementation of a new integer encoding algorithm. Jonas Wielicki <jonas@wielicki.name> - 2015-02-19 09:38 +0100
      Re: python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-19 07:58 -0800
    Re: python implementation of a new integer encoding algorithm. Denis McMahon <denismfmcmahon@gmail.com> - 2015-02-20 02:46 +0000
      Re: python implementation of a new integer encoding algorithm. wxjmfauth@gmail.com - 2015-02-20 00:58 -0800

Page 1 of 3  [1] 2 3  Next page →


#85737 — python implementation of a new integer encoding algorithm.

Fromjanhein.vanderburg@gmail.com
Date2015-02-17 03:22 -0800
Subjectpython implementation of a new integer encoding algorithm.
Message-ID<e45a71b2-7ec0-4c7d-88ae-c48aebe154b7@googlegroups.com>
In http://optarbvalintenc.blogspot.nl/ I propose a new way to encode arbitrarily valued integers and the python code that can be used as a reference for practical implementations of codecs.

The encoding algorithm itself is optimized for transmission and storage requirements without any processor load consideration whatsoever.

The next step is the development of the python code that minimizes processor requirements without compromising the algorithm.

Is this the right forum to ask for help with this step or to obtain references to more suitable platforms?

Thanks in advance,
Jan-Hein.

[toc] | [next] | [standalone]


#85738

FromChris Angelico <rosuav@gmail.com>
Date2015-02-18 00:16 +1100
Message-ID<mailman.18783.1424178970.18130.python-list@python.org>
In reply to#85737
On Tue, Feb 17, 2015 at 10:22 PM,  <janhein.vanderburg@gmail.com> wrote:
> In http://optarbvalintenc.blogspot.nl/ I propose a new way to encode arbitrarily valued integers and the python code that can be used as a reference for practical implementations of codecs.
>
> The encoding algorithm itself is optimized for transmission and storage requirements without any processor load consideration whatsoever.
>
> The next step is the development of the python code that minimizes processor requirements without compromising the algorithm.
>
> Is this the right forum to ask for help with this step or to obtain references to more suitable platforms?

This is a fine forum to ask in. However, you may find that the advice
you get isn't quite what you were asking for. In my case, the advice
I'm offering is: Don't do this. In over 99% of cases, the benefit from
this kind of packed transmission format is negligible; it's much MUCH
better to make your protocol simple readable text. Have a look at
internet protocols like SMTP/POP/IMAP (the email trio), HTTP, FTP, and
so on; all of them follow a basic pattern of connecting to a
well-known port on a destination server, sending textual commands
terminated by end-of-line, and receiving textual responses terminated
by end-of-line. Most of the world wide web consists of HTTP requests
(possibly using SSL/TLS, but that doesn't change the protocol), but
even so, the advantage of packing the headers into a binary format
just isn't worth the cost of making everything harder to debug.

Because debugging is *hugely* easier when you have a text protocol.
All you need to do is telnet or netcat to the appropriate port, type
some commands, and eyeball the responses. I have a MUD client called
Gypsum [1] which does that, with a few extra features, including (like
netcat, but unlike telnet) listening on a port, so you can test a
client; I've used it and its predecessor for testing myriad networking
programs, both my own and other people's, to try to figure out what's
going on.

So if you want to develop a brand new protocol, here's what I'd suggest:

1) Mandate UTF-8 encoding everywhere
2) Stipulate \n as the end of line, and strip/ignore all \r found
3) Follow the basic model laid down by SMTP and POP: the server sends
a greeting on startup, then everything's done with simple commands and
responses. The server may also send unilateral messages, as long as
they're unambiguously detectable (eg if it's notifying you of
something that just happened).

Sure, you might be able to pack your integers into something more
compact than their decimal representations... but how much will you
really gain? Most of the internet works on the basis of packets, and
you'll find that the difference between a 200-byte packet and a
210-byte packet probably isn't even measurable; at very best, you
might see an advantage when you transfer a huge file as a coherent
blob, but that would mean dealing with a large number of these
packetized integers. In the meantime, you make your protocol fragile
and hard to read by eye, and you spend a lot of time developing your
protocol, instead of just blatting simple text down the wire. Take the
easy option; you can always make things more complicated later.

[1] https://github.com/Rosuav/Gypsum

ChrisA

[toc] | [prev] | [next] | [standalone]


#85776

Fromjanhein.vanderburg@gmail.com
Date2015-02-18 00:55 -0800
Message-ID<85554c47-0957-4a5a-8ac7-513cd0f436ff@googlegroups.com>
In reply to#85738
On Tuesday, February 17, 2015 at 2:17:02 PM UTC+1, Chris Angelico wrote:
> This is a fine forum to ask in. However, you may find that the advice
> you get isn't quite what you were asking for. In my case, the advice
> I'm offering is: Don't do this.
Thanks Chris; let me explain why I want this.

As you rightly point out, human readable encoding enables debugging without the need for specialized tools that assemble/disassemble character streams.
I also agree that saving space in a partition of a character stream like a communication packet or persistent memory sector won't do much good if it does not save additional partitions.

The human centered encoding versus machine oriented encoding discussion dates from some decades ago now, and I am personally still not convinced that human readable text is all that sensible in most applications.
The main problem I see is that we are trying to make computers behave like human beings, and that introduces interpretation risks and resource requirements that should be avoided.
I'll be happy to discuss this subject again, but let's do that in a different thread.

The difference between two encoded stream lengths within a single partition may not be relevant, but the difference between needing 1 or 2 partitions is quite significant.
Especially in applications that depend on very limited bandwidth communication channels.
Since all messages in my applications are primarily composed of  integer values, finding the algorithm that fundamentally minimizes the number of bits to encode such a value will increase my chances to save partitions needed to encode the message, even without adding compression techniques at the message level.
Note that I don't assume any properties of the integer values to be encoded.

> Take the
> easy option; you can always make things more complicated later.
That makes sense alright.
No offense, but I still believe that human readable text encoding complicates things right now and shouldn't be tried until "my way" has proven unpractical in its first application.
Consider it to be a theoretical challenge: how do I find the general encoding of an arbitrary integer value that minimizes the number of bits needed and given that algorithm, find the python code that minimizes the processor load inflicted by the codec implementation.

[toc] | [prev] | [next] | [standalone]


#85780

FromChris Angelico <rosuav@gmail.com>
Date2015-02-18 20:36 +1100
Message-ID<mailman.18811.1424252180.18130.python-list@python.org>
In reply to#85776
On Wed, Feb 18, 2015 at 7:55 PM,  <janhein.vanderburg@gmail.com> wrote:
>> Take the
>> easy option; you can always make things more complicated later.
> That makes sense alright.
> No offense, but I still believe that human readable text encoding complicates things right now and shouldn't be tried until "my way" has proven unpractical in its first application.
> Consider it to be a theoretical challenge: how do I find the general encoding of an arbitrary integer value that minimizes the number of bits needed and given that algorithm, find the python code that minimizes the processor load inflicted by the codec implementation.
>

I would actually look at it the other way: a human-readable encoding
is the easy way (you can simply print() your numbers and int() them on
the way back in), and until you can prove that it's impractical, don't
write a single line of code towards this algorithm. But let's get some
figures, though: How many payload bits per byte can you achieve?
What's your average going to be? You have two baselines to beat: 4
bits per byte (hexadecimal), and 7 bits per byte (MIDI varlen, or
whatever you want to call it; I first met it with MIDI, but it's been
used in so many places that other people have other names for it).

ChrisA

[toc] | [prev] | [next] | [standalone]


#85811

Fromjanhein.vanderburg@gmail.com
Date2015-02-18 11:29 -0800
Message-ID<a07f5f39-d08a-4eb2-9a03-d04b37f80063@googlegroups.com>
In reply to#85780
Op woensdag 18 februari 2015 10:36:37 UTC+1 schreef Chris Angelico:
> I would actually look at it the other way:
I'm aware of that, since you already warned me with "This is a fine forum to ask in. However, you may find that the advice you get isn't quite what you were asking for. In my case, ..."
Again: no offense, but I agree to disagree.

There will be no assumptions about the integers to be encoded.

Jan-Hein.

[toc] | [prev] | [next] | [standalone]


#85782

FromLaura Creighton <lac@openend.se>
Date2015-02-18 11:32 +0100
Message-ID<mailman.18813.1424255578.18130.python-list@python.org>
In reply to#85776
Hi Jan.

I'm an old fart.  In the late 1970s, when I started programming these
things, and memory was non-existant, we came up with all sorts of data
compression algorithms which were absolutely necessary to get any work
done whatsoever.  Should you ever need an assembler programmer for
quick and dirty hacks for the PDP-11 line (11/20 and 11/05 preferred
as it is harder) I am still the woman for the job.  Indeed, I spent
most of my 20s finding better and better ways to fit programs into
smaller and smaller memory footprints.

I perfectly understand the intellectual thrill of doing such things.
As puzzles go, it is about as cool a one as exists, and its all for
things that matter -- for real.

However, in the matter of financial compensation and world recognition,
you have just laid a very large goose-egg.   The VAX-11/780 was introduced
on October 25, 1977, according to wikipedia.  But in my world, it was 1982
before I got to see the first one.  And it was godly more expensive than
a pdp-11, but the writing was on the wall.  The thing could page, and
so all the techniques we learned for making our code concise -- let alone
the dirty tricks I specialised in -- were no longer relevant.

>From the mid 1980s onward I have been telling people 'your code is
ugly, please tighten it up by refactoring it <here> and <here>' and
when I am their instructor they grumble and do it, and otherwise they
flip me the bird.  In their eyes, it doesn't matter how the code
_looks_ as long as it does the job.  And I deeply sympathise.  But
what I am going for is not a 'death - by looking unfashionable' but
rather a demand that good code is clear to understand.  Because what
I have learned, that Brian Kernighan expressed a long time ago is
that:

	Debugging is twice as hard as writing the code in the
	first place. Therefore, if you write the code as cleverly
	as possible, you are, by definition, not smart enough to debug it.

Your proposed encoding scheme (if it does as you say, I have not
analysed it) scores very, very high in the _cleverness_ department.
Enough that a lot of people, who aren't as clever as you do, have
no hope in hell of ever being clever enough to debug something that
uses it.  Therefore, you will never see widespread adoption of your
scheme, no matter how brilliantly it does as you say, because we all
need things that are easier to debug more than we need better compression.

So now you are sad.  I was sad, too, but the sooner I learned this the
sooner I could stop wasting my time creating algorithms that provided
cool functionality that people hated for the same reasons I found them
cool.

You need to find a different sort of algorithm that people like to use
if you want to get widespread success in the world of widely used
algorithms.  If you have found a way to improve on Lemel-Ziv, then this
will count.

But it may be that your next step is 'how to encode things that
are not phonetic language'.  Go look -- for the next few months --
at how MIDI stores sounds.  You will find plenty of places for
improvement, but the idea is not to improve the standard but to learn
it well enough that you can see things in the non-alphabetic world.

So then, now what?

If you are still fired up with the desire to compress things, then
there is a huge, _very well paying_ market I want to introduce you
to.  And this is _tech support for porn sites_.  Porn sites make a
ton of money, indeed the numbers are scary.  And here the idea of
'I saved 2% of time/bandwidth/disk space/' really matters.  You
can really save money for them, and it really matters to them.
Since I have never found sex 'dirty' and indeed consider it one
of the great joys in life, I have never found anything wrong with
working for porn sites.

And, hell, out of male porn Jimmy Wales made wikipedia.  Haven't
we all benefitted?

But, right now, you are, alas for you, about 45 years too late for the
ideas you are sprouting.  I had similar ones about 30 years too late
and, well, they only worked for me for about 3-5 years.  Sucks to be
you, friend -- you needed to be your grandfather, I fear.


Laura Creighton

[toc] | [prev] | [next] | [standalone]


#85813

Fromjanhein.vanderburg@gmail.com
Date2015-02-18 11:48 -0800
Message-ID<4efc991a-947f-415f-9999-2f490aef0927@googlegroups.com>
In reply to#85782
Op woensdag 18 februari 2015 11:33:18 UTC+1 schreef Laura Creighton:
> Hi Jan.
Hi Laura, thanks for your comments; let me explain my why:


> Should you ever need an assembler programmer for
> quick and dirty hacks for the PDP-11 line (11/20 and 11/05 preferred
> as it is harder) I am still the woman for the job.  Indeed, I spent
> most of my 20s finding better and better ways to fit programs into
> smaller and smaller memory footprints.
I once wrote a bootstrap loader for a PDP-11, but I forget when and what model exactly, for it was some decades ago and not all that spectacular.

> I perfectly understand the intellectual thrill of doing such things.
> As puzzles go, it is about as cool a one as exists, and its all for
> things that matter -- for real.
> 
> However, in the matter of financial compensation and world recognition,
> you have just laid a very large goose-egg.
I don't want financial compensation and don't need world recognition, so I will accept the egg.

> so all the techniques we learned for making our code concise -- let alone
> the dirty tricks I specialised in -- were no longer relevant.
If dirty tricks means shortcuts I personally wouldn't have excepted them even at that time of resource shortages.
I hate shortcuts.

> >From the mid 1980s onward I have been telling people 'your code is
> ugly, please tighten it up by refactoring it <here> and <here>' and
> when I am their instructor they grumble and do it, and otherwise they
> flip me the bird.  In their eyes, it doesn't matter how the code
> _looks_ as long as it does the job.
Do not get intimidated and simply refuse to follow the lemmings.

> what I am going for is not a 'death - by looking unfashionable' but
> rather a demand that good code is clear to understand.
Sorry that my code is all that confusing.
I have had several comparable complaints from other contributors to this discussion.
I will try to improve that part of my proposal, but don't hold your breath, for it may take some time.
In fact I was initially asking for help with that code, but I ended up defending my credentials.

> Your proposed encoding scheme (if it does as you say, I have not
> analysed it) scores very, very high in the _cleverness_ department.
> Enough that a lot of people, who aren't as clever as you do, have
> no hope in hell of ever being clever enough to debug something that
> uses it.  Therefore, you will never see widespread adoption of your
> scheme, no matter how brilliantly it does as you say, because we all
> need things that are easier to debug more than we need better compression.
I don't need widespread adaption outside my own applications, and I don't see any problem for a python programmer that uses an object that flawlessly encodes and decodes an arbitrary integer value.
To me debugging is not synonymous to making things human readable or even translatable.

So what's in it for other programmers then me myself and I?: the no brainer optimizing of resource requirements.

Compression is different from optimal individual integer encoding, since the first needs a context of surrounding  integers in a sequence of integer images.

> So now you are sad.
Not yet.

> So then, now what?
> 
> If you are still fired up with the desire to compress things, then
> there is a huge, _very well paying_ market I want to introduce you
> to.  And this is _tech support for porn sites_.  Porn sites make a
> ton of money, indeed the numbers are scary.  And here the idea of
> 'I saved 2% of time/bandwidth/disk space/' really matters.  You
> can really save money for them, and it really matters to them.
> Since I have never found sex 'dirty' and indeed consider it one
> of the great joys in life, I have never found anything wrong with
> working for porn sites.
Frankly a have a first application in mind that is less sexy and won't be all that profitable.
There is nothing wrong with porn indeed, but at our age it is just no longer on the top of one's mind every second is it?

> And, hell, out of male porn Jimmy Wales made wikipedia.  Haven't
> we all benefitted?
I really don't know what you are talking about here, and only interested in a reply if it seriously contributes to my  personal objective to get "my" way of integer encoding implemented as proposed.

> But, right now, you are, alas for you, about 45 years too late for the
> ideas you are sprouting.  I had similar ones about 30 years too late
> and, well, they only worked for me for about 3-5 years.  Sucks to be
> you, friend -- you needed to be your grandfather, I fear.

Thanks for making me feel so mature.
Let's go back to the future.

[toc] | [prev] | [next] | [standalone]


#85783 — People hated it for the same reasons I found them cool (was: python implementation of a new integer encoding algorithm.)

FromBen Finney <ben+python@benfinney.id.au>
Date2015-02-18 21:57 +1100
SubjectPeople hated it for the same reasons I found them cool (was: python implementation of a new integer encoding algorithm.)
Message-ID<mailman.18814.1424257067.18130.python-list@python.org>
In reply to#85776
Laura Creighton <lac@openend.se> writes:

> So now you are sad. I was sad, too, but the sooner I learned this the
> sooner I could stop wasting my time creating algorithms that provided
> cool functionality that people hated for the same reasons I found them
> cool.

+1 QotW

-- 
 \            “Human reason is snatching everything to itself, leaving |
  `\           nothing for faith.” —Bernard of Clairvaux, 1090–1153 CE |
_o__)                                                                  |
Ben Finney

[toc] | [prev] | [next] | [standalone]


#85739

FromDave Angel <davea@davea.name>
Date2015-02-17 09:12 -0500
Message-ID<mailman.18784.1424182393.18130.python-list@python.org>
In reply to#85737
On 02/17/2015 06:22 AM, janhein.vanderburg@gmail.com wrote:
> In http://optarbvalintenc.blogspot.nl/ I propose a new way to encode arbitrarily valued integers and the python code that can be used as a reference for practical implementations of codecs.
>
> The encoding algorithm itself is optimized for transmission and storage requirements without any processor load consideration whatsoever.
>
> The next step is the development of the python code that minimizes processor requirements without compromising the algorithm.
>
> Is this the right forum to ask for help with this step or to obtain references to more suitable platforms?
>

This is a fine forum for such a discussion.  I for one would love to 
participate.  However, note that it isn't necessary true that "the 
smaller the better" is a good algorithm.  In context, there are 
frequently a number of tradeoffs, even ignoring processor time (as you 
imply).

Many years ago I worked on some code that manipulated the intermediate 
files for a compiler.  I had no say in the format, I just had to deal 
with it.

They had a field type called a "compressed integer."  It could vary 
between one byte and I think about six.  And the theory was that it took 
less space than the equivalent format of fixed size integers.  The catch 
from my point was that these integers could appear in the middle of a 
struct, and thus access to the later fields of the struct required a 
dynamic calculation.  This put a huge onus on my code to read or write 
the data serially.  I ended up solving it by writing code that generated 
40 thousand lines of C++ header and body code, so that the rest of the 
code didn't care.

Was it worth it?  To reduce the size of some files that only lived a few 
seconds on disk?  I seriously doubt it.  But I learned a lot in the process.

On another project, the goal was to be able to recover data from 
archives in spite of physical damage to some files.  So I had to add 
selective redundancy for that.  In the process, I also compress the 
data, but confine the compression algorithm to relatively small pieces 
of data, and label those pieces independently, so that any single 
decompression error affects only one unit of data.

So going back to your problem, and assuming that the other issues are 
moot, what's your proposal?  Are you compressing relative to a straight 
binary form of storage?  Are you assuming anything about the relative 
likelihood of various values of integers?  Do you provide anything to 
allow for the possibility that your prediction for probability 
distribution isn't met?

-- 
DaveA

[toc] | [prev] | [next] | [standalone]


#85777

Fromjanhein.vanderburg@gmail.com
Date2015-02-18 00:59 -0800
Message-ID<515047c1-a20d-430e-a029-1c2d77db2f1a@googlegroups.com>
In reply to#85739
On Tuesday, February 17, 2015 at 3:13:41 PM UTC+1, Dave Angel wrote:
> This is a fine forum for such a discussion.  I for one would love to 
> participate.  However, note that it isn't necessary true that "the 
> smaller the better" is a good algorithm.  In context, there are 
> frequently a number of tradeoffs, even ignoring processor time (as you 
> imply).
Thanks Dave; about those trade offs:

I agree that allowing variable length encoding in general forces the programmer to process records in a file sequentially.
Random access to individual data items in such a stream is simply not an option, unless all items have been indexed in a directory that accompanies the data character stream.
So if you need random access to individual items put limits to each individual data item.
My applications do not allow me to do that, because I would always be to restrictive and be caught by the "640k should be enough for any programmer" trap.

> So going back to your problem, and assuming that the other issues are 
> moot, what's your proposal?  Are you compressing relative to a straight 
> binary form of storage?  Are you assuming anything about the relative 
> likelihood of various values of integers?  Do you provide anything to 
> allow for the possibility that your prediction for probability 
> distribution isn't met?

I'm not compressing sequences of integer encodings but encoding individual integers optimally without any assumptions about their values.

[toc] | [prev] | [next] | [standalone]


#85798

FromDave Angel <davea@davea.name>
Date2015-02-18 11:46 -0500
Message-ID<mailman.18823.1424278025.18130.python-list@python.org>
In reply to#85777
On 02/18/2015 03:59 AM, janhein.vanderburg@gmail.com wrote:


> encoding individual integers optimally without any assumptions about their values.
>

Contradiction in terms.

-- 
DaveA

[toc] | [prev] | [next] | [standalone]


#85802

FromGrant Edwards <invalid@invalid.invalid>
Date2015-02-18 17:30 +0000
Message-ID<mc2i7g$eof$2@reader1.panix.com>
In reply to#85798
On 2015-02-18, Dave Angel <davea@davea.name> wrote:
> On 02/18/2015 03:59 AM, janhein.vanderburg@gmail.com wrote:
>
>
>> encoding individual integers optimally without any assumptions about
>> their values.
>
> Contradiction in terms.

Ah, that depends on whether the encoding has to be lossless or not.

For example:

  def encode_integer(i):
    return 0;

Voila! Optimal encoding with no assumptions about intut values.

It is, however, a bit lossy.
    
-- 
Grant Edwards               grant.b.edwards        Yow! Did you move a lot of
                                  at               KOREAN STEAK KNIVES this
                              gmail.com            trip, Dingy?

[toc] | [prev] | [next] | [standalone]


#85804

FromMark Lawrence <breamoreboy@yahoo.co.uk>
Date2015-02-18 18:12 +0000
Message-ID<mailman.18827.1424283186.18130.python-list@python.org>
In reply to#85802
On 18/02/2015 17:30, Grant Edwards wrote:
> On 2015-02-18, Dave Angel <davea@davea.name> wrote:
>> On 02/18/2015 03:59 AM, janhein.vanderburg@gmail.com wrote:
>>
>>
>>> encoding individual integers optimally without any assumptions about
>>> their values.
>>
>> Contradiction in terms.
>
> Ah, that depends on whether the encoding has to be lossless or not.
>
> For example:
>
>    def encode_integer(i):
>      return 0;
>
> Voila! Optimal encoding with no assumptions about intut values.
>
> It is, however, a bit lossy.
>

Is it this kind of coding that makes the PEP 393 FSR so poor?

-- 
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.

Mark Lawrence

[toc] | [prev] | [next] | [standalone]


#85815

Fromjanhein.vanderburg@gmail.com
Date2015-02-18 11:55 -0800
Message-ID<2a717ffb-d61d-4407-9082-1c17cd7ee573@googlegroups.com>
In reply to#85798
Op woensdag 18 februari 2015 17:47:49 UTC+1 schreef Dave Angel:
> On 02/18/2015 03:59 AM, janhein.vanderburg@gmail.com wrote:
> 
> 
> > encoding individual integers optimally without any assumptions about their values.
> >
> 
> Contradiction in terms.
> 
> -- 
> DaveA

Not.
Jan-Hein.

[toc] | [prev] | [next] | [standalone]


#85820

FromMarko Rauhamaa <marko@pacujo.net>
Date2015-02-18 23:54 +0200
Message-ID<87fva2ophe.fsf@elektro.pacujo.net>
In reply to#85815
janhein.vanderburg@gmail.com:

> Op woensdag 18 februari 2015 17:47:49 UTC+1 schreef Dave Angel:
>> On 02/18/2015 03:59 AM, janhein.vanderburg@gmail.com wrote:
>> > encoding individual integers optimally without any assumptions
>> > about their values.
>> 
>> Contradiction in terms.
>
> Not.

Out of curiosity, could you give me an example of an integer, not
assuming anything about its value.

I mean, any integer you could mention would be very close to zero
compared with virtually all other integers.


Marko

[toc] | [prev] | [next] | [standalone]


#85824

FromMarko Rauhamaa <marko@pacujo.net>
Date2015-02-19 00:08 +0200
Message-ID<87bnkqooth.fsf@elektro.pacujo.net>
In reply to#85820
Marko Rauhamaa <marko@pacujo.net>:

> Out of curiosity, could you give me an example of an integer, not
> assuming anything about its value.
>
> I mean, any integer you could mention would be very close to zero
> compared with virtually all other integers.

And I don't mean to be snide, either. I'm just saying that your
"optimal encoding" necessarily depends on the expected distribution of
the integers. There is no such thing as an even distribution of
integers.


Marko

[toc] | [prev] | [next] | [standalone]


#85831

FromGrant Edwards <invalid@invalid.invalid>
Date2015-02-18 22:58 +0000
Message-ID<mc35f7$hqo$1@reader1.panix.com>
In reply to#85820
On 2015-02-18, Marko Rauhamaa <marko@pacujo.net> wrote:
> janhein.vanderburg@gmail.com:
>
>> Op woensdag 18 februari 2015 17:47:49 UTC+1 schreef Dave Angel:
>>> On 02/18/2015 03:59 AM, janhein.vanderburg@gmail.com wrote:
>>> > encoding individual integers optimally without any assumptions
>>> > about their values.
>>> 
>>> Contradiction in terms.
>>
>> Not.
>
> Out of curiosity, could you give me an example of an integer, not
> assuming anything about its value.
>
> I mean, any integer you could mention would be very close to zero
> compared with virtually all other integers.

Thus demonstrating the fitness of my previously posted "lossy integer
encoding" algorithm.

-- 
Grant Edwards               grant.b.edwards        Yow! Hmmm ... an arrogant
                                  at               bouquet with a subtle
                              gmail.com            suggestion of POLYVINYL
                                                   CHLORIDE ...

[toc] | [prev] | [next] | [standalone]


#85827

FromDave Angel <davea@davea.name>
Date2015-02-18 17:19 -0500
Message-ID<mailman.18839.1424298000.18130.python-list@python.org>
In reply to#85815
On 02/18/2015 02:55 PM, janhein.vanderburg@gmail.com wrote:
> Op woensdag 18 februari 2015 17:47:49 UTC+1 schreef Dave Angel:
>> On 02/18/2015 03:59 AM, janhein.vanderburg@gmail.com wrote:
>>
>>
>>> encoding individual integers optimally without any assumptions about their values.
>>>
>>
>> Contradiction in terms.
>>
>> --
>> DaveA
>
> Not.
> Jan-Hein.
>

Then you had better define your new word "optimal" to us.  I decided to 
try your algorithm for all the values between 0 and 999999.  One million 
values, and the 7bit encoding[1] beat yours for 950081 of them.  The 
rest were tied.  Yours never was shorter.

For a uniform distribution of those particular values, the average 
number of bytes used by yours was 3.933568 bytes, and by 7bit encoding 
was 2.983487

For the second and third million, yours are all 4 bytes, while 7bit uses 
3. Beyond 2097152, 7bit uses 4 bytes, same as you.  Between 16 and 17 
million, you average 4.156865, while 7bit is a constant 4.0.

After that, I started spot-checking.  I went up to 100 billion, and for 
none of those I tried did your algorithm take fewer bytes than 7bit.


So how is yours optimal?  Over what range of values?

I'm not necessarily doubting it, just challenging you to provide a data 
sample that actually shows it.  And of course, I'm not claiming that 
7bit is in any way optimal.  You cannot define optimal without first 
defining the distribution.

[1] by 7bit, I'm referring to the one apparently used in MIDI encoding, 
where 7 bits of each byte hold the value, and the 8th bit is zero, 
except for the last byte, where the 8th bit is one.  So 3 bytes can 
encode 21 bits, or up to 2**21 - 1.

-- 
DaveA

[toc] | [prev] | [next] | [standalone]


#85910

Fromjanhein.vanderburg@gmail.com
Date2015-02-19 07:45 -0800
Message-ID<993f64c3-fc85-48a7-9d02-a4f12ecb33c6@googlegroups.com>
In reply to#85827
On Wednesday, February 18, 2015 at 11:20:12 PM UTC+1, Dave Angel wrote:
> I'm not necessarily doubting it, just challenging you to provide a data 
> sample that actually shows it.  And of course, I'm not claiming that 
> 7bit is in any way optimal.  You cannot define optimal without first 
> defining the distribution.

Weird results.
For a character size 2 the growth processes are shown below.
I listed the decimal representations, the difficult representation, a stop bit encoding, and the number of characters they differ in length:
0:  00				00				0
1:  01				01				0
2:  10, 00			10, 00				0
3:  10, 01			10, 01				0
4:  10, 10			11, 00				0
5:  10, 11			11, 01				0
6:  11, 00.00			11, 10, 00			0
7:  11, 00.01			11, 10, 01			0
8:  11, 00.10			11, 11, 00			0
9:  11, 00.11			11, 11, 01			0
10: 11, 01.00			11, 11, 10, 00			1
11: 11, 01.01			11, 11, 10, 01			1
12: 11, 01.10			11, 11, 11, 00			1
13: 11, 01.11			11, 11, 11, 01			1
14: 11, 10.00, 00		11, 11, 11, 10, 00		1
15: 11, 10.00, 01		11, 11, 11, 10, 01		1
16: 11, 10.00, 10		11, 11, 11, 11, 00		1
17: 11, 10.00, 11		11, 11, 11, 11, 01		1
18: 11, 10.01, 00.00		11, 11, 11, 11, 10, 00		1
19: 11, 10.01, 00.01		11, 11, 11, 11, 10, 01		1
20: 11, 10.01, 00.10		11, 11, 11, 11, 11, 00		1
21: 11, 10.01, 00.11		11, 11, 11, 11, 11, 01		1
22: 11, 10.01, 01.00		11, 11, 11, 11, 11, 10, 00	2
23: 11, 10.01, 01.01		11, 11, 11, 11, 11, 10, 01	2
24: 11, 10.01, 01.10		11, 11, 11, 11, 11, 11, 00	2
25: 11, 10.01, 01.11		11, 11, 11, 11, 11, 11, 01	2
26: 11, 10.01, 10.00		11, 11, 11, 11, 11, 11, 10, 00	3

I didn't take the time to prove it mathematically, but these results suggest to me that the complicated encoding beats the stop bit encoding.

[toc] | [prev] | [next] | [standalone]


#85915

FromIan Kelly <ian.g.kelly@gmail.com>
Date2015-02-19 11:04 -0700
Message-ID<mailman.18892.1424369100.18130.python-list@python.org>
In reply to#85910
On Thu, Feb 19, 2015 at 8:45 AM,  <janhein.vanderburg@gmail.com> wrote:
> On Wednesday, February 18, 2015 at 11:20:12 PM UTC+1, Dave Angel wrote:
>> I'm not necessarily doubting it, just challenging you to provide a data
>> sample that actually shows it.  And of course, I'm not claiming that
>> 7bit is in any way optimal.  You cannot define optimal without first
>> defining the distribution.
>
> Weird results.
> For a character size 2 the growth processes are shown below.
> I listed the decimal representations, the difficult representation, a stop bit encoding, and the number of characters they differ in length:
> 0:  00                          00                              0
> 1:  01                          01                              0
> 2:  10, 00                      10, 00                          0
> 3:  10, 01                      10, 01                          0
> 4:  10, 10                      11, 00                          0
> 5:  10, 11                      11, 01                          0
> 6:  11, 00.00                   11, 10, 00                      0
> 7:  11, 00.01                   11, 10, 01                      0
> 8:  11, 00.10                   11, 11, 00                      0
> 9:  11, 00.11                   11, 11, 01                      0
> 10: 11, 01.00                   11, 11, 10, 00                  1
> 11: 11, 01.01                   11, 11, 10, 01                  1
> 12: 11, 01.10                   11, 11, 11, 00                  1
> 13: 11, 01.11                   11, 11, 11, 01                  1
> 14: 11, 10.00, 00               11, 11, 11, 10, 00              1
> 15: 11, 10.00, 01               11, 11, 11, 10, 01              1
> 16: 11, 10.00, 10               11, 11, 11, 11, 00              1
> 17: 11, 10.00, 11               11, 11, 11, 11, 01              1
> 18: 11, 10.01, 00.00            11, 11, 11, 11, 10, 00          1
> 19: 11, 10.01, 00.01            11, 11, 11, 11, 10, 01          1
> 20: 11, 10.01, 00.10            11, 11, 11, 11, 11, 00          1
> 21: 11, 10.01, 00.11            11, 11, 11, 11, 11, 01          1
> 22: 11, 10.01, 01.00            11, 11, 11, 11, 11, 10, 00      2
> 23: 11, 10.01, 01.01            11, 11, 11, 11, 11, 10, 01      2
> 24: 11, 10.01, 01.10            11, 11, 11, 11, 11, 11, 00      2
> 25: 11, 10.01, 01.11            11, 11, 11, 11, 11, 11, 01      2
> 26: 11, 10.01, 10.00            11, 11, 11, 11, 11, 11, 10, 00  3
>
> I didn't take the time to prove it mathematically, but these results suggest to me that the complicated encoding beats the stop bit encoding.

That stop-bit variant looks extremely inefficient (and wrong) to me.
First, 2 bits per group is probably a bad choice for a stop-bit
encoding. It saves some space for very small integers, but it won't
scale well at all. Fully half of the bits are stop bits! Secondly, I
don't understand why the leading groups are all 11s and only the later
groups introduce variability. In fact, that's practically a unary
encoding with just a small amount of binary at the end. This is what I
would expect a 2-bit stop-bit encoding to look like:

0: 00
1: 01
2: 11, 00
3: 11, 01
4: 11, 10, 00
5: 11, 10, 01
6: 11, 11, 00
7: 11, 11, 01
8: 11, 10, 10, 00
9: 11, 10, 10, 01
10: 11, 10, 11, 00
11: 11, 10, 11, 01
12: 11, 11, 10, 00
13: 11, 11, 10, 01
14: 11, 11, 11, 00
15: 11, 11, 11, 01
16: 11, 10, 10, 10, 00
17: 11, 10, 10, 10, 01
18: 11, 10, 10, 11, 00
19: 11, 10, 10, 11, 01
20: 11, 10, 11, 10, 00
21: 11, 10, 11, 10, 01
22: 11, 10, 11, 11, 00
23: 11, 10, 11, 11, 01
24: 11, 11, 10, 10, 00
25: 11, 11, 10, 10, 01
26: 11, 11, 10, 11, 00
27: 11, 11, 10, 11, 01
28: 11, 11, 11, 10, 00
29: 11, 11, 11, 10, 01
30: 11, 11, 11, 11, 00
31: 11, 11, 11, 11, 01
etc.

Notice that the size grows as O(log n), not O(n) as above. Notice also
that the only values here for which this saves space over the 7-bit
version are 0-7. Unless you expect those values to be very common, the
7-bit encoding that needs only one byte all the way up to 127 makes a
lot of sense.

There's also an optimization that can be added here if we wish to
inject a bit of cleverness. Notice that all values with more than one
group start with 11, never 10. We can borrow a trick from IEEE
floating point and make the leading 1 bit of the mantissa implicit for
all values greater than 3 (we can't do it for 2 and 3 because then we
couldn't distinguish them from 0 and 1). Applying this optimization
removes one full group from the representation of all values greater
than 3, which appears to make the stop-bit representation as short as
or shorter than the "difficult" one for all the values that have been
enumerated above.

[toc] | [prev] | [next] | [standalone]


Page 1 of 3  [1] 2 3  Next page →

Back to top | Article view | comp.lang.python


csiph-web