Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #85737 > unrolled thread
| Started by | janhein.vanderburg@gmail.com |
|---|---|
| First post | 2015-02-17 03:22 -0800 |
| Last post | 2015-02-20 00:58 -0800 |
| Articles | 20 on this page of 53 — 15 participants |
Back to article view | Back to comp.lang.python
python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-17 03:22 -0800
Re: python implementation of a new integer encoding algorithm. Chris Angelico <rosuav@gmail.com> - 2015-02-18 00:16 +1100
Re: python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-18 00:55 -0800
Re: python implementation of a new integer encoding algorithm. Chris Angelico <rosuav@gmail.com> - 2015-02-18 20:36 +1100
Re: python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-18 11:29 -0800
Re: python implementation of a new integer encoding algorithm. Laura Creighton <lac@openend.se> - 2015-02-18 11:32 +0100
Re: python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-18 11:48 -0800
People hated it for the same reasons I found them cool (was: python implementation of a new integer encoding algorithm.) Ben Finney <ben+python@benfinney.id.au> - 2015-02-18 21:57 +1100
Re: python implementation of a new integer encoding algorithm. Dave Angel <davea@davea.name> - 2015-02-17 09:12 -0500
Re: python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-18 00:59 -0800
Re: python implementation of a new integer encoding algorithm. Dave Angel <davea@davea.name> - 2015-02-18 11:46 -0500
Re: python implementation of a new integer encoding algorithm. Grant Edwards <invalid@invalid.invalid> - 2015-02-18 17:30 +0000
Re: python implementation of a new integer encoding algorithm. Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-02-18 18:12 +0000
Re: python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-18 11:55 -0800
Re: python implementation of a new integer encoding algorithm. Marko Rauhamaa <marko@pacujo.net> - 2015-02-18 23:54 +0200
Re: python implementation of a new integer encoding algorithm. Marko Rauhamaa <marko@pacujo.net> - 2015-02-19 00:08 +0200
Re: python implementation of a new integer encoding algorithm. Grant Edwards <invalid@invalid.invalid> - 2015-02-18 22:58 +0000
Re: python implementation of a new integer encoding algorithm. Dave Angel <davea@davea.name> - 2015-02-18 17:19 -0500
Re: python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-19 07:45 -0800
Re: python implementation of a new integer encoding algorithm. Ian Kelly <ian.g.kelly@gmail.com> - 2015-02-19 11:04 -0700
Re: python implementation of a new integer encoding algorithm. Ian Kelly <ian.g.kelly@gmail.com> - 2015-02-19 11:16 -0700
Re: python implementation of a new integer encoding algorithm. Dave Angel <davea@davea.name> - 2015-02-19 13:24 -0500
Re: python implementation of a new integer encoding algorithm. Chris Angelico <rosuav@gmail.com> - 2015-02-20 05:34 +1100
Re: python implementation of a new integer encoding algorithm. Ian Kelly <ian.g.kelly@gmail.com> - 2015-02-19 11:32 -0700
Re: python implementation of a new integer encoding algorithm. Dave Angel <davea@davea.name> - 2015-02-19 13:41 -0500
Re: python implementation of a new integer encoding algorithm. Dave Angel <davea@davea.name> - 2015-02-19 13:46 -0500
Re: python implementation of a new integer encoding algorithm. Chris Angelico <rosuav@gmail.com> - 2015-02-20 05:49 +1100
Re: python implementation of a new integer encoding algorithm. Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-02-18 17:00 +0000
Re: python implementation of a new integer encoding algorithm. Chris Angelico <rosuav@gmail.com> - 2015-02-18 01:34 +1100
Re: python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-18 01:04 -0800
Re: python implementation of a new integer encoding algorithm. Dave Angel <davea@davea.name> - 2015-02-18 08:54 -0500
Re: python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-18 11:52 -0800
Re: python implementation of a new integer encoding algorithm. Chris Angelico <rosuav@gmail.com> - 2015-02-19 01:16 +1100
Re: python implementation of a new integer encoding algorithm. Dave Angel <davea@davea.name> - 2015-02-17 09:50 -0500
Re: python implementation of a new integer encoding algorithm. Chris Angelico <rosuav@gmail.com> - 2015-02-18 01:58 +1100
Re: python implementation of a new integer encoding algorithm. Dave Angel <davea@davea.name> - 2015-02-17 10:18 -0500
Re: python implementation of a new integer encoding algorithm. Chris Angelico <rosuav@gmail.com> - 2015-02-18 02:25 +1100
Re: python implementation of a new integer encoding algorithm. Paul Rubin <no.email@nospam.invalid> - 2015-02-17 08:43 -0800
Re: python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-18 01:06 -0800
Re: python implementation of a new integer encoding algorithm. Mario Figueiredo <marfig@gmail.com> - 2015-02-19 08:44 +0100
Re: python implementation of a new integer encoding algorithm. Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-02-19 08:06 +0000
Re: python implementation of a new integer encoding algorithm. Marko Rauhamaa <marko@pacujo.net> - 2015-02-19 10:36 +0200
Re: python implementation of a new integer encoding algorithm. Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-02-19 09:33 +0000
Re: python implementation of a new integer encoding algorithm. Terry Reedy <tjreedy@udel.edu> - 2015-02-19 14:50 -0500
Re: python implementation of a new integer encoding algorithm. Marko Rauhamaa <marko@pacujo.net> - 2015-02-19 21:55 +0200
Re: python implementation of a new integer encoding algorithm. Chris Angelico <rosuav@gmail.com> - 2015-02-19 19:36 +1100
Re: python implementation of a new integer encoding algorithm. Mario Figueiredo <marfig@gmail.com> - 2015-02-19 10:42 +0100
Re: python implementation of a new integer encoding algorithm. Mark Lawrence <breamoreboy@yahoo.co.uk> - 2015-02-19 10:28 +0000
Re: python implementation of a new integer encoding algorithm. Mario Figueiredo <marfig@gmail.com> - 2015-02-19 14:27 +0100
Re: python implementation of a new integer encoding algorithm. Jonas Wielicki <jonas@wielicki.name> - 2015-02-19 09:38 +0100
Re: python implementation of a new integer encoding algorithm. janhein.vanderburg@gmail.com - 2015-02-19 07:58 -0800
Re: python implementation of a new integer encoding algorithm. Denis McMahon <denismfmcmahon@gmail.com> - 2015-02-20 02:46 +0000
Re: python implementation of a new integer encoding algorithm. wxjmfauth@gmail.com - 2015-02-20 00:58 -0800
Page 1 of 3 [1] 2 3 Next page →
| From | janhein.vanderburg@gmail.com |
|---|---|
| Date | 2015-02-17 03:22 -0800 |
| Subject | python implementation of a new integer encoding algorithm. |
| Message-ID | <e45a71b2-7ec0-4c7d-88ae-c48aebe154b7@googlegroups.com> |
In http://optarbvalintenc.blogspot.nl/ I propose a new way to encode arbitrarily valued integers and the python code that can be used as a reference for practical implementations of codecs. The encoding algorithm itself is optimized for transmission and storage requirements without any processor load consideration whatsoever. The next step is the development of the python code that minimizes processor requirements without compromising the algorithm. Is this the right forum to ask for help with this step or to obtain references to more suitable platforms? Thanks in advance, Jan-Hein.
[toc] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-02-18 00:16 +1100 |
| Message-ID | <mailman.18783.1424178970.18130.python-list@python.org> |
| In reply to | #85737 |
On Tue, Feb 17, 2015 at 10:22 PM, <janhein.vanderburg@gmail.com> wrote: > In http://optarbvalintenc.blogspot.nl/ I propose a new way to encode arbitrarily valued integers and the python code that can be used as a reference for practical implementations of codecs. > > The encoding algorithm itself is optimized for transmission and storage requirements without any processor load consideration whatsoever. > > The next step is the development of the python code that minimizes processor requirements without compromising the algorithm. > > Is this the right forum to ask for help with this step or to obtain references to more suitable platforms? This is a fine forum to ask in. However, you may find that the advice you get isn't quite what you were asking for. In my case, the advice I'm offering is: Don't do this. In over 99% of cases, the benefit from this kind of packed transmission format is negligible; it's much MUCH better to make your protocol simple readable text. Have a look at internet protocols like SMTP/POP/IMAP (the email trio), HTTP, FTP, and so on; all of them follow a basic pattern of connecting to a well-known port on a destination server, sending textual commands terminated by end-of-line, and receiving textual responses terminated by end-of-line. Most of the world wide web consists of HTTP requests (possibly using SSL/TLS, but that doesn't change the protocol), but even so, the advantage of packing the headers into a binary format just isn't worth the cost of making everything harder to debug. Because debugging is *hugely* easier when you have a text protocol. All you need to do is telnet or netcat to the appropriate port, type some commands, and eyeball the responses. I have a MUD client called Gypsum [1] which does that, with a few extra features, including (like netcat, but unlike telnet) listening on a port, so you can test a client; I've used it and its predecessor for testing myriad networking programs, both my own and other people's, to try to figure out what's going on. So if you want to develop a brand new protocol, here's what I'd suggest: 1) Mandate UTF-8 encoding everywhere 2) Stipulate \n as the end of line, and strip/ignore all \r found 3) Follow the basic model laid down by SMTP and POP: the server sends a greeting on startup, then everything's done with simple commands and responses. The server may also send unilateral messages, as long as they're unambiguously detectable (eg if it's notifying you of something that just happened). Sure, you might be able to pack your integers into something more compact than their decimal representations... but how much will you really gain? Most of the internet works on the basis of packets, and you'll find that the difference between a 200-byte packet and a 210-byte packet probably isn't even measurable; at very best, you might see an advantage when you transfer a huge file as a coherent blob, but that would mean dealing with a large number of these packetized integers. In the meantime, you make your protocol fragile and hard to read by eye, and you spend a lot of time developing your protocol, instead of just blatting simple text down the wire. Take the easy option; you can always make things more complicated later. [1] https://github.com/Rosuav/Gypsum ChrisA
[toc] | [prev] | [next] | [standalone]
| From | janhein.vanderburg@gmail.com |
|---|---|
| Date | 2015-02-18 00:55 -0800 |
| Message-ID | <85554c47-0957-4a5a-8ac7-513cd0f436ff@googlegroups.com> |
| In reply to | #85738 |
On Tuesday, February 17, 2015 at 2:17:02 PM UTC+1, Chris Angelico wrote: > This is a fine forum to ask in. However, you may find that the advice > you get isn't quite what you were asking for. In my case, the advice > I'm offering is: Don't do this. Thanks Chris; let me explain why I want this. As you rightly point out, human readable encoding enables debugging without the need for specialized tools that assemble/disassemble character streams. I also agree that saving space in a partition of a character stream like a communication packet or persistent memory sector won't do much good if it does not save additional partitions. The human centered encoding versus machine oriented encoding discussion dates from some decades ago now, and I am personally still not convinced that human readable text is all that sensible in most applications. The main problem I see is that we are trying to make computers behave like human beings, and that introduces interpretation risks and resource requirements that should be avoided. I'll be happy to discuss this subject again, but let's do that in a different thread. The difference between two encoded stream lengths within a single partition may not be relevant, but the difference between needing 1 or 2 partitions is quite significant. Especially in applications that depend on very limited bandwidth communication channels. Since all messages in my applications are primarily composed of integer values, finding the algorithm that fundamentally minimizes the number of bits to encode such a value will increase my chances to save partitions needed to encode the message, even without adding compression techniques at the message level. Note that I don't assume any properties of the integer values to be encoded. > Take the > easy option; you can always make things more complicated later. That makes sense alright. No offense, but I still believe that human readable text encoding complicates things right now and shouldn't be tried until "my way" has proven unpractical in its first application. Consider it to be a theoretical challenge: how do I find the general encoding of an arbitrary integer value that minimizes the number of bits needed and given that algorithm, find the python code that minimizes the processor load inflicted by the codec implementation.
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-02-18 20:36 +1100 |
| Message-ID | <mailman.18811.1424252180.18130.python-list@python.org> |
| In reply to | #85776 |
On Wed, Feb 18, 2015 at 7:55 PM, <janhein.vanderburg@gmail.com> wrote: >> Take the >> easy option; you can always make things more complicated later. > That makes sense alright. > No offense, but I still believe that human readable text encoding complicates things right now and shouldn't be tried until "my way" has proven unpractical in its first application. > Consider it to be a theoretical challenge: how do I find the general encoding of an arbitrary integer value that minimizes the number of bits needed and given that algorithm, find the python code that minimizes the processor load inflicted by the codec implementation. > I would actually look at it the other way: a human-readable encoding is the easy way (you can simply print() your numbers and int() them on the way back in), and until you can prove that it's impractical, don't write a single line of code towards this algorithm. But let's get some figures, though: How many payload bits per byte can you achieve? What's your average going to be? You have two baselines to beat: 4 bits per byte (hexadecimal), and 7 bits per byte (MIDI varlen, or whatever you want to call it; I first met it with MIDI, but it's been used in so many places that other people have other names for it). ChrisA
[toc] | [prev] | [next] | [standalone]
| From | janhein.vanderburg@gmail.com |
|---|---|
| Date | 2015-02-18 11:29 -0800 |
| Message-ID | <a07f5f39-d08a-4eb2-9a03-d04b37f80063@googlegroups.com> |
| In reply to | #85780 |
Op woensdag 18 februari 2015 10:36:37 UTC+1 schreef Chris Angelico: > I would actually look at it the other way: I'm aware of that, since you already warned me with "This is a fine forum to ask in. However, you may find that the advice you get isn't quite what you were asking for. In my case, ..." Again: no offense, but I agree to disagree. There will be no assumptions about the integers to be encoded. Jan-Hein.
[toc] | [prev] | [next] | [standalone]
| From | Laura Creighton <lac@openend.se> |
|---|---|
| Date | 2015-02-18 11:32 +0100 |
| Message-ID | <mailman.18813.1424255578.18130.python-list@python.org> |
| In reply to | #85776 |
Hi Jan. I'm an old fart. In the late 1970s, when I started programming these things, and memory was non-existant, we came up with all sorts of data compression algorithms which were absolutely necessary to get any work done whatsoever. Should you ever need an assembler programmer for quick and dirty hacks for the PDP-11 line (11/20 and 11/05 preferred as it is harder) I am still the woman for the job. Indeed, I spent most of my 20s finding better and better ways to fit programs into smaller and smaller memory footprints. I perfectly understand the intellectual thrill of doing such things. As puzzles go, it is about as cool a one as exists, and its all for things that matter -- for real. However, in the matter of financial compensation and world recognition, you have just laid a very large goose-egg. The VAX-11/780 was introduced on October 25, 1977, according to wikipedia. But in my world, it was 1982 before I got to see the first one. And it was godly more expensive than a pdp-11, but the writing was on the wall. The thing could page, and so all the techniques we learned for making our code concise -- let alone the dirty tricks I specialised in -- were no longer relevant. >From the mid 1980s onward I have been telling people 'your code is ugly, please tighten it up by refactoring it <here> and <here>' and when I am their instructor they grumble and do it, and otherwise they flip me the bird. In their eyes, it doesn't matter how the code _looks_ as long as it does the job. And I deeply sympathise. But what I am going for is not a 'death - by looking unfashionable' but rather a demand that good code is clear to understand. Because what I have learned, that Brian Kernighan expressed a long time ago is that: Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. Your proposed encoding scheme (if it does as you say, I have not analysed it) scores very, very high in the _cleverness_ department. Enough that a lot of people, who aren't as clever as you do, have no hope in hell of ever being clever enough to debug something that uses it. Therefore, you will never see widespread adoption of your scheme, no matter how brilliantly it does as you say, because we all need things that are easier to debug more than we need better compression. So now you are sad. I was sad, too, but the sooner I learned this the sooner I could stop wasting my time creating algorithms that provided cool functionality that people hated for the same reasons I found them cool. You need to find a different sort of algorithm that people like to use if you want to get widespread success in the world of widely used algorithms. If you have found a way to improve on Lemel-Ziv, then this will count. But it may be that your next step is 'how to encode things that are not phonetic language'. Go look -- for the next few months -- at how MIDI stores sounds. You will find plenty of places for improvement, but the idea is not to improve the standard but to learn it well enough that you can see things in the non-alphabetic world. So then, now what? If you are still fired up with the desire to compress things, then there is a huge, _very well paying_ market I want to introduce you to. And this is _tech support for porn sites_. Porn sites make a ton of money, indeed the numbers are scary. And here the idea of 'I saved 2% of time/bandwidth/disk space/' really matters. You can really save money for them, and it really matters to them. Since I have never found sex 'dirty' and indeed consider it one of the great joys in life, I have never found anything wrong with working for porn sites. And, hell, out of male porn Jimmy Wales made wikipedia. Haven't we all benefitted? But, right now, you are, alas for you, about 45 years too late for the ideas you are sprouting. I had similar ones about 30 years too late and, well, they only worked for me for about 3-5 years. Sucks to be you, friend -- you needed to be your grandfather, I fear. Laura Creighton
[toc] | [prev] | [next] | [standalone]
| From | janhein.vanderburg@gmail.com |
|---|---|
| Date | 2015-02-18 11:48 -0800 |
| Message-ID | <4efc991a-947f-415f-9999-2f490aef0927@googlegroups.com> |
| In reply to | #85782 |
Op woensdag 18 februari 2015 11:33:18 UTC+1 schreef Laura Creighton: > Hi Jan. Hi Laura, thanks for your comments; let me explain my why: > Should you ever need an assembler programmer for > quick and dirty hacks for the PDP-11 line (11/20 and 11/05 preferred > as it is harder) I am still the woman for the job. Indeed, I spent > most of my 20s finding better and better ways to fit programs into > smaller and smaller memory footprints. I once wrote a bootstrap loader for a PDP-11, but I forget when and what model exactly, for it was some decades ago and not all that spectacular. > I perfectly understand the intellectual thrill of doing such things. > As puzzles go, it is about as cool a one as exists, and its all for > things that matter -- for real. > > However, in the matter of financial compensation and world recognition, > you have just laid a very large goose-egg. I don't want financial compensation and don't need world recognition, so I will accept the egg. > so all the techniques we learned for making our code concise -- let alone > the dirty tricks I specialised in -- were no longer relevant. If dirty tricks means shortcuts I personally wouldn't have excepted them even at that time of resource shortages. I hate shortcuts. > >From the mid 1980s onward I have been telling people 'your code is > ugly, please tighten it up by refactoring it <here> and <here>' and > when I am their instructor they grumble and do it, and otherwise they > flip me the bird. In their eyes, it doesn't matter how the code > _looks_ as long as it does the job. Do not get intimidated and simply refuse to follow the lemmings. > what I am going for is not a 'death - by looking unfashionable' but > rather a demand that good code is clear to understand. Sorry that my code is all that confusing. I have had several comparable complaints from other contributors to this discussion. I will try to improve that part of my proposal, but don't hold your breath, for it may take some time. In fact I was initially asking for help with that code, but I ended up defending my credentials. > Your proposed encoding scheme (if it does as you say, I have not > analysed it) scores very, very high in the _cleverness_ department. > Enough that a lot of people, who aren't as clever as you do, have > no hope in hell of ever being clever enough to debug something that > uses it. Therefore, you will never see widespread adoption of your > scheme, no matter how brilliantly it does as you say, because we all > need things that are easier to debug more than we need better compression. I don't need widespread adaption outside my own applications, and I don't see any problem for a python programmer that uses an object that flawlessly encodes and decodes an arbitrary integer value. To me debugging is not synonymous to making things human readable or even translatable. So what's in it for other programmers then me myself and I?: the no brainer optimizing of resource requirements. Compression is different from optimal individual integer encoding, since the first needs a context of surrounding integers in a sequence of integer images. > So now you are sad. Not yet. > So then, now what? > > If you are still fired up with the desire to compress things, then > there is a huge, _very well paying_ market I want to introduce you > to. And this is _tech support for porn sites_. Porn sites make a > ton of money, indeed the numbers are scary. And here the idea of > 'I saved 2% of time/bandwidth/disk space/' really matters. You > can really save money for them, and it really matters to them. > Since I have never found sex 'dirty' and indeed consider it one > of the great joys in life, I have never found anything wrong with > working for porn sites. Frankly a have a first application in mind that is less sexy and won't be all that profitable. There is nothing wrong with porn indeed, but at our age it is just no longer on the top of one's mind every second is it? > And, hell, out of male porn Jimmy Wales made wikipedia. Haven't > we all benefitted? I really don't know what you are talking about here, and only interested in a reply if it seriously contributes to my personal objective to get "my" way of integer encoding implemented as proposed. > But, right now, you are, alas for you, about 45 years too late for the > ideas you are sprouting. I had similar ones about 30 years too late > and, well, they only worked for me for about 3-5 years. Sucks to be > you, friend -- you needed to be your grandfather, I fear. Thanks for making me feel so mature. Let's go back to the future.
[toc] | [prev] | [next] | [standalone]
| From | Ben Finney <ben+python@benfinney.id.au> |
|---|---|
| Date | 2015-02-18 21:57 +1100 |
| Subject | People hated it for the same reasons I found them cool (was: python implementation of a new integer encoding algorithm.) |
| Message-ID | <mailman.18814.1424257067.18130.python-list@python.org> |
| In reply to | #85776 |
Laura Creighton <lac@openend.se> writes: > So now you are sad. I was sad, too, but the sooner I learned this the > sooner I could stop wasting my time creating algorithms that provided > cool functionality that people hated for the same reasons I found them > cool. +1 QotW -- \ “Human reason is snatching everything to itself, leaving | `\ nothing for faith.” —Bernard of Clairvaux, 1090–1153 CE | _o__) | Ben Finney
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <davea@davea.name> |
|---|---|
| Date | 2015-02-17 09:12 -0500 |
| Message-ID | <mailman.18784.1424182393.18130.python-list@python.org> |
| In reply to | #85737 |
On 02/17/2015 06:22 AM, janhein.vanderburg@gmail.com wrote: > In http://optarbvalintenc.blogspot.nl/ I propose a new way to encode arbitrarily valued integers and the python code that can be used as a reference for practical implementations of codecs. > > The encoding algorithm itself is optimized for transmission and storage requirements without any processor load consideration whatsoever. > > The next step is the development of the python code that minimizes processor requirements without compromising the algorithm. > > Is this the right forum to ask for help with this step or to obtain references to more suitable platforms? > This is a fine forum for such a discussion. I for one would love to participate. However, note that it isn't necessary true that "the smaller the better" is a good algorithm. In context, there are frequently a number of tradeoffs, even ignoring processor time (as you imply). Many years ago I worked on some code that manipulated the intermediate files for a compiler. I had no say in the format, I just had to deal with it. They had a field type called a "compressed integer." It could vary between one byte and I think about six. And the theory was that it took less space than the equivalent format of fixed size integers. The catch from my point was that these integers could appear in the middle of a struct, and thus access to the later fields of the struct required a dynamic calculation. This put a huge onus on my code to read or write the data serially. I ended up solving it by writing code that generated 40 thousand lines of C++ header and body code, so that the rest of the code didn't care. Was it worth it? To reduce the size of some files that only lived a few seconds on disk? I seriously doubt it. But I learned a lot in the process. On another project, the goal was to be able to recover data from archives in spite of physical damage to some files. So I had to add selective redundancy for that. In the process, I also compress the data, but confine the compression algorithm to relatively small pieces of data, and label those pieces independently, so that any single decompression error affects only one unit of data. So going back to your problem, and assuming that the other issues are moot, what's your proposal? Are you compressing relative to a straight binary form of storage? Are you assuming anything about the relative likelihood of various values of integers? Do you provide anything to allow for the possibility that your prediction for probability distribution isn't met? -- DaveA
[toc] | [prev] | [next] | [standalone]
| From | janhein.vanderburg@gmail.com |
|---|---|
| Date | 2015-02-18 00:59 -0800 |
| Message-ID | <515047c1-a20d-430e-a029-1c2d77db2f1a@googlegroups.com> |
| In reply to | #85739 |
On Tuesday, February 17, 2015 at 3:13:41 PM UTC+1, Dave Angel wrote: > This is a fine forum for such a discussion. I for one would love to > participate. However, note that it isn't necessary true that "the > smaller the better" is a good algorithm. In context, there are > frequently a number of tradeoffs, even ignoring processor time (as you > imply). Thanks Dave; about those trade offs: I agree that allowing variable length encoding in general forces the programmer to process records in a file sequentially. Random access to individual data items in such a stream is simply not an option, unless all items have been indexed in a directory that accompanies the data character stream. So if you need random access to individual items put limits to each individual data item. My applications do not allow me to do that, because I would always be to restrictive and be caught by the "640k should be enough for any programmer" trap. > So going back to your problem, and assuming that the other issues are > moot, what's your proposal? Are you compressing relative to a straight > binary form of storage? Are you assuming anything about the relative > likelihood of various values of integers? Do you provide anything to > allow for the possibility that your prediction for probability > distribution isn't met? I'm not compressing sequences of integer encodings but encoding individual integers optimally without any assumptions about their values.
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <davea@davea.name> |
|---|---|
| Date | 2015-02-18 11:46 -0500 |
| Message-ID | <mailman.18823.1424278025.18130.python-list@python.org> |
| In reply to | #85777 |
On 02/18/2015 03:59 AM, janhein.vanderburg@gmail.com wrote: > encoding individual integers optimally without any assumptions about their values. > Contradiction in terms. -- DaveA
[toc] | [prev] | [next] | [standalone]
| From | Grant Edwards <invalid@invalid.invalid> |
|---|---|
| Date | 2015-02-18 17:30 +0000 |
| Message-ID | <mc2i7g$eof$2@reader1.panix.com> |
| In reply to | #85798 |
On 2015-02-18, Dave Angel <davea@davea.name> wrote:
> On 02/18/2015 03:59 AM, janhein.vanderburg@gmail.com wrote:
>
>
>> encoding individual integers optimally without any assumptions about
>> their values.
>
> Contradiction in terms.
Ah, that depends on whether the encoding has to be lossless or not.
For example:
def encode_integer(i):
return 0;
Voila! Optimal encoding with no assumptions about intut values.
It is, however, a bit lossy.
--
Grant Edwards grant.b.edwards Yow! Did you move a lot of
at KOREAN STEAK KNIVES this
gmail.com trip, Dingy?
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2015-02-18 18:12 +0000 |
| Message-ID | <mailman.18827.1424283186.18130.python-list@python.org> |
| In reply to | #85802 |
On 18/02/2015 17:30, Grant Edwards wrote: > On 2015-02-18, Dave Angel <davea@davea.name> wrote: >> On 02/18/2015 03:59 AM, janhein.vanderburg@gmail.com wrote: >> >> >>> encoding individual integers optimally without any assumptions about >>> their values. >> >> Contradiction in terms. > > Ah, that depends on whether the encoding has to be lossless or not. > > For example: > > def encode_integer(i): > return 0; > > Voila! Optimal encoding with no assumptions about intut values. > > It is, however, a bit lossy. > Is it this kind of coding that makes the PEP 393 FSR so poor? -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence
[toc] | [prev] | [next] | [standalone]
| From | janhein.vanderburg@gmail.com |
|---|---|
| Date | 2015-02-18 11:55 -0800 |
| Message-ID | <2a717ffb-d61d-4407-9082-1c17cd7ee573@googlegroups.com> |
| In reply to | #85798 |
Op woensdag 18 februari 2015 17:47:49 UTC+1 schreef Dave Angel: > On 02/18/2015 03:59 AM, janhein.vanderburg@gmail.com wrote: > > > > encoding individual integers optimally without any assumptions about their values. > > > > Contradiction in terms. > > -- > DaveA Not. Jan-Hein.
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2015-02-18 23:54 +0200 |
| Message-ID | <87fva2ophe.fsf@elektro.pacujo.net> |
| In reply to | #85815 |
janhein.vanderburg@gmail.com: > Op woensdag 18 februari 2015 17:47:49 UTC+1 schreef Dave Angel: >> On 02/18/2015 03:59 AM, janhein.vanderburg@gmail.com wrote: >> > encoding individual integers optimally without any assumptions >> > about their values. >> >> Contradiction in terms. > > Not. Out of curiosity, could you give me an example of an integer, not assuming anything about its value. I mean, any integer you could mention would be very close to zero compared with virtually all other integers. Marko
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2015-02-19 00:08 +0200 |
| Message-ID | <87bnkqooth.fsf@elektro.pacujo.net> |
| In reply to | #85820 |
Marko Rauhamaa <marko@pacujo.net>: > Out of curiosity, could you give me an example of an integer, not > assuming anything about its value. > > I mean, any integer you could mention would be very close to zero > compared with virtually all other integers. And I don't mean to be snide, either. I'm just saying that your "optimal encoding" necessarily depends on the expected distribution of the integers. There is no such thing as an even distribution of integers. Marko
[toc] | [prev] | [next] | [standalone]
| From | Grant Edwards <invalid@invalid.invalid> |
|---|---|
| Date | 2015-02-18 22:58 +0000 |
| Message-ID | <mc35f7$hqo$1@reader1.panix.com> |
| In reply to | #85820 |
On 2015-02-18, Marko Rauhamaa <marko@pacujo.net> wrote:
> janhein.vanderburg@gmail.com:
>
>> Op woensdag 18 februari 2015 17:47:49 UTC+1 schreef Dave Angel:
>>> On 02/18/2015 03:59 AM, janhein.vanderburg@gmail.com wrote:
>>> > encoding individual integers optimally without any assumptions
>>> > about their values.
>>>
>>> Contradiction in terms.
>>
>> Not.
>
> Out of curiosity, could you give me an example of an integer, not
> assuming anything about its value.
>
> I mean, any integer you could mention would be very close to zero
> compared with virtually all other integers.
Thus demonstrating the fitness of my previously posted "lossy integer
encoding" algorithm.
--
Grant Edwards grant.b.edwards Yow! Hmmm ... an arrogant
at bouquet with a subtle
gmail.com suggestion of POLYVINYL
CHLORIDE ...
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <davea@davea.name> |
|---|---|
| Date | 2015-02-18 17:19 -0500 |
| Message-ID | <mailman.18839.1424298000.18130.python-list@python.org> |
| In reply to | #85815 |
On 02/18/2015 02:55 PM, janhein.vanderburg@gmail.com wrote: > Op woensdag 18 februari 2015 17:47:49 UTC+1 schreef Dave Angel: >> On 02/18/2015 03:59 AM, janhein.vanderburg@gmail.com wrote: >> >> >>> encoding individual integers optimally without any assumptions about their values. >>> >> >> Contradiction in terms. >> >> -- >> DaveA > > Not. > Jan-Hein. > Then you had better define your new word "optimal" to us. I decided to try your algorithm for all the values between 0 and 999999. One million values, and the 7bit encoding[1] beat yours for 950081 of them. The rest were tied. Yours never was shorter. For a uniform distribution of those particular values, the average number of bytes used by yours was 3.933568 bytes, and by 7bit encoding was 2.983487 For the second and third million, yours are all 4 bytes, while 7bit uses 3. Beyond 2097152, 7bit uses 4 bytes, same as you. Between 16 and 17 million, you average 4.156865, while 7bit is a constant 4.0. After that, I started spot-checking. I went up to 100 billion, and for none of those I tried did your algorithm take fewer bytes than 7bit. So how is yours optimal? Over what range of values? I'm not necessarily doubting it, just challenging you to provide a data sample that actually shows it. And of course, I'm not claiming that 7bit is in any way optimal. You cannot define optimal without first defining the distribution. [1] by 7bit, I'm referring to the one apparently used in MIDI encoding, where 7 bits of each byte hold the value, and the 8th bit is zero, except for the last byte, where the 8th bit is one. So 3 bytes can encode 21 bits, or up to 2**21 - 1. -- DaveA
[toc] | [prev] | [next] | [standalone]
| From | janhein.vanderburg@gmail.com |
|---|---|
| Date | 2015-02-19 07:45 -0800 |
| Message-ID | <993f64c3-fc85-48a7-9d02-a4f12ecb33c6@googlegroups.com> |
| In reply to | #85827 |
On Wednesday, February 18, 2015 at 11:20:12 PM UTC+1, Dave Angel wrote: > I'm not necessarily doubting it, just challenging you to provide a data > sample that actually shows it. And of course, I'm not claiming that > 7bit is in any way optimal. You cannot define optimal without first > defining the distribution. Weird results. For a character size 2 the growth processes are shown below. I listed the decimal representations, the difficult representation, a stop bit encoding, and the number of characters they differ in length: 0: 00 00 0 1: 01 01 0 2: 10, 00 10, 00 0 3: 10, 01 10, 01 0 4: 10, 10 11, 00 0 5: 10, 11 11, 01 0 6: 11, 00.00 11, 10, 00 0 7: 11, 00.01 11, 10, 01 0 8: 11, 00.10 11, 11, 00 0 9: 11, 00.11 11, 11, 01 0 10: 11, 01.00 11, 11, 10, 00 1 11: 11, 01.01 11, 11, 10, 01 1 12: 11, 01.10 11, 11, 11, 00 1 13: 11, 01.11 11, 11, 11, 01 1 14: 11, 10.00, 00 11, 11, 11, 10, 00 1 15: 11, 10.00, 01 11, 11, 11, 10, 01 1 16: 11, 10.00, 10 11, 11, 11, 11, 00 1 17: 11, 10.00, 11 11, 11, 11, 11, 01 1 18: 11, 10.01, 00.00 11, 11, 11, 11, 10, 00 1 19: 11, 10.01, 00.01 11, 11, 11, 11, 10, 01 1 20: 11, 10.01, 00.10 11, 11, 11, 11, 11, 00 1 21: 11, 10.01, 00.11 11, 11, 11, 11, 11, 01 1 22: 11, 10.01, 01.00 11, 11, 11, 11, 11, 10, 00 2 23: 11, 10.01, 01.01 11, 11, 11, 11, 11, 10, 01 2 24: 11, 10.01, 01.10 11, 11, 11, 11, 11, 11, 00 2 25: 11, 10.01, 01.11 11, 11, 11, 11, 11, 11, 01 2 26: 11, 10.01, 10.00 11, 11, 11, 11, 11, 11, 10, 00 3 I didn't take the time to prove it mathematically, but these results suggest to me that the complicated encoding beats the stop bit encoding.
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2015-02-19 11:04 -0700 |
| Message-ID | <mailman.18892.1424369100.18130.python-list@python.org> |
| In reply to | #85910 |
On Thu, Feb 19, 2015 at 8:45 AM, <janhein.vanderburg@gmail.com> wrote: > On Wednesday, February 18, 2015 at 11:20:12 PM UTC+1, Dave Angel wrote: >> I'm not necessarily doubting it, just challenging you to provide a data >> sample that actually shows it. And of course, I'm not claiming that >> 7bit is in any way optimal. You cannot define optimal without first >> defining the distribution. > > Weird results. > For a character size 2 the growth processes are shown below. > I listed the decimal representations, the difficult representation, a stop bit encoding, and the number of characters they differ in length: > 0: 00 00 0 > 1: 01 01 0 > 2: 10, 00 10, 00 0 > 3: 10, 01 10, 01 0 > 4: 10, 10 11, 00 0 > 5: 10, 11 11, 01 0 > 6: 11, 00.00 11, 10, 00 0 > 7: 11, 00.01 11, 10, 01 0 > 8: 11, 00.10 11, 11, 00 0 > 9: 11, 00.11 11, 11, 01 0 > 10: 11, 01.00 11, 11, 10, 00 1 > 11: 11, 01.01 11, 11, 10, 01 1 > 12: 11, 01.10 11, 11, 11, 00 1 > 13: 11, 01.11 11, 11, 11, 01 1 > 14: 11, 10.00, 00 11, 11, 11, 10, 00 1 > 15: 11, 10.00, 01 11, 11, 11, 10, 01 1 > 16: 11, 10.00, 10 11, 11, 11, 11, 00 1 > 17: 11, 10.00, 11 11, 11, 11, 11, 01 1 > 18: 11, 10.01, 00.00 11, 11, 11, 11, 10, 00 1 > 19: 11, 10.01, 00.01 11, 11, 11, 11, 10, 01 1 > 20: 11, 10.01, 00.10 11, 11, 11, 11, 11, 00 1 > 21: 11, 10.01, 00.11 11, 11, 11, 11, 11, 01 1 > 22: 11, 10.01, 01.00 11, 11, 11, 11, 11, 10, 00 2 > 23: 11, 10.01, 01.01 11, 11, 11, 11, 11, 10, 01 2 > 24: 11, 10.01, 01.10 11, 11, 11, 11, 11, 11, 00 2 > 25: 11, 10.01, 01.11 11, 11, 11, 11, 11, 11, 01 2 > 26: 11, 10.01, 10.00 11, 11, 11, 11, 11, 11, 10, 00 3 > > I didn't take the time to prove it mathematically, but these results suggest to me that the complicated encoding beats the stop bit encoding. That stop-bit variant looks extremely inefficient (and wrong) to me. First, 2 bits per group is probably a bad choice for a stop-bit encoding. It saves some space for very small integers, but it won't scale well at all. Fully half of the bits are stop bits! Secondly, I don't understand why the leading groups are all 11s and only the later groups introduce variability. In fact, that's practically a unary encoding with just a small amount of binary at the end. This is what I would expect a 2-bit stop-bit encoding to look like: 0: 00 1: 01 2: 11, 00 3: 11, 01 4: 11, 10, 00 5: 11, 10, 01 6: 11, 11, 00 7: 11, 11, 01 8: 11, 10, 10, 00 9: 11, 10, 10, 01 10: 11, 10, 11, 00 11: 11, 10, 11, 01 12: 11, 11, 10, 00 13: 11, 11, 10, 01 14: 11, 11, 11, 00 15: 11, 11, 11, 01 16: 11, 10, 10, 10, 00 17: 11, 10, 10, 10, 01 18: 11, 10, 10, 11, 00 19: 11, 10, 10, 11, 01 20: 11, 10, 11, 10, 00 21: 11, 10, 11, 10, 01 22: 11, 10, 11, 11, 00 23: 11, 10, 11, 11, 01 24: 11, 11, 10, 10, 00 25: 11, 11, 10, 10, 01 26: 11, 11, 10, 11, 00 27: 11, 11, 10, 11, 01 28: 11, 11, 11, 10, 00 29: 11, 11, 11, 10, 01 30: 11, 11, 11, 11, 00 31: 11, 11, 11, 11, 01 etc. Notice that the size grows as O(log n), not O(n) as above. Notice also that the only values here for which this saves space over the 7-bit version are 0-7. Unless you expect those values to be very common, the 7-bit encoding that needs only one byte all the way up to 127 makes a lot of sense. There's also an optimization that can be added here if we wish to inject a bit of cleverness. Notice that all values with more than one group start with 11, never 10. We can borrow a trick from IEEE floating point and make the leading 1 bit of the mantissa implicit for all values greater than 3 (we can't do it for 2 and 3 because then we couldn't distinguish them from 0 and 1). Applying this optimization removes one full group from the representation of all values greater than 3, which appears to make the stop-bit representation as short as or shorter than the "difficult" one for all the values that have been enumerated above.
[toc] | [prev] | [next] | [standalone]
Page 1 of 3 [1] 2 3 Next page →
Back to top | Article view | comp.lang.python
csiph-web