Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #18566 > unrolled thread
| Started by | random joe <pywin32@gmail.com> |
|---|---|
| First post | 2012-01-05 14:44 -0800 |
| Last post | 2012-01-05 17:02 -0700 |
| Articles | 17 — 6 participants |
Back to article view | Back to comp.lang.python
Help with python-list archives random joe <pywin32@gmail.com> - 2012-01-05 14:44 -0800
Re: Help with python-list archives Miki Tebeka <miki.tebeka@gmail.com> - 2012-01-05 15:39 -0800
Re: Help with python-list archives random joe <pywin32@gmail.com> - 2012-01-05 15:52 -0800
Re: Help with python-list archives Ian Kelly <ian.g.kelly@gmail.com> - 2012-01-05 17:10 -0700
Re: Help with python-list archives random joe <pywin32@gmail.com> - 2012-01-05 16:45 -0800
Re: Help with python-list archives MRAB <python@mrabarnett.plus.com> - 2012-01-06 01:27 +0000
Re: Help with python-list archives random joe <pywin32@gmail.com> - 2012-01-05 18:14 -0800
Re: Help with python-list archives MRAB <python@mrabarnett.plus.com> - 2012-01-06 03:00 +0000
Re: Help with python-list archives random joe <pywin32@gmail.com> - 2012-01-05 20:01 -0800
Re: Help with python-list archives random joe <pywin32@gmail.com> - 2012-01-05 20:08 -0800
Re: Help with python-list archives Chris Angelico <rosuav@gmail.com> - 2012-01-06 15:12 +1100
Re: Help with python-list archives Ian Kelly <ian.g.kelly@gmail.com> - 2012-01-06 00:45 -0700
Re: Help with python-list archives Anssi Saari <as@sci.fi> - 2012-01-10 16:47 +0200
Re: Help with python-list archives Chris Angelico <rosuav@gmail.com> - 2012-01-06 15:11 +1100
Re: Help with python-list archives Ian Kelly <ian.g.kelly@gmail.com> - 2012-01-06 00:41 -0700
Re: Help with python-list archives random joe <pywin32@gmail.com> - 2012-01-06 16:55 -0800
Re: Help with python-list archives Ian Kelly <ian.g.kelly@gmail.com> - 2012-01-05 17:02 -0700
| From | random joe <pywin32@gmail.com> |
|---|---|
| Date | 2012-01-05 14:44 -0800 |
| Subject | Help with python-list archives |
| Message-ID | <78484055-dd01-4237-9217-9eb038fc744f@p16g2000yqd.googlegroups.com> |
Hi. I am new to python and wanted to search the python-list archives for answers to my many questions but i can't seem to get the archive files to uncompressed? What gives? From what i understand they are gzip files so i assumed the gzip module would work, but no! The best i could do was to get a ton of chinese chars using gzip and zlib.uncompress(). I would like to be courteous and search for my answers before asking so as not to waste anyones time. Does anyone know how to uncompress these files into a readable text form?
[toc] | [next] | [standalone]
| From | Miki Tebeka <miki.tebeka@gmail.com> |
|---|---|
| Date | 2012-01-05 15:39 -0800 |
| Message-ID | <14749754.624.1325806776674.JavaMail.geo-discussion-forums@vbgw2> |
| In reply to | #18566 |
Is there Google groups search not good enough? (groups.google.com/group/comp.lang.python) Also, can you give an example of the code and an input file?
[toc] | [prev] | [next] | [standalone]
| From | random joe <pywin32@gmail.com> |
|---|---|
| Date | 2012-01-05 15:52 -0800 |
| Message-ID | <8f3b98e1-3b21-4f06-8456-0a555a7ee523@u32g2000yqe.googlegroups.com> |
| In reply to | #18567 |
On Jan 5, 5:39 pm, Miki Tebeka <miki.teb...@gmail.com> wrote:
> Is the Google groups search not good enough?
That works but i would like to do some regexes and set up some
defaults.
> Also, can you give an example of the code and an input file?
Sure. Take the most recent file as example. "2012 - January.txt.gz".
If you use the python doc example this is the result. If i use "r" or
"rb" the result is the same.
>>> import gzip
>>> f1 = gzip.open('C:\\2012-January.txt.gz', 'rb')
>>> data = f1.read()
>>> data[:100]
'\x1f\x8b\x08\x08x\n\x05O\x02\xff/srv/mailman/archives/private/python-
list/2012-January.txt\x00\xec\xbdy\x7f\xdb\xc6\xb50\xfcw\xf0)\xa6z|+
\xaa!!l\xdc\x14[\x8b-;V\xe2-\x92\x12'
>>> f2 = gzip.open('C:\\2012-January.txt.gz', 'r')
>>> data = f2.read()
>>> data[:100]
'\x1f\x8b\x08\x08x\n\x05O\x02\xff/srv/mailman/archives/private/python-
list/2012-January.txt\x00\xec\xbdy\x7f\xdb\xc6\xb50\xfcw\xf0)\xa6z|+
\xaa!!l\xdc\x14[\x8b-;V\xe2-\x92\x12'
The docs and google provide no clear answer. I even tried 7zip and
ended up with nothing but gibberish characters. There must be levels
of compression or something. Why could they not simply use the tar
format? Is there anywhere else one can download the archives?
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2012-01-05 17:10 -0700 |
| Message-ID | <mailman.4463.1325808641.27778.python-list@python.org> |
| In reply to | #18568 |
On Thu, Jan 5, 2012 at 4:52 PM, random joe <pywin32@gmail.com> wrote:
> Sure. Take the most recent file as example. "2012 - January.txt.gz".
> If you use the python doc example this is the result. If i use "r" or
> "rb" the result is the same.
>
>>>> import gzip
>>>> f1 = gzip.open('C:\\2012-January.txt.gz', 'rb')
>>>> data = f1.read()
>>>> data[:100]
> '\x1f\x8b\x08\x08x\n\x05O\x02\xff/srv/mailman/archives/private/python-
> list/2012-January.txt\x00\xec\xbdy\x7f\xdb\xc6\xb50\xfcw\xf0)\xa6z|+
> \xaa!!l\xdc\x14[\x8b-;V\xe2-\x92\x12'
>>>> f2 = gzip.open('C:\\2012-January.txt.gz', 'r')
>>>> data = f2.read()
>>>> data[:100]
> '\x1f\x8b\x08\x08x\n\x05O\x02\xff/srv/mailman/archives/private/python-
> list/2012-January.txt\x00\xec\xbdy\x7f\xdb\xc6\xb50\xfcw\xf0)\xa6z|+
> \xaa!!l\xdc\x14[\x8b-;V\xe2-\x92\x12'
>
> The docs and google provide no clear answer. I even tried 7zip and
> ended up with nothing but gibberish characters. There must be levels
> of compression or something. Why could they not simply use the tar
> format? Is there anywhere else one can download the archives?
Interesting. I tried this on a Linux system using both gunzip and
your code, and both worked fine to extract that file. I also tried
your code on a Windows system, and I get the same result that you do.
This appears to be a bug in the gzip module under Windows.
I think there may be something peculiar about the archive files that
the module is not handling correctly. If I gunzip the file locally
and then gzip it again before trying to open it in Python, then
everything seems to be fine.
[toc] | [prev] | [next] | [standalone]
| From | random joe <pywin32@gmail.com> |
|---|---|
| Date | 2012-01-05 16:45 -0800 |
| Message-ID | <0cbcc2ce-5dff-4cd9-89f5-833956dc8e37@q7g2000yqn.googlegroups.com> |
| In reply to | #18570 |
On Jan 5, 6:10 pm, Ian Kelly <ian.g.ke...@gmail.com> wrote: > Interesting. I tried this on a Linux system using both gunzip and > your code, and both worked fine to extract that file. I also tried > your code on a Windows system, and I get the same result that you do. > This appears to be a bug in the gzip module under Windows. > > I think there may be something peculiar about the archive files that > the module is not handling correctly. If I gunzip the file locally > and then gzip it again before trying to open it in Python, then > everything seems to be fine. That is interesting. I wonder if anyone else has had the same issue? Just to be thorough I tried to uncompress using both python 2.x and 3.x and the results are unreadable text files in both cases. I have no idea what the problem could be. Especially without some way to compare my files to the gunzip'ed files on a linux machine.
[toc] | [prev] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2012-01-06 01:27 +0000 |
| Message-ID | <mailman.4464.1325813187.27778.python-list@python.org> |
| In reply to | #18568 |
On 06/01/2012 00:10, Ian Kelly wrote:
> On Thu, Jan 5, 2012 at 4:52 PM, random joe<pywin32@gmail.com> wrote:
>> Sure. Take the most recent file as example. "2012 - January.txt.gz".
>> If you use the python doc example this is the result. If i use "r" or
>> "rb" the result is the same.
>>
>>>>> import gzip
>>>>> f1 = gzip.open('C:\\2012-January.txt.gz', 'rb')
>>>>> data = f1.read()
>>>>> data[:100]
>> '\x1f\x8b\x08\x08x\n\x05O\x02\xff/srv/mailman/archives/private/python-
>> list/2012-January.txt\x00\xec\xbdy\x7f\xdb\xc6\xb50\xfcw\xf0)\xa6z|+
>> \xaa!!l\xdc\x14[\x8b-;V\xe2-\x92\x12'
>>>>> f2 = gzip.open('C:\\2012-January.txt.gz', 'r')
>>>>> data = f2.read()
>>>>> data[:100]
>> '\x1f\x8b\x08\x08x\n\x05O\x02\xff/srv/mailman/archives/private/python-
>> list/2012-January.txt\x00\xec\xbdy\x7f\xdb\xc6\xb50\xfcw\xf0)\xa6z|+
>> \xaa!!l\xdc\x14[\x8b-;V\xe2-\x92\x12'
>>
>> The docs and google provide no clear answer. I even tried 7zip and
>> ended up with nothing but gibberish characters. There must be levels
>> of compression or something. Why could they not simply use the tar
>> format? Is there anywhere else one can download the archives?
>
> Interesting. I tried this on a Linux system using both gunzip and
> your code, and both worked fine to extract that file. I also tried
> your code on a Windows system, and I get the same result that you do.
> This appears to be a bug in the gzip module under Windows.
>
> I think there may be something peculiar about the archive files that
> the module is not handling correctly. If I gunzip the file locally
> and then gzip it again before trying to open it in Python, then
> everything seems to be fine.
I've found that if I gunzip it twice (gunzip it and then gunzip the
result) using the gzip module I get the text file.
[toc] | [prev] | [next] | [standalone]
| From | random joe <pywin32@gmail.com> |
|---|---|
| Date | 2012-01-05 18:14 -0800 |
| Message-ID | <fa7b56e3-68b1-4b37-834d-48ab73177baf@k28g2000yqn.googlegroups.com> |
| In reply to | #18573 |
On Jan 5, 7:27 pm, MRAB <pyt...@mrabarnett.plus.com> wrote: > I've found that if I gunzip it twice (gunzip it and then gunzip the > result) using the gzip module I get the text file. On a windows machine? If so, can you post a code snippet please? Thanks
[toc] | [prev] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2012-01-06 03:00 +0000 |
| Message-ID | <mailman.4466.1325818829.27778.python-list@python.org> |
| In reply to | #18574 |
On 06/01/2012 02:14, random joe wrote: > On Jan 5, 7:27 pm, MRAB<pyt...@mrabarnett.plus.com> wrote: > >> I've found that if I gunzip it twice (gunzip it and then gunzip the >> result) using the gzip module I get the text file. > > On a windows machine? If so, can you post a code snippet please? > Thanks import gzip in_file = gzip.open(r"C:\2012-January.txt.gz") out_file = open(r"C:\2012-January.txt.tmp", "wb") out_file.write(in_file.read()) in_file.close() out_file.close() in_file = gzip.open(r"C:\2012-January.txt.tmp") out_file = open(r"C:\2012-January.txt", "wb") out_file.write(in_file.read()) in_file.close() out_file.close()
[toc] | [prev] | [next] | [standalone]
| From | random joe <pywin32@gmail.com> |
|---|---|
| Date | 2012-01-05 20:01 -0800 |
| Message-ID | <cfdb189b-a480-42c5-beac-2b9855c439ae@t13g2000yqg.googlegroups.com> |
| In reply to | #18577 |
On Jan 5, 9:00 pm, MRAB <pyt...@mrabarnett.plus.com> wrote: > On 06/01/2012 02:14, random joe wrote: > > > On Jan 5, 7:27 pm, MRAB<pyt...@mrabarnett.plus.com> wrote: > > >> I've found that if I gunzip it twice (gunzip it and then gunzip the > >> result) using the gzip module I get the text file. > > > On a windows machine? If so, can you post a code snippet please? > > Thanks > > import gzip > > in_file = gzip.open(r"C:\2012-January.txt.gz") > out_file = open(r"C:\2012-January.txt.tmp", "wb") > out_file.write(in_file.read()) > in_file.close() > out_file.close() > > in_file = gzip.open(r"C:\2012-January.txt.tmp") > out_file = open(r"C:\2012-January.txt", "wb") > out_file.write(in_file.read()) > in_file.close() > out_file.close() EXCELLENT! Thanks. THis works however there is one more tiny hiccup. The text has lost all significant indention and newlines. Was this intended or is this a result of another bug?
[toc] | [prev] | [next] | [standalone]
| From | random joe <pywin32@gmail.com> |
|---|---|
| Date | 2012-01-05 20:08 -0800 |
| Message-ID | <d89d0b04-3f89-4c65-a329-3ca2be88b0ad@d9g2000yqg.googlegroups.com> |
| In reply to | #18579 |
On Jan 5, 10:01 pm, random joe <pywi...@gmail.com> wrote: > On Jan 5, 9:00 pm, MRAB <pyt...@mrabarnett.plus.com> wrote: > > import gzip > > > in_file = gzip.open(r"C:\2012-January.txt.gz") > > out_file = open(r"C:\2012-January.txt.tmp", "wb") > > out_file.write(in_file.read()) > > in_file.close() > > out_file.close() > > > in_file = gzip.open(r"C:\2012-January.txt.tmp") > > out_file = open(r"C:\2012-January.txt", "wb") > > out_file.write(in_file.read()) > > in_file.close() > > out_file.close() > > EXCELLENT! Thanks. > > THis works however there is one more tiny hiccup. The text has lost > all significant indention and newlines. Was this intended or is this a > result of another bug? Nevermind. Notepad was the problem. After using a real editor the text is displayed correctly! Thanks for help everyone! PS: I wonder why no one has added a note to the Python-list archives to advise people about the bug?
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2012-01-06 15:12 +1100 |
| Message-ID | <mailman.4470.1325823127.27778.python-list@python.org> |
| In reply to | #18580 |
On Fri, Jan 6, 2012 at 3:08 PM, random joe <pywin32@gmail.com> wrote: > Nevermind. Notepad was the problem. After using a real editor the text > is displayed correctly! Thanks for help everyone! ... or that could be your problem :) ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2012-01-06 00:45 -0700 |
| Message-ID | <mailman.4473.1325835957.27778.python-list@python.org> |
| In reply to | #18580 |
On Thu, Jan 5, 2012 at 9:08 PM, random joe <pywin32@gmail.com> wrote: > PS: I wonder why no one has added a note to the Python-list archives > to advise people about the bug? Probably nobody has noticed it until now. It seems to be a quirk of the archive files that they are double-gzipped, and most people probably just use gunzip or gzcat (or a higher-level tool that invokes those) to extract them, which seems to be smart enough to handle it.
[toc] | [prev] | [next] | [standalone]
| From | Anssi Saari <as@sci.fi> |
|---|---|
| Date | 2012-01-10 16:47 +0200 |
| Message-ID | <vg3sjjnzayv.fsf@sci.fi> |
| In reply to | #18588 |
Ian Kelly <ian.g.kelly@gmail.com> writes: > Probably nobody has noticed it until now. It seems to be a quirk of > the archive files that they are double-gzipped... Interesting, but I don't think the files are actually double-gzipped. If I download http://mail.python.org/pipermail/python-list/2012-January.txt.gz with wget in Cygwin or Unix, the file is 226753 bytes and singly gzipped. However, if I download the same file with Firefox in Windows, then it's 226782 bytes and double gzipped. So maybe it's something in the browser or server setup?
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2012-01-06 15:11 +1100 |
| Message-ID | <mailman.4469.1325823094.27778.python-list@python.org> |
| In reply to | #18579 |
On Fri, Jan 6, 2012 at 3:01 PM, random joe <pywin32@gmail.com> wrote: > THis works however there is one more tiny hiccup. The text has lost > all significant indention and newlines. Was this intended or is this a > result of another bug? I'm seeing it as plain text, with proper newlines. There's no indentation as it just runs straight through, top-to-bottom; but you should be able to see line breaks. Check your mail reader in case something's getting botched there. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2012-01-06 00:41 -0700 |
| Message-ID | <mailman.4472.1325835693.27778.python-list@python.org> |
| In reply to | #18574 |
On Thu, Jan 5, 2012 at 8:00 PM, MRAB <python@mrabarnett.plus.com> wrote:
> import gzip
>
> in_file = gzip.open(r"C:\2012-January.txt.gz")
> out_file = open(r"C:\2012-January.txt.tmp", "wb")
> out_file.write(in_file.read())
> in_file.close()
> out_file.close()
>
> in_file = gzip.open(r"C:\2012-January.txt.tmp")
> out_file = open(r"C:\2012-January.txt", "wb")
> out_file.write(in_file.read())
> in_file.close()
> out_file.close()
One could also avoid creating the intermediate file by using a
StringIO to keep it in memory instead:
import gzip
from cStringIO import StringIO
in_file = gzip.open('2012-January.txt.gz')
tmp_file = StringIO(in_file.read())
in_file.close()
in_file = gzip.GzipFile(fileobj=tmp_file)
out_file = open('2012-January.txt', 'wb')
out_file.write(in_file.read())
in_file.close()
out_file.close()
Sadly, GzipFile won't read directly from another GzipFile instance
(ValueError: Seek from end not supported), so some sort of
intermediate is necessary.
[toc] | [prev] | [next] | [standalone]
| From | random joe <pywin32@gmail.com> |
|---|---|
| Date | 2012-01-06 16:55 -0800 |
| Message-ID | <0616bd58-bb3d-4e75-b142-fc3979b20aac@cf6g2000vbb.googlegroups.com> |
| In reply to | #18587 |
On Jan 6, 1:41 am, Ian Kelly <ian.g.ke...@gmail.com> wrote: > One could also avoid creating the intermediate file by using a > StringIO to keep it in memory instead: Yes StringIO is perfect for this. Many thanks to all who replied.
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2012-01-05 17:02 -0700 |
| Message-ID | <mailman.4462.1325808206.27778.python-list@python.org> |
| In reply to | #18567 |
On Thu, Jan 5, 2012 at 4:39 PM, Miki Tebeka <miki.tebeka@gmail.com> wrote: > Is there Google groups search not good enough? (groups.google.com/group/comp.lang.python) My experience with the Google groups search (and Google groups in general) in the past has been terrible. If you're looking for a specific thread, it can actually be quite hard to find.
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web