Groups > comp.lang.python > #18566 > unrolled thread

Help with python-list archives

Started by	random joe <pywin32@gmail.com>
First post	2012-01-05 14:44 -0800
Last post	2012-01-05 17:02 -0700
Articles	17 — 6 participants

Back to article view | Back to comp.lang.python

  Help with python-list archives random joe <pywin32@gmail.com> - 2012-01-05 14:44 -0800
    Re: Help with python-list archives Miki Tebeka <miki.tebeka@gmail.com> - 2012-01-05 15:39 -0800
      Re: Help with python-list archives random joe <pywin32@gmail.com> - 2012-01-05 15:52 -0800
        Re: Help with python-list archives Ian Kelly <ian.g.kelly@gmail.com> - 2012-01-05 17:10 -0700
          Re: Help with python-list archives random joe <pywin32@gmail.com> - 2012-01-05 16:45 -0800
        Re: Help with python-list archives MRAB <python@mrabarnett.plus.com> - 2012-01-06 01:27 +0000
          Re: Help with python-list archives random joe <pywin32@gmail.com> - 2012-01-05 18:14 -0800
            Re: Help with python-list archives MRAB <python@mrabarnett.plus.com> - 2012-01-06 03:00 +0000
              Re: Help with python-list archives random joe <pywin32@gmail.com> - 2012-01-05 20:01 -0800
                Re: Help with python-list archives random joe <pywin32@gmail.com> - 2012-01-05 20:08 -0800
                  Re: Help with python-list archives Chris Angelico <rosuav@gmail.com> - 2012-01-06 15:12 +1100
                  Re: Help with python-list archives Ian Kelly <ian.g.kelly@gmail.com> - 2012-01-06 00:45 -0700
                    Re: Help with python-list archives Anssi Saari <as@sci.fi> - 2012-01-10 16:47 +0200
                Re: Help with python-list archives Chris Angelico <rosuav@gmail.com> - 2012-01-06 15:11 +1100
            Re: Help with python-list archives Ian Kelly <ian.g.kelly@gmail.com> - 2012-01-06 00:41 -0700
              Re: Help with python-list archives random joe <pywin32@gmail.com> - 2012-01-06 16:55 -0800
      Re: Help with python-list archives Ian Kelly <ian.g.kelly@gmail.com> - 2012-01-05 17:02 -0700

#18566 — Help with python-list archives

From	random joe <pywin32@gmail.com>
Date	2012-01-05 14:44 -0800
Subject	Help with python-list archives
Message-ID	<78484055-dd01-4237-9217-9eb038fc744f@p16g2000yqd.googlegroups.com>

Hi. I am new to python and wanted to search the python-list archives
for answers to my many questions but i can't seem to get the archive
files to uncompressed? What gives? From what i understand they are
gzip files so i assumed the gzip module would work, but no! The best i
could do was to get a ton of chinese chars using gzip and
zlib.uncompress(). I would like to be courteous and search for my
answers before asking so as not to waste anyones time. Does anyone
know how to uncompress these files into a readable text form?

[toc] | [next] | [standalone]

#18567

From	Miki Tebeka <miki.tebeka@gmail.com>
Date	2012-01-05 15:39 -0800
Message-ID	<14749754.624.1325806776674.JavaMail.geo-discussion-forums@vbgw2>
In reply to	#18566

Is there Google groups search not good enough? (groups.google.com/group/comp.lang.python)

Also, can you give an example of the code and an input file?

[toc] | [prev] | [next] | [standalone]

#18568

From	random joe <pywin32@gmail.com>
Date	2012-01-05 15:52 -0800
Message-ID	<8f3b98e1-3b21-4f06-8456-0a555a7ee523@u32g2000yqe.googlegroups.com>
In reply to	#18567

On Jan 5, 5:39 pm, Miki Tebeka <miki.teb...@gmail.com> wrote:
> Is the Google groups search not good enough?

That works but i would like to do some regexes and set up some
defaults.

> Also, can you give an example of the code and an input file?

Sure. Take the most recent file as example. "2012 - January.txt.gz".
If you use the python doc example this is the result. If i use "r" or
"rb" the result is the same.

>>> import gzip
>>> f1 = gzip.open('C:\\2012-January.txt.gz', 'rb')
>>> data = f1.read()
>>> data[:100]
'\x1f\x8b\x08\x08x\n\x05O\x02\xff/srv/mailman/archives/private/python-
list/2012-January.txt\x00\xec\xbdy\x7f\xdb\xc6\xb50\xfcw\xf0)\xa6z|+
\xaa!!l\xdc\x14[\x8b-;V\xe2-\x92\x12'
>>> f2 = gzip.open('C:\\2012-January.txt.gz', 'r')
>>> data = f2.read()
>>> data[:100]
'\x1f\x8b\x08\x08x\n\x05O\x02\xff/srv/mailman/archives/private/python-
list/2012-January.txt\x00\xec\xbdy\x7f\xdb\xc6\xb50\xfcw\xf0)\xa6z|+
\xaa!!l\xdc\x14[\x8b-;V\xe2-\x92\x12'

The docs and google provide no clear answer. I even tried 7zip and
ended up with nothing but gibberish characters. There must be levels
of compression or something. Why could they not simply use the tar
format? Is there anywhere else one can download the archives?

[toc] | [prev] | [next] | [standalone]

#18570

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2012-01-05 17:10 -0700
Message-ID	<mailman.4463.1325808641.27778.python-list@python.org>
In reply to	#18568

On Thu, Jan 5, 2012 at 4:52 PM, random joe <pywin32@gmail.com> wrote:
> Sure. Take the most recent file as example. "2012 - January.txt.gz".
> If you use the python doc example this is the result. If i use "r" or
> "rb" the result is the same.
>
>>>> import gzip
>>>> f1 = gzip.open('C:\\2012-January.txt.gz', 'rb')
>>>> data = f1.read()
>>>> data[:100]
> '\x1f\x8b\x08\x08x\n\x05O\x02\xff/srv/mailman/archives/private/python-
> list/2012-January.txt\x00\xec\xbdy\x7f\xdb\xc6\xb50\xfcw\xf0)\xa6z|+
> \xaa!!l\xdc\x14[\x8b-;V\xe2-\x92\x12'
>>>> f2 = gzip.open('C:\\2012-January.txt.gz', 'r')
>>>> data = f2.read()
>>>> data[:100]
> '\x1f\x8b\x08\x08x\n\x05O\x02\xff/srv/mailman/archives/private/python-
> list/2012-January.txt\x00\xec\xbdy\x7f\xdb\xc6\xb50\xfcw\xf0)\xa6z|+
> \xaa!!l\xdc\x14[\x8b-;V\xe2-\x92\x12'
>
> The docs and google provide no clear answer. I even tried 7zip and
> ended up with nothing but gibberish characters. There must be levels
> of compression or something. Why could they not simply use the tar
> format? Is there anywhere else one can download the archives?

Interesting.  I tried this on a Linux system using both gunzip and
your code, and both worked fine to extract that file.  I also tried
your code on a Windows system, and I get the same result that you do.
This appears to be a bug in the gzip module under Windows.

I think there may be something peculiar about the archive files that
the module is not handling correctly.  If I gunzip the file locally
and then gzip it again before trying to open it in Python, then
everything seems to be fine.

[toc] | [prev] | [next] | [standalone]

#18572

From	random joe <pywin32@gmail.com>
Date	2012-01-05 16:45 -0800
Message-ID	<0cbcc2ce-5dff-4cd9-89f5-833956dc8e37@q7g2000yqn.googlegroups.com>
In reply to	#18570

On Jan 5, 6:10 pm, Ian Kelly <ian.g.ke...@gmail.com> wrote:
> Interesting.  I tried this on a Linux system using both gunzip and
> your code, and both worked fine to extract that file.  I also tried
> your code on a Windows system, and I get the same result that you do.
> This appears to be a bug in the gzip module under Windows.
>
> I think there may be something peculiar about the archive files that
> the module is not handling correctly.  If I gunzip the file locally
> and then gzip it again before trying to open it in Python, then
> everything seems to be fine.

That is interesting. I wonder if anyone else has had the same issue?

Just to be thorough I tried to uncompress using both python 2.x and
3.x and the results are unreadable text files in both cases. I have no
idea what the problem could be. Especially without some way to compare
my files to the gunzip'ed files on a linux machine.

[toc] | [prev] | [next] | [standalone]

#18573

From	MRAB <python@mrabarnett.plus.com>
Date	2012-01-06 01:27 +0000
Message-ID	<mailman.4464.1325813187.27778.python-list@python.org>
In reply to	#18568

On 06/01/2012 00:10, Ian Kelly wrote:
> On Thu, Jan 5, 2012 at 4:52 PM, random joe<pywin32@gmail.com>  wrote:
>>  Sure. Take the most recent file as example. "2012 - January.txt.gz".
>>  If you use the python doc example this is the result. If i use "r" or
>>  "rb" the result is the same.
>>
>>>>>  import gzip
>>>>>  f1 = gzip.open('C:\\2012-January.txt.gz', 'rb')
>>>>>  data = f1.read()
>>>>>  data[:100]
>>  '\x1f\x8b\x08\x08x\n\x05O\x02\xff/srv/mailman/archives/private/python-
>>  list/2012-January.txt\x00\xec\xbdy\x7f\xdb\xc6\xb50\xfcw\xf0)\xa6z|+
>>  \xaa!!l\xdc\x14[\x8b-;V\xe2-\x92\x12'
>>>>>  f2 = gzip.open('C:\\2012-January.txt.gz', 'r')
>>>>>  data = f2.read()
>>>>>  data[:100]
>>  '\x1f\x8b\x08\x08x\n\x05O\x02\xff/srv/mailman/archives/private/python-
>>  list/2012-January.txt\x00\xec\xbdy\x7f\xdb\xc6\xb50\xfcw\xf0)\xa6z|+
>>  \xaa!!l\xdc\x14[\x8b-;V\xe2-\x92\x12'
>>
>>  The docs and google provide no clear answer. I even tried 7zip and
>>  ended up with nothing but gibberish characters. There must be levels
>>  of compression or something. Why could they not simply use the tar
>>  format? Is there anywhere else one can download the archives?
>
> Interesting.  I tried this on a Linux system using both gunzip and
> your code, and both worked fine to extract that file.  I also tried
> your code on a Windows system, and I get the same result that you do.
> This appears to be a bug in the gzip module under Windows.
>
> I think there may be something peculiar about the archive files that
> the module is not handling correctly.  If I gunzip the file locally
> and then gzip it again before trying to open it in Python, then
> everything seems to be fine.

I've found that if I gunzip it twice (gunzip it and then gunzip the
result) using the gzip module I get the text file.

[toc] | [prev] | [next] | [standalone]

#18574

From	random joe <pywin32@gmail.com>
Date	2012-01-05 18:14 -0800
Message-ID	<fa7b56e3-68b1-4b37-834d-48ab73177baf@k28g2000yqn.googlegroups.com>
In reply to	#18573

On Jan 5, 7:27 pm, MRAB <pyt...@mrabarnett.plus.com> wrote:

> I've found that if I gunzip it twice (gunzip it and then gunzip the
> result) using the gzip module I get the text file.

On a windows machine? If so, can you post a code snippet please?
Thanks

[toc] | [prev] | [next] | [standalone]

#18577

From	MRAB <python@mrabarnett.plus.com>
Date	2012-01-06 03:00 +0000
Message-ID	<mailman.4466.1325818829.27778.python-list@python.org>
In reply to	#18574

On 06/01/2012 02:14, random joe wrote:
> On Jan 5, 7:27 pm, MRAB<pyt...@mrabarnett.plus.com>  wrote:
>
>>  I've found that if I gunzip it twice (gunzip it and then gunzip the
>>  result) using the gzip module I get the text file.
>
> On a windows machine? If so, can you post a code snippet please?
> Thanks

import gzip

in_file = gzip.open(r"C:\2012-January.txt.gz")
out_file = open(r"C:\2012-January.txt.tmp", "wb")
out_file.write(in_file.read())
in_file.close()
out_file.close()

in_file = gzip.open(r"C:\2012-January.txt.tmp")
out_file = open(r"C:\2012-January.txt", "wb")
out_file.write(in_file.read())
in_file.close()
out_file.close()

[toc] | [prev] | [next] | [standalone]

#18579

From	random joe <pywin32@gmail.com>
Date	2012-01-05 20:01 -0800
Message-ID	<cfdb189b-a480-42c5-beac-2b9855c439ae@t13g2000yqg.googlegroups.com>
In reply to	#18577

On Jan 5, 9:00 pm, MRAB <pyt...@mrabarnett.plus.com> wrote:
> On 06/01/2012 02:14, random joe wrote:
>
> > On Jan 5, 7:27 pm, MRAB<pyt...@mrabarnett.plus.com>  wrote:
>
> >>  I've found that if I gunzip it twice (gunzip it and then gunzip the
> >>  result) using the gzip module I get the text file.
>
> > On a windows machine? If so, can you post a code snippet please?
> > Thanks
>
> import gzip
>
> in_file = gzip.open(r"C:\2012-January.txt.gz")
> out_file = open(r"C:\2012-January.txt.tmp", "wb")
> out_file.write(in_file.read())
> in_file.close()
> out_file.close()
>
> in_file = gzip.open(r"C:\2012-January.txt.tmp")
> out_file = open(r"C:\2012-January.txt", "wb")
> out_file.write(in_file.read())
> in_file.close()
> out_file.close()

EXCELLENT! Thanks.

THis works however there is one more tiny hiccup. The text has lost
all significant indention and newlines. Was this intended or is this a
result of another bug?

[toc] | [prev] | [next] | [standalone]

#18580

From	random joe <pywin32@gmail.com>
Date	2012-01-05 20:08 -0800
Message-ID	<d89d0b04-3f89-4c65-a329-3ca2be88b0ad@d9g2000yqg.googlegroups.com>
In reply to	#18579

On Jan 5, 10:01 pm, random joe <pywi...@gmail.com> wrote:
> On Jan 5, 9:00 pm, MRAB <pyt...@mrabarnett.plus.com> wrote:

> > import gzip
>
> > in_file = gzip.open(r"C:\2012-January.txt.gz")
> > out_file = open(r"C:\2012-January.txt.tmp", "wb")
> > out_file.write(in_file.read())
> > in_file.close()
> > out_file.close()
>
> > in_file = gzip.open(r"C:\2012-January.txt.tmp")
> > out_file = open(r"C:\2012-January.txt", "wb")
> > out_file.write(in_file.read())
> > in_file.close()
> > out_file.close()
>
> EXCELLENT! Thanks.
>
> THis works however there is one more tiny hiccup. The text has lost
> all significant indention and newlines. Was this intended or is this a
> result of another bug?

Nevermind. Notepad was the problem. After using a real editor the text
is displayed correctly! Thanks for help everyone!

PS: I wonder why no one has added a note to the Python-list archives
to advise people about the bug?

[toc] | [prev] | [next] | [standalone]

#18582

From	Chris Angelico <rosuav@gmail.com>
Date	2012-01-06 15:12 +1100
Message-ID	<mailman.4470.1325823127.27778.python-list@python.org>
In reply to	#18580

On Fri, Jan 6, 2012 at 3:08 PM, random joe <pywin32@gmail.com> wrote:
> Nevermind. Notepad was the problem. After using a real editor the text
> is displayed correctly! Thanks for help everyone!

... or that could be your problem :)

ChrisA

[toc] | [prev] | [next] | [standalone]

#18588

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2012-01-06 00:45 -0700
Message-ID	<mailman.4473.1325835957.27778.python-list@python.org>
In reply to	#18580

On Thu, Jan 5, 2012 at 9:08 PM, random joe <pywin32@gmail.com> wrote:
> PS: I wonder why no one has added a note to the Python-list archives
> to advise people about the bug?

Probably nobody has noticed it until now.  It seems to be a quirk of
the archive files that they are double-gzipped, and most people
probably just use gunzip or gzcat (or a higher-level tool that invokes
those) to extract them, which seems to be smart enough to handle it.

[toc] | [prev] | [next] | [standalone]

#18776

From	Anssi Saari <as@sci.fi>
Date	2012-01-10 16:47 +0200
Message-ID	<vg3sjjnzayv.fsf@sci.fi>
In reply to	#18588

Ian Kelly <ian.g.kelly@gmail.com> writes:

> Probably nobody has noticed it until now.  It seems to be a quirk of
> the archive files that they are double-gzipped...

Interesting, but I don't think the files are actually double-gzipped. If
I download
http://mail.python.org/pipermail/python-list/2012-January.txt.gz with
wget in Cygwin or Unix, the file is 226753 bytes and singly gzipped.

However, if I download the same file with Firefox in Windows, then it's
226782 bytes and double gzipped. So maybe it's something in the browser
or server setup?

[toc] | [prev] | [next] | [standalone]

#18581

From	Chris Angelico <rosuav@gmail.com>
Date	2012-01-06 15:11 +1100
Message-ID	<mailman.4469.1325823094.27778.python-list@python.org>
In reply to	#18579

On Fri, Jan 6, 2012 at 3:01 PM, random joe <pywin32@gmail.com> wrote:
> THis works however there is one more tiny hiccup. The text has lost
> all significant indention and newlines. Was this intended or is this a
> result of another bug?

I'm seeing it as plain text, with proper newlines. There's no
indentation as it just runs straight through, top-to-bottom; but you
should be able to see line breaks. Check your mail reader in case
something's getting botched there.

ChrisA

[toc] | [prev] | [next] | [standalone]

#18587

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2012-01-06 00:41 -0700
Message-ID	<mailman.4472.1325835693.27778.python-list@python.org>
In reply to	#18574

On Thu, Jan 5, 2012 at 8:00 PM, MRAB <python@mrabarnett.plus.com> wrote:
> import gzip
>
> in_file = gzip.open(r"C:\2012-January.txt.gz")
> out_file = open(r"C:\2012-January.txt.tmp", "wb")
> out_file.write(in_file.read())
> in_file.close()
> out_file.close()
>
> in_file = gzip.open(r"C:\2012-January.txt.tmp")
> out_file = open(r"C:\2012-January.txt", "wb")
> out_file.write(in_file.read())
> in_file.close()
> out_file.close()

One could also avoid creating the intermediate file by using a
StringIO to keep it in memory instead:

import gzip
from cStringIO import StringIO

in_file = gzip.open('2012-January.txt.gz')
tmp_file = StringIO(in_file.read())
in_file.close()
in_file = gzip.GzipFile(fileobj=tmp_file)
out_file = open('2012-January.txt', 'wb')
out_file.write(in_file.read())
in_file.close()
out_file.close()

Sadly, GzipFile won't read directly from another GzipFile instance
(ValueError: Seek from end not supported), so some sort of
intermediate is necessary.

[toc] | [prev] | [next] | [standalone]

#18628

From	random joe <pywin32@gmail.com>
Date	2012-01-06 16:55 -0800
Message-ID	<0616bd58-bb3d-4e75-b142-fc3979b20aac@cf6g2000vbb.googlegroups.com>
In reply to	#18587

On Jan 6, 1:41 am, Ian Kelly <ian.g.ke...@gmail.com> wrote:
> One could also avoid creating the intermediate file by using a
> StringIO to keep it in memory instead:

Yes StringIO is perfect for this. Many thanks to all who replied.

[toc] | [prev] | [next] | [standalone]

#18569

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2012-01-05 17:02 -0700
Message-ID	<mailman.4462.1325808206.27778.python-list@python.org>
In reply to	#18567

On Thu, Jan 5, 2012 at 4:39 PM, Miki Tebeka <miki.tebeka@gmail.com> wrote:
> Is there Google groups search not good enough? (groups.google.com/group/comp.lang.python)

My experience with the Google groups search (and Google groups in
general) in the past has been terrible.  If you're looking for a
specific thread, it can actually be quite hard to find.

[toc] | [prev] | [standalone]

csiph-web

Help with python-list archives

Contents

#18566 — Help with python-list archives

#18567

#18568

#18570

#18572

#18573

#18574

#18577

#18579

#18580

#18582

#18588

#18776

#18581

#18587

#18628

#18569