Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #18566 > unrolled thread

Help with python-list archives

Started byrandom joe <pywin32@gmail.com>
First post2012-01-05 14:44 -0800
Last post2012-01-05 17:02 -0700
Articles 17 — 6 participants

Back to article view | Back to comp.lang.python


Contents

  Help with python-list archives random joe <pywin32@gmail.com> - 2012-01-05 14:44 -0800
    Re: Help with python-list archives Miki Tebeka <miki.tebeka@gmail.com> - 2012-01-05 15:39 -0800
      Re: Help with python-list archives random joe <pywin32@gmail.com> - 2012-01-05 15:52 -0800
        Re: Help with python-list archives Ian Kelly <ian.g.kelly@gmail.com> - 2012-01-05 17:10 -0700
          Re: Help with python-list archives random joe <pywin32@gmail.com> - 2012-01-05 16:45 -0800
        Re: Help with python-list archives MRAB <python@mrabarnett.plus.com> - 2012-01-06 01:27 +0000
          Re: Help with python-list archives random joe <pywin32@gmail.com> - 2012-01-05 18:14 -0800
            Re: Help with python-list archives MRAB <python@mrabarnett.plus.com> - 2012-01-06 03:00 +0000
              Re: Help with python-list archives random joe <pywin32@gmail.com> - 2012-01-05 20:01 -0800
                Re: Help with python-list archives random joe <pywin32@gmail.com> - 2012-01-05 20:08 -0800
                  Re: Help with python-list archives Chris Angelico <rosuav@gmail.com> - 2012-01-06 15:12 +1100
                  Re: Help with python-list archives Ian Kelly <ian.g.kelly@gmail.com> - 2012-01-06 00:45 -0700
                    Re: Help with python-list archives Anssi Saari <as@sci.fi> - 2012-01-10 16:47 +0200
                Re: Help with python-list archives Chris Angelico <rosuav@gmail.com> - 2012-01-06 15:11 +1100
            Re: Help with python-list archives Ian Kelly <ian.g.kelly@gmail.com> - 2012-01-06 00:41 -0700
              Re: Help with python-list archives random joe <pywin32@gmail.com> - 2012-01-06 16:55 -0800
      Re: Help with python-list archives Ian Kelly <ian.g.kelly@gmail.com> - 2012-01-05 17:02 -0700

#18566 — Help with python-list archives

Fromrandom joe <pywin32@gmail.com>
Date2012-01-05 14:44 -0800
SubjectHelp with python-list archives
Message-ID<78484055-dd01-4237-9217-9eb038fc744f@p16g2000yqd.googlegroups.com>
Hi. I am new to python and wanted to search the python-list archives
for answers to my many questions but i can't seem to get the archive
files to uncompressed? What gives? From what i understand they are
gzip files so i assumed the gzip module would work, but no! The best i
could do was to get a ton of chinese chars using gzip and
zlib.uncompress(). I would like to be courteous and search for my
answers before asking so as not to waste anyones time. Does anyone
know how to uncompress these files into a readable text form?

[toc] | [next] | [standalone]


#18567

FromMiki Tebeka <miki.tebeka@gmail.com>
Date2012-01-05 15:39 -0800
Message-ID<14749754.624.1325806776674.JavaMail.geo-discussion-forums@vbgw2>
In reply to#18566
Is there Google groups search not good enough? (groups.google.com/group/comp.lang.python)

Also, can you give an example of the code and an input file?

[toc] | [prev] | [next] | [standalone]


#18568

Fromrandom joe <pywin32@gmail.com>
Date2012-01-05 15:52 -0800
Message-ID<8f3b98e1-3b21-4f06-8456-0a555a7ee523@u32g2000yqe.googlegroups.com>
In reply to#18567
On Jan 5, 5:39 pm, Miki Tebeka <miki.teb...@gmail.com> wrote:
> Is the Google groups search not good enough?

That works but i would like to do some regexes and set up some
defaults.

> Also, can you give an example of the code and an input file?

Sure. Take the most recent file as example. "2012 - January.txt.gz".
If you use the python doc example this is the result. If i use "r" or
"rb" the result is the same.

>>> import gzip
>>> f1 = gzip.open('C:\\2012-January.txt.gz', 'rb')
>>> data = f1.read()
>>> data[:100]
'\x1f\x8b\x08\x08x\n\x05O\x02\xff/srv/mailman/archives/private/python-
list/2012-January.txt\x00\xec\xbdy\x7f\xdb\xc6\xb50\xfcw\xf0)\xa6z|+
\xaa!!l\xdc\x14[\x8b-;V\xe2-\x92\x12'
>>> f2 = gzip.open('C:\\2012-January.txt.gz', 'r')
>>> data = f2.read()
>>> data[:100]
'\x1f\x8b\x08\x08x\n\x05O\x02\xff/srv/mailman/archives/private/python-
list/2012-January.txt\x00\xec\xbdy\x7f\xdb\xc6\xb50\xfcw\xf0)\xa6z|+
\xaa!!l\xdc\x14[\x8b-;V\xe2-\x92\x12'

The docs and google provide no clear answer. I even tried 7zip and
ended up with nothing but gibberish characters. There must be levels
of compression or something. Why could they not simply use the tar
format? Is there anywhere else one can download the archives?

[toc] | [prev] | [next] | [standalone]


#18570

FromIan Kelly <ian.g.kelly@gmail.com>
Date2012-01-05 17:10 -0700
Message-ID<mailman.4463.1325808641.27778.python-list@python.org>
In reply to#18568
On Thu, Jan 5, 2012 at 4:52 PM, random joe <pywin32@gmail.com> wrote:
> Sure. Take the most recent file as example. "2012 - January.txt.gz".
> If you use the python doc example this is the result. If i use "r" or
> "rb" the result is the same.
>
>>>> import gzip
>>>> f1 = gzip.open('C:\\2012-January.txt.gz', 'rb')
>>>> data = f1.read()
>>>> data[:100]
> '\x1f\x8b\x08\x08x\n\x05O\x02\xff/srv/mailman/archives/private/python-
> list/2012-January.txt\x00\xec\xbdy\x7f\xdb\xc6\xb50\xfcw\xf0)\xa6z|+
> \xaa!!l\xdc\x14[\x8b-;V\xe2-\x92\x12'
>>>> f2 = gzip.open('C:\\2012-January.txt.gz', 'r')
>>>> data = f2.read()
>>>> data[:100]
> '\x1f\x8b\x08\x08x\n\x05O\x02\xff/srv/mailman/archives/private/python-
> list/2012-January.txt\x00\xec\xbdy\x7f\xdb\xc6\xb50\xfcw\xf0)\xa6z|+
> \xaa!!l\xdc\x14[\x8b-;V\xe2-\x92\x12'
>
> The docs and google provide no clear answer. I even tried 7zip and
> ended up with nothing but gibberish characters. There must be levels
> of compression or something. Why could they not simply use the tar
> format? Is there anywhere else one can download the archives?

Interesting.  I tried this on a Linux system using both gunzip and
your code, and both worked fine to extract that file.  I also tried
your code on a Windows system, and I get the same result that you do.
This appears to be a bug in the gzip module under Windows.

I think there may be something peculiar about the archive files that
the module is not handling correctly.  If I gunzip the file locally
and then gzip it again before trying to open it in Python, then
everything seems to be fine.

[toc] | [prev] | [next] | [standalone]


#18572

Fromrandom joe <pywin32@gmail.com>
Date2012-01-05 16:45 -0800
Message-ID<0cbcc2ce-5dff-4cd9-89f5-833956dc8e37@q7g2000yqn.googlegroups.com>
In reply to#18570
On Jan 5, 6:10 pm, Ian Kelly <ian.g.ke...@gmail.com> wrote:
> Interesting.  I tried this on a Linux system using both gunzip and
> your code, and both worked fine to extract that file.  I also tried
> your code on a Windows system, and I get the same result that you do.
> This appears to be a bug in the gzip module under Windows.
>
> I think there may be something peculiar about the archive files that
> the module is not handling correctly.  If I gunzip the file locally
> and then gzip it again before trying to open it in Python, then
> everything seems to be fine.

That is interesting. I wonder if anyone else has had the same issue?

Just to be thorough I tried to uncompress using both python 2.x and
3.x and the results are unreadable text files in both cases. I have no
idea what the problem could be. Especially without some way to compare
my files to the gunzip'ed files on a linux machine.

[toc] | [prev] | [next] | [standalone]


#18573

FromMRAB <python@mrabarnett.plus.com>
Date2012-01-06 01:27 +0000
Message-ID<mailman.4464.1325813187.27778.python-list@python.org>
In reply to#18568
On 06/01/2012 00:10, Ian Kelly wrote:
> On Thu, Jan 5, 2012 at 4:52 PM, random joe<pywin32@gmail.com>  wrote:
>>  Sure. Take the most recent file as example. "2012 - January.txt.gz".
>>  If you use the python doc example this is the result. If i use "r" or
>>  "rb" the result is the same.
>>
>>>>>  import gzip
>>>>>  f1 = gzip.open('C:\\2012-January.txt.gz', 'rb')
>>>>>  data = f1.read()
>>>>>  data[:100]
>>  '\x1f\x8b\x08\x08x\n\x05O\x02\xff/srv/mailman/archives/private/python-
>>  list/2012-January.txt\x00\xec\xbdy\x7f\xdb\xc6\xb50\xfcw\xf0)\xa6z|+
>>  \xaa!!l\xdc\x14[\x8b-;V\xe2-\x92\x12'
>>>>>  f2 = gzip.open('C:\\2012-January.txt.gz', 'r')
>>>>>  data = f2.read()
>>>>>  data[:100]
>>  '\x1f\x8b\x08\x08x\n\x05O\x02\xff/srv/mailman/archives/private/python-
>>  list/2012-January.txt\x00\xec\xbdy\x7f\xdb\xc6\xb50\xfcw\xf0)\xa6z|+
>>  \xaa!!l\xdc\x14[\x8b-;V\xe2-\x92\x12'
>>
>>  The docs and google provide no clear answer. I even tried 7zip and
>>  ended up with nothing but gibberish characters. There must be levels
>>  of compression or something. Why could they not simply use the tar
>>  format? Is there anywhere else one can download the archives?
>
> Interesting.  I tried this on a Linux system using both gunzip and
> your code, and both worked fine to extract that file.  I also tried
> your code on a Windows system, and I get the same result that you do.
> This appears to be a bug in the gzip module under Windows.
>
> I think there may be something peculiar about the archive files that
> the module is not handling correctly.  If I gunzip the file locally
> and then gzip it again before trying to open it in Python, then
> everything seems to be fine.

I've found that if I gunzip it twice (gunzip it and then gunzip the
result) using the gzip module I get the text file.

[toc] | [prev] | [next] | [standalone]


#18574

Fromrandom joe <pywin32@gmail.com>
Date2012-01-05 18:14 -0800
Message-ID<fa7b56e3-68b1-4b37-834d-48ab73177baf@k28g2000yqn.googlegroups.com>
In reply to#18573
On Jan 5, 7:27 pm, MRAB <pyt...@mrabarnett.plus.com> wrote:

> I've found that if I gunzip it twice (gunzip it and then gunzip the
> result) using the gzip module I get the text file.

On a windows machine? If so, can you post a code snippet please?
Thanks

[toc] | [prev] | [next] | [standalone]


#18577

FromMRAB <python@mrabarnett.plus.com>
Date2012-01-06 03:00 +0000
Message-ID<mailman.4466.1325818829.27778.python-list@python.org>
In reply to#18574
On 06/01/2012 02:14, random joe wrote:
> On Jan 5, 7:27 pm, MRAB<pyt...@mrabarnett.plus.com>  wrote:
>
>>  I've found that if I gunzip it twice (gunzip it and then gunzip the
>>  result) using the gzip module I get the text file.
>
> On a windows machine? If so, can you post a code snippet please?
> Thanks

import gzip

in_file = gzip.open(r"C:\2012-January.txt.gz")
out_file = open(r"C:\2012-January.txt.tmp", "wb")
out_file.write(in_file.read())
in_file.close()
out_file.close()

in_file = gzip.open(r"C:\2012-January.txt.tmp")
out_file = open(r"C:\2012-January.txt", "wb")
out_file.write(in_file.read())
in_file.close()
out_file.close()

[toc] | [prev] | [next] | [standalone]


#18579

Fromrandom joe <pywin32@gmail.com>
Date2012-01-05 20:01 -0800
Message-ID<cfdb189b-a480-42c5-beac-2b9855c439ae@t13g2000yqg.googlegroups.com>
In reply to#18577
On Jan 5, 9:00 pm, MRAB <pyt...@mrabarnett.plus.com> wrote:
> On 06/01/2012 02:14, random joe wrote:
>
> > On Jan 5, 7:27 pm, MRAB<pyt...@mrabarnett.plus.com>  wrote:
>
> >>  I've found that if I gunzip it twice (gunzip it and then gunzip the
> >>  result) using the gzip module I get the text file.
>
> > On a windows machine? If so, can you post a code snippet please?
> > Thanks
>
> import gzip
>
> in_file = gzip.open(r"C:\2012-January.txt.gz")
> out_file = open(r"C:\2012-January.txt.tmp", "wb")
> out_file.write(in_file.read())
> in_file.close()
> out_file.close()
>
> in_file = gzip.open(r"C:\2012-January.txt.tmp")
> out_file = open(r"C:\2012-January.txt", "wb")
> out_file.write(in_file.read())
> in_file.close()
> out_file.close()

EXCELLENT! Thanks.

THis works however there is one more tiny hiccup. The text has lost
all significant indention and newlines. Was this intended or is this a
result of another bug?

[toc] | [prev] | [next] | [standalone]


#18580

Fromrandom joe <pywin32@gmail.com>
Date2012-01-05 20:08 -0800
Message-ID<d89d0b04-3f89-4c65-a329-3ca2be88b0ad@d9g2000yqg.googlegroups.com>
In reply to#18579
On Jan 5, 10:01 pm, random joe <pywi...@gmail.com> wrote:
> On Jan 5, 9:00 pm, MRAB <pyt...@mrabarnett.plus.com> wrote:

> > import gzip
>
> > in_file = gzip.open(r"C:\2012-January.txt.gz")
> > out_file = open(r"C:\2012-January.txt.tmp", "wb")
> > out_file.write(in_file.read())
> > in_file.close()
> > out_file.close()
>
> > in_file = gzip.open(r"C:\2012-January.txt.tmp")
> > out_file = open(r"C:\2012-January.txt", "wb")
> > out_file.write(in_file.read())
> > in_file.close()
> > out_file.close()
>
> EXCELLENT! Thanks.
>
> THis works however there is one more tiny hiccup. The text has lost
> all significant indention and newlines. Was this intended or is this a
> result of another bug?

Nevermind. Notepad was the problem. After using a real editor the text
is displayed correctly! Thanks for help everyone!

PS: I wonder why no one has added a note to the Python-list archives
to advise people about the bug?

[toc] | [prev] | [next] | [standalone]


#18582

FromChris Angelico <rosuav@gmail.com>
Date2012-01-06 15:12 +1100
Message-ID<mailman.4470.1325823127.27778.python-list@python.org>
In reply to#18580
On Fri, Jan 6, 2012 at 3:08 PM, random joe <pywin32@gmail.com> wrote:
> Nevermind. Notepad was the problem. After using a real editor the text
> is displayed correctly! Thanks for help everyone!

... or that could be your problem :)

ChrisA

[toc] | [prev] | [next] | [standalone]


#18588

FromIan Kelly <ian.g.kelly@gmail.com>
Date2012-01-06 00:45 -0700
Message-ID<mailman.4473.1325835957.27778.python-list@python.org>
In reply to#18580
On Thu, Jan 5, 2012 at 9:08 PM, random joe <pywin32@gmail.com> wrote:
> PS: I wonder why no one has added a note to the Python-list archives
> to advise people about the bug?

Probably nobody has noticed it until now.  It seems to be a quirk of
the archive files that they are double-gzipped, and most people
probably just use gunzip or gzcat (or a higher-level tool that invokes
those) to extract them, which seems to be smart enough to handle it.

[toc] | [prev] | [next] | [standalone]


#18776

FromAnssi Saari <as@sci.fi>
Date2012-01-10 16:47 +0200
Message-ID<vg3sjjnzayv.fsf@sci.fi>
In reply to#18588
Ian Kelly <ian.g.kelly@gmail.com> writes:

> Probably nobody has noticed it until now.  It seems to be a quirk of
> the archive files that they are double-gzipped...

Interesting, but I don't think the files are actually double-gzipped. If
I download
http://mail.python.org/pipermail/python-list/2012-January.txt.gz with
wget in Cygwin or Unix, the file is 226753 bytes and singly gzipped.

However, if I download the same file with Firefox in Windows, then it's
226782 bytes and double gzipped. So maybe it's something in the browser
or server setup?

[toc] | [prev] | [next] | [standalone]


#18581

FromChris Angelico <rosuav@gmail.com>
Date2012-01-06 15:11 +1100
Message-ID<mailman.4469.1325823094.27778.python-list@python.org>
In reply to#18579
On Fri, Jan 6, 2012 at 3:01 PM, random joe <pywin32@gmail.com> wrote:
> THis works however there is one more tiny hiccup. The text has lost
> all significant indention and newlines. Was this intended or is this a
> result of another bug?

I'm seeing it as plain text, with proper newlines. There's no
indentation as it just runs straight through, top-to-bottom; but you
should be able to see line breaks. Check your mail reader in case
something's getting botched there.

ChrisA

[toc] | [prev] | [next] | [standalone]


#18587

FromIan Kelly <ian.g.kelly@gmail.com>
Date2012-01-06 00:41 -0700
Message-ID<mailman.4472.1325835693.27778.python-list@python.org>
In reply to#18574
On Thu, Jan 5, 2012 at 8:00 PM, MRAB <python@mrabarnett.plus.com> wrote:
> import gzip
>
> in_file = gzip.open(r"C:\2012-January.txt.gz")
> out_file = open(r"C:\2012-January.txt.tmp", "wb")
> out_file.write(in_file.read())
> in_file.close()
> out_file.close()
>
> in_file = gzip.open(r"C:\2012-January.txt.tmp")
> out_file = open(r"C:\2012-January.txt", "wb")
> out_file.write(in_file.read())
> in_file.close()
> out_file.close()

One could also avoid creating the intermediate file by using a
StringIO to keep it in memory instead:

import gzip
from cStringIO import StringIO

in_file = gzip.open('2012-January.txt.gz')
tmp_file = StringIO(in_file.read())
in_file.close()
in_file = gzip.GzipFile(fileobj=tmp_file)
out_file = open('2012-January.txt', 'wb')
out_file.write(in_file.read())
in_file.close()
out_file.close()

Sadly, GzipFile won't read directly from another GzipFile instance
(ValueError: Seek from end not supported), so some sort of
intermediate is necessary.

[toc] | [prev] | [next] | [standalone]


#18628

Fromrandom joe <pywin32@gmail.com>
Date2012-01-06 16:55 -0800
Message-ID<0616bd58-bb3d-4e75-b142-fc3979b20aac@cf6g2000vbb.googlegroups.com>
In reply to#18587
On Jan 6, 1:41 am, Ian Kelly <ian.g.ke...@gmail.com> wrote:
> One could also avoid creating the intermediate file by using a
> StringIO to keep it in memory instead:

Yes StringIO is perfect for this. Many thanks to all who replied.

[toc] | [prev] | [next] | [standalone]


#18569

FromIan Kelly <ian.g.kelly@gmail.com>
Date2012-01-05 17:02 -0700
Message-ID<mailman.4462.1325808206.27778.python-list@python.org>
In reply to#18567
On Thu, Jan 5, 2012 at 4:39 PM, Miki Tebeka <miki.tebeka@gmail.com> wrote:
> Is there Google groups search not good enough? (groups.google.com/group/comp.lang.python)

My experience with the Google groups search (and Google groups in
general) in the past has been terrible.  If you're looking for a
specific thread, it can actually be quite hard to find.

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web