Groups > comp.lang.python > #64979 > unrolled thread

fseek In Compressed Files

Started by	Ayushi Dalmia <ayushidalmia2604@gmail.com>
First post	2014-01-30 02:50 -0800
Last post	2014-02-03 20:32 +0200
Articles	15 — 5 participants

Back to article view | Back to comp.lang.python

  fseek In Compressed Files Ayushi Dalmia <ayushidalmia2604@gmail.com> - 2014-01-30 02:50 -0800
    Re: fseek In Compressed Files Peter Otten <__peter__@web.de> - 2014-01-30 12:28 +0100
    Re:fseek In Compressed Files Dave Angel <davea@davea.name> - 2014-01-30 06:55 -0500
    Re: fseek In Compressed Files Ayushi Dalmia <ayushidalmia2604@gmail.com> - 2014-01-30 05:34 -0800
      Re: fseek In Compressed Files Chris Angelico <rosuav@gmail.com> - 2014-01-31 00:49 +1100
      Re: fseek In Compressed Files Dave Angel <davea@davea.name> - 2014-02-03 15:57 -0500
        Re: fseek In Compressed Files Ayushi Dalmia <ayushidalmia2604@gmail.com> - 2014-02-04 04:39 -0800
    Re: fseek In Compressed Files Serhiy Storchaka <storchaka@gmail.com> - 2014-01-30 17:02 +0200
    Re: fseek In Compressed Files Ayushi Dalmia <ayushidalmia2604@gmail.com> - 2014-01-30 07:37 -0800
      Re: fseek In Compressed Files Dave Angel <davea@davea.name> - 2014-01-30 13:46 -0500
        Re: fseek In Compressed Files Ayushi Dalmia <ayushidalmia2604@gmail.com> - 2014-01-31 21:52 -0800
          Re: fseek In Compressed Files Dave Angel <davea@davea.name> - 2014-02-01 04:38 -0500
    Re: fseek In Compressed Files Peter Otten <__peter__@web.de> - 2014-01-30 17:21 +0100
      Re: fseek In Compressed Files Ayushi Dalmia <ayushidalmia2604@gmail.com> - 2014-01-31 21:50 -0800
    Re: fseek In Compressed Files Serhiy Storchaka <storchaka@gmail.com> - 2014-02-03 20:32 +0200

#64979 — fseek In Compressed Files

From	Ayushi Dalmia <ayushidalmia2604@gmail.com>
Date	2014-01-30 02:50 -0800
Subject	fseek In Compressed Files
Message-ID	<68316d1a-e52e-48b5-87df-7119f46ebabc@googlegroups.com>

Hello,

I need to randomly access a bzip2 or gzip file. How can I set the offset for a line and later retreive the line from the file using the offset. Pointers in this direction will help.

[toc] | [next] | [standalone]

#64980

From	Peter Otten <__peter__@web.de>
Date	2014-01-30 12:28 +0100
Message-ID	<mailman.6123.1391081306.18130.python-list@python.org>
In reply to	#64979

Ayushi Dalmia wrote:

> I need to randomly access a bzip2 or gzip file. How can I set the offset
> for a line and later retreive the line from the file using the offset.
> Pointers in this direction will help.

with gzip.open(filename) as f:
    f.seek(some_pos)
    print(f.readline())
    f.seek(some_pos)
    print(f.readline())

seems to work as expected. Can you tell a bit more about your usecase (if it 
isn't covered by that basic example)?

[toc] | [prev] | [next] | [standalone]

#64985

From	Dave Angel <davea@davea.name>
Date	2014-01-30 06:55 -0500
Message-ID	<mailman.6125.1391082792.18130.python-list@python.org>
In reply to	#64979

 Ayushi Dalmia <ayushidalmia2604@gmail.com> Wrote in message:
> Hello,
> 
> I need to randomly access a bzip2 or gzip file. How can I set the offset for a line and later retreive the line from the file using the offset. Pointers in this direction will help.
> 

Start with the zlib module. Note that it doesn't handle all
 possible compression types, like compress and pack.
 

I don't imagine that seeking to a line in a compressed text file
 would be any easier than a non compressed one. Try using
 gzip.open in a text mode to get a handle,  then loop through it
 line by line.  If you save all the offsets in a list,  you
 should
subsequently be able to seek to a remembered offset. But
 realize it'll be horribly slow,  compared to a non compressed
 one. 

Consider using readlines and referencing the lines from there.  Or
 building a temp file if too big for ram.

If this is not enough,  tell us your Python version and your os, 
 and show what you've tried and what went wrong. 

-- 
DaveA

[toc] | [prev] | [next] | [standalone]

#64996

From	Ayushi Dalmia <ayushidalmia2604@gmail.com>
Date	2014-01-30 05:34 -0800
Message-ID	<78213f6b-3311-4487-a611-ecd3de33a168@googlegroups.com>
In reply to	#64979

On Thursday, January 30, 2014 4:20:26 PM UTC+5:30, Ayushi Dalmia wrote:
> Hello,
> 
> 
> 
> I need to randomly access a bzip2 or gzip file. How can I set the offset for a line and later retreive the line from the file using the offset. Pointers in this direction will help.

This is what I have done:

import bz2
import sys
from random import randint

index={}

data=[]
f=open('temp.txt','r')
for line in f:
    data.append(line)

filename='temp1.txt.bz2'
with bz2.BZ2File(filename, 'wb', compresslevel=9) as f:
    f.writelines(data)

prevsize=0
list1=[]
offset={}
with bz2.BZ2File(filename, 'rb') as f:
    for line in f:
        words=line.strip().split(' ')
        list1.append(words[0])
        offset[words[0]]= prevsize
        prevsize = sys.getsizeof(line)+prevsize


data=[]
count=0

with bz2.BZ2File(filename, 'rb') as f:
    while count<20:
        y=randint(1,25)
        print y
        print offset[str(y)]
        count+=1
        f.seek(int(offset[str(y)]))
        x= f.readline()
        data.append(x)

f=open('b.txt','w')
f.write(''.join(data))
f.close()

where temp.txt is the posting list file which is first written in a compressed format and then read  later. I am trying to build the index for the entire wikipedia dump which needs to be done in a space and time optimised way. The temp.txt is as follows:

1 456 t0b3c0i0e0:784 t0b2c0i0e0:801 t0b2c0i0e0
2 221 t0b1c0i0e0:774 t0b1c0i0e0:801 t0b2c0i0e0
3 455 t0b7c0i0e0:456 t0b1c0i0e0:459 t0b2c0i0e0:669 t0b10c11i3e0:673 t0b1c0i0e0:678 t0b2c0i1e0:854 t0b1c0i0e0
4 410 t0b4c0i0e0:553 t0b1c0i0e0:609 t0b1c0i0e0
5 90 t0b1c0i0e0
6 727 t0b2c0i0e0
7 431 t0b2c0i1e0
8 532 t0b1c0i0e0:652 t0b1c0i0e0:727 t0b2c0i0e0
9 378 t0b1c0i0e0
10 666 t0b2c0i0e0
11 405 t0b1c0i0e0
12 702 t0b1c0i0e0
13 755 t0b1c0i0e0
14 781 t0b1c0i0e0
15 593 t0b1c0i0e0
16 725 t0b1c0i0e0
17 989 t0b2c0i1e0
18 221 t0b1c0i0e0:402 t0b1c0i0e0:842 t0b1c0i0e0
19 405 t0b1c0i0e0
20 200 t0b1c0i0e0:300 t0b1c0i0e0:398 t0b1c0i0e0:649 t0b1c0i0e0
21 66 t0b1c0i0e0
22 30 t0b1c0i0e0
23 126 t0b1c0i0e0:895 t0b1c0i0e0
24 355 t0b1c0i0e0:374 t0b1c0i0e0:378 t0b1c0i0e0:431 t0b3c0i0e0:482 t0b1c0i0e0:546 t0b3c0i0e0:578 t0b1c0i0e0
25 198 t0b1c0i0e0

[toc] | [prev] | [next] | [standalone]

#65001

From	Chris Angelico <rosuav@gmail.com>
Date	2014-01-31 00:49 +1100
Message-ID	<mailman.6137.1391089752.18130.python-list@python.org>
In reply to	#64996

On Fri, Jan 31, 2014 at 12:34 AM, Ayushi Dalmia
<ayushidalmia2604@gmail.com> wrote:
> where temp.txt is the posting list file which is first written in a compressed format and then read  later.

Unless you specify otherwise, a compressed file is likely to have
sub-byte boundaries. It might not be possible to seek to a specific
line.

What you could do, though, is explicitly compress each line, then
write out separately-compressed blocks. You can then seek to any one
that you want, read it, and decompress it. But at this point, you're
probably going to do better with a database; PostgreSQL, for instance,
will automatically compress any content that it believes it's
worthwhile to compress (as long as it's in a VARCHAR field or similar
and the table hasn't been configured to prevent that, yada yada). All
you have to do is tell Postgres to store this, retrieve that, and
it'll worry about the details of compression and decompression. As an
added benefit, you can divide the text up and let it do the hard work
of indexing, filtering, sorting, etc. I suspect you'll find that
deploying a database is a much more efficient use of your development
time than recreating all of that.

ChrisA

[toc] | [prev] | [next] | [standalone]

#65376

From	Dave Angel <davea@davea.name>
Date	2014-02-03 15:57 -0500
Message-ID	<mailman.6363.1391460874.18130.python-list@python.org>
In reply to	#64996

 Ayushi Dalmia <ayushidalmia2604@gmail.com> Wrote in message:
> On Thursday, January 30, 2014 4:20:26 PM UTC+5:30, Ayushi Dalmia wrote:
>> Hello,
>> 
>> 
>> 
>> I need to randomly access a bzip2 or gzip file. How can I set the offset for a line and later retreive the line from the file using the offset. Pointers in this direction will help.
> 
> This is what I have done:
> 
> import bz2
> import sys
> from random import randint
> 
> index={}
> 
> data=[]
> f=open('temp.txt','r')
> for line in f:
>     data.append(line)
> 
> filename='temp1.txt.bz2'
> with bz2.BZ2File(filename, 'wb', compresslevel=9) as f:
>     f.writelines(data)
> 
> prevsize=0
> list1=[]
> offset={}
> with bz2.BZ2File(filename, 'rb') as f:
>     for line in f:
>         words=line.strip().split(' ')
>         list1.append(words[0])
>         offset[words[0]]= prevsize
>         prevsize = sys.getsizeof(line)+prevsize

sys.getsizeof looks at internal size of a python object, and is
 totally unrelated to a size on disk of a text line. len () might
 come closer, unless you're on Windows. You really should be using
 tell to define the offsets for later seek. In text mode any other
 calculation is not legal,  ie undefined. 

> 
> 
> data=[]
> count=0
> 
> with bz2.BZ2File(filename, 'rb') as f:
>     while count<20:
>         y=randint(1,25)
>         print y
>         print offset[str(y)]
>         count+=1
>         f.seek(int(offset[str(y)]))
>         x= f.readline()
>         data.append(x)
> 
> f=open('b.txt','w')
> f.write(''.join(data))
> f.close()
> 
> where temp.txt is the posting list file which is first written in a compressed format and then read  later. 

I thought you were starting with a compressed file.  If you're
 being given an uncompressed file, just deal with it directly.
 

>I am trying to build the index for the entire wikipedia dump which needs to be done in a space and time optimised way. The temp.txt is as follows:
> 
> 1 456 t0b3c0i0e0:784 t0b2c0i0e0:801 t0b2c0i0e0
> 2 221 t0b1c0i0e0:774 t0b1c0i0e0:801 t0b2c0i0e0
> 3 455 t0b7c0i0e0:456 t0b1c0i0e0:459 t0b2c0i0e0:669 t0b10c11i3e0:673 t0b1c0i0e0:678 t0b2c0i1e0:854 t0b1c0i0e0
> 4 410 t0b4c0i0e0:553 t0b1c0i0e0:609 t0b1c0i0e0
> 5 90 t0b1c0i0e0

So every line begins with its line number in ascii form?  If true,
 the dict above called offsets should just be a list.
 

Maybe you should just quote the entire assignment.  You're
 probably adding way too much complication to it.

-- 
DaveA

[toc] | [prev] | [next] | [standalone]

#65417

From	Ayushi Dalmia <ayushidalmia2604@gmail.com>
Date	2014-02-04 04:39 -0800
Message-ID	<a4aa581b-01d7-4095-9649-e5973d031b2d@googlegroups.com>
In reply to	#65376

On Tuesday, February 4, 2014 2:27:38 AM UTC+5:30, Dave Angel wrote:
> Ayushi Dalmia <ayushidalmia2604@gmail.com> Wrote in message:
> 
> > On Thursday, January 30, 2014 4:20:26 PM UTC+5:30, Ayushi Dalmia wrote:
> 
> >> Hello,
> 
> >> 
> 
> >> 
> 
> >> 
> 
> >> I need to randomly access a bzip2 or gzip file. How can I set the offset for a line and later retreive the line from the file using the offset. Pointers in this direction will help.
> 
> > 
> 
> > This is what I have done:
> 
> > 
> 
> > import bz2
> 
> > import sys
> 
> > from random import randint
> 
> > 
> 
> > index={}
> 
> > 
> 
> > data=[]
> 
> > f=open('temp.txt','r')
> 
> > for line in f:
> 
> >     data.append(line)
> 
> > 
> 
> > filename='temp1.txt.bz2'
> 
> > with bz2.BZ2File(filename, 'wb', compresslevel=9) as f:
> 
> >     f.writelines(data)
> 
> > 
> 
> > prevsize=0
> 
> > list1=[]
> 
> > offset={}
> 
> > with bz2.BZ2File(filename, 'rb') as f:
> 
> >     for line in f:
> 
> >         words=line.strip().split(' ')
> 
> >         list1.append(words[0])
> 
> >         offset[words[0]]= prevsize
> 
> >         prevsize = sys.getsizeof(line)+prevsize
> 
> 
> 
> sys.getsizeof looks at internal size of a python object, and is
> 
>  totally unrelated to a size on disk of a text line. len () might
> 
>  come closer, unless you're on Windows. You really should be using
> 
>  tell to define the offsets for later seek. In text mode any other
> 
>  calculation is not legal,  ie undefined. 
> 
> 
> 
> > 
> 
> > 
> 
> > data=[]
> 
> > count=0
> 
> > 
> 
> > with bz2.BZ2File(filename, 'rb') as f:
> 
> >     while count<20:
> 
> >         y=randint(1,25)
> 
> >         print y
> 
> >         print offset[str(y)]
> 
> >         count+=1
> 
> >         f.seek(int(offset[str(y)]))
> 
> >         x= f.readline()
> 
> >         data.append(x)
> 
> > 
> 
> > f=open('b.txt','w')
> 
> > f.write(''.join(data))
> 
> > f.close()
> 
> > 
> 
> > where temp.txt is the posting list file which is first written in a compressed format and then read  later. 
> 
> 
> 
> I thought you were starting with a compressed file.  If you're
> 
>  being given an uncompressed file, just deal with it directly.
> 
>  
> 
> 
> 
> >I am trying to build the index for the entire wikipedia dump which needs to be done in a space and time optimised way. The temp.txt is as follows:
> 
> > 
> 
> > 1 456 t0b3c0i0e0:784 t0b2c0i0e0:801 t0b2c0i0e0
> 
> > 2 221 t0b1c0i0e0:774 t0b1c0i0e0:801 t0b2c0i0e0
> 
> > 3 455 t0b7c0i0e0:456 t0b1c0i0e0:459 t0b2c0i0e0:669 t0b10c11i3e0:673 t0b1c0i0e0:678 t0b2c0i1e0:854 t0b1c0i0e0
> 
> > 4 410 t0b4c0i0e0:553 t0b1c0i0e0:609 t0b1c0i0e0
> 
> > 5 90 t0b1c0i0e0
> 
> 
> 
> So every line begins with its line number in ascii form?  If true,
> 
>  the dict above called offsets should just be a list.
> 
>  
> 
> 
> 
> Maybe you should just quote the entire assignment.  You're
> 
>  probably adding way too much complication to it.
> 
> 
> 
> -- 
> 
> DaveA

Hey! I am new here. Sorry about the incorrect posts. Didn't understand the protocol then.

Although, I have the uncompressed text, I cannot start right away with them

[toc] | [prev] | [next] | [standalone]

#65017

From	Serhiy Storchaka <storchaka@gmail.com>
Date	2014-01-30 17:02 +0200
Message-ID	<mailman.6146.1391094182.18130.python-list@python.org>
In reply to	#64979

30.01.14 13:28, Peter Otten написав(ла):
> Ayushi Dalmia wrote:
>
>> I need to randomly access a bzip2 or gzip file. How can I set the offset
>> for a line and later retreive the line from the file using the offset.
>> Pointers in this direction will help.
>
> with gzip.open(filename) as f:
>      f.seek(some_pos)
>      print(f.readline())
>      f.seek(some_pos)
>      print(f.readline())
>
> seems to work as expected. Can you tell a bit more about your usecase (if it
> isn't covered by that basic example)?

I don't recommend to seek backward in compressed file. This is very 
inefficient operation.

[toc] | [prev] | [next] | [standalone]

#65025

From	Ayushi Dalmia <ayushidalmia2604@gmail.com>
Date	2014-01-30 07:37 -0800
Message-ID	<aa3a6a6d-0c5e-4e1a-95e6-96a5ba257b3e@googlegroups.com>
In reply to	#64979

On Thursday, January 30, 2014 4:20:26 PM UTC+5:30, Ayushi Dalmia wrote:
> Hello,
> 
> 
> 
> I need to randomly access a bzip2 or gzip file. How can I set the offset for a line and later retreive the line from the file using the offset. Pointers in this direction will help.

We are not allowed to use databases! I need to do this without database.

[toc] | [prev] | [next] | [standalone]

#65034

From	Dave Angel <davea@davea.name>
Date	2014-01-30 13:46 -0500
Message-ID	<mailman.6156.1391107470.18130.python-list@python.org>
In reply to	#65025

 Ayushi Dalmia <ayushidalmia2604@gmail.com> Wrote in message:
> On Thursday, January 30, 2014 4:20:26 PM UTC+5:30, Ayushi Dalmia wrote:
>> Hello,
>> 
>> 
>> 
>> I need to randomly access a bzip2 or gzip file. How can I set the offset for a line and later retreive the line from the file using the offset. Pointers in this direction will help.
> 
> We are not allowed to use databases! I need to do this without database.
> 

Why do you reply to your own message?  Makes it hard for people to
 make sense of your post.

Have you any answers to earlier questions? How big is this file,
 what python version,  do you care about performance, code you've
 tried,  ...

-- 
DaveA

[toc] | [prev] | [next] | [standalone]

#65200

From	Ayushi Dalmia <ayushidalmia2604@gmail.com>
Date	2014-01-31 21:52 -0800
Message-ID	<897c196f-0812-43cf-9829-07993263207e@googlegroups.com>
In reply to	#65034

On Friday, January 31, 2014 12:16:59 AM UTC+5:30, Dave Angel wrote:
> Ayushi Dalmia <ayushidalmia2604@gmail.com> Wrote in message:
> 
> > On Thursday, January 30, 2014 4:20:26 PM UTC+5:30, Ayushi Dalmia wrote:
> 
> >> Hello,
> 
> >> 
> 
> >> 
> 
> >> 
> 
> >> I need to randomly access a bzip2 or gzip file. How can I set the offset for a line and later retreive the line from the file using the offset. Pointers in this direction will help.
> 
> > 
> 
> > We are not allowed to use databases! I need to do this without database.
> 
> > 
> 
> 
> 
> Why do you reply to your own message?  Makes it hard for people to
> 
>  make sense of your post.
> 
> 
> 
> Have you any answers to earlier questions? How big is this file,
> 
>  what python version,  do you care about performance, code you've
> 
>  tried,  ...
> 
> 
> 
> -- 
> 
> DaveA

The size of this file will be 10 GB. The version of Python I am using is 2.7.2. Yes, performance is an important issue.

[toc] | [prev] | [next] | [standalone]

#65208

From	Dave Angel <davea@davea.name>
Date	2014-02-01 04:38 -0500
Message-ID	<mailman.6274.1391247311.18130.python-list@python.org>
In reply to	#65200

 Ayushi Dalmia <ayushidalmia2604@gmail.com> Wrote in message:
>
> 
> The size of this file will be 10 GB. The version of Python I am using is 2.7.2. Yes, performance is an important issue. 
> 

Then the only viable option is to extract the entire file and
 write it to a temp location. Perhaps as you extract it, you could
 also build a list of offsets,  so the seeking by line number can
 be efficient. 
-- 
DaveA

[toc] | [prev] | [next] | [standalone]

#65028

From	Peter Otten <__peter__@web.de>
Date	2014-01-30 17:21 +0100
Message-ID	<mailman.6152.1391098879.18130.python-list@python.org>
In reply to	#64979

Serhiy Storchaka wrote:

> 30.01.14 13:28, Peter Otten написав(ла):
>> Ayushi Dalmia wrote:
>>
>>> I need to randomly access a bzip2 or gzip file. How can I set the offset
>>> for a line and later retreive the line from the file using the offset.
>>> Pointers in this direction will help.
>>
>> with gzip.open(filename) as f:
>>      f.seek(some_pos)
>>      print(f.readline())
>>      f.seek(some_pos)
>>      print(f.readline())
>>
>> seems to work as expected. Can you tell a bit more about your usecase (if
>> it isn't covered by that basic example)?
> 
> I don't recommend to seek backward in compressed file. This is very
> inefficient operation.

Do you know an efficient way to implement random access for a bzip2 or gzip 
file?

[toc] | [prev] | [next] | [standalone]

#65198

From	Ayushi Dalmia <ayushidalmia2604@gmail.com>
Date	2014-01-31 21:50 -0800
Message-ID	<06185e63-7f49-48b6-a86e-bfa96ed84248@googlegroups.com>
In reply to	#65028

On Thursday, January 30, 2014 9:51:28 PM UTC+5:30, Peter Otten wrote:
> Serhiy Storchaka wrote:
> 
> 
> 
> > 30.01.14 13:28, Peter Otten написав(ла):
> 
> >> Ayushi Dalmia wrote:
> 
> >>
> 
> >>> I need to randomly access a bzip2 or gzip file. How can I set the offset
> 
> >>> for a line and later retreive the line from the file using the offset.
> 
> >>> Pointers in this direction will help.
> 
> >>
> 
> >> with gzip.open(filename) as f:
> 
> >>      f.seek(some_pos)
> 
> >>      print(f.readline())
> 
> >>      f.seek(some_pos)
> 
> >>      print(f.readline())
> 
> >>
> 
> >> seems to work as expected. Can you tell a bit more about your usecase (if
> 
> >> it isn't covered by that basic example)?
> 
> > 
> 
> > I don't recommend to seek backward in compressed file. This is very
> 
> > inefficient operation.
> 
> 
> 
> Do you know an efficient way to implement random access for a bzip2 or gzip 
> 
> file?

Nothing that I know of.

[toc] | [prev] | [next] | [standalone]

#65358

From	Serhiy Storchaka <storchaka@gmail.com>
Date	2014-02-03 20:32 +0200
Message-ID	<mailman.6352.1391452395.18130.python-list@python.org>
In reply to	#64979

30.01.14 18:21, Peter Otten написав(ла):
> Do you know an efficient way to implement random access for a bzip2 or gzip
> file?

See dictzip and BGZF. Unfortunately Python stdlib doesn't support them.

[toc] | [prev] | [standalone]

csiph-web

fseek In Compressed Files

Contents

#64979 — fseek In Compressed Files

#64980

#64985

#64996

#65001

#65376

#65417

#65017

#65025

#65034

#65200

#65208

#65028

#65198

#65358