Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #65417
| Newsgroups | comp.lang.python |
|---|---|
| Date | 2014-02-04 04:39 -0800 |
| References | <68316d1a-e52e-48b5-87df-7119f46ebabc@googlegroups.com> <78213f6b-3311-4487-a611-ecd3de33a168@googlegroups.com> <mailman.6363.1391460874.18130.python-list@python.org> |
| Message-ID | <a4aa581b-01d7-4095-9649-e5973d031b2d@googlegroups.com> (permalink) |
| Subject | Re: fseek In Compressed Files |
| From | Ayushi Dalmia <ayushidalmia2604@gmail.com> |
On Tuesday, February 4, 2014 2:27:38 AM UTC+5:30, Dave Angel wrote:
> Ayushi Dalmia <ayushidalmia2604@gmail.com> Wrote in message:
>
> > On Thursday, January 30, 2014 4:20:26 PM UTC+5:30, Ayushi Dalmia wrote:
>
> >> Hello,
>
> >>
>
> >>
>
> >>
>
> >> I need to randomly access a bzip2 or gzip file. How can I set the offset for a line and later retreive the line from the file using the offset. Pointers in this direction will help.
>
> >
>
> > This is what I have done:
>
> >
>
> > import bz2
>
> > import sys
>
> > from random import randint
>
> >
>
> > index={}
>
> >
>
> > data=[]
>
> > f=open('temp.txt','r')
>
> > for line in f:
>
> > data.append(line)
>
> >
>
> > filename='temp1.txt.bz2'
>
> > with bz2.BZ2File(filename, 'wb', compresslevel=9) as f:
>
> > f.writelines(data)
>
> >
>
> > prevsize=0
>
> > list1=[]
>
> > offset={}
>
> > with bz2.BZ2File(filename, 'rb') as f:
>
> > for line in f:
>
> > words=line.strip().split(' ')
>
> > list1.append(words[0])
>
> > offset[words[0]]= prevsize
>
> > prevsize = sys.getsizeof(line)+prevsize
>
>
>
> sys.getsizeof looks at internal size of a python object, and is
>
> totally unrelated to a size on disk of a text line. len () might
>
> come closer, unless you're on Windows. You really should be using
>
> tell to define the offsets for later seek. In text mode any other
>
> calculation is not legal, ie undefined.
>
>
>
> >
>
> >
>
> > data=[]
>
> > count=0
>
> >
>
> > with bz2.BZ2File(filename, 'rb') as f:
>
> > while count<20:
>
> > y=randint(1,25)
>
> > print y
>
> > print offset[str(y)]
>
> > count+=1
>
> > f.seek(int(offset[str(y)]))
>
> > x= f.readline()
>
> > data.append(x)
>
> >
>
> > f=open('b.txt','w')
>
> > f.write(''.join(data))
>
> > f.close()
>
> >
>
> > where temp.txt is the posting list file which is first written in a compressed format and then read later.
>
>
>
> I thought you were starting with a compressed file. If you're
>
> being given an uncompressed file, just deal with it directly.
>
>
>
>
>
> >I am trying to build the index for the entire wikipedia dump which needs to be done in a space and time optimised way. The temp.txt is as follows:
>
> >
>
> > 1 456 t0b3c0i0e0:784 t0b2c0i0e0:801 t0b2c0i0e0
>
> > 2 221 t0b1c0i0e0:774 t0b1c0i0e0:801 t0b2c0i0e0
>
> > 3 455 t0b7c0i0e0:456 t0b1c0i0e0:459 t0b2c0i0e0:669 t0b10c11i3e0:673 t0b1c0i0e0:678 t0b2c0i1e0:854 t0b1c0i0e0
>
> > 4 410 t0b4c0i0e0:553 t0b1c0i0e0:609 t0b1c0i0e0
>
> > 5 90 t0b1c0i0e0
>
>
>
> So every line begins with its line number in ascii form? If true,
>
> the dict above called offsets should just be a list.
>
>
>
>
>
> Maybe you should just quote the entire assignment. You're
>
> probably adding way too much complication to it.
>
>
>
> --
>
> DaveA
Hey! I am new here. Sorry about the incorrect posts. Didn't understand the protocol then.
Although, I have the uncompressed text, I cannot start right away with them
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
fseek In Compressed Files Ayushi Dalmia <ayushidalmia2604@gmail.com> - 2014-01-30 02:50 -0800
Re: fseek In Compressed Files Peter Otten <__peter__@web.de> - 2014-01-30 12:28 +0100
Re:fseek In Compressed Files Dave Angel <davea@davea.name> - 2014-01-30 06:55 -0500
Re: fseek In Compressed Files Ayushi Dalmia <ayushidalmia2604@gmail.com> - 2014-01-30 05:34 -0800
Re: fseek In Compressed Files Chris Angelico <rosuav@gmail.com> - 2014-01-31 00:49 +1100
Re: fseek In Compressed Files Dave Angel <davea@davea.name> - 2014-02-03 15:57 -0500
Re: fseek In Compressed Files Ayushi Dalmia <ayushidalmia2604@gmail.com> - 2014-02-04 04:39 -0800
Re: fseek In Compressed Files Serhiy Storchaka <storchaka@gmail.com> - 2014-01-30 17:02 +0200
Re: fseek In Compressed Files Ayushi Dalmia <ayushidalmia2604@gmail.com> - 2014-01-30 07:37 -0800
Re: fseek In Compressed Files Dave Angel <davea@davea.name> - 2014-01-30 13:46 -0500
Re: fseek In Compressed Files Ayushi Dalmia <ayushidalmia2604@gmail.com> - 2014-01-31 21:52 -0800
Re: fseek In Compressed Files Dave Angel <davea@davea.name> - 2014-02-01 04:38 -0500
Re: fseek In Compressed Files Peter Otten <__peter__@web.de> - 2014-01-30 17:21 +0100
Re: fseek In Compressed Files Ayushi Dalmia <ayushidalmia2604@gmail.com> - 2014-01-31 21:50 -0800
Re: fseek In Compressed Files Serhiy Storchaka <storchaka@gmail.com> - 2014-02-03 20:32 +0200
csiph-web