Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #29357 > unrolled thread
| Started by | "Jadhav, Alok" <alok.jadhav@credit-suisse.com> |
|---|---|
| First post | 2012-09-17 10:28 +0800 |
| Last post | 2012-11-15 12:20 +0100 |
| Articles | 10 — 8 participants |
Back to article view | Back to comp.lang.python
This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by
below is the oldest one visible, not the original post.
RE: Python garbage collector/memory manager behaving strangely "Jadhav, Alok" <alok.jadhav@credit-suisse.com> - 2012-09-17 10:28 +0800
Re: Python garbage collector/memory manager behaving strangely alex23 <wuwei23@gmail.com> - 2012-09-16 20:25 -0700
Re: Python garbage collector/memory manager behaving strangely 88888 Dihedral <dihedral88888@googlemail.com> - 2012-09-16 21:39 -0700
Re: Python garbage collector/memory manager behaving strangely Dave Angel <d@davea.name> - 2012-09-17 06:46 -0400
Re: Python garbage collector/memory manager behaving strangely Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-17 11:47 +0000
Re: Python garbage collector/memory manager behaving strangely Dave Angel <d@davea.name> - 2012-09-17 08:03 -0400
Re: Python garbage collector/memory manager behaving strangely aahz@pythoncraft.com (Aahz) - 2012-11-14 06:19 -0800
Re: Python garbage collector/memory manager behaving strangely Dieter Maurer <dieter@handshake.de> - 2012-11-15 08:31 +0100
RE: Python garbage collector/memory manager behaving strangely "Jadhav, Alok" <alok.jadhav@credit-suisse.com> - 2012-09-17 19:00 +0800
Re: Python garbage collector/memory manager behaving strangely Thomas Rachel <nutznetz-0c1b6768-bfa9-48d5-a470-7603bd3aa915@spamschutz.glglgl.de> - 2012-11-15 12:20 +0100
| From | "Jadhav, Alok" <alok.jadhav@credit-suisse.com> |
|---|---|
| Date | 2012-09-17 10:28 +0800 |
| Subject | RE: Python garbage collector/memory manager behaving strangely |
| Message-ID | <mailman.818.1347849124.27098.python-list@python.org> |
Thanks Dave for clean explanation. I clearly understand what is going on
now. I still need some suggestions from you on this.
There are 2 reasons why I was using self.rawfile.read().split('|\n')
instead of self.rawfile.readlines()
- As you have seen, the line separator is not '\n' but its '|\n'.
Sometimes the data itself has '\n' characters in the middle of the line
and only way to find true end of the line is that previous character
should be a bar '|'. I was not able specify end of line using
readlines() function, but I could do it using split() function.
(One hack would be to readlines and combine them until I find '|\n'. is
there a cleaner way to do this?)
- Reading whole file at once and processing line by line was must
faster. Though speed is not of very important issue here but I think the
tie it took to parse complete file was reduced to one third of original
time.
Regards,
Alok
-----Original Message-----
From: Dave Angel [mailto:d@davea.name]
Sent: Monday, September 17, 2012 10:13 AM
To: Jadhav, Alok
Cc: python-list@python.org
Subject: Re: Python garbage collector/memory manager behaving strangely
On 09/16/2012 09:07 PM, Jadhav, Alok wrote:
> Hi Everyone,
>
>
>
> I have a simple program which reads a large file containing few
million
> rows, parses each row (`numpy array`) and converts into an array of
> doubles (`python array`) and later writes into an `hdf5 file`. I
repeat
> this loop for multiple days. After reading each file, i delete all the
> objects and call garbage collector. When I run the program, First day
> is parsed without any error but on the second day i get `MemoryError`.
I
> monitored the memory usage of my program, during first day of parsing,
> memory usage is around **1.5 GB**. When the first day parsing is
> finished, memory usage goes down to **50 MB**. Now when 2nd day starts
> and i try to read the lines from the file I get `MemoryError`.
Following
> is the output of the program.
>
>
>
>
>
> source file extracted at C:\rfadump\au\2012.08.07.txt
>
> parsing started
>
> current time: 2012-09-16 22:40:16.829000
>
> 500000 lines parsed
>
> 1000000 lines parsed
>
> 1500000 lines parsed
>
> 2000000 lines parsed
>
> 2500000 lines parsed
>
> 3000000 lines parsed
>
> 3500000 lines parsed
>
> 4000000 lines parsed
>
> 4500000 lines parsed
>
> 5000000 lines parsed
>
> parsing done.
>
> end time is 2012-09-16 23:34:19.931000
>
> total time elapsed 0:54:03.102000
>
> repacking file
>
> done
>
> >
s:\users\aaj\projects\pythonhf\rfadumptohdf.py(132)generateFiles()
>
> -> while single_date <= self.end_date:
>
> (Pdb) c
>
> *** 2012-08-08 ***
>
> source file extracted at C:\rfadump\au\2012.08.08.txt
>
> cought an exception while generating file for day 2012-08-08.
>
> Traceback (most recent call last):
>
> File "rfaDumpToHDF.py", line 175, in generateFile
>
> lines = self.rawfile.read().split('|\n')
>
> MemoryError
>
>
>
> I am very sure that windows system task manager shows the memory usage
> as **50 MB** for this process. It looks like the garbage collector or
> memory manager for Python is not calculating the free memory
correctly.
> There should be lot of free memory but it thinks there is not enough.
>
>
>
> Any idea?
>
>
>
> Thanks.
>
>
>
>
>
> Alok Jadhav
>
> CREDIT SUISSE AG
>
> GAT IT Hong Kong, KVAG 67
>
> International Commerce Centre | Hong Kong | Hong Kong
>
> Phone +852 2101 6274 | Mobile +852 9169 7172
>
> alok.jadhav@credit-suisse.com | www.credit-suisse.com
> <http://www.credit-suisse.com/>
>
>
>
Don't blame CPython. You're trying to do a read() of a large file,
which will result in a single large string. Then you split it into
lines. Why not just read it in as lines, in which case the large string
isn't necessary. Take a look at the readlines() function. Chances are
that even that is unnecessary, but i can't tell without seeing more of
the code.
lines = self.rawfile.read().split('|\n')
lines = self.rawfile.readlines()
When a single large item is being allocated, it's not enough to have
sufficient free space, the space also has to be contiguous. After a
program runs for a while, its space naturally gets fragmented more and
more. it's the nature of the C runtime, and CPython is stuck with it.
--
DaveA
===============================================================================
Please access the attached hyperlink for an important electronic communications disclaimer:
http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
===============================================================================
[toc] | [next] | [standalone]
| From | alex23 <wuwei23@gmail.com> |
|---|---|
| Date | 2012-09-16 20:25 -0700 |
| Message-ID | <59f8c664-8f11-439e-8002-ca76ee24a632@g7g2000pbh.googlegroups.com> |
| In reply to | #29357 |
On Sep 17, 12:32 pm, "Jadhav, Alok" <alok.jad...@credit-suisse.com>
wrote:
> - As you have seen, the line separator is not '\n' but its '|\n'.
> Sometimes the data itself has '\n' characters in the middle of the line
> and only way to find true end of the line is that previous character
> should be a bar '|'. I was not able specify end of line using
> readlines() function, but I could do it using split() function.
> (One hack would be to readlines and combine them until I find '|\n'. is
> there a cleaner way to do this?)
You can use a generator to take care of your readlines requirements:
def readlines(f):
lines = []
while "f is not empty":
line = f.readline()
if not line: break
if len(line) > 2 and line[-2:] == '|\n':
lines.append(line)
yield ''.join(lines)
lines = []
else:
lines.append(line)
> - Reading whole file at once and processing line by line was must
> faster. Though speed is not of very important issue here but I think the
> tie it took to parse complete file was reduced to one third of original
> time.
With the readlines generator above, it'll read lines from the file
until it has a complete "line" by your requirement, at which point
it'll yield it. If you don't need the entire file in memory for the
end result, you'll be able to process each "line" one at a time and
perform whatever you need against it before asking for the next.
with open(u'infile.txt','r') as infile:
for line in readlines(infile):
...
Generators are a very efficient way of processing large amounts of
data. You can chain them together very easily:
real_lines = readlines(infile)
marker_lines = (l for l in real_lines if l.startswith('#'))
every_second_marker = (l for i,l in enumerate(marker_lines) if (i
+1) % 2 == 0)
map(some_function, every_second_marker)
The real_lines generator returns your definition of a line. The
marker_lines generator filters out everything that doesn't start with
#, while every_second_marker returns only half of those. (Yes, these
could all be written as a single generator, but this is very useful
for more complex pipelines).
The big advantage of this approach is that nothing is read from the
file into memory until map is called, and given the way they're
chained together, only one of your lines should be in memory at any
given time.
[toc] | [prev] | [next] | [standalone]
| From | 88888 Dihedral <dihedral88888@googlemail.com> |
|---|---|
| Date | 2012-09-16 21:39 -0700 |
| Message-ID | <f0370abf-303a-47b7-81ea-a3d8e4f012bc@googlegroups.com> |
| In reply to | #29359 |
alex23於 2012年9月17日星期一UTC+8上午11時25分06秒寫道:
> On Sep 17, 12:32 pm, "Jadhav, Alok" <alok.jad...@credit-suisse.com>
>
> wrote:
>
> > - As you have seen, the line separator is not '\n' but its '|\n'.
>
> > Sometimes the data itself has '\n' characters in the middle of the line
>
> > and only way to find true end of the line is that previous character
>
> > should be a bar '|'. I was not able specify end of line using
>
> > readlines() function, but I could do it using split() function.
>
> > (One hack would be to readlines and combine them until I find '|\n'. is
>
> > there a cleaner way to do this?)
>
>
>
> You can use a generator to take care of your readlines requirements:
>
>
>
> def readlines(f):
>
> lines = []
>
> while "f is not empty":
>
> line = f.readline()
>
> if not line: break
>
> if len(line) > 2 and line[-2:] == '|\n':
>
> lines.append(line)
>
> yield ''.join(lines)
>
> lines = []
>
> else:
>
> lines.append(line)
>
>
>
> > - Reading whole file at once and processing line by line was must
>
> > faster. Though speed is not of very important issue here but I think the
>
> > tie it took to parse complete file was reduced to one third of original
>
> > time.
>
>
>
> With the readlines generator above, it'll read lines from the file
>
> until it has a complete "line" by your requirement, at which point
>
> it'll yield it. If you don't need the entire file in memory for the
>
> end result, you'll be able to process each "line" one at a time and
>
> perform whatever you need against it before asking for the next.
>
>
>
> with open(u'infile.txt','r') as infile:
>
> for line in readlines(infile):
>
> ...
>
>
>
> Generators are a very efficient way of processing large amounts of
>
> data. You can chain them together very easily:
>
>
>
> real_lines = readlines(infile)
>
> marker_lines = (l for l in real_lines if l.startswith('#'))
>
> every_second_marker = (l for i,l in enumerate(marker_lines) if (i
>
> +1) % 2 == 0)
>
> map(some_function, every_second_marker)
>
>
>
> The real_lines generator returns your definition of a line. The
>
> marker_lines generator filters out everything that doesn't start with
>
> #, while every_second_marker returns only half of those. (Yes, these
>
> could all be written as a single generator, but this is very useful
>
> for more complex pipelines).
>
>
>
> The big advantage of this approach is that nothing is read from the
>
> file into memory until map is called, and given the way they're
>
> chained together, only one of your lines should be in memory at any
>
> given time.
The basic problem is whether the output items really need
all lines of the input text file to be buffered to
produce the results.
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <d@davea.name> |
|---|---|
| Date | 2012-09-17 06:46 -0400 |
| Message-ID | <mailman.830.1347878839.27098.python-list@python.org> |
| In reply to | #29359 |
On 09/16/2012 11:25 PM, alex23 wrote:
> On Sep 17, 12:32 pm, "Jadhav, Alok" <alok.jad...@credit-suisse.com>
> wrote:
>> - As you have seen, the line separator is not '\n' but its '|\n'.
>> Sometimes the data itself has '\n' characters in the middle of the line
>> and only way to find true end of the line is that previous character
>> should be a bar '|'. I was not able specify end of line using
>> readlines() function, but I could do it using split() function.
>> (One hack would be to readlines and combine them until I find '|\n'. is
>> there a cleaner way to do this?)
> You can use a generator to take care of your readlines requirements:
>
> def readlines(f):
> lines = []
> while "f is not empty":
> line = f.readline()
> if not line: break
> if len(line) > 2 and line[-2:] == '|\n':
> lines.append(line)
> yield ''.join(lines)
> lines = []
> else:
> lines.append(line)
There's a few changes I'd make:
I'd change the name to something else, so as not to shadow the built-in,
and to make it clear in caller's code that it's not the built-in one.
I'd replace that compound if statement with
if line.endswith("|\n":
I'd add a comment saying that partial lines at the end of file are ignored.
>> - Reading whole file at once and processing line by line was must
>> faster. Though speed is not of very important issue here but I think the
>> tie it took to parse complete file was reduced to one third of original
>> time.
You don't say what it was faster than. Chances are you went to the
other extreme, of doing a read() of 1 byte at a time. Using Alex's
approach of a generator which in turn uses the readline() generator.
> With the readlines generator above, it'll read lines from the file
> until it has a complete "line" by your requirement, at which point
> it'll yield it. If you don't need the entire file in memory for the
> end result, you'll be able to process each "line" one at a time and
> perform whatever you need against it before asking for the next.
>
> with open(u'infile.txt','r') as infile:
> for line in readlines(infile):
> ...
>
> Generators are a very efficient way of processing large amounts of
> data. You can chain them together very easily:
>
> real_lines = readlines(infile)
> marker_lines = (l for l in real_lines if l.startswith('#'))
> every_second_marker = (l for i,l in enumerate(marker_lines) if (i
> +1) % 2 == 0)
> map(some_function, every_second_marker)
>
> The real_lines generator returns your definition of a line. The
> marker_lines generator filters out everything that doesn't start with
> #, while every_second_marker returns only half of those. (Yes, these
> could all be written as a single generator, but this is very useful
> for more complex pipelines).
>
> The big advantage of this approach is that nothing is read from the
> file into memory until map is called, and given the way they're
> chained together, only one of your lines should be in memory at any
> given time.
--
DaveA
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-09-17 11:47 +0000 |
| Message-ID | <50570de3$0$29981$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #29373 |
On Mon, 17 Sep 2012 06:46:55 -0400, Dave Angel wrote:
> On 09/16/2012 11:25 PM, alex23 wrote:
>> def readlines(f):
>> lines = []
>> while "f is not empty":
>> line = f.readline()
>> if not line: break
>> if len(line) > 2 and line[-2:] == '|\n':
>> lines.append(line)
>> yield ''.join(lines)
>> lines = []
>> else:
>> lines.append(line)
>
> There's a few changes I'd make:
> I'd change the name to something else, so as not to shadow the built-in,
Which built-in are you referring to? There is no readlines built-in.
py> readlines
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'readlines' is not defined
There is a file.readlines method, but that lives in a different namespace
to the function readlines so there should be no confusion. At least not
for a moderately experienced programmer, beginners can be confused by the
littlest things sometimes.
> and to make it clear in caller's code that it's not the built-in one.
> I'd replace that compound if statement with
> if line.endswith("|\n":
> I'd add a comment saying that partial lines at the end of file are
> ignored.
Or fix the generator so that it doesn't ignore partial lines, or raises
an exception, whichever is more appropriate.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <d@davea.name> |
|---|---|
| Date | 2012-09-17 08:03 -0400 |
| Message-ID | <mailman.832.1347883451.27098.python-list@python.org> |
| In reply to | #29375 |
On 09/17/2012 07:47 AM, Steven D'Aprano wrote:
> On Mon, 17 Sep 2012 06:46:55 -0400, Dave Angel wrote:
>
>> On 09/16/2012 11:25 PM, alex23 wrote:
>>> def readlines(f):
>>> lines = []
>>> while "f is not empty":
>>> line = f.readline()
>>> if not line: break
>>> if len(line) > 2 and line[-2:] == '|\n':
>>> lines.append(line)
>>> yield ''.join(lines)
>>> lines = []
>>> else:
>>> lines.append(line)
>> There's a few changes I'd make:
>> I'd change the name to something else, so as not to shadow the built-in,
> Which built-in are you referring to? There is no readlines built-in.
>
> py> readlines
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> NameError: name 'readlines' is not defined
>
>
> There is a file.readlines method, but that lives in a different namespace
> to the function readlines so there should be no confusion. At least not
> for a moderately experienced programmer, beginners can be confused by the
> littlest things sometimes.
You're right of course, and that's not restricted to beginners. I've
been at this for over 40 years, and I make that kind of mistake once in
a while. Fortunately, when I make such a mistake on this forum, you
usually pop in to keep me honest. When I make it in code, I either get
a runtime error, or no harm is done.
>
>> and to make it clear in caller's code that it's not the built-in one.
>> I'd replace that compound if statement with
>> if line.endswith("|\n":
>> I'd add a comment saying that partial lines at the end of file are
>> ignored.
> Or fix the generator so that it doesn't ignore partial lines, or raises
> an exception, whichever is more appropriate.
>
>
>
--
DaveA
[toc] | [prev] | [next] | [standalone]
| From | aahz@pythoncraft.com (Aahz) |
|---|---|
| Date | 2012-11-14 06:19 -0800 |
| Message-ID | <k8098p$b2e$1@panix5.panix.com> |
| In reply to | #29375 |
In article <50570de3$0$29981$c3e8da3$5496439d@news.astraweb.com>, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: >On Mon, 17 Sep 2012 06:46:55 -0400, Dave Angel wrote: >> On 09/16/2012 11:25 PM, alex23 wrote: >>> >>> def readlines(f): >>> lines = [] >>> while "f is not empty": >>> line = f.readline() >>> if not line: break >>> if len(line) > 2 and line[-2:] == '|\n': >>> lines.append(line) >>> yield ''.join(lines) >>> lines = [] >>> else: >>> lines.append(line) >> >> There's a few changes I'd make: >> I'd change the name to something else, so as not to shadow the built-in, > >Which built-in are you referring to? There is no readlines built-in. > >py> readlines >Traceback (most recent call last): > File "<stdin>", line 1, in <module> >NameError: name 'readlines' is not defined > >There is a file.readlines method, but that lives in a different namespace >to the function readlines so there should be no confusion. At least not >for a moderately experienced programmer, beginners can be confused by the >littlest things sometimes. Actually, as an experienced programmer, I *do* think it is confusing as evidenced by the mistake Dave made! Segregated namespaces are wonderful (per Zen), but let's not pollute multiple namespaces with same name, either. It may not be literally shadowing the built-in, but it definitely mentally shadows the built-in. -- Aahz (aahz@pythoncraft.com) <*> http://www.pythoncraft.com/ "....Normal is what cuts off your sixth finger and your tail..." --Siobhan
[toc] | [prev] | [next] | [standalone]
| From | Dieter Maurer <dieter@handshake.de> |
|---|---|
| Date | 2012-11-15 08:31 +0100 |
| Message-ID | <mailman.3711.1352964690.27098.python-list@python.org> |
| In reply to | #33336 |
aahz@pythoncraft.com (Aahz) writes: > ... >>>> def readlines(f): >>>> lines = [] >>>> while "f is not empty": >>>> line = f.readline() >>>> if not line: break >>>> if len(line) > 2 and line[-2:] == '|\n': >>>> lines.append(line) >>>> yield ''.join(lines) >>>> lines = [] >>>> else: >>>> lines.append(line) >>> >>> There's a few changes I'd make: >>> I'd change the name to something else, so as not to shadow the built-in, > ... > Actually, as an experienced programmer, I *do* think it is confusing as > evidenced by the mistake Dave made! Segregated namespaces are wonderful > (per Zen), but let's not pollute multiple namespaces with same name, > either. > > It may not be literally shadowing the built-in, but it definitely > mentally shadows the built-in. I disagree with you. namespaces are there that in working with a namespace I do not need to worry much about other namespaces. Therefore, calling a function "readlines" is very much justified (if it reads lines from a file), even though there was a module around with name "readlines". By the way, the module is named "readline" (not "readlines").
[toc] | [prev] | [next] | [standalone]
| From | "Jadhav, Alok" <alok.jadhav@credit-suisse.com> |
|---|---|
| Date | 2012-09-17 19:00 +0800 |
| Message-ID | <mailman.831.1347879875.27098.python-list@python.org> |
| In reply to | #29359 |
Thanks for your valuable inputs. This is very helpful.
-----Original Message-----
From: Python-list
[mailto:python-list-bounces+alok.jadhav=credit-suisse.com@python.org] On
Behalf Of Dave Angel
Sent: Monday, September 17, 2012 6:47 PM
To: alex23
Cc: python-list@python.org
Subject: Re: Python garbage collector/memory manager behaving strangely
On 09/16/2012 11:25 PM, alex23 wrote:
> On Sep 17, 12:32 pm, "Jadhav, Alok" <alok.jad...@credit-suisse.com>
> wrote:
>> - As you have seen, the line separator is not '\n' but its '|\n'.
>> Sometimes the data itself has '\n' characters in the middle of the
line
>> and only way to find true end of the line is that previous character
>> should be a bar '|'. I was not able specify end of line using
>> readlines() function, but I could do it using split() function.
>> (One hack would be to readlines and combine them until I find '|\n'.
is
>> there a cleaner way to do this?)
> You can use a generator to take care of your readlines requirements:
>
> def readlines(f):
> lines = []
> while "f is not empty":
> line = f.readline()
> if not line: break
> if len(line) > 2 and line[-2:] == '|\n':
> lines.append(line)
> yield ''.join(lines)
> lines = []
> else:
> lines.append(line)
There's a few changes I'd make:
I'd change the name to something else, so as not to shadow the built-in,
and to make it clear in caller's code that it's not the built-in one.
I'd replace that compound if statement with
if line.endswith("|\n":
I'd add a comment saying that partial lines at the end of file are
ignored.
>> - Reading whole file at once and processing line by line was must
>> faster. Though speed is not of very important issue here but I think
the
>> tie it took to parse complete file was reduced to one third of
original
>> time.
You don't say what it was faster than. Chances are you went to the
other extreme, of doing a read() of 1 byte at a time. Using Alex's
approach of a generator which in turn uses the readline() generator.
> With the readlines generator above, it'll read lines from the file
> until it has a complete "line" by your requirement, at which point
> it'll yield it. If you don't need the entire file in memory for the
> end result, you'll be able to process each "line" one at a time and
> perform whatever you need against it before asking for the next.
>
> with open(u'infile.txt','r') as infile:
> for line in readlines(infile):
> ...
>
> Generators are a very efficient way of processing large amounts of
> data. You can chain them together very easily:
>
> real_lines = readlines(infile)
> marker_lines = (l for l in real_lines if l.startswith('#'))
> every_second_marker = (l for i,l in enumerate(marker_lines) if (i
> +1) % 2 == 0)
> map(some_function, every_second_marker)
>
> The real_lines generator returns your definition of a line. The
> marker_lines generator filters out everything that doesn't start with
> #, while every_second_marker returns only half of those. (Yes, these
> could all be written as a single generator, but this is very useful
> for more complex pipelines).
>
> The big advantage of this approach is that nothing is read from the
> file into memory until map is called, and given the way they're
> chained together, only one of your lines should be in memory at any
> given time.
--
DaveA
--
http://mail.python.org/mailman/listinfo/python-list
===============================================================================
Please access the attached hyperlink for an important electronic communications disclaimer:
http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
===============================================================================
[toc] | [prev] | [next] | [standalone]
| From | Thomas Rachel <nutznetz-0c1b6768-bfa9-48d5-a470-7603bd3aa915@spamschutz.glglgl.de> |
|---|---|
| Date | 2012-11-15 12:20 +0100 |
| Message-ID | <k82j65$qb4$1@r03.glglgl.gl> |
| In reply to | #29357 |
Am 17.09.2012 04:28 schrieb Jadhav, Alok:
> Thanks Dave for clean explanation. I clearly understand what is going on
> now. I still need some suggestions from you on this.
>
> There are 2 reasons why I was using self.rawfile.read().split('|\n')
> instead of self.rawfile.readlines()
>
> - As you have seen, the line separator is not '\n' but its '|\n'.
> Sometimes the data itself has '\n' characters in the middle of the line
> and only way to find true end of the line is that previous character
> should be a bar '|'. I was not able specify end of line using
> readlines() function, but I could do it using split() function.
> (One hack would be to readlines and combine them until I find '|\n'. is
> there a cleaner way to do this?)
> - Reading whole file at once and processing line by line was must
> faster. Though speed is not of very important issue here but I think the
> tie it took to parse complete file was reduced to one third of original
> time.
With
def itersep(f, sep='\0', buffering=1024, keepsep=True):
if keepsep:
keepsep=sep
else: keepsep=''
data = f.read(buffering)
next_line = data # empty? -> end.
while next_line: # -> data is empty as well.
lines = data.split(sep)
for line in lines[:-1]:
yield line+keepsep
next_line = f.read(buffering)
data = lines[-1] + next_line
# keepsep: only if we have something.
if (not keepsep) or data:
yield data
you can iterate over everything you want without needing too much
memory. Using a larger "buffering" might improve speed a little bit.
Thomas
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web