Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!feeds.phibee-telecom.net!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Content-class: urn:content-classes:message
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Subject: RE: Python garbage collector/memory manager behaving strangely
Date: Mon, 17 Sep 2012 19:00:46 +0800
In-Reply-To: <5056FF9F.1020305@davea.name>
Thread-Topic: Python garbage collector/memory manager behaving strangely
Thread-Index: Ac2UwhWSZGTq3DvBSv+ELxGBsZNl3wAAY21w
References: <CEE8C35195DB944D9C75ABB15A04193B14E77085@EHKG17P32001A.csfb.cs-group.com><5056871E.7050206@davea.name><mailman.818.1347849124.27098.python-list@python.org><59f8c664-8f11-439e-8002-ca76ee24a632@g7g2000pbh.googlegroups.com> <5056FF9F.1020305@davea.name>
From: "Jadhav, Alok" <alok.jadhav@credit-suisse.com>
To: <d@davea.name>, "alex23" <wuwei23@gmail.com>
Cc: python-list@python.org
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.831.1347879875.27098.python-list@python.org>
Lines: 100
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:29374

Thanks for your valuable inputs. This is very helpful. 


-----Original Message-----
From: Python-list
[mailto:python-list-bounces+alok.jadhav=credit-suisse.com@python.org] On
Behalf Of Dave Angel
Sent: Monday, September 17, 2012 6:47 PM
To: alex23
Cc: python-list@python.org
Subject: Re: Python garbage collector/memory manager behaving strangely

On 09/16/2012 11:25 PM, alex23 wrote:
> On Sep 17, 12:32 pm, "Jadhav, Alok" <alok.jad...@credit-suisse.com>
> wrote:
>> - As you have seen, the line separator is not '\n' but its '|\n'.
>> Sometimes the data itself has '\n' characters in the middle of the
line
>> and only way to find true end of the line is that previous character
>> should be a bar '|'. I was not able specify end of line using
>> readlines() function, but I could do it using split() function.
>> (One hack would be to readlines and combine them until I find '|\n'.
is
>> there a cleaner way to do this?)
> You can use a generator to take care of your readlines requirements:
>
>     def readlines(f):
>         lines = []
>         while "f is not empty":
>             line = f.readline()
>             if not line: break
>             if len(line) > 2 and line[-2:] == '|\n':
>                 lines.append(line)
>                 yield ''.join(lines)
>                 lines = []
>             else:
>                 lines.append(line)

There's a few changes I'd make:
I'd change the name to something else, so as not to shadow the built-in,
and to make it clear in caller's code that it's not the built-in one.
I'd replace that compound if statement with
      if line.endswith("|\n":
I'd add a comment saying that partial lines at the end of file are
ignored.

>> - Reading whole file at once and processing line by line was must
>> faster. Though speed is not of very important issue here but I think
the
>> tie it took to parse complete file was reduced to one third of
original
>> time.

You don't say what it was faster than.  Chances are you went to the
other extreme, of doing a read() of 1 byte at a time.  Using Alex's
approach of a generator which in turn uses the readline() generator.

> With the readlines generator above, it'll read lines from the file
> until it has a complete "line" by your requirement, at which point
> it'll yield it. If you don't need the entire file in memory for the
> end result, you'll be able to process each "line" one at a time and
> perform whatever you need against it before asking for the next.
>
>     with open(u'infile.txt','r') as infile:
>         for line in readlines(infile):
>             ...
>
> Generators are a very efficient way of processing large amounts of
> data. You can chain them together very easily:
>
>     real_lines = readlines(infile)
>     marker_lines = (l for l in real_lines if l.startswith('#'))
>     every_second_marker = (l for i,l in enumerate(marker_lines) if (i
> +1) % 2 == 0)
>     map(some_function, every_second_marker)
>
> The real_lines generator returns your definition of a line. The
> marker_lines generator filters out everything that doesn't start with
> #, while every_second_marker returns only half of those. (Yes, these
> could all be written as a single generator, but this is very useful
> for more complex pipelines).
>
> The big advantage of this approach is that nothing is read from the
> file into memory until map is called, and given the way they're
> chained together, only one of your lines should be in memory at any
> given time.


-- 

DaveA

-- 
http://mail.python.org/mailman/listinfo/python-list

=============================================================================== 
Please access the attached hyperlink for an important electronic communications disclaimer: 
http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html 
===============================================================================