Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!ecngs!feeder2.ecngs.de!newsfeed.freenet.ag!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.002 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'else:': 0.03; 'subject:Python': 0.05; 'result,': 0.05; 'function,': 0.07; 'line:': 0.07; 'generators': 0.09; 'read()': 0.09; 'seen,': 0.09; 'sep': 0.09; 'cc:addr:python-list': 0.10; 'def': 0.10; 'called,': 0.16; 'chained': 0.16; 'cleaner': 0.16; 'generator.': 0.16; 'len(line)': 0.16; 'next.': 0.16; 'wrote:': 0.17; 'byte': 0.17; 'else,': 0.17; 'specify': 0.17; 'yield': 0.17; 'hack': 0.18; 'saying': 0.18; 'memory': 0.18; 'changes': 0.20; 'written': 0.20; 'amounts': 0.22; 'parse': 0.22; "i'd": 0.22; 'cc:2**0': 0.23; 'ignored.': 0.23; 'statement': 0.23; 'cc:no real name:2**0': 0.24; 'cc:addr:python.org': 0.25; 'header:In-Reply-To:1': 0.25; 'header :User-Agent:1': 0.26; 'replace': 0.27; "doesn't": 0.28; 'lines': 0.28; 'went': 0.28; 'subject:/': 0.28; 'reduced': 0.29; 'character': 0.29; 'definition': 0.29; "skip:' 10": 0.30; 'code': 0.31; 'point': 0.31; 'asking': 0.32; 'file': 0.32; 'could': 0.32; 'function.': 0.33; 'third': 0.34; 'clear': 0.35; 'whatever': 0.35; 'built-in': 0.35; 'faster': 0.35; 'doing': 0.35; 'pm,': 0.35; 'sometimes': 0.35; 'something': 0.35; 'there': 0.35; 'add': 0.36; 'but': 0.36; 'characters': 0.36; 'data.': 0.36; 'useful': 0.36; 'should': 0.36; 'turn': 0.36; 'itself': 0.37; 'uses': 0.37; 'previous': 0.37; 'data': 0.37; 'subject:: ': 0.38; 'perform': 0.38; 'comment': 0.38; 'skip:l 20': 0.38; 'skip:o 20': 0.38; 'nothing': 0.38; 'received:192': 0.39; 'received:192.168': 0.40; 'end': 0.40; 'think': 0.40; 'your': 0.60; 'map': 0.61; "you'll": 0.62; 'more': 0.63; 'here': 0.65; 'middle': 0.66; 'header:Reply- To:1': 0.68; 'received:74.208': 0.71; 'reply-to:no real name:2**0': 0.72; '(yes,': 0.84; 'extreme,': 0.84; 'tie': 0.84; 'together,': 0.84; 'faster.': 0.91; 'shadow': 0.91 Date: Mon, 17 Sep 2012 06:46:55 -0400 From: Dave Angel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:14.0) Gecko/20120714 Thunderbird/14.0 MIME-Version: 1.0 To: alex23 Subject: Re: Python garbage collector/memory manager behaving strangely References: <5056871E.7050206@davea.name> <59f8c664-8f11-439e-8002-ca76ee24a632@g7g2000pbh.googlegroups.com> In-Reply-To: <59f8c664-8f11-439e-8002-ca76ee24a632@g7g2000pbh.googlegroups.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit X-Provags-ID: V02:K0:VQ3dutH0fQ6Cxdd58Im8Shs6qrETG28G/5MTirFoPQo eFBQVGgsTy7tpi7evvhEvUwozRaJPtTJL9WZ6cI9359aRlGBDc OyZWdMisa56/paB6ZgdJNPMFUsmVv6GYtriiM2+xwph/ZtnoEk HGKevwOG2PJI1raOErJVNcSYPtW9hCI+fj9BOpg7VHEenI80ah HWRqpLYZUDKDDLw+6oHjf8qIyuFjR9dtAfDAN40HD+MLNwOl1H 12IoBCxZ2PQWV04bnETy/AHSlUQ0jb80aBFtOkOS4fEJtmRPJN eMB6dznEyB2mczAI5o6UGm7glWKDUMfk6XmlXqoUEU7tND67Q= = Cc: python-list@python.org X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list Reply-To: d@davea.name List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 75 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1347878839 news.xs4all.nl 6955 [2001:888:2000:d::a6]:48702 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:29373 On 09/16/2012 11:25 PM, alex23 wrote: > On Sep 17, 12:32 pm, "Jadhav, Alok" > wrote: >> - As you have seen, the line separator is not '\n' but its '|\n'. >> Sometimes the data itself has '\n' characters in the middle of the line >> and only way to find true end of the line is that previous character >> should be a bar '|'. I was not able specify end of line using >> readlines() function, but I could do it using split() function. >> (One hack would be to readlines and combine them until I find '|\n'. is >> there a cleaner way to do this?) > You can use a generator to take care of your readlines requirements: > > def readlines(f): > lines = [] > while "f is not empty": > line = f.readline() > if not line: break > if len(line) > 2 and line[-2:] == '|\n': > lines.append(line) > yield ''.join(lines) > lines = [] > else: > lines.append(line) There's a few changes I'd make: I'd change the name to something else, so as not to shadow the built-in, and to make it clear in caller's code that it's not the built-in one. I'd replace that compound if statement with if line.endswith("|\n": I'd add a comment saying that partial lines at the end of file are ignored. >> - Reading whole file at once and processing line by line was must >> faster. Though speed is not of very important issue here but I think the >> tie it took to parse complete file was reduced to one third of original >> time. You don't say what it was faster than. Chances are you went to the other extreme, of doing a read() of 1 byte at a time. Using Alex's approach of a generator which in turn uses the readline() generator. > With the readlines generator above, it'll read lines from the file > until it has a complete "line" by your requirement, at which point > it'll yield it. If you don't need the entire file in memory for the > end result, you'll be able to process each "line" one at a time and > perform whatever you need against it before asking for the next. > > with open(u'infile.txt','r') as infile: > for line in readlines(infile): > ... > > Generators are a very efficient way of processing large amounts of > data. You can chain them together very easily: > > real_lines = readlines(infile) > marker_lines = (l for l in real_lines if l.startswith('#')) > every_second_marker = (l for i,l in enumerate(marker_lines) if (i > +1) % 2 == 0) > map(some_function, every_second_marker) > > The real_lines generator returns your definition of a line. The > marker_lines generator filters out everything that doesn't start with > #, while every_second_marker returns only half of those. (Yes, these > could all be written as a single generator, but this is very useful > for more complex pipelines). > > The big advantage of this approach is that nothing is read from the > file into memory until map is called, and given the way they're > chained together, only one of your lines should be in memory at any > given time. -- DaveA