Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail From: Chris Angelico Newsgroups: comp.lang.python Subject: Re: How to read from a file to an arbitrary delimiter efficiently? Date: Sat, 27 Feb 2016 23:17:36 +1100 Lines: 54 Message-ID: References: <56cea44e$0$11128$c3e8da3@news.astraweb.com> <56d17138$0$1605$c3e8da3$5496439d@news.astraweb.com> Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 X-Trace: news.uni-berlin.de xdlSeiKh4UZZp6/02FUL+Qqj97TOuHbxIxFBAYkdnMjg== Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'received:209.85.223': 0.03; 'chunk': 0.07; 'remaining': 0.07; 'subject:file': 0.07; 'cc:addr:python-list': 0.09; 'subject:How': 0.09; '"""return': 0.09; 'buffer.': 0.09; 'self.buffer': 0.09; 'underlying': 0.09; ':-)': 0.12; 'def': 0.13; 'file,': 0.15; 'thu,': 0.15; '*only*': 0.16; '2016': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'ideally,': 0.16; 'line")': 0.16; 'losing': 0.16; 'read(self,': 0.16; 'received:io': 0.16; 'received:psf.io': 0.16; 'wrote:': 0.16; '>>>': 0.20; 'cc:2**0': 0.20; 'cc:addr:python.org': 0.20; 'file.': 0.22; 'feb': 0.23; 'sat,': 0.23; 'this:': 0.23; 'header:In-Reply-To:1': 0.24; 'all.': 0.24; 'chris': 0.26; 'message-id:@mail.gmail.com': 0.27; 'operations,': 0.27; 'yield': 0.27; "skip:' 10": 0.28; 'ret': 0.29; "i'd": 0.31; 'skip:_ 10': 0.32; 'possibly': 0.32; 'class': 0.33; "d'aprano": 0.33; 'steven': 0.33; 'open': 0.33; 'received:google.com': 0.35; 'next': 0.35; 'something': 0.35; 'but': 0.36; 'should': 0.36; 'received:209.85': 0.36; '(and': 0.36; 'subject:?': 0.36; 'pm,': 0.36; 'subject:: ': 0.37; 'two': 0.37; 'received:209': 0.38; 'subject:from': 0.39; 'rather': 0.39; 'still': 0.40; 'your': 0.60; 'chrisa': 0.84; 'subject:read': 0.84; 'to:none': 0.91 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:cc; bh=3Tkcm3b2YVmAXGEuj3zn40QVKh2XXOnCFGLGYPoB7yM=; b=bywFrybbCn3+tyVTvIp8CSIWnOclugNrAiDS3qntGuvH5jy6jWgvu3TjIlY9TlUcg0 ttb5GKIH9MhNzK0frTrMWfdlmW9OoaOMNAWm7Sz+g2vcM2+gPrftfPsAZfI8+UF39V5D Iu3ZmX1tXC4Cl28UAtO2GfpvS+RcUlfLx3p1GRPxLJKhIYjvWhDcRv4pFwDp52ZVI/eY OCnK+/cap2BxiD5yL7g4SGW5raVHngilYxhQAVy0pah1PSogV++hlydUxKPwSwwxyQKX FBuoQ+YpKetwumXbkQqzePYHmHYNsl4TZY8CzzakAJ+M99LTN4vXfV11DsZ6yAKM35B8 h2tg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:cc; bh=3Tkcm3b2YVmAXGEuj3zn40QVKh2XXOnCFGLGYPoB7yM=; b=GUQFoQx1k5Va7rVJcoGFDxYFEO3q6n1vOmvHaKWuLUmZuh3cP2gnf1eUGoFV4uLoVL UuBdFARFZhaiMQH94S1wvC5gl/7r/9WnrNyIpA1Ix/SKU4sEAb08mPvmvsuYMIqSDlMX wWJ70qR66ypNRGiGi1OFP+ss9q/TJSFb85U5Qm+yLC9zLZTNElNiwvlLu2q2CDZROOtR e7l26O616labAEhGZhnbkwjFLBxMKJjD5hSZaX2moik02WA8iL2ijULNIGI9njUHM6yq 08LI6zKQA5XwhRqVrwcTF3N48l6L3Pw4cQ0lIa8hoTsvs+ydK5b879yR8zd6eVmgnKHi N2ag== X-Gm-Message-State: AG10YOSojcZhQ7G5WP0e/ErD3s+kcKa2r5W/0eZfbeOHXJ7fx5UEY857K2KMcq5eQ7fuWdnTaOf50FS1Ufk/hw== X-Received: by 10.107.47.162 with SMTP id v34mr11328659iov.19.1456575456642; Sat, 27 Feb 2016 04:17:36 -0800 (PST) In-Reply-To: <56d17138$0$1605$c3e8da3$5496439d@news.astraweb.com> X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.21rc2 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Xref: csiph.com comp.lang.python:103582 On Sat, Feb 27, 2016 at 8:49 PM, Steven D'Aprano wrote: > On Thu, 25 Feb 2016 06:30 pm, Chris Angelico wrote: > >> On Thu, Feb 25, 2016 at 5:50 PM, Steven D'Aprano >> wrote: >>> >>> # Read a chunk of bytes/characters from an open file. >>> def chunkiter(f, delim): >>> buffer = [] >>> b = f.read(1) >>> while b: >>> buffer.append(b) >>> if b in delim: >>> yield ''.join(buffer) >>> buffer = [] >>> b = f.read(1) >>> if buffer: >>> yield ''.join(buffer) >> >> How bad is it if you over-read? > > Pretty bad :-) > > Ideally, I'd rather not over-read at all. I'd like the user to be able to > swap from "read N bytes" to "read to the next delimiter" (and possibly > even "read the next line") without losing anything. If those are the *only* two operations, you should be able to maintain your own buffer. Something like this: class ChunkIter: def __init__(self, f, delim): self.f = f self.delim = re.compile("["+delim+"]") self.buffer = "" def read_to_delim(self): """Return characters up to the next delim, or remaining chars, or "" if at EOF""" while "delimiter not found": *parts, self.buffer = self.delim.split(self.buffer, 1) if parts: return parts[0] b = self.f.read(256) if not b: return self.buffer self.buffer += b def read(self, nbytes): need = nbytes - len(self.buffer) if need > 0: self.buffer += self.f.read(need) ret, self.buffer = self.buffer[:need], self.buffer[need:] return ret It still might over-read from the underlying file, but those extra chars will be available to the read(N) function. ChrisA