Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail
From: Chris Angelico <rosuav@gmail.com>
Newsgroups: comp.lang.python
Subject: Re: How to read from a file to an arbitrary delimiter efficiently?
Date: Sat, 27 Feb 2016 23:17:36 +1100
Lines: 54
Message-ID: <mailman.173.1456575458.20994.python-list@python.org>
References: <56cea44e$0$11128$c3e8da3@news.astraweb.com> <mailman.116.1456385901.20994.python-list@python.org> <56d17138$0$1605$c3e8da3$5496439d@news.astraweb.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
In-Reply-To: <56d17138$0$1605$c3e8da3$5496439d@news.astraweb.com>
Precedence: list
Xref: csiph.com comp.lang.python:103582

On Sat, Feb 27, 2016 at 8:49 PM, Steven D'Aprano <steve@pearwood.info> wrote:
> On Thu, 25 Feb 2016 06:30 pm, Chris Angelico wrote:
>
>> On Thu, Feb 25, 2016 at 5:50 PM, Steven D'Aprano
>> <steve+comp.lang.python@pearwood.info> wrote:
>>>
>>> # Read a chunk of bytes/characters from an open file.
>>> def chunkiter(f, delim):
>>>     buffer = []
>>>     b = f.read(1)
>>>     while b:
>>>         buffer.append(b)
>>>         if b in delim:
>>>             yield ''.join(buffer)
>>>             buffer = []
>>>         b = f.read(1)
>>>     if buffer:
>>>         yield ''.join(buffer)
>>
>> How bad is it if you over-read?
>
> Pretty bad :-)
>
> Ideally, I'd rather not over-read at all. I'd like the user to be able to
> swap from "read N bytes" to "read to the next delimiter" (and possibly
> even "read the next line") without losing anything.

If those are the *only* two operations, you should be able to maintain
your own buffer. Something like this:

class ChunkIter:
    def __init__(self, f, delim):
        self.f = f
        self.delim = re.compile("["+delim+"]")
        self.buffer = ""
    def read_to_delim(self):
        """Return characters up to the next delim, or remaining chars,
or "" if at EOF"""
        while "delimiter not found":
            *parts, self.buffer = self.delim.split(self.buffer, 1)
            if parts: return parts[0]
            b = self.f.read(256)
            if not b: return self.buffer
            self.buffer += b
    def read(self, nbytes):
        need = nbytes - len(self.buffer)
        if need > 0: self.buffer += self.f.read(need)
        ret, self.buffer = self.buffer[:need], self.buffer[need:]
        return ret

It still might over-read from the underlying file, but those extra
chars will be available to the read(N) function.

ChrisA