Groups > comp.lang.python > #103480 > unrolled thread

How to read from a file to an arbitrary delimiter efficiently?

Started by	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
First post	2016-02-25 17:50 +1100
Last post	2016-02-29 08:00 +1100
Articles	19 — 13 participants

Back to article view | Back to comp.lang.python

  How to read from a file to an arbitrary delimiter efficiently? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-02-25 17:50 +1100
    Re: How to read from a file to an arbitrary delimiter efficiently? Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de> - 2016-02-25 08:37 +0100
      Re: How to read from a file to an arbitrary delimiter efficiently? Steven D'Aprano <steve@pearwood.info> - 2016-02-27 21:40 +1100
        Re: How to read from a file to an arbitrary delimiter efficiently? Dan Sommers <dan@tombstonezero.net> - 2016-02-27 14:40 +0000
        Re: How to read from a file to an arbitrary delimiter efficiently? Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2016-02-27 12:03 -0500
          Re: How to read from a file to an arbitrary delimiter efficiently? Marko Rauhamaa <marko@pacujo.net> - 2016-02-27 19:47 +0200
    Re: How to read from a file to an arbitrary delimiter efficiently? Chris Angelico <rosuav@gmail.com> - 2016-02-25 18:30 +1100
      Re: How to read from a file to an arbitrary delimiter efficiently? Steven D'Aprano <steve@pearwood.info> - 2016-02-27 20:49 +1100
        Re: How to read from a file to an arbitrary delimiter efficiently? Chris Angelico <rosuav@gmail.com> - 2016-02-27 23:17 +1100
        Re: How to read from a file to an arbitrary delimiter efficiently? Chris Angelico <rosuav@gmail.com> - 2016-02-27 23:18 +1100
        Re: How to read from a file to an arbitrary delimiter efficiently? Serhiy Storchaka <storchaka@gmail.com> - 2016-02-27 17:23 +0200
    Re: How to read from a file to an arbitrary delimiter efficiently? Paul Rubin <no.email@nospam.invalid> - 2016-02-24 23:48 -0800
      Re: How to read from a file to an arbitrary delimiter efficiently? wxjmfauth@gmail.com - 2016-02-25 06:37 -0800
      Re: How to read from a file to an arbitrary delimiter efficiently? wxjmfauth@gmail.com - 2016-02-25 06:38 -0800
    Re: How to read from a file to an arbitrary delimiter efficiently? BartC <bc@freeuk.com> - 2016-02-27 16:35 +0000
      Re: How to read from a file to an arbitrary delimiter efficiently? BartC <bc@freeuk.com> - 2016-02-27 20:03 +0000
        Re: How to read from a file to an arbitrary delimiter efficiently? BartC <bc@freeuk.com> - 2016-02-27 20:28 +0000
    Re: How to read from a file to an arbitrary delimiter efficiently? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2016-02-28 20:28 +0000
    Re: How to read from a file to an arbitrary delimiter efficiently? Tim Delaney <timothy.c.delaney@gmail.com> - 2016-02-29 08:00 +1100

#103480 — How to read from a file to an arbitrary delimiter efficiently?

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2016-02-25 17:50 +1100
Subject	How to read from a file to an arbitrary delimiter efficiently?
Message-ID	<56cea44e$0$11128$c3e8da3@news.astraweb.com>

I have a need to read to an arbitrary delimiter, which might be any of a 
(small) set of characters. For the sake of the exercise, lets say it is 
either ! or ? (for example).

I want to read from files reasonably efficiently. I don't mind if there is a 
little overhead, but my first attempt is 100 times slower than the built-in 
"read to the end of the line" method.

Here is the function I came up with:


# Read a chunk of bytes/characters from an open file.
def chunkiter(f, delim):
    buffer = []
    b = f.read(1)
    while b:
        buffer.append(b)
        if b in delim:
            yield ''.join(buffer)
            buffer = []
        b = f.read(1)
    if buffer:
        yield ''.join(buffer)



And here is some test code showing how slow it is:


# Create a test file.
FILENAME = '/tmp/foo'
s = """\
abcdefghijklmnopqrstuvwxyz!
abcdefghijklmnopqrstuvwxyz?
""" * 500
with open(FILENAME, 'w') as f:
    f.write(s)


# Run some timing tests, comparing to reading lines from a file.

def readlines(f):
    f.seek(0)
    for line in f:
        pass

def readchunks(f):
    f.seek(0)
    for chunk in chunkiter(f, '!?'):
        pass

from timeit import Timer
SETUP = 'from __main__ import readlines, readchunks, FILENAME; '
SETUP += 'open(FILENAME)'

t1 = Timer('readlines(f)', SETUP)
t2 = Timer('readchunks(f)', SETUP)

# Time them.
x = t1.repeat(number=10)  # Ignore the first run, in case of caching issues.
x = min(t1.repeat(number=1000, repeat=9))

y = t2.repeat(number=10)
y = min(t2.repeat(number=1000, repeat=9))

print('reading lines:', x, 'reading chunks:', y)






On my laptop, the results I get are:

reading lines: 0.22584209218621254 reading chunks: 21.716224210336804


Is there a better way to read chunks from a file up to one of a set of 
arbitrary delimiters? Bonus for it working equally well with text and bytes.

(You can assume that the delimiters will be no more than one byte, or 
character, each. E.g. "!" or "?", but never "!?" or "?!".)

-- 
Steve

[toc] | [next] | [standalone]

#103482

From	Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de>
Date	2016-02-25 08:37 +0100
Message-ID	<mailman.115.1456385844.20994.python-list@python.org>
In reply to	#103480

On 25.02.2016 07:50, Steven D'Aprano wrote:
> I have a need to read to an arbitrary delimiter, which might be any of a
> (small) set of characters. For the sake of the exercise, lets say it is
> either ! or ? (for example).
>

You are not alone with your need.

http://bugs.python.org/issue1152248 discusses the problem and has some 
code snippets that you may be interested in. While there is no trivial 
solution there are certainly faster ways than your first attempt.

Wolfgang

[toc] | [prev] | [next] | [standalone]

#103574

From	Steven D'Aprano <steve@pearwood.info>
Date	2016-02-27 21:40 +1100
Message-ID	<56d17d13$0$1596$c3e8da3$5496439d@news.astraweb.com>
In reply to	#103482

On Thu, 25 Feb 2016 06:37 pm, Wolfgang Maier wrote:

> On 25.02.2016 07:50, Steven D'Aprano wrote:
>> I have a need to read to an arbitrary delimiter, which might be any of a
>> (small) set of characters. For the sake of the exercise, lets say it is
>> either ! or ? (for example).
>>
> 
> You are not alone with your need.
> 
> http://bugs.python.org/issue1152248 discusses the problem and has some
> code snippets that you may be interested in. While there is no trivial
> solution there are certainly faster ways than your first attempt.

Wow. Ten years and still no solution :-(

Thanks for finding the issue, but the solutions given don't suit my use
case. I don't want an iterator that operates on pre-read blocks, I want
something that will read a record from a file, and leave the file pointer
one entry past the end of the record.

Oh, and records are likely fairly short, but there may be a lot of them.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#103586

From	Dan Sommers <dan@tombstonezero.net>
Date	2016-02-27 14:40 +0000
Message-ID	<nasch0$3kd$1@dont-email.me>
In reply to	#103574

On Sat, 27 Feb 2016 21:40:17 +1100, Steven D'Aprano wrote:

> Thanks for finding the issue, but the solutions given don't suit my
> use case. I don't want an iterator that operates on pre-read blocks, I
> want something that will read a record from a file, and leave the file
> pointer one entry past the end of the record.

A file is a stream of bytes, but you want to view it as a stream of
records.  It sounds like you want an abstraction layer, and it sounds
like you also want to let the file leak through that layer when it's
convenient.  (Yes, I spun that horribly on purpose, and I understand the
use case of imposing some structure on part of a file, and possibly a
different structure on a different part of a file.  MIME messages and
literate programming files spring to mind.)

Perhaps (as I think ChrisA suggested), you could provide your own
buffering/chunking layer between your application and the file itself,
and never let the application see the file directly.

[toc] | [prev] | [next] | [standalone]

#103594

From	Dennis Lee Bieber <wlfraed@ix.netcom.com>
Date	2016-02-27 12:03 -0500
Message-ID	<mailman.183.1456592632.20994.python-list@python.org>
In reply to	#103574

On Sat, 27 Feb 2016 21:40:17 +1100, Steven D'Aprano <steve@pearwood.info>
declaimed the following:

>Wow. Ten years and still no solution :-(
>
>Thanks for finding the issue, but the solutions given don't suit my use
>case. I don't want an iterator that operates on pre-read blocks, I want
>something that will read a record from a file, and leave the file pointer
>one entry past the end of the record.
>
>Oh, and records are likely fairly short, but there may be a lot of them.

	Considering that most of the world has settled on the view that files
are just linear streams (curse you, UNIX) anything working with "records"
has to build the concept on top of the stream. Either by making records
"fixed width" (allowing for fast random access: recNum*recLen => seek
position), though likely giving up the stream access... Or by wrapping the
stream with something that does parsing/buffering.

	Old days, in my world, the first was more common -- after all, the
"common" input method was 80-column Hollerith cards; records consisted of
reading one (or a set) of cards and then handling what was on that multiple
of 80 characters. My college computer system, by default, used an ISAM
structure for editor text files -- but that was a system where the ISAM
overhead was handled transparently by the OS, not a user-level linked
library (how many libraries are there for ISAM access in C?), so even
simple "type"/"print" commands properly displayed the contents.

	The other format is the Pascal style counted-string saved as file
contents, in which each "record" is prefaced with a length code. While not
as fast as fixed-length records, it does allow for rather fast scanning of
a file by reading the length field, then seeking that many bytes further
before reading the next length. But again, the I/O library has to retain
knowledge of what the record length was, and how far into a record one has
advanced (if not doing full record I/O) so that one recognizes the next
length field.

	I will admit that I miss the idea of OS support for higher level file
structures (even the TRS-80 had OS support for fixed length random access
files -- and not by wasting the rest of a disk sector; the OS did the
packing/unpacking of shorter records into the sectors).
-- 
	Wulfraed                 Dennis Lee Bieber         AF6VN
    wlfraed@ix.netcom.com    HTTP://wlfraed.home.netcom.com/

[toc] | [prev] | [next] | [standalone]

#103596

From	Marko Rauhamaa <marko@pacujo.net>
Date	2016-02-27 19:47 +0200
Message-ID	<87bn72f3k9.fsf@elektro.pacujo.net>
In reply to	#103594

Dennis Lee Bieber <wlfraed@ix.netcom.com>:

> On Sat, 27 Feb 2016 21:40:17 +1100, Steven D'Aprano <steve@pearwood.info>
> declaimed the following:
>>Thanks for finding the issue, but the solutions given don't suit my
>>use case. I don't want an iterator that operates on pre-read blocks, I
>>want something that will read a record from a file, and leave the file
>>pointer one entry past the end of the record.
>>
>>Oh, and records are likely fairly short, but there may be a lot of them.
>
> 	Considering that most of the world has settled on the view that
> files are just linear streams (curse you, UNIX) anything working with
> "records" has to build the concept on top of the stream. Either by
> making records "fixed width" (allowing for fast random access:
> recNum*recLen => seek position), though likely giving up the stream
> access... Or by wrapping the stream with something that does
> parsing/buffering.

It may be instructive to see how the Linux/UNIX utility head(1)
operates. It actually reads its input greedily but once it has seen
enough, it uses lseek(2) to move the seek position back.

Not all file-like objects can seek so head(1) may fail to operate as
advertised:

========================================================================
$ seq 10000 >/tmp/data.txt
$ {
> head -n 5 >/dev/null
> head -n 5
> } </tmp/data.txt
6
7
8
9
10
$ cat /tmp/data.txt | {
> head -n 5 >/dev/null
> head -n 5
> }

1861
1862
1863
1864
$
========================================================================


Marko

[toc] | [prev] | [next] | [standalone]

#103483

From	Chris Angelico <rosuav@gmail.com>
Date	2016-02-25 18:30 +1100
Message-ID	<mailman.116.1456385901.20994.python-list@python.org>
In reply to	#103480

On Thu, Feb 25, 2016 at 5:50 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
>
> # Read a chunk of bytes/characters from an open file.
> def chunkiter(f, delim):
>     buffer = []
>     b = f.read(1)
>     while b:
>         buffer.append(b)
>         if b in delim:
>             yield ''.join(buffer)
>             buffer = []
>         b = f.read(1)
>     if buffer:
>         yield ''.join(buffer)

How bad is it if you over-read? If it's absolutely critical that you
not read anything from the buffer that you shouldn't, then yeah, it's
going to be slow. But if you're never going to read the file using
anything other than this iterator, the best thing to do is to read
more at a time. Simple and naive method:

def chunkiter(f, delim):
    """Don't use [ or ] as the delimiter, kthx"""
    buffer = ""
    b = f.read(256)
    while b:
        buffer += b
        *parts, buffer = re.split("["+delim+"]", buffer)
        yield from parts
    if buffer: yield buffer

How well does that perform?

ChrisA

[toc] | [prev] | [next] | [standalone]

#103572

From	Steven D'Aprano <steve@pearwood.info>
Date	2016-02-27 20:49 +1100
Message-ID	<56d17138$0$1605$c3e8da3$5496439d@news.astraweb.com>
In reply to	#103483

On Thu, 25 Feb 2016 06:30 pm, Chris Angelico wrote:

> On Thu, Feb 25, 2016 at 5:50 PM, Steven D'Aprano
> <steve+comp.lang.python@pearwood.info> wrote:
>>
>> # Read a chunk of bytes/characters from an open file.
>> def chunkiter(f, delim):
>>     buffer = []
>>     b = f.read(1)
>>     while b:
>>         buffer.append(b)
>>         if b in delim:
>>             yield ''.join(buffer)
>>             buffer = []
>>         b = f.read(1)
>>     if buffer:
>>         yield ''.join(buffer)
> 
> How bad is it if you over-read? 

Pretty bad :-)

Ideally, I'd rather not over-read at all. I'd like the user to be able to
swap from "read N bytes" to "read to the next delimiter" (and possibly
even "read the next line") without losing anything.

If there's absolutely no other way to speed this up by at least a factor of
ten, I'll consider reading into a buffer and losing the ability to mix
different kinds of reads.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#103582

From	Chris Angelico <rosuav@gmail.com>
Date	2016-02-27 23:17 +1100
Message-ID	<mailman.173.1456575458.20994.python-list@python.org>
In reply to	#103572

On Sat, Feb 27, 2016 at 8:49 PM, Steven D'Aprano <steve@pearwood.info> wrote:
> On Thu, 25 Feb 2016 06:30 pm, Chris Angelico wrote:
>
>> On Thu, Feb 25, 2016 at 5:50 PM, Steven D'Aprano
>> <steve+comp.lang.python@pearwood.info> wrote:
>>>
>>> # Read a chunk of bytes/characters from an open file.
>>> def chunkiter(f, delim):
>>>     buffer = []
>>>     b = f.read(1)
>>>     while b:
>>>         buffer.append(b)
>>>         if b in delim:
>>>             yield ''.join(buffer)
>>>             buffer = []
>>>         b = f.read(1)
>>>     if buffer:
>>>         yield ''.join(buffer)
>>
>> How bad is it if you over-read?
>
> Pretty bad :-)
>
> Ideally, I'd rather not over-read at all. I'd like the user to be able to
> swap from "read N bytes" to "read to the next delimiter" (and possibly
> even "read the next line") without losing anything.

If those are the *only* two operations, you should be able to maintain
your own buffer. Something like this:

class ChunkIter:
    def __init__(self, f, delim):
        self.f = f
        self.delim = re.compile("["+delim+"]")
        self.buffer = ""
    def read_to_delim(self):
        """Return characters up to the next delim, or remaining chars,
or "" if at EOF"""
        while "delimiter not found":
            *parts, self.buffer = self.delim.split(self.buffer, 1)
            if parts: return parts[0]
            b = self.f.read(256)
            if not b: return self.buffer
            self.buffer += b
    def read(self, nbytes):
        need = nbytes - len(self.buffer)
        if need > 0: self.buffer += self.f.read(need)
        ret, self.buffer = self.buffer[:need], self.buffer[need:]
        return ret

It still might over-read from the underlying file, but those extra
chars will be available to the read(N) function.

ChrisA

[toc] | [prev] | [next] | [standalone]

#103583

From	Chris Angelico <rosuav@gmail.com>
Date	2016-02-27 23:18 +1100
Message-ID	<mailman.174.1456575532.20994.python-list@python.org>
In reply to	#103572

On Sat, Feb 27, 2016 at 11:17 PM, Chris Angelico <rosuav@gmail.com> wrote:
>> Ideally, I'd rather not over-read at all. I'd like the user to be able to
>> swap from "read N bytes" to "read to the next delimiter" (and possibly
>> even "read the next line") without losing anything.
>
> If those are the *only* two operations, you should be able to maintain
> your own buffer.

And, I started out by thinking "to next delimiter" and "next line"
were the same thing with different delimiters, but then went and coded
the delimiter so that wouldn't work. Whatevs. If those are the only
*three* operations, the same class with one more method could do it.

ChrisA

[toc] | [prev] | [next] | [standalone]

#103587

From	Serhiy Storchaka <storchaka@gmail.com>
Date	2016-02-27 17:23 +0200
Message-ID	<mailman.178.1456586629.20994.python-list@python.org>
In reply to	#103572

On 27.02.16 11:49, Steven D'Aprano wrote:
> On Thu, 25 Feb 2016 06:30 pm, Chris Angelico wrote:
>> How bad is it if you over-read?
>
> Pretty bad :-)
>
> Ideally, I'd rather not over-read at all. I'd like the user to be able to
> swap from "read N bytes" to "read to the next delimiter" (and possibly
> even "read the next line") without losing anything.
>
>
> If there's absolutely no other way to speed this up by at least a factor of
> ten, I'll consider reading into a buffer and losing the ability to mix
> different kinds of reads.

If the file is buffered, you can use Chris's receipt, but with peek(). 
Otherwise you should fall back to slow one-byte read.

[toc] | [prev] | [next] | [standalone]

#103484

From	Paul Rubin <no.email@nospam.invalid>
Date	2016-02-24 23:48 -0800
Message-ID	<871t81w7pw.fsf@jester.gateway.pace.com>
In reply to	#103480

Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:
>     while b:
>         buffer.append(b)

This looks bad because of the overhead of list elements, and also the
reading of 1 char at a time.  If it's bytes that you're reading, try
using bytearray instead of list:

    def chunkiter(f,delim):
        buf = bytearray()
        bufappend = buf.append   # avoid an attribute lookup when calling
        fread = f.read    # similar
        while True:
            c = fread(1)
            bufappend(c)
            if c in delim:
                yield str(buf)
                del buf[:]

If that's still not fast enough, you could do a more hacky thing of
reading large chunks of input at once (f.read(4096) or whatever),
splitting on the delimiter set with re.split, and yielding the split
output, refilling the buffer when you don't find more delimiters.  That
doesn't tell you what delimiters actually match: do you need that?
Maybe there is nicer a way to get at it than adding up the lengths of
the chunks to index into the buffer.  How large do you expect the chunks
to be?

[toc] | [prev] | [next] | [standalone]

#103496

From	wxjmfauth@gmail.com
Date	2016-02-25 06:37 -0800
Message-ID	<08020191-19cd-4c57-a408-b8ce48acee8d@googlegroups.com>
In reply to	#103484

:-)

[toc] | [prev] | [next] | [standalone]

#103498

From	wxjmfauth@gmail.com
Date	2016-02-25 06:38 -0800
Message-ID	<83674e83-9fe3-4266-bdff-93705dc49e39@googlegroups.com>
In reply to	#103484

:-)

[toc] | [prev] | [next] | [standalone]

#103590

From	BartC <bc@freeuk.com>
Date	2016-02-27 16:35 +0000
Message-ID	<nasj2p$hec$1@dont-email.me>
In reply to	#103480

On 25/02/2016 06:50, Steven D'Aprano wrote:
> I have a need to read to an arbitrary delimiter, which might be any of a
> (small) set of characters. For the sake of the exercise, lets say it is
> either ! or ? (for example).

>
> # Read a chunk of bytes/characters from an open file.
> def chunkiter(f, delim):
>      buffer = []
>      b = f.read(1)
>      while b:
>          buffer.append(b)
>          if b in delim:
>              yield ''.join(buffer)
>              buffer = []
>          b = f.read(1)
>      if buffer:
>          yield ''.join(buffer)

At first sight, it's not surprising it's slow when you throw in 
generators and whatnot in there.

However those aren't the main reasons for the poor speed. The limiting 
factor here is reading one byte at a time. Just a loop like this:

    while f.read(1):
       pass

without doing anything else, seems to take most of the time. (3.6 
seconds, compared with 5.6 seconds of your readchunks() on a 6MB version 
of your test file, on Python 2.7. readlines() took about 0.2 seconds.)

Any faster solutions would need to read more than one byte at a time.

(This bottleneck occurs in C too if you try and do read a file using 
only fgetc(), compared with any buffered solutions.)

-- 
bartc

[toc] | [prev] | [next] | [standalone]

#103606

From	BartC <bc@freeuk.com>
Date	2016-02-27 20:03 +0000
Message-ID	<nasv9k$ij8$1@dont-email.me>
In reply to	#103590

On 27/02/2016 16:35, BartC wrote:
> On 25/02/2016 06:50, Steven D'Aprano wrote:
>> I have a need to read to an arbitrary delimiter, which might be any of a
>> (small) set of characters. For the sake of the exercise, lets say it is
>> either ! or ? (for example).

> However those aren't the main reasons for the poor speed. The limiting
> factor here is reading one byte at a time. Just a loop like this:
>
>     while f.read(1):
>        pass
>
> without doing anything else, seems to take most of the time. (3.6
> seconds, compared with 5.6 seconds of your readchunks() on a 6MB version
> of your test file, on Python 2.7. readlines() took about 0.2 seconds.)
>
> Any faster solutions would need to read more than one byte at a time.

I've done some more test using Python 3.4, with the same 200,000 line 
6MB test file:

0.25 seconds       Scan the file with 'for line in f'
2.25 seconds       Scan the file with your readlines() routine
4.0  seconds       Scan the file with your readchunks() routine
0.65 seconds       Scan the file with using a buffer

This latter test uses a 64-byte buffer, reading not more than an extra 
63 bytes, but resetting the file position to just past the end of of 
each identified chunk so that any subsequent read works as expected.

This test (the code is too untidy to post) only checks for two specific 
delimiters (not an arbitrary string fill of them). (It also counts EOF 
as a valid delimiter so counts one more chunk.)

Increasing the buffer size doesn't help, and beyond 256 bytes slowed 
things down (for this input) as it spends too long rereading data.

-- 
Bartc

[toc] | [prev] | [next] | [standalone]

#103607

From	BartC <bc@freeuk.com>
Date	2016-02-27 20:28 +0000
Message-ID	<nat0oe$ojt$1@dont-email.me>
In reply to	#103606

On 27/02/2016 20:03, BartC wrote:
> On 27/02/2016 16:35, BartC wrote:

>> Any faster solutions would need to read more than one byte at a time.
>
> I've done some more test using Python 3.4, with the same 200,000 line
> 6MB test file:
>
> 0.25 seconds       Scan the file with 'for line in f'
> 2.25 seconds       Scan the file with your readlines() routine

That's not right. 0.25 seconds was for readlines(). 2.25 for a f.read(1) 
loop.

[toc] | [prev] | [next] | [standalone]

#103660

From	Oscar Benjamin <oscar.j.benjamin@gmail.com>
Date	2016-02-28 20:28 +0000
Message-ID	<mailman.24.1456691337.9760.python-list@python.org>
In reply to	#103480

On 25 February 2016 at 06:50, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
>
> I have a need to read to an arbitrary delimiter, which might be any of a
> (small) set of characters. For the sake of the exercise, lets say it is
> either ! or ? (for example).
>
> I want to read from files reasonably efficiently. I don't mind if there is a
> little overhead, but my first attempt is 100 times slower than the built-in
> "read to the end of the line" method.

You can get something much faster using mmap and searching for a
single delimiter:

def readuntil(m, delim):
    start = m.tell()
    index = m.find(delim, start)
    if index == -1:
        return m.read()
    else:
        return m.read(index - start)

def readmmap(f):
    m = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
    f.seek(0)
    while True:
        chunk = readuntil(m, b'!') # Note byte-string
        if not chunk:
            return
        # Do stuff with chunk
        pass

My timing makes that ~7x slower than iterating over the lines of the
file but still around 100x faster than reading individual characters.
I'm not sure how to generalise it to looking for multiple delimiters
without dropping back to reading individual characters though.

--
Oscar

[toc] | [prev] | [next] | [standalone]

#103662

From	Tim Delaney <timothy.c.delaney@gmail.com>
Date	2016-02-29 08:00 +1100
Message-ID	<mailman.26.1456693224.9760.python-list@python.org>
In reply to	#103480

On 29 February 2016 at 07:28, Oscar Benjamin <oscar.j.benjamin@gmail.com>
wrote:

> On 25 February 2016 at 06:50, Steven D'Aprano
> <steve+comp.lang.python@pearwood.info> wrote:
> >
> > I have a need to read to an arbitrary delimiter, which might be any of a
> > (small) set of characters. For the sake of the exercise, lets say it is
> > either ! or ? (for example).
> >
> > I want to read from files reasonably efficiently. I don't mind if there
> is a
> > little overhead, but my first attempt is 100 times slower than the
> built-in
> > "read to the end of the line" method.
>
> You can get something much faster using mmap and searching for a
> single delimiter:
>
> My timing makes that ~7x slower than iterating over the lines of the
> file but still around 100x faster than reading individual characters.
> I'm not sure how to generalise it to looking for multiple delimiters
> without dropping back to reading individual characters though.
>

You can use an mmapped file as the input for regular expressions. May or
may not be particularly efficient.

Otherwise, if reading from a file I think read a chunk, and seek() back to
the delimiter is probably going to be most efficient whilst leaving the
file position just after the delimiter.

If reading from a stream, I think Chris' read a chunk and maintain an
internal buffer, and don't give access to the underlying stream.

Tim Delaney

[toc] | [prev] | [standalone]

csiph-web

How to read from a file to an arbitrary delimiter efficiently?

Contents

#103480 — How to read from a file to an arbitrary delimiter efficiently?

#103482

#103574

#103586

#103594

#103596

#103483

#103572

#103582

#103583

#103587

#103484

#103496

#103498

#103590

#103606

#103607

#103660

#103662