Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #103480 > unrolled thread
| Started by | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| First post | 2016-02-25 17:50 +1100 |
| Last post | 2016-02-29 08:00 +1100 |
| Articles | 19 — 13 participants |
Back to article view | Back to comp.lang.python
How to read from a file to an arbitrary delimiter efficiently? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2016-02-25 17:50 +1100
Re: How to read from a file to an arbitrary delimiter efficiently? Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de> - 2016-02-25 08:37 +0100
Re: How to read from a file to an arbitrary delimiter efficiently? Steven D'Aprano <steve@pearwood.info> - 2016-02-27 21:40 +1100
Re: How to read from a file to an arbitrary delimiter efficiently? Dan Sommers <dan@tombstonezero.net> - 2016-02-27 14:40 +0000
Re: How to read from a file to an arbitrary delimiter efficiently? Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2016-02-27 12:03 -0500
Re: How to read from a file to an arbitrary delimiter efficiently? Marko Rauhamaa <marko@pacujo.net> - 2016-02-27 19:47 +0200
Re: How to read from a file to an arbitrary delimiter efficiently? Chris Angelico <rosuav@gmail.com> - 2016-02-25 18:30 +1100
Re: How to read from a file to an arbitrary delimiter efficiently? Steven D'Aprano <steve@pearwood.info> - 2016-02-27 20:49 +1100
Re: How to read from a file to an arbitrary delimiter efficiently? Chris Angelico <rosuav@gmail.com> - 2016-02-27 23:17 +1100
Re: How to read from a file to an arbitrary delimiter efficiently? Chris Angelico <rosuav@gmail.com> - 2016-02-27 23:18 +1100
Re: How to read from a file to an arbitrary delimiter efficiently? Serhiy Storchaka <storchaka@gmail.com> - 2016-02-27 17:23 +0200
Re: How to read from a file to an arbitrary delimiter efficiently? Paul Rubin <no.email@nospam.invalid> - 2016-02-24 23:48 -0800
Re: How to read from a file to an arbitrary delimiter efficiently? wxjmfauth@gmail.com - 2016-02-25 06:37 -0800
Re: How to read from a file to an arbitrary delimiter efficiently? wxjmfauth@gmail.com - 2016-02-25 06:38 -0800
Re: How to read from a file to an arbitrary delimiter efficiently? BartC <bc@freeuk.com> - 2016-02-27 16:35 +0000
Re: How to read from a file to an arbitrary delimiter efficiently? BartC <bc@freeuk.com> - 2016-02-27 20:03 +0000
Re: How to read from a file to an arbitrary delimiter efficiently? BartC <bc@freeuk.com> - 2016-02-27 20:28 +0000
Re: How to read from a file to an arbitrary delimiter efficiently? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2016-02-28 20:28 +0000
Re: How to read from a file to an arbitrary delimiter efficiently? Tim Delaney <timothy.c.delaney@gmail.com> - 2016-02-29 08:00 +1100
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2016-02-25 17:50 +1100 |
| Subject | How to read from a file to an arbitrary delimiter efficiently? |
| Message-ID | <56cea44e$0$11128$c3e8da3@news.astraweb.com> |
I have a need to read to an arbitrary delimiter, which might be any of a
(small) set of characters. For the sake of the exercise, lets say it is
either ! or ? (for example).
I want to read from files reasonably efficiently. I don't mind if there is a
little overhead, but my first attempt is 100 times slower than the built-in
"read to the end of the line" method.
Here is the function I came up with:
# Read a chunk of bytes/characters from an open file.
def chunkiter(f, delim):
buffer = []
b = f.read(1)
while b:
buffer.append(b)
if b in delim:
yield ''.join(buffer)
buffer = []
b = f.read(1)
if buffer:
yield ''.join(buffer)
And here is some test code showing how slow it is:
# Create a test file.
FILENAME = '/tmp/foo'
s = """\
abcdefghijklmnopqrstuvwxyz!
abcdefghijklmnopqrstuvwxyz?
""" * 500
with open(FILENAME, 'w') as f:
f.write(s)
# Run some timing tests, comparing to reading lines from a file.
def readlines(f):
f.seek(0)
for line in f:
pass
def readchunks(f):
f.seek(0)
for chunk in chunkiter(f, '!?'):
pass
from timeit import Timer
SETUP = 'from __main__ import readlines, readchunks, FILENAME; '
SETUP += 'open(FILENAME)'
t1 = Timer('readlines(f)', SETUP)
t2 = Timer('readchunks(f)', SETUP)
# Time them.
x = t1.repeat(number=10) # Ignore the first run, in case of caching issues.
x = min(t1.repeat(number=1000, repeat=9))
y = t2.repeat(number=10)
y = min(t2.repeat(number=1000, repeat=9))
print('reading lines:', x, 'reading chunks:', y)
On my laptop, the results I get are:
reading lines: 0.22584209218621254 reading chunks: 21.716224210336804
Is there a better way to read chunks from a file up to one of a set of
arbitrary delimiters? Bonus for it working equally well with text and bytes.
(You can assume that the delimiters will be no more than one byte, or
character, each. E.g. "!" or "?", but never "!?" or "?!".)
--
Steve
[toc] | [next] | [standalone]
| From | Wolfgang Maier <wolfgang.maier@biologie.uni-freiburg.de> |
|---|---|
| Date | 2016-02-25 08:37 +0100 |
| Message-ID | <mailman.115.1456385844.20994.python-list@python.org> |
| In reply to | #103480 |
On 25.02.2016 07:50, Steven D'Aprano wrote: > I have a need to read to an arbitrary delimiter, which might be any of a > (small) set of characters. For the sake of the exercise, lets say it is > either ! or ? (for example). > You are not alone with your need. http://bugs.python.org/issue1152248 discusses the problem and has some code snippets that you may be interested in. While there is no trivial solution there are certainly faster ways than your first attempt. Wolfgang
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2016-02-27 21:40 +1100 |
| Message-ID | <56d17d13$0$1596$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #103482 |
On Thu, 25 Feb 2016 06:37 pm, Wolfgang Maier wrote: > On 25.02.2016 07:50, Steven D'Aprano wrote: >> I have a need to read to an arbitrary delimiter, which might be any of a >> (small) set of characters. For the sake of the exercise, lets say it is >> either ! or ? (for example). >> > > You are not alone with your need. > > http://bugs.python.org/issue1152248 discusses the problem and has some > code snippets that you may be interested in. While there is no trivial > solution there are certainly faster ways than your first attempt. Wow. Ten years and still no solution :-( Thanks for finding the issue, but the solutions given don't suit my use case. I don't want an iterator that operates on pre-read blocks, I want something that will read a record from a file, and leave the file pointer one entry past the end of the record. Oh, and records are likely fairly short, but there may be a lot of them. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Dan Sommers <dan@tombstonezero.net> |
|---|---|
| Date | 2016-02-27 14:40 +0000 |
| Message-ID | <nasch0$3kd$1@dont-email.me> |
| In reply to | #103574 |
On Sat, 27 Feb 2016 21:40:17 +1100, Steven D'Aprano wrote: > Thanks for finding the issue, but the solutions given don't suit my > use case. I don't want an iterator that operates on pre-read blocks, I > want something that will read a record from a file, and leave the file > pointer one entry past the end of the record. A file is a stream of bytes, but you want to view it as a stream of records. It sounds like you want an abstraction layer, and it sounds like you also want to let the file leak through that layer when it's convenient. (Yes, I spun that horribly on purpose, and I understand the use case of imposing some structure on part of a file, and possibly a different structure on a different part of a file. MIME messages and literate programming files spring to mind.) Perhaps (as I think ChrisA suggested), you could provide your own buffering/chunking layer between your application and the file itself, and never let the application see the file directly.
[toc] | [prev] | [next] | [standalone]
| From | Dennis Lee Bieber <wlfraed@ix.netcom.com> |
|---|---|
| Date | 2016-02-27 12:03 -0500 |
| Message-ID | <mailman.183.1456592632.20994.python-list@python.org> |
| In reply to | #103574 |
On Sat, 27 Feb 2016 21:40:17 +1100, Steven D'Aprano <steve@pearwood.info>
declaimed the following:
>Wow. Ten years and still no solution :-(
>
>Thanks for finding the issue, but the solutions given don't suit my use
>case. I don't want an iterator that operates on pre-read blocks, I want
>something that will read a record from a file, and leave the file pointer
>one entry past the end of the record.
>
>Oh, and records are likely fairly short, but there may be a lot of them.
Considering that most of the world has settled on the view that files
are just linear streams (curse you, UNIX) anything working with "records"
has to build the concept on top of the stream. Either by making records
"fixed width" (allowing for fast random access: recNum*recLen => seek
position), though likely giving up the stream access... Or by wrapping the
stream with something that does parsing/buffering.
Old days, in my world, the first was more common -- after all, the
"common" input method was 80-column Hollerith cards; records consisted of
reading one (or a set) of cards and then handling what was on that multiple
of 80 characters. My college computer system, by default, used an ISAM
structure for editor text files -- but that was a system where the ISAM
overhead was handled transparently by the OS, not a user-level linked
library (how many libraries are there for ISAM access in C?), so even
simple "type"/"print" commands properly displayed the contents.
The other format is the Pascal style counted-string saved as file
contents, in which each "record" is prefaced with a length code. While not
as fast as fixed-length records, it does allow for rather fast scanning of
a file by reading the length field, then seeking that many bytes further
before reading the next length. But again, the I/O library has to retain
knowledge of what the record length was, and how far into a record one has
advanced (if not doing full record I/O) so that one recognizes the next
length field.
I will admit that I miss the idea of OS support for higher level file
structures (even the TRS-80 had OS support for fixed length random access
files -- and not by wasting the rest of a disk sector; the OS did the
packing/unpacking of shorter records into the sectors).
--
Wulfraed Dennis Lee Bieber AF6VN
wlfraed@ix.netcom.com HTTP://wlfraed.home.netcom.com/
[toc] | [prev] | [next] | [standalone]
| From | Marko Rauhamaa <marko@pacujo.net> |
|---|---|
| Date | 2016-02-27 19:47 +0200 |
| Message-ID | <87bn72f3k9.fsf@elektro.pacujo.net> |
| In reply to | #103594 |
Dennis Lee Bieber <wlfraed@ix.netcom.com>:
> On Sat, 27 Feb 2016 21:40:17 +1100, Steven D'Aprano <steve@pearwood.info>
> declaimed the following:
>>Thanks for finding the issue, but the solutions given don't suit my
>>use case. I don't want an iterator that operates on pre-read blocks, I
>>want something that will read a record from a file, and leave the file
>>pointer one entry past the end of the record.
>>
>>Oh, and records are likely fairly short, but there may be a lot of them.
>
> Considering that most of the world has settled on the view that
> files are just linear streams (curse you, UNIX) anything working with
> "records" has to build the concept on top of the stream. Either by
> making records "fixed width" (allowing for fast random access:
> recNum*recLen => seek position), though likely giving up the stream
> access... Or by wrapping the stream with something that does
> parsing/buffering.
It may be instructive to see how the Linux/UNIX utility head(1)
operates. It actually reads its input greedily but once it has seen
enough, it uses lseek(2) to move the seek position back.
Not all file-like objects can seek so head(1) may fail to operate as
advertised:
========================================================================
$ seq 10000 >/tmp/data.txt
$ {
> head -n 5 >/dev/null
> head -n 5
> } </tmp/data.txt
6
7
8
9
10
$ cat /tmp/data.txt | {
> head -n 5 >/dev/null
> head -n 5
> }
1861
1862
1863
1864
$
========================================================================
Marko
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2016-02-25 18:30 +1100 |
| Message-ID | <mailman.116.1456385901.20994.python-list@python.org> |
| In reply to | #103480 |
On Thu, Feb 25, 2016 at 5:50 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
>
> # Read a chunk of bytes/characters from an open file.
> def chunkiter(f, delim):
> buffer = []
> b = f.read(1)
> while b:
> buffer.append(b)
> if b in delim:
> yield ''.join(buffer)
> buffer = []
> b = f.read(1)
> if buffer:
> yield ''.join(buffer)
How bad is it if you over-read? If it's absolutely critical that you
not read anything from the buffer that you shouldn't, then yeah, it's
going to be slow. But if you're never going to read the file using
anything other than this iterator, the best thing to do is to read
more at a time. Simple and naive method:
def chunkiter(f, delim):
"""Don't use [ or ] as the delimiter, kthx"""
buffer = ""
b = f.read(256)
while b:
buffer += b
*parts, buffer = re.split("["+delim+"]", buffer)
yield from parts
if buffer: yield buffer
How well does that perform?
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2016-02-27 20:49 +1100 |
| Message-ID | <56d17138$0$1605$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #103483 |
On Thu, 25 Feb 2016 06:30 pm, Chris Angelico wrote: > On Thu, Feb 25, 2016 at 5:50 PM, Steven D'Aprano > <steve+comp.lang.python@pearwood.info> wrote: >> >> # Read a chunk of bytes/characters from an open file. >> def chunkiter(f, delim): >> buffer = [] >> b = f.read(1) >> while b: >> buffer.append(b) >> if b in delim: >> yield ''.join(buffer) >> buffer = [] >> b = f.read(1) >> if buffer: >> yield ''.join(buffer) > > How bad is it if you over-read? Pretty bad :-) Ideally, I'd rather not over-read at all. I'd like the user to be able to swap from "read N bytes" to "read to the next delimiter" (and possibly even "read the next line") without losing anything. If there's absolutely no other way to speed this up by at least a factor of ten, I'll consider reading into a buffer and losing the ability to mix different kinds of reads. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2016-02-27 23:17 +1100 |
| Message-ID | <mailman.173.1456575458.20994.python-list@python.org> |
| In reply to | #103572 |
On Sat, Feb 27, 2016 at 8:49 PM, Steven D'Aprano <steve@pearwood.info> wrote:
> On Thu, 25 Feb 2016 06:30 pm, Chris Angelico wrote:
>
>> On Thu, Feb 25, 2016 at 5:50 PM, Steven D'Aprano
>> <steve+comp.lang.python@pearwood.info> wrote:
>>>
>>> # Read a chunk of bytes/characters from an open file.
>>> def chunkiter(f, delim):
>>> buffer = []
>>> b = f.read(1)
>>> while b:
>>> buffer.append(b)
>>> if b in delim:
>>> yield ''.join(buffer)
>>> buffer = []
>>> b = f.read(1)
>>> if buffer:
>>> yield ''.join(buffer)
>>
>> How bad is it if you over-read?
>
> Pretty bad :-)
>
> Ideally, I'd rather not over-read at all. I'd like the user to be able to
> swap from "read N bytes" to "read to the next delimiter" (and possibly
> even "read the next line") without losing anything.
If those are the *only* two operations, you should be able to maintain
your own buffer. Something like this:
class ChunkIter:
def __init__(self, f, delim):
self.f = f
self.delim = re.compile("["+delim+"]")
self.buffer = ""
def read_to_delim(self):
"""Return characters up to the next delim, or remaining chars,
or "" if at EOF"""
while "delimiter not found":
*parts, self.buffer = self.delim.split(self.buffer, 1)
if parts: return parts[0]
b = self.f.read(256)
if not b: return self.buffer
self.buffer += b
def read(self, nbytes):
need = nbytes - len(self.buffer)
if need > 0: self.buffer += self.f.read(need)
ret, self.buffer = self.buffer[:need], self.buffer[need:]
return ret
It still might over-read from the underlying file, but those extra
chars will be available to the read(N) function.
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2016-02-27 23:18 +1100 |
| Message-ID | <mailman.174.1456575532.20994.python-list@python.org> |
| In reply to | #103572 |
On Sat, Feb 27, 2016 at 11:17 PM, Chris Angelico <rosuav@gmail.com> wrote: >> Ideally, I'd rather not over-read at all. I'd like the user to be able to >> swap from "read N bytes" to "read to the next delimiter" (and possibly >> even "read the next line") without losing anything. > > If those are the *only* two operations, you should be able to maintain > your own buffer. And, I started out by thinking "to next delimiter" and "next line" were the same thing with different delimiters, but then went and coded the delimiter so that wouldn't work. Whatevs. If those are the only *three* operations, the same class with one more method could do it. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Serhiy Storchaka <storchaka@gmail.com> |
|---|---|
| Date | 2016-02-27 17:23 +0200 |
| Message-ID | <mailman.178.1456586629.20994.python-list@python.org> |
| In reply to | #103572 |
On 27.02.16 11:49, Steven D'Aprano wrote: > On Thu, 25 Feb 2016 06:30 pm, Chris Angelico wrote: >> How bad is it if you over-read? > > Pretty bad :-) > > Ideally, I'd rather not over-read at all. I'd like the user to be able to > swap from "read N bytes" to "read to the next delimiter" (and possibly > even "read the next line") without losing anything. > > > If there's absolutely no other way to speed this up by at least a factor of > ten, I'll consider reading into a buffer and losing the ability to mix > different kinds of reads. If the file is buffered, you can use Chris's receipt, but with peek(). Otherwise you should fall back to slow one-byte read.
[toc] | [prev] | [next] | [standalone]
| From | Paul Rubin <no.email@nospam.invalid> |
|---|---|
| Date | 2016-02-24 23:48 -0800 |
| Message-ID | <871t81w7pw.fsf@jester.gateway.pace.com> |
| In reply to | #103480 |
Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:
> while b:
> buffer.append(b)
This looks bad because of the overhead of list elements, and also the
reading of 1 char at a time. If it's bytes that you're reading, try
using bytearray instead of list:
def chunkiter(f,delim):
buf = bytearray()
bufappend = buf.append # avoid an attribute lookup when calling
fread = f.read # similar
while True:
c = fread(1)
bufappend(c)
if c in delim:
yield str(buf)
del buf[:]
If that's still not fast enough, you could do a more hacky thing of
reading large chunks of input at once (f.read(4096) or whatever),
splitting on the delimiter set with re.split, and yielding the split
output, refilling the buffer when you don't find more delimiters. That
doesn't tell you what delimiters actually match: do you need that?
Maybe there is nicer a way to get at it than adding up the lengths of
the chunks to index into the buffer. How large do you expect the chunks
to be?
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2016-02-25 06:37 -0800 |
| Message-ID | <08020191-19cd-4c57-a408-b8ce48acee8d@googlegroups.com> |
| In reply to | #103484 |
:-)
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2016-02-25 06:38 -0800 |
| Message-ID | <83674e83-9fe3-4266-bdff-93705dc49e39@googlegroups.com> |
| In reply to | #103484 |
:-)
[toc] | [prev] | [next] | [standalone]
| From | BartC <bc@freeuk.com> |
|---|---|
| Date | 2016-02-27 16:35 +0000 |
| Message-ID | <nasj2p$hec$1@dont-email.me> |
| In reply to | #103480 |
On 25/02/2016 06:50, Steven D'Aprano wrote:
> I have a need to read to an arbitrary delimiter, which might be any of a
> (small) set of characters. For the sake of the exercise, lets say it is
> either ! or ? (for example).
>
> # Read a chunk of bytes/characters from an open file.
> def chunkiter(f, delim):
> buffer = []
> b = f.read(1)
> while b:
> buffer.append(b)
> if b in delim:
> yield ''.join(buffer)
> buffer = []
> b = f.read(1)
> if buffer:
> yield ''.join(buffer)
At first sight, it's not surprising it's slow when you throw in
generators and whatnot in there.
However those aren't the main reasons for the poor speed. The limiting
factor here is reading one byte at a time. Just a loop like this:
while f.read(1):
pass
without doing anything else, seems to take most of the time. (3.6
seconds, compared with 5.6 seconds of your readchunks() on a 6MB version
of your test file, on Python 2.7. readlines() took about 0.2 seconds.)
Any faster solutions would need to read more than one byte at a time.
(This bottleneck occurs in C too if you try and do read a file using
only fgetc(), compared with any buffered solutions.)
--
bartc
[toc] | [prev] | [next] | [standalone]
| From | BartC <bc@freeuk.com> |
|---|---|
| Date | 2016-02-27 20:03 +0000 |
| Message-ID | <nasv9k$ij8$1@dont-email.me> |
| In reply to | #103590 |
On 27/02/2016 16:35, BartC wrote: > On 25/02/2016 06:50, Steven D'Aprano wrote: >> I have a need to read to an arbitrary delimiter, which might be any of a >> (small) set of characters. For the sake of the exercise, lets say it is >> either ! or ? (for example). > However those aren't the main reasons for the poor speed. The limiting > factor here is reading one byte at a time. Just a loop like this: > > while f.read(1): > pass > > without doing anything else, seems to take most of the time. (3.6 > seconds, compared with 5.6 seconds of your readchunks() on a 6MB version > of your test file, on Python 2.7. readlines() took about 0.2 seconds.) > > Any faster solutions would need to read more than one byte at a time. I've done some more test using Python 3.4, with the same 200,000 line 6MB test file: 0.25 seconds Scan the file with 'for line in f' 2.25 seconds Scan the file with your readlines() routine 4.0 seconds Scan the file with your readchunks() routine 0.65 seconds Scan the file with using a buffer This latter test uses a 64-byte buffer, reading not more than an extra 63 bytes, but resetting the file position to just past the end of of each identified chunk so that any subsequent read works as expected. This test (the code is too untidy to post) only checks for two specific delimiters (not an arbitrary string fill of them). (It also counts EOF as a valid delimiter so counts one more chunk.) Increasing the buffer size doesn't help, and beyond 256 bytes slowed things down (for this input) as it spends too long rereading data. -- Bartc
[toc] | [prev] | [next] | [standalone]
| From | BartC <bc@freeuk.com> |
|---|---|
| Date | 2016-02-27 20:28 +0000 |
| Message-ID | <nat0oe$ojt$1@dont-email.me> |
| In reply to | #103606 |
On 27/02/2016 20:03, BartC wrote: > On 27/02/2016 16:35, BartC wrote: >> Any faster solutions would need to read more than one byte at a time. > > I've done some more test using Python 3.4, with the same 200,000 line > 6MB test file: > > 0.25 seconds Scan the file with 'for line in f' > 2.25 seconds Scan the file with your readlines() routine That's not right. 0.25 seconds was for readlines(). 2.25 for a f.read(1) loop.
[toc] | [prev] | [next] | [standalone]
| From | Oscar Benjamin <oscar.j.benjamin@gmail.com> |
|---|---|
| Date | 2016-02-28 20:28 +0000 |
| Message-ID | <mailman.24.1456691337.9760.python-list@python.org> |
| In reply to | #103480 |
On 25 February 2016 at 06:50, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
>
> I have a need to read to an arbitrary delimiter, which might be any of a
> (small) set of characters. For the sake of the exercise, lets say it is
> either ! or ? (for example).
>
> I want to read from files reasonably efficiently. I don't mind if there is a
> little overhead, but my first attempt is 100 times slower than the built-in
> "read to the end of the line" method.
You can get something much faster using mmap and searching for a
single delimiter:
def readuntil(m, delim):
start = m.tell()
index = m.find(delim, start)
if index == -1:
return m.read()
else:
return m.read(index - start)
def readmmap(f):
m = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
f.seek(0)
while True:
chunk = readuntil(m, b'!') # Note byte-string
if not chunk:
return
# Do stuff with chunk
pass
My timing makes that ~7x slower than iterating over the lines of the
file but still around 100x faster than reading individual characters.
I'm not sure how to generalise it to looking for multiple delimiters
without dropping back to reading individual characters though.
--
Oscar
[toc] | [prev] | [next] | [standalone]
| From | Tim Delaney <timothy.c.delaney@gmail.com> |
|---|---|
| Date | 2016-02-29 08:00 +1100 |
| Message-ID | <mailman.26.1456693224.9760.python-list@python.org> |
| In reply to | #103480 |
On 29 February 2016 at 07:28, Oscar Benjamin <oscar.j.benjamin@gmail.com> wrote: > On 25 February 2016 at 06:50, Steven D'Aprano > <steve+comp.lang.python@pearwood.info> wrote: > > > > I have a need to read to an arbitrary delimiter, which might be any of a > > (small) set of characters. For the sake of the exercise, lets say it is > > either ! or ? (for example). > > > > I want to read from files reasonably efficiently. I don't mind if there > is a > > little overhead, but my first attempt is 100 times slower than the > built-in > > "read to the end of the line" method. > > You can get something much faster using mmap and searching for a > single delimiter: > > My timing makes that ~7x slower than iterating over the lines of the > file but still around 100x faster than reading individual characters. > I'm not sure how to generalise it to looking for multiple delimiters > without dropping back to reading individual characters though. > You can use an mmapped file as the input for regular expressions. May or may not be particularly efficient. Otherwise, if reading from a file I think read a chunk, and seek() back to the delimiter is probably going to be most efficient whilst leaving the file position just after the delimiter. If reading from a stream, I think Chris' read a chunk and maintain an internal buffer, and don't give access to the underlying stream. Tim Delaney
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web