Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail From: Dennis Lee Bieber Newsgroups: comp.lang.python Subject: Re: How to read from a file to an arbitrary delimiter efficiently? Date: Sat, 27 Feb 2016 12:03:58 -0500 Organization: IISS Elusive Unicorn Lines: 45 Message-ID: References: <56cea44e$0$11128$c3e8da3@news.astraweb.com> <56d17d13$0$1596$c3e8da3$5496439d@news.astraweb.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Trace: news.uni-berlin.de hqrgBG0zCltCo1MAg4QP9g9QzwQeI8hNgdhuv/SUx3XA== Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.002 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; '(even': 0.05; ':-(': 0.07; 'linear': 0.07; 'subject:file': 0.07; 'subject:How': 0.09; 'length.': 0.09; 'message-id:@4ax.com': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'record.': 0.09; 'file,': 0.15; 'properly': 0.15; '2016': 0.16; 'iterator': 0.16; 'position),': 0.16; 'received:80.91.229.3': 0.16; 'received:io': 0.16; 'received:plane.gmane.org': 0.16; 'received:psf.io': 0.16; 'set)': 0.16; 'stream.': 0.16; 'bytes': 0.18; 'pointer': 0.18; 'url:home': 0.18; 'input': 0.18; 'all,': 0.20; 'library': 0.20; 'fairly': 0.22; 'default,': 0.22; 'libraries': 0.22; 'os,': 0.22; 'pascal': 0.22; 'code.': 0.23; 'leave': 0.23; '(or': 0.23; 'feb': 0.23; 'sat,': 0.23; 'header:X-Complaints-To:1': 0.26; 'rest': 0.26; 'handling': 0.27; 'disk': 0.27; 'issue,': 0.27; 'idea': 0.28; 'record': 0.29; 'i/o': 0.29; 'short,': 0.29; 'wasting': 0.29; 'handled': 0.29; 'random': 0.29; 'system,': 0.30; 'field,': 0.30; 'entry': 0.31; 'fixed': 0.31; 'common': 0.33; "d'aprano": 0.33; 'displayed': 0.33; 'retain': 0.33; 'shorter': 0.33; 'steven': 0.33; 'stream': 0.33; 'suit': 0.33; 'editor': 0.34; 'structure': 0.34; 'file': 0.34; 'world,': 0.35; 'next': 0.35; 'text': 0.35; 'saved': 0.35; 'knowledge': 0.35; 'something': 0.35; 'level': 0.35; 'but': 0.36; 'there': 0.36; 'structures': 0.36; 'to:addr :python-list': 0.36; 'subject:?': 0.36; 'subject:: ': 0.37; 'method': 0.37; 'received:org': 0.37; 'charset:us-ascii': 0.37; 'doing': 0.38; 'anything': 0.38; 'files': 0.38; 'end': 0.39; 'format': 0.39; 'does': 0.39; 'subject:from': 0.39; 'rather': 0.39; 'build': 0.40; 'to:addr:python.org': 0.40; 'where': 0.40; 'still': 0.40; 'ten': 0.60; 'advanced': 0.61; 'further': 0.62; 'making': 0.62; 'linked': 0.63; 'more': 0.63; 'world': 0.64; 'cards': 0.67; 'records,': 0.67; 'college': 0.67; 'records': 0.70; 'miss': 0.77; 'consisted': 0.84; 'streams': 0.84; 'subject:read': 0.84; 'contents,': 0.91; 'dennis': 0.91; 'received:108': 0.93 X-Injected-Via-Gmane: http://gmane.org/ X-Gmane-NNTP-Posting-Host: adsl-108-73-119-79.dsl.klmzmi.sbcglobal.net X-Newsreader: Forte Agent 6.00/32.1186 X-No-Archive: YES X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.21rc2 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Xref: csiph.com comp.lang.python:103594 On Sat, 27 Feb 2016 21:40:17 +1100, Steven D'Aprano declaimed the following: >Wow. Ten years and still no solution :-( > >Thanks for finding the issue, but the solutions given don't suit my use >case. I don't want an iterator that operates on pre-read blocks, I want >something that will read a record from a file, and leave the file pointer >one entry past the end of the record. > >Oh, and records are likely fairly short, but there may be a lot of them. Considering that most of the world has settled on the view that files are just linear streams (curse you, UNIX) anything working with "records" has to build the concept on top of the stream. Either by making records "fixed width" (allowing for fast random access: recNum*recLen => seek position), though likely giving up the stream access... Or by wrapping the stream with something that does parsing/buffering. Old days, in my world, the first was more common -- after all, the "common" input method was 80-column Hollerith cards; records consisted of reading one (or a set) of cards and then handling what was on that multiple of 80 characters. My college computer system, by default, used an ISAM structure for editor text files -- but that was a system where the ISAM overhead was handled transparently by the OS, not a user-level linked library (how many libraries are there for ISAM access in C?), so even simple "type"/"print" commands properly displayed the contents. The other format is the Pascal style counted-string saved as file contents, in which each "record" is prefaced with a length code. While not as fast as fixed-length records, it does allow for rather fast scanning of a file by reading the length field, then seeking that many bytes further before reading the next length. But again, the I/O library has to retain knowledge of what the record length was, and how far into a record one has advanced (if not doing full record I/O) so that one recognizes the next length field. I will admit that I miss the idea of OS support for higher level file structures (even the TRS-80 had OS support for fixed length random access files -- and not by wasting the rest of a disk sector; the OS did the packing/unpacking of shorter records into the sectors). -- Wulfraed Dennis Lee Bieber AF6VN wlfraed@ix.netcom.com HTTP://wlfraed.home.netcom.com/