Groups > comp.lang.python > #12237 > unrolled thread

Record seperator

Started by	greymaus <greymausg@mail.com>
First post	2011-08-26 18:39 +0000
Last post	2011-08-28 10:03 +0000
Articles	11 — 7 participants

Back to article view | Back to comp.lang.python

  Record seperator greymaus <greymausg@mail.com> - 2011-08-26 18:39 +0000
    Re: Record seperator "D'Arcy J.M. Cain" <darcy@druid.net> - 2011-08-26 15:02 -0400
      Re: Record seperator greymaus <greymausg@mail.com> - 2011-08-27 16:59 +0000
        Re: Record seperator Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-08-28 03:24 +1000
          Re: Record seperator Roy Smith <roy@panix.com> - 2011-08-27 13:45 -0400
            Re: Record seperator ChasBrown <cbrown@cbrownsystems.com> - 2011-08-27 11:40 -0700
            Re: Record seperator Terry Reedy <tjreedy@udel.edu> - 2011-08-27 16:03 -0400
              Re: Record seperator Roy Smith <roy@panix.com> - 2011-08-27 17:07 -0400
                Re: Record seperator Terry Reedy <tjreedy@udel.edu> - 2011-08-27 20:55 -0400
            Re: Record seperator Chris Angelico <rosuav@gmail.com> - 2011-08-28 06:07 +1000
          Re: Record seperator greymaus <greymausg@mail.com> - 2011-08-28 10:03 +0000

#12237 — Record seperator

From	greymaus <greymausg@mail.com>
Date	2011-08-26 18:39 +0000
Subject	Record seperator
Message-ID	<slrnj5fo7u.4ra.greymausg@hmaus.org>

Is there an equivelent for the AWK RS in Python?


as in RS='\n\n'
will seperate a file at two blank line intervals


-- 
maus
 .
  .
...   NO CARRIER

[toc] | [next] | [standalone]

#12239

From	"D'Arcy J.M. Cain" <darcy@druid.net>
Date	2011-08-26 15:02 -0400
Message-ID	<mailman.451.1314385354.27778.python-list@python.org>
In reply to	#12237

On 26 Aug 2011 18:39:07 GMT
greymaus <greymausg@mail.com> wrote:
> 
> Is there an equivelent for the AWK RS in Python?
> 
> 
> as in RS='\n\n'
> will seperate a file at two blank line intervals

open("file.txt").read().split("\n\n")

-- 
D'Arcy J.M. Cain <darcy@druid.net>         |  Democracy is three wolves
http://www.druid.net/darcy/                |  and a sheep voting on
+1 416 425 1212     (DoD#0082)    (eNTP)   |  what's for dinner.

[toc] | [prev] | [next] | [standalone]

#12274

From	greymaus <greymausg@mail.com>
Date	2011-08-27 16:59 +0000
Message-ID	<slrnj5i1g9.581.greymausg@hmaus.org>
In reply to	#12239

On 2011-08-26, D'Arcy J.M. Cain <darcy@druid.net> wrote:
> On 26 Aug 2011 18:39:07 GMT
> greymaus <greymausg@mail.com> wrote:
>> 
>> Is there an equivelent for the AWK RS in Python?
>> 
>> 
>> as in RS='\n\n'
>> will seperate a file at two blank line intervals
>
> open("file.txt").read().split("\n\n")
>


Ta!.. bit awkard. :))))))


-- 
maus
 .
  .
...   NO CARRIER

[toc] | [prev] | [next] | [standalone]

#12278

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2011-08-28 03:24 +1000
Message-ID	<4e592852$0$29965$c3e8da3$5496439d@news.astraweb.com>
In reply to	#12274

greymaus wrote:

> On 2011-08-26, D'Arcy J.M. Cain <darcy@druid.net> wrote:
>> On 26 Aug 2011 18:39:07 GMT
>> greymaus <greymausg@mail.com> wrote:
>>> 
>>> Is there an equivelent for the AWK RS in Python?
>>> 
>>> 
>>> as in RS='\n\n'
>>> will seperate a file at two blank line intervals
>>
>> open("file.txt").read().split("\n\n")
>>
> 
> 
> Ta!.. bit awkard. :))))))

Er, is that meant to be a pun? "Awk[w]ard", as in awk-ward?

In any case, no, the Python line might be a handful of characters longer
than the AWK equivalent, but it isn't awkward. It is logical and easy to
understand. It's embarrassingly easy to describe what it does:

open("file.txt")   # opens the file
 .read()           # reads the contents of the file
 .split("\n\n")    # splits the text on double-newlines.

The only tricky part is knowing that \n means newline, but anyone familiar
with C, Perl, AWK etc. should know that.

The Python code might be "long" (but only by the standards of AWK, which can
be painfully concise), but it is simple, obvious and readable. A few extra
characters is the price you pay for making your language readable. At the
cost of a few extra key presses, you get something that you will be able to
understand in 10 years time.

AWK is a specialist text processing language. Python is a general scripting
and programming language. They have different values: AWK values short,
concise code, Python is willing to pay a little more in source code.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#12283

From	Roy Smith <roy@panix.com>
Date	2011-08-27 13:45 -0400
Message-ID	<roy-F7BDDC.13453127082011@news.panix.com>
In reply to	#12278

In article <4e592852$0$29965$c3e8da3$5496439d@news.astraweb.com>,
 Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:

> open("file.txt")   # opens the file
>  .read()           # reads the contents of the file
>  .split("\n\n")    # splits the text on double-newlines.

The biggest problem with this code is that read() slurps the entire file 
into a string.  That's fine for moderately sized files, but will fail 
(or at least be grossly inefficient) for very large files.

It's always annoyed me a little that while it's easy to iterate over the 
lines of a file, it's more complicated to iterate over a file character 
by character.  You could write your own generator to do that:

for c in getchar(open("file.txt")):
   whatever

def getchar(f):
   for line in f:
      for c in line:
         yield c

but that's annoyingly verbose (and probably not hugely efficient).

Of course, the next problem for the specific problem at hand is that 
even with an iterator over the characters of a file, split() only works 
on strings.  It would be nice to have a version of split which took an 
iterable and returned an iterator over the split components.  Maybe 
there is such a thing and I'm just missing it?

[toc] | [prev] | [next] | [standalone]

#12288

From	ChasBrown <cbrown@cbrownsystems.com>
Date	2011-08-27 11:40 -0700
Message-ID	<a116cc8d-cf6e-4643-8712-10c61cc413a1@u6g2000prc.googlegroups.com>
In reply to	#12283

On Aug 27, 10:45 am, Roy Smith <r...@panix.com> wrote:
> In article <4e592852$0$29965$c3e8da3$54964...@news.astraweb.com>,
>  Steven D'Aprano <steve+comp.lang.pyt...@pearwood.info> wrote:
>
> > open("file.txt")   # opens the file
> >  .read()           # reads the contents of the file
> >  .split("\n\n")    # splits the text on double-newlines.
>
> The biggest problem with this code is that read() slurps the entire file
> into a string.  That's fine for moderately sized files, but will fail
> (or at least be grossly inefficient) for very large files.
>
> It's always annoyed me a little that while it's easy to iterate over the
> lines of a file, it's more complicated to iterate over a file character
> by character.  You could write your own generator to do that:
>
> for c in getchar(open("file.txt")):
>    whatever
>
> def getchar(f):
>    for line in f:
>       for c in line:
>          yield c
>
> but that's annoyingly verbose (and probably not hugely efficient).

read() takes an optional size parameter; so f.read(1) is another
option...

>
> Of course, the next problem for the specific problem at hand is that
> even with an iterator over the characters of a file, split() only works
> on strings.  It would be nice to have a version of split which took an
> iterable and returned an iterator over the split components.  Maybe
> there is such a thing and I'm just missing it?

I don't know if there is such a thing; but for the OP's problem you
could read the file in chunks, e.g.:

def readgroup(f, delim, buffsize=8192):
    tail=''
    while True:
        s = f.read(buffsize)
        if not s:
            yield tail
            break
        groups = (tail + s).split(delim)
        tail = groups[-1]
        for group in groups[:-1]:
            yield group

for group in readgroup(open('file.txt'), '\n\n'):
    # do something

Cheers - Chas

[toc] | [prev] | [next] | [standalone]

#12291

From	Terry Reedy <tjreedy@udel.edu>
Date	2011-08-27 16:03 -0400
Message-ID	<mailman.477.1314475482.27778.python-list@python.org>
In reply to	#12283

On 8/27/2011 1:45 PM, Roy Smith wrote:
> In article<4e592852$0$29965$c3e8da3$5496439d@news.astraweb.com>,
>   Steven D'Aprano<steve+comp.lang.python@pearwood.info>  wrote:
>
>> open("file.txt")   # opens the file
>>   .read()           # reads the contents of the file
>>   .split("\n\n")    # splits the text on double-newlines.
>
> The biggest problem with this code is that read() slurps the entire file
> into a string.  That's fine for moderately sized files, but will fail
> (or at least be grossly inefficient) for very large files.

I read the above as separating the file into paragraphs, as indicated by 
blank lines.

def paragraphs(file):
   para = []
   for line in file:
     if line:
       para.append(line)
     else:
       yield para # or ''.join(para), as desired
       para = []

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#12302

From	Roy Smith <roy@panix.com>
Date	2011-08-27 17:07 -0400
Message-ID	<roy-098B9B.17074827082011@news.panix.com>
In reply to	#12291

In article <mailman.477.1314475482.27778.python-list@python.org>,
 Terry Reedy <tjreedy@udel.edu> wrote:

> On 8/27/2011 1:45 PM, Roy Smith wrote:
> > In article<4e592852$0$29965$c3e8da3$5496439d@news.astraweb.com>,
> >   Steven D'Aprano<steve+comp.lang.python@pearwood.info>  wrote:
> >
> >> open("file.txt")   # opens the file
> >>   .read()           # reads the contents of the file
> >>   .split("\n\n")    # splits the text on double-newlines.
> >
> > The biggest problem with this code is that read() slurps the entire file
> > into a string.  That's fine for moderately sized files, but will fail
> > (or at least be grossly inefficient) for very large files.
> 
> I read the above as separating the file into paragraphs, as indicated by 
> blank lines.
> 
> def paragraphs(file):
>    para = []
>    for line in file:
>      if line:
>        para.append(line)
>      else:
>        yield para # or ''.join(para), as desired
>        para = []

Plus or minus the last paragraph in the file :-)

[toc] | [prev] | [next] | [standalone]

#12323

From	Terry Reedy <tjreedy@udel.edu>
Date	2011-08-27 20:55 -0400
Message-ID	<mailman.496.1314493010.27778.python-list@python.org>
In reply to	#12302

On 8/27/2011 5:07 PM, Roy Smith wrote:
> In article<mailman.477.1314475482.27778.python-list@python.org>,
>   Terry Reedy<tjreedy@udel.edu>  wrote:
>
>> On 8/27/2011 1:45 PM, Roy Smith wrote:
>>> In article<4e592852$0$29965$c3e8da3$5496439d@news.astraweb.com>,
>>>    Steven D'Aprano<steve+comp.lang.python@pearwood.info>   wrote:
>>>
>>>> open("file.txt")   # opens the file
>>>>    .read()           # reads the contents of the file
>>>>    .split("\n\n")    # splits the text on double-newlines.
>>>
>>> The biggest problem with this code is that read() slurps the entire file
>>> into a string.  That's fine for moderately sized files, but will fail
>>> (or at least be grossly inefficient) for very large files.
>>
>> I read the above as separating the file into paragraphs, as indicated by
>> blank lines.
>>
>> def paragraphs(file):
>>     para = []
>>     for line in file:
>>       if line:
>>         para.append(line)
>>       else:
>>         yield para # or ''.join(para), as desired
>>         para = []
>
> Plus or minus the last paragraph in the file :-)

Or right, I forgot the last line, which is a repeat of the yield after 
the for loop finishes.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#12292

From	Chris Angelico <rosuav@gmail.com>
Date	2011-08-28 06:07 +1000
Message-ID	<mailman.478.1314475630.27778.python-list@python.org>
In reply to	#12283

On Sun, Aug 28, 2011 at 6:03 AM, Terry Reedy <tjreedy@udel.edu> wrote:
>      yield para # or ''.join(para), as desired
>

Or possibly '\n'.join(para) if you want to keep the line breaks inside
paragraphs.

ChrisA

[toc] | [prev] | [next] | [standalone]

#12336

From	greymaus <greymausg@mail.com>
Date	2011-08-28 10:03 +0000
Message-ID	<slrnj5il7v.8jc.greymausg@hmaus.org>
In reply to	#12278

On 2011-08-27, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:
> greymaus wrote:
>
>> On 2011-08-26, D'Arcy J.M. Cain <darcy@druid.net> wrote:
>>> On 26 Aug 2011 18:39:07 GMT
>>> greymaus <greymausg@mail.com> wrote:
>>>> 
>>>> Is there an equivelent for the AWK RS in Python?
>>>> 
>>>> 
>>>> as in RS='\n\n'
>>>> will seperate a file at two blank line intervals
>>>
>>> open("file.txt").read().split("\n\n")
>>>
>> 
>> 
>> Ta!.. bit awkard. :))))))
>
> Er, is that meant to be a pun? "Awk[w]ard", as in awk-ward?

Yup, mispelled it and realized th error :)
>
> In any case, no, the Python line might be a handful of characters longer
> than the AWK equivalent, but it isn't awkward. It is logical and easy to
> understand. It's embarrassingly easy to describe what it does:
>
> open("file.txt")   # opens the file
>  .read()           # reads the contents of the file
>  .split("\n\n")    # splits the text on double-newlines.
>
> The only tricky part is knowing that \n means newline, but anyone familiar
> with C, Perl, AWK etc. should know that.
>
> The Python code might be "long" (but only by the standards of AWK, which can
> be painfully concise), but it is simple, obvious and readable. A few extra
> characters is the price you pay for making your language readable. At the
> cost of a few extra key presses, you get something that you will be able to
> understand in 10 years time.
>
> AWK is a specialist text processing language. Python is a general scripting
> and programming language. They have different values: AWK values short,
> concise code, Python is willing to pay a little more in source code.
>
>

RS, and its Perl equivelent, which I forget, mean that you can read in
full multiline records. 

(I am coming into Python via Perl from AWK, and trying to get a grip
on the language and its idions)

Thanks to All

Oh, Awk is far more than a text processing language, may be old (like me!)
but useful (ditto)



-- 
maus
 .
  .
...   NO CARRIER

[toc] | [prev] | [standalone]

csiph-web

Record seperator

Contents

#12237 — Record seperator

#12239

#12274

#12278

#12283

#12288

#12291

#12302

#12323

#12292

#12336