Groups > comp.lang.python > #19213 > unrolled thread

Splitting a file from specific column content

Started by	Yigit Turgut <y.turgut@gmail.com>
First post	2012-01-22 06:32 -0800
Last post	2012-01-22 20:55 +0000
Articles	15 — 6 participants

Back to article view | Back to comp.lang.python

  Splitting a file from specific column content Yigit Turgut <y.turgut@gmail.com> - 2012-01-22 06:32 -0800
    Re: Splitting a file from specific column content Roy Smith <roy@panix.com> - 2012-01-22 09:45 -0500
      Re: Splitting a file from specific column content Roy Smith <roy@panix.com> - 2012-01-22 14:26 -0500
      Re: Splitting a file from specific column content Tim Chase <python.list@tim.thechases.com> - 2012-01-22 13:34 -0600
      Re: Splitting a file from specific column content Roy Smith <roy@panix.com> - 2012-01-22 14:37 -0500
        Re: Splitting a file from specific column content Yigit Turgut <y.turgut@gmail.com> - 2012-01-22 12:16 -0800
    Re: Splitting a file from specific column content MRAB <python@mrabarnett.plus.com> - 2012-01-22 15:19 +0000
    Re: Splitting a file from specific column content Arnaud Delobelle <arnodel@gmail.com> - 2012-01-22 15:39 +0000
      Re: Splitting a file from specific column content Yigit Turgut <y.turgut@gmail.com> - 2012-01-22 08:17 -0800
        Re: Splitting a file from specific column content MRAB <python@mrabarnett.plus.com> - 2012-01-22 16:56 +0000
          Re: Splitting a file from specific column content Yigit Turgut <y.turgut@gmail.com> - 2012-01-22 09:47 -0800
      Re: Splitting a file from specific column content Eelco <hoogendoorn.eelco@gmail.com> - 2012-01-22 12:43 -0800
    Re: Splitting a file from specific column content MRAB <python@mrabarnett.plus.com> - 2012-01-22 16:09 +0000
    Re: Splitting a file from specific column content Arnaud Delobelle <arnodel@gmail.com> - 2012-01-22 19:58 +0000
    Re: Splitting a file from specific column content MRAB <python@mrabarnett.plus.com> - 2012-01-22 20:55 +0000

#19213 — Splitting a file from specific column content

From	Yigit Turgut <y.turgut@gmail.com>
Date	2012-01-22 06:32 -0800
Subject	Splitting a file from specific column content
Message-ID	<e1f0636a-195c-4fbb-931a-4d619d5f0d18@g27g2000yqa.googlegroups.com>

Hi all,

I have a text file approximately 20mb in size and contains about one
million lines. I was doing some processing on the data but then the
data rate increased and it takes very long time to process. I import
using numpy.loadtxt, here is a fragment of the data ;

0.000006 	 -0.0004
0.000071 	 0.0028
0.000079 	 0.0044
0.000086 	 0.0104
.
.
.

First column is the timestamp in seconds and second column is the
data. File contains 8seconds of measurement, and I would like to be
able to split the file into 3 parts seperated from specific time
locations. For example I want to divide the file into 3 parts, first
part containing 3 seconds of data, second containing 2 seconds of data
and third containing 3 seconds. Splitting based on file size doesn't
work that accurately for this specific data, some columns become
missing and etc. I need to split depending on the column content ;

1 - read file until first character of column1 is 3 (3 seconds)
2 - save this region to another file
3 - read the file where first characters  of column1 are between 3 to
5 (2 seconds)
4 - save this region to another file
5 - read the file where first characters  of column1 are between 5 to
5 (3 seconds)
6 - save this region to another file

I need to do this exactly because numpy.loadtxt or genfromtxt doesn't
get well with missing columns / rows. I even tried the invalidraise
parameter of genfromtxt but no luck.

I am sure it's a few lines of code for experienced users and I would
appreciate some guidance.

[toc] | [next] | [standalone]

#19214

From	Roy Smith <roy@panix.com>
Date	2012-01-22 09:45 -0500
Message-ID	<roy-125359.09450822012012@news.panix.com>
In reply to	#19213

In article 
<e1f0636a-195c-4fbb-931a-4d619d5f0d18@g27g2000yqa.googlegroups.com>,
 Yigit Turgut <y.turgut@gmail.com> wrote:

> Hi all,
> 
> I have a text file approximately 20mb in size and contains about one
> million lines. I was doing some processing on the data but then the
> data rate increased and it takes very long time to process. I import
> using numpy.loadtxt, here is a fragment of the data ;
> 
> 0.000006 	 -0.0004
> 0.000071 	 0.0028
> 0.000079 	 0.0044
> 0.000086 	 0.0104
> .
> .
> .
> 
> First column is the timestamp in seconds and second column is the
> data. File contains 8seconds of measurement, and I would like to be
> able to split the file into 3 parts seperated from specific time
> locations. For example I want to divide the file into 3 parts, first
> part containing 3 seconds of data, second containing 2 seconds of data
> and third containing 3 seconds.

I would do this with standard unix tools:

grep '^[012]' input.txt > first-three-seconds.txt
grep '^[34]' input.txt > next-two-seconds.txt
grep '^[567]' input.txt > next-three-seconds.txt

Sure, it makes three passes over the data, but for 20 MB of data, you 
could have the whole job done in less time than it took me to type this.

As a sanity check, I would run "wc -l" on each of the files and confirm 
that they add up to the original line count.

[toc] | [prev] | [next] | [standalone]

#19228

From	Roy Smith <roy@panix.com>
Date	2012-01-22 14:26 -0500
Message-ID	<mailman.4933.1327260402.27778.python-list@python.org>
In reply to	#19214

I stand humbled.

On Jan 22, 2012, at 2:25 PM, Tim Chase wrote:

> On 01/22/12 08:45, Roy Smith wrote:
>> I would do this with standard unix tools:
>> 
>> grep '^[012]' input.txt>  first-three-seconds.txt
>> grep '^[34]' input.txt>  next-two-seconds.txt
>> grep '^[567]' input.txt>  next-three-seconds.txt
>> 
>> Sure, it makes three passes over the data, but for 20 MB of data, you
>> could have the whole job done in less time than it took me to type this.
> 
> 
> If you wanted to do it in one pass using standard unix tools, you can use:
> 
> sed -n -e'/^[0-2]/w first-three.txt' -e'/^[34]/w next-two.txt' -e'/^[5-7]/w next-three.txt'
> 
> -tkc
> 
> 
> 


--
Roy Smith
roy@panix.com

[toc] | [prev] | [next] | [standalone]

#19229

From	Tim Chase <python.list@tim.thechases.com>
Date	2012-01-22 13:34 -0600
Message-ID	<mailman.4934.1327260861.27778.python-list@python.org>
In reply to	#19214

On 01/22/12 13:26, Roy Smith wrote:
>> If you wanted to do it in one pass using standard unix
>> tools, you can use:
>>
>> sed -n -e'/^[0-2]/w first-three.txt' -e'/^[34]/w
>> next-two.txt' -e'/^[5-7]/w next-three.txt'
>>
> I stand humbled.

In all likelyhood, you stand *younger*, not so much humbled ;-)

-tkc

[toc] | [prev] | [next] | [standalone]

#19230

From	Roy Smith <roy@panix.com>
Date	2012-01-22 14:37 -0500
Message-ID	<mailman.4935.1327261051.27778.python-list@python.org>
In reply to	#19214

On Jan 22, 2012, at 2:34 PM, Tim Chase wrote:

> On 01/22/12 13:26, Roy Smith wrote:
>>> If you wanted to do it in one pass using standard unix
>>> tools, you can use:
>>> 
>>> sed -n -e'/^[0-2]/w first-three.txt' -e'/^[34]/w
>>> next-two.txt' -e'/^[5-7]/w next-three.txt'
>>> 
>> I stand humbled.
> 
> In all likelyhood, you stand *younger*, not so much humbled ;-)


Oh, yeah?  That must explain my grey hair and bifocals.   I go back to Unix v6 in 1977.  Humbled it is.



--
Roy Smith
roy@panix.com

[toc] | [prev] | [next] | [standalone]

#19233

From	Yigit Turgut <y.turgut@gmail.com>
Date	2012-01-22 12:16 -0800
Message-ID	<61ba8425-a793-4141-adf9-212cc01233f5@cf6g2000vbb.googlegroups.com>
In reply to	#19230

On Jan 22, 9:37 pm, Roy Smith <r...@panix.com> wrote:
> On Jan 22, 2012, at 2:34 PM, Tim Chase wrote:
>
> > On 01/22/12 13:26, Roy Smith wrote:
> >>> If you wanted to do it in one pass using standard unix
> >>> tools, you can use:
>
> >>> sed -n -e'/^[0-2]/w first-three.txt' -e'/^[34]/w
> >>> next-two.txt' -e'/^[5-7]/w next-three.txt'
>
> >> I stand humbled.
>
> > In all likelyhood, you stand *younger*, not so much humbled ;-)
>
> Oh, yeah?  That must explain my grey hair and bifocals.   I go back to Unix v6 in 1977.  Humbled it is.

Those times were much better IMHO (:

[toc] | [prev] | [next] | [standalone]

#19215

From	MRAB <python@mrabarnett.plus.com>
Date	2012-01-22 15:19 +0000
Message-ID	<mailman.4923.1327245574.27778.python-list@python.org>
In reply to	#19213

On 22/01/2012 14:32, Yigit Turgut wrote:
> Hi all,
>
> I have a text file approximately 20mb in size and contains about one
> million lines. I was doing some processing on the data but then the
> data rate increased and it takes very long time to process. I import
> using numpy.loadtxt, here is a fragment of the data ;
>
> 0.000006 	 -0.0004
> 0.000071 	 0.0028
> 0.000079 	 0.0044
> 0.000086 	 0.0104
> .
> .
> .
>
> First column is the timestamp in seconds and second column is the
> data. File contains 8seconds of measurement, and I would like to be
> able to split the file into 3 parts seperated from specific time
> locations. For example I want to divide the file into 3 parts, first
> part containing 3 seconds of data, second containing 2 seconds of data
> and third containing 3 seconds. Splitting based on file size doesn't
> work that accurately for this specific data, some columns become
> missing and etc. I need to split depending on the column content ;
>
> 1 - read file until first character of column1 is 3 (3 seconds)
> 2 - save this region to another file
> 3 - read the file where first characters  of column1 are between 3 to
> 5 (2 seconds)
> 4 - save this region to another file
> 5 - read the file where first characters  of column1 are between 5 to
> 5 (3 seconds)
> 6 - save this region to another file
>
> I need to do this exactly because numpy.loadtxt or genfromtxt doesn't
> get well with missing columns / rows. I even tried the invalidraise
> parameter of genfromtxt but no luck.
>
> I am sure it's a few lines of code for experienced users and I would
> appreciate some guidance.
>
Here's a solution in Python 3:

input_path = "..."
section_1_path = "..."
section_2_path = "..."
section_3_path = "..."

with open(input_path) as input_file:
     try:
         line = next(input_file)

         # Copy section 1.
         with open(section_1_path, "w") as output_file:
             while line[0] < "3":
                 output_file.write(line)
                 line = next(input_file)

         # Copy section 2.
         with open(section_2_path, "w") as output_file:
             while line[5] < "5":
                 output_file.write(line)
                 line = next(input_file)

         # Copy section 3.
         with open(section_3_path, "w") as output_file:
             while True:
                 output_file.write(line)
                 line = next(input_file)
     except StopIteration:
         pass

[toc] | [prev] | [next] | [standalone]

#19216

From	Arnaud Delobelle <arnodel@gmail.com>
Date	2012-01-22 15:39 +0000
Message-ID	<mailman.4924.1327246798.27778.python-list@python.org>
In reply to	#19213

On 22 January 2012 15:19, MRAB <python@mrabarnett.plus.com> wrote:

> Here's a solution in Python 3:
>
> input_path = "..."
> section_1_path = "..."
> section_2_path = "..."
> section_3_path = "..."
>
> with open(input_path) as input_file:
>    try:
>        line = next(input_file)
>
>        # Copy section 1.
>        with open(section_1_path, "w") as output_file:
>            while line[0] < "3":
>                output_file.write(line)
>                line = next(input_file)
>
>        # Copy section 2.
>        with open(section_2_path, "w") as output_file:
>            while line[5] < "5":
>                output_file.write(line)
>                line = next(input_file)
>
>        # Copy section 3.
>        with open(section_3_path, "w") as output_file:
>            while True:
>                output_file.write(line)
>                line = next(input_file)
>    except StopIteration:
>        pass
> --
> http://mail.python.org/mailman/listinfo/python-list

Or more succintly (but not tested):


sections = [
    ("3", "section_1")
    ("5", "section_2")
    ("\xFF", "section_3")
]

with open(input_path) as input_file:
    lines = iter(input_file)
    for end, path in sections:
        with open(path, "w") as output_file:
            for line in lines:
                if line >= end:
                    break
                output_file.write(line)

-- 
Arnaud

[toc] | [prev] | [next] | [standalone]

#19220

From	Yigit Turgut <y.turgut@gmail.com>
Date	2012-01-22 08:17 -0800
Message-ID	<849e46d1-b3bb-481d-8a8e-17cb51b0523f@cf6g2000vbb.googlegroups.com>
In reply to	#19216

On Jan 22, 4:45 pm, Roy Smith <r...@panix.com> wrote:
> In article
> <e1f0636a-195c-4fbb-931a-4d619d5f0...@g27g2000yqa.googlegroups.com>,
>  Yigit Turgut <y.tur...@gmail.com> wrote:

> > Hi all,
>
> > I have a text file approximately 20mb in size and contains about one
> > million lines. I was doing some processing on the data but then the
> > data rate increased and it takes very long time to process. I import
> > using numpy.loadtxt, here is a fragment of the data ;
>
> > 0.000006    -0.0004
> > 0.000071    0.0028
> > 0.000079    0.0044
> > 0.000086    0.0104
> > .
> > .
> > .
>
> > First column is the timestamp in seconds and second column is the
> > data. File contains 8seconds of measurement, and I would like to be
> > able to split the file into 3 parts seperated from specific time
> > locations. For example I want to divide the file into 3 parts, first
> > part containing 3 seconds of data, second containing 2 seconds of data
> > and third containing 3 seconds.
>
> I would do this with standard unix tools:
>
> grep '^[012]' input.txt > first-three-seconds.txt
> grep '^[34]' input.txt > next-two-seconds.txt
> grep '^[567]' input.txt > next-three-seconds.txt
>
> Sure, it makes three passes over the data, but for 20 MB of data, you
> could have the whole job done in less time than it took me to type this.
>
> As a sanity check, I would run "wc -l" on each of the files and confirm
> that they add up to the original line count.

This works and is very fast but it missed a few hundred lines
unfortunately.

On Jan 22, 5:19 pm, MRAB <pyt...@mrabarnett.plus.com> wrote:
> On 22/01/2012 14:32, Yigit Turgut wrote:
> > Hi all,
>
> > I have a text file approximately 20mb in size and contains about one
> > million lines. I was doing some processing on the data but then the
> > data rate increased and it takes very long time to process. I import
> > using numpy.loadtxt, here is a fragment of the data ;
>
> > 0.000006    -0.0004
> > 0.000071    0.0028
> > 0.000079    0.0044
> > 0.000086    0.0104
> > .
> > .
> > .
>
> > First column is the timestamp in seconds and second column is the
> > data. File contains 8seconds of measurement, and I would like to be
> > able to split the file into 3 parts seperated from specific time
> > locations. For example I want to divide the file into 3 parts, first
> > part containing 3 seconds of data, second containing 2 seconds of data
> > and third containing 3 seconds. Splitting based on file size doesn't
> > work that accurately for this specific data, some columns become
> > missing and etc. I need to split depending on the column content ;
>
> > 1 - read file until first character of column1 is 3 (3 seconds)
> > 2 - save this region to another file
> > 3 - read the file where first characters  of column1 are between 3 to
> > 5 (2 seconds)
> > 4 - save this region to another file
> > 5 - read the file where first characters  of column1 are between 5 to
> > 5 (3 seconds)
> > 6 - save this region to another file
>
> > I need to do this exactly because numpy.loadtxt or genfromtxt doesn't
> > get well with missing columns / rows. I even tried the invalidraise
> > parameter of genfromtxt but no luck.
>
> > I am sure it's a few lines of code for experienced users and I would
> > appreciate some guidance.
>
> Here's a solution in Python 3:
>
> input_path = "..."
> section_1_path = "..."
> section_2_path = "..."
> section_3_path = "..."
>
> with open(input_path) as input_file:
>      try:
>          line = next(input_file)
>
>          # Copy section 1.
>          with open(section_1_path, "w") as output_file:
>              while line[0] < "3":
>                  output_file.write(line)
>                  line = next(input_file)
>
>          # Copy section 2.
>          with open(section_2_path, "w") as output_file:
>              while line[5] < "5":
>                  output_file.write(line)
>                  line = next(input_file)
>
>          # Copy section 3.
>          with open(section_3_path, "w") as output_file:
>              while True:
>                  output_file.write(line)
>                  line = next(input_file)
>      except StopIteration:
>          pass

With the following correction ;

while line[5] < "5":
should be
while line[0] < "5":

This works well.

On Jan 22, 5:39 pm, Arnaud Delobelle <arno...@gmail.com> wrote:
> On 22 January 2012 15:19, MRAB <pyt...@mrabarnett.plus.com> wrote:
> > Here's a solution in Python 3:
>
> > input_path = "..."
> > section_1_path = "..."
> > section_2_path = "..."
> > section_3_path = "..."
>
> > with open(input_path) as input_file:
> >    try:
> >        line = next(input_file)
>
> >        # Copy section 1.
> >        with open(section_1_path, "w") as output_file:
> >            while line[0] < "3":
> >                output_file.write(line)
> >                line = next(input_file)
>
> >        # Copy section 2.
> >        with open(section_2_path, "w") as output_file:
> >            while line[5] < "5":
> >                output_file.write(line)
> >                line = next(input_file)
>
> >        # Copy section 3.
> >        with open(section_3_path, "w") as output_file:
> >            while True:
> >                output_file.write(line)
> >                line = next(input_file)
> >    except StopIteration:
> >        pass
> > --
> >http://mail.python.org/mailman/listinfo/python-list
>
> Or more succintly (but not tested):
>
> sections = [
>     ("3", "section_1")
>     ("5", "section_2")
>     ("\xFF", "section_3")
> ]
>
> with open(input_path) as input_file:
>     lines = iter(input_file)
>     for end, path in sections:
>         with open(path, "w") as output_file:
>             for line in lines:
>                 if line >= end:
>                     break
>                 output_file.write(line)
>
> --
> Arnaud

Good idea. Especially when dealing with variable numbers of sections.
But somehow  I got ;

    ("5", "section_2")
TypeError: 'tuple' object is not callable

[toc] | [prev] | [next] | [standalone]

#19223

From	MRAB <python@mrabarnett.plus.com>
Date	2012-01-22 16:56 +0000
Message-ID	<mailman.4928.1327251361.27778.python-list@python.org>
In reply to	#19220

On 22/01/2012 16:17, Yigit Turgut wrote:
[snip]
> On Jan 22, 5:39 pm, Arnaud Delobelle<arno...@gmail.com>  wrote:
[snip]
>>  Or more succintly (but not tested):
>>
>>  sections = [
>>      ("3", "section_1")
>>      ("5", "section_2")
>>      ("\xFF", "section_3")
>>  ]
>>
>>  with open(input_path) as input_file:
>>      lines = iter(input_file)
>>      for end, path in sections:
>>          with open(path, "w") as output_file:
>>              for line in lines:
>>                  if line>= end:
>>                      break
>>                  output_file.write(line)
>>
>>  --
>>  Arnaud
>
> Good idea. Especially when dealing with variable numbers of sections.
> But somehow  I got ;
>
>      ("5", "section_2")
> TypeError: 'tuple' object is not callable
>
That's due to missing commas:

sections = [
     ("3", "section_1"),
     ("5", "section_2"),
     ("\xFF", "section_3")
]

[toc] | [prev] | [next] | [standalone]

#19225

From	Yigit Turgut <y.turgut@gmail.com>
Date	2012-01-22 09:47 -0800
Message-ID	<729dc0f2-6bb8-4331-a13c-1cb5924519e4@o14g2000vbo.googlegroups.com>
In reply to	#19223

On Jan 22, 6:56 pm, MRAB <pyt...@mrabarnett.plus.com> wrote:
> On 22/01/2012 16:17, Yigit Turgut wrote:
> [snip]
>
>
>
>
>
>
>
> > On Jan 22, 5:39 pm, Arnaud Delobelle<arno...@gmail.com>  wrote:
> [snip]
> >>  Or more succintly (but not tested):
>
> >>  sections = [
> >>      ("3", "section_1")
> >>      ("5", "section_2")
> >>      ("\xFF", "section_3")
> >>  ]
>
> >>  with open(input_path) as input_file:
> >>      lines = iter(input_file)
> >>      for end, path in sections:
> >>          with open(path, "w") as output_file:
> >>              for line in lines:
> >>                  if line>= end:
> >>                      break
> >>                  output_file.write(line)
>
> >>  --
> >>  Arnaud
>
> > Good idea. Especially when dealing with variable numbers of sections.
> > But somehow  I got ;
>
> >      ("5", "section_2")
> > TypeError: 'tuple' object is not callable
>
> That's due to missing commas:
>
> sections = [
>      ("3", "section_1"),
>      ("5", "section_2"),
>      ("\xFF", "section_3")
> ]

Thank you.

[toc] | [prev] | [next] | [standalone]

#19236

From	Eelco <hoogendoorn.eelco@gmail.com>
Date	2012-01-22 12:43 -0800
Message-ID	<d3725ab2-2a8f-43fc-9381-7bbba30510ac@k6g2000vbz.googlegroups.com>
In reply to	#19216

The grep solution is not cross-platform, and not really an answer to a
question about python.

The by-line iteration examples are inefficient and bad practice from a
numpy/vectorization perspective.

I would advice to do it the numpythonic way (untested code):

breakpoints = [3, 5, 7]
data = np.loadtxt('data.txt')
time = data[:,0]
indices = np.searchsorted(time, breakpoints)
chunks = np.split(data, indices, axis=0)
for i, d in enumerate(chunks):
    np.savetxt('data'+str(i)+'.txt', d)

Not sure how it compared to the grep solution in terms of performance,
but that should be quite a non-issue for 20mb of data, and its sure to
blow the by-line iteration out of the water. If you want to be more
efficient, you are going to have to cut the text-to-numeric parsing
out of the loop, which is the vast majority of the computational load
here; but if thats possible at all depends on how structured your
timestamps are; there must be a really compelling performance gain to
justify throwing the elegance of the np.split based solution out of
the window, in my opinion.

[toc] | [prev] | [next] | [standalone]

#19218

From	MRAB <python@mrabarnett.plus.com>
Date	2012-01-22 16:09 +0000
Message-ID	<mailman.4926.1327248722.27778.python-list@python.org>
In reply to	#19213

On 22/01/2012 15:39, Arnaud Delobelle wrote:
> On 22 January 2012 15:19, MRAB<python@mrabarnett.plus.com>  wrote:
>
>>  Here's a solution in Python 3:
>>
>>  input_path = "..."
>>  section_1_path = "..."
>>  section_2_path = "..."
>>  section_3_path = "..."
>>
>>  with open(input_path) as input_file:
>>      try:
>>          line = next(input_file)
>>
>>          # Copy section 1.
>>          with open(section_1_path, "w") as output_file:
>>              while line[0]<  "3":
>>                  output_file.write(line)
>>                  line = next(input_file)
>>
>>          # Copy section 2.
>>          with open(section_2_path, "w") as output_file:
>>              while line[5]<  "5":
>>                  output_file.write(line)
>>                  line = next(input_file)
>>
>>          # Copy section 3.
>>          with open(section_3_path, "w") as output_file:
>>              while True:
>>                  output_file.write(line)
>>                  line = next(input_file)
>>      except StopIteration:
>>          pass
>>  --
>>  http://mail.python.org/mailman/listinfo/python-list
>
> Or more succintly (but not tested):
>
>
> sections = [
>      ("3", "section_1")
>      ("5", "section_2")
>      ("\xFF", "section_3")
> ]
>
> with open(input_path) as input_file:
>      lines = iter(input_file)
>      for end, path in sections:
>          with open(path, "w") as output_file:
>              for line in lines:
>                  if line>= end:
>                      break
>                  output_file.write(line)
>
Consider the condition "line >= end".

If it's true, then control will break out of the inner loop and start
the inner loop again, getting the next line.

But what of the line which caused it to break out? It'll be lost.

[toc] | [prev] | [next] | [standalone]

#19231

From	Arnaud Delobelle <arnodel@gmail.com>
Date	2012-01-22 19:58 +0000
Message-ID	<mailman.4936.1327262289.27778.python-list@python.org>
In reply to	#19213

On 22 January 2012 16:09, MRAB <python@mrabarnett.plus.com> wrote:
> On 22/01/2012 15:39, Arnaud Delobelle wrote:
[...]
>> Or more succintly (but not tested):
>>
>>
>> sections = [
>>     ("3", "section_1")
>>     ("5", "section_2")
>>     ("\xFF", "section_3")
>> ]
>>
>> with open(input_path) as input_file:
>>     lines = iter(input_file)
>>     for end, path in sections:
>>         with open(path, "w") as output_file:
>>             for line in lines:
>>                 if line>= end:
>>                     break
>>                 output_file.write(line)
>>
> Consider the condition "line >= end".
>
> If it's true, then control will break out of the inner loop and start
> the inner loop again, getting the next line.
>
> But what of the line which caused it to break out? It'll be lost.

Of course you're correct - my reply was too rushed.  Here's a
hopefully working version (but still untested :).

sections = [
    ("3", "section_1")
    ("5", "section_2")
    ("\xFF", "section_3")
]

with open(input_path) as input_file:
    line, lines = "", iter(input_file)
    for end, path in sections:
        with open(path, "w") as output_file:
            output_file.write(line)
            for line in lines:
                if line >= end:
                    break
                output_file.write(line)

-- 
Arnaud

[toc] | [prev] | [next] | [standalone]

#19238

From	MRAB <python@mrabarnett.plus.com>
Date	2012-01-22 20:55 +0000
Message-ID	<mailman.4940.1327265707.27778.python-list@python.org>
In reply to	#19213

On 22/01/2012 19:58, Arnaud Delobelle wrote:
> On 22 January 2012 16:09, MRAB<python@mrabarnett.plus.com>  wrote:
>>  On 22/01/2012 15:39, Arnaud Delobelle wrote:
> [...]
>>>  Or more succintly (but not tested):
>>>
>>>
>>>  sections = [
>>>       ("3", "section_1")
>>>       ("5", "section_2")
>>>       ("\xFF", "section_3")
>>>  ]
>>>
>>>  with open(input_path) as input_file:
>>>       lines = iter(input_file)
>>>       for end, path in sections:
>>>           with open(path, "w") as output_file:
>>>               for line in lines:
>>>                   if line>= end:
>>>                       break
>>>                   output_file.write(line)
>>>
>>  Consider the condition "line>= end".
>>
>>  If it's true, then control will break out of the inner loop and start
>>  the inner loop again, getting the next line.
>>
>>  But what of the line which caused it to break out? It'll be lost.
>
> Of course you're correct - my reply was too rushed.  Here's a
> hopefully working version (but still untested :).
>
> sections = [
>      ("3", "section_1")
>      ("5", "section_2")
>      ("\xFF", "section_3")
> ]
>
[snip]
Missing commas! :-)

[toc] | [prev] | [standalone]

csiph-web

Splitting a file from specific column content

Contents

#19213 — Splitting a file from specific column content

#19214

#19228

#19229

#19230

#19233

#19215

#19216

#19220

#19223

#19225

#19236

#19218

#19231

#19238