Groups > comp.lang.python > #11245 > unrolled thread

Processing a large string

Started by	goldtech <goldtech@worldpost.com>
First post	2011-08-11 19:03 -0700
Last post	2011-08-28 20:18 +0100
Articles	8 — 6 participants

Back to article view | Back to comp.lang.python

  Processing a large string goldtech <goldtech@worldpost.com> - 2011-08-11 19:03 -0700
    Re: Processing a large string MRAB <python@mrabarnett.plus.com> - 2011-08-12 03:15 +0100
    Re: Processing a large string Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-08-12 12:30 +1000
    Re: Processing a large string Nobody <nobody@nowhere.com> - 2011-08-12 05:11 +0100
    Re: Processing a large string Peter Otten <__peter__@web.de> - 2011-08-12 10:39 +0200
      Re: Processing a large string goldtech <goldtech@worldpost.com> - 2011-08-12 06:36 -0700
      Re: Processing a large string Peter Otten <__peter__@web.de> - 2011-08-12 16:48 +0200
    Re: Processing a large string Paul Rudin <paul.nospam@rudin.co.uk> - 2011-08-28 20:18 +0100

#11245 — Processing a large string

From	goldtech <goldtech@worldpost.com>
Date	2011-08-11 19:03 -0700
Subject	Processing a large string
Message-ID	<b16af723-854c-449d-8b45-565d73579e17@br5g2000vbb.googlegroups.com>

Hi,

Say I have a very big string with a pattern like:

akakksssk3dhdhdhdbddb3dkdkdkddk3dmdmdmd3dkdkdkdk3asnsn.....

I want to split the sting into separate parts on the "3" and process
each part separately. I might run into memory limitations if I use
"split" and get a big array(?)  I wondered if there's a way I could
read (stream?) the string from start to finish and read what's
delimited by the "3" into a variable, process the smaller string
variable then append/build a new string with the processed data?

Would I loop it and read it char by char till a "3"...? Or?

Thanks.

[toc] | [next] | [standalone]

#11246

From	MRAB <python@mrabarnett.plus.com>
Date	2011-08-12 03:15 +0100
Message-ID	<mailman.2201.1313115355.1164.python-list@python.org>
In reply to	#11245

On 12/08/2011 03:03, goldtech wrote:
> Hi,
>
> Say I have a very big string with a pattern like:
>
> akakksssk3dhdhdhdbddb3dkdkdkddk3dmdmdmd3dkdkdkdk3asnsn.....
>
> I want to split the sting into separate parts on the "3" and process
> each part separately. I might run into memory limitations if I use
> "split" and get a big array(?)  I wondered if there's a way I could
> read (stream?) the string from start to finish and read what's
> delimited by the "3" into a variable, process the smaller string
> variable then append/build a new string with the processed data?
>
> Would I loop it and read it char by char till a "3"...? Or?
>
You could write a generator like this:

def split(string, sep):
     pos = 0
     try:
         while True:
             next_pos = string.index(sep, pos)
             yield string[pos : next_pos]
             pos = next_pos + 1
     except ValueError:
         yield string[pos : ]

string = "akakksssk3dhdhdhdbddb3dkdkdkddk3dmdmdmd3dkdkdkdk3asnsn..."

for part in split(string, "3"):
     print(part)

[toc] | [prev] | [next] | [standalone]

#11247

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2011-08-12 12:30 +1000
Message-ID	<4e449062$0$29975$c3e8da3$5496439d@news.astraweb.com>
In reply to	#11245

goldtech wrote:

> Hi,
> 
> Say I have a very big string with a pattern like:
> 
> akakksssk3dhdhdhdbddb3dkdkdkddk3dmdmdmd3dkdkdkdk3asnsn.....


Define "big".

What seems big to you is probably not big to your computer.


> I want to split the sting into separate parts on the "3" and process
> each part separately. I might run into memory limitations if I use
> "split" and get a big array(?)  I wondered if there's a way I could
> read (stream?) the string from start to finish and read what's
> delimited by the "3" into a variable, process the smaller string
> variable then append/build a new string with the processed data?
> 
> Would I loop it and read it char by char till a "3"...? Or?

You could, but unless there are a lot of 3s, it will probably be slow. If
the 3s are far apart, it will be better to do this:

# untested
def split(source):
    start = 0
    i = source.find("3")
    while i >= 0:
        yield source[start:i]
        start = i+1
        i = source.find("3", start)


That should give you the pieces of the string one at a time, as efficiently
as possible.




-- 
Steven

[toc] | [prev] | [next] | [standalone]

#11251

From	Nobody <nobody@nowhere.com>
Date	2011-08-12 05:11 +0100
Message-ID	<pan.2011.08.12.04.11.40.19000@nowhere.com>
In reply to	#11245

On Thu, 11 Aug 2011 19:03:36 -0700, goldtech wrote:

> Say I have a very big string with a pattern like:
> 
> akakksssk3dhdhdhdbddb3dkdkdkddk3dmdmdmd3dkdkdkdk3asnsn.....
> 
> I want to split the sting into separate parts on the "3" and process
> each part separately. I might run into memory limitations if I use
> "split" and get a big array(?)  I wondered if there's a way I could
> read (stream?) the string from start to finish and read what's
> delimited by the "3" into a variable, process the smaller string
> variable then append/build a new string with the processed data?
> 
> Would I loop it and read it char by char till a "3"...? Or?

Use the .find() or .index() methods to find the next occurrence of a
character.

Building a large string by concatenation is inefficient, as each append
will copy the original string. If you must have the result as a
single string, using cStringIO would be preferable. But you'd be better
off if you can work with a list of strings.

[toc] | [prev] | [next] | [standalone]

#11259

From	Peter Otten <__peter__@web.de>
Date	2011-08-12 10:39 +0200
Message-ID	<j22oqv$9ro$1@solani.org>
In reply to	#11245

goldtech wrote:

> Hi,
> 
> Say I have a very big string with a pattern like:
> 
> akakksssk3dhdhdhdbddb3dkdkdkddk3dmdmdmd3dkdkdkdk3asnsn.....
> 
> I want to split the sting into separate parts on the "3" and process
> each part separately. I might run into memory limitations if I use
> "split" and get a big array(?)  I wondered if there's a way I could
> read (stream?) the string from start to finish and read what's
> delimited by the "3" into a variable, process the smaller string
> variable then append/build a new string with the processed data?
> 
> Would I loop it and read it char by char till a "3"...? Or?

You can read the file in chunks:

from functools import partial

def read_chunks(instream, chunksize=None):
    if chunksize is None:
        chunksize = 2**20
    return iter(partial(instream.read, chunksize), "")

def split_file(instream, delimiter, chunksize=None):
    leftover = ""
    chunk = None
    for chunk in read_chunks(instream):
        chunk = leftover + chunk
        parts = chunk.split(delimiter)
        leftover = parts.pop()
        for part in parts:
            yield part
    if leftover or chunk is None or chunk.endswith(delimiter):
        yield leftover

I hope I got the corner cases right.

PS: This has come up before, but I couldn't find the relevant threads...

[toc] | [prev] | [next] | [standalone]

#11275

From	goldtech <goldtech@worldpost.com>
Date	2011-08-12 06:36 -0700
Message-ID	<bc665478-94a8-461f-9ea0-602f29fd1d22@h14g2000yqd.googlegroups.com>
In reply to	#11259

Thanks for all this info.

[toc] | [prev] | [next] | [standalone]

#11279

From	Peter Otten <__peter__@web.de>
Date	2011-08-12 16:48 +0200
Message-ID	<mailman.2220.1313160461.1164.python-list@python.org>
In reply to	#11259

Peter Otten wrote:

> goldtech wrote:

>> Say I have a very big string with a pattern like:
>> 
>> akakksssk3dhdhdhdbddb3dkdkdkddk3dmdmdmd3dkdkdkdk3asnsn.....
>> 
>> I want to split the sting into separate parts on the "3" and process
>> each part separately. I might run into memory limitations if I use
>> "split" and get a big array(?)  I wondered if there's a way I could
>> read (stream?) the string from start to finish and read what's
>> delimited by the "3" into a variable, process the smaller string
>> variable then append/build a new string with the processed data?

> PS: This has come up before, but I couldn't find the relevant threads...

Alex Martelli a looong time ago:

> from __future__ import generators
> 
> def splitby(fileobj, splitter, bufsize=8192):
>     buf = ''
> 
>     while True:
>         try: 
>             item, buf = buf.split(splitter, 1)
>         except ValueError:
>             more = fileobj.read(bufsize)
>             if not more: break
>             buf += more
>         else:
>             yield item + splitter
> 
>     if buf:
>         yield buf

http://mail.python.org/pipermail/python-list/2002-September/770673.html

[toc] | [prev] | [next] | [standalone]

#12349

From	Paul Rudin <paul.nospam@rudin.co.uk>
Date	2011-08-28 20:18 +0100
Message-ID	<87y5ydqr9o.fsf@no-fixed-abode.cable.virginmedia.net>
In reply to	#11245

goldtech <goldtech@worldpost.com> writes:

> Hi,
>
> Say I have a very big string with a pattern like:
>
> akakksssk3dhdhdhdbddb3dkdkdkddk3dmdmdmd3dkdkdkdk3asnsn.....
>
> I want to split the sting into separate parts on the "3" and process
> each part separately. I might run into memory limitations if I use
> "split" and get a big array(?)  I wondered if there's a way I could
> read (stream?) the string from start to finish and read what's
> delimited by the "3" into a variable, process the smaller string
> variable then append/build a new string with the processed data?
>
> Would I loop it and read it char by char till a "3"...? Or?
>
> Thanks.

s = "akakksssk3dhdhdhdbddb3dkdkdkddk3dmdmdmd3dkdkdkdk3asnsn"
for k, subs in itertools.groupby(s, lambda x: x=="3"):
   print ''.join(subs)


what you actually do in the body of the loop depends on what you want to
do with the bits.

[toc] | [prev] | [standalone]

csiph-web

Processing a large string

Contents

#11245 — Processing a large string

#11246

#11247

#11251

#11259

#11275

#11279

#12349