Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #11245 > unrolled thread
| Started by | goldtech <goldtech@worldpost.com> |
|---|---|
| First post | 2011-08-11 19:03 -0700 |
| Last post | 2011-08-28 20:18 +0100 |
| Articles | 8 — 6 participants |
Back to article view | Back to comp.lang.python
Processing a large string goldtech <goldtech@worldpost.com> - 2011-08-11 19:03 -0700
Re: Processing a large string MRAB <python@mrabarnett.plus.com> - 2011-08-12 03:15 +0100
Re: Processing a large string Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-08-12 12:30 +1000
Re: Processing a large string Nobody <nobody@nowhere.com> - 2011-08-12 05:11 +0100
Re: Processing a large string Peter Otten <__peter__@web.de> - 2011-08-12 10:39 +0200
Re: Processing a large string goldtech <goldtech@worldpost.com> - 2011-08-12 06:36 -0700
Re: Processing a large string Peter Otten <__peter__@web.de> - 2011-08-12 16:48 +0200
Re: Processing a large string Paul Rudin <paul.nospam@rudin.co.uk> - 2011-08-28 20:18 +0100
| From | goldtech <goldtech@worldpost.com> |
|---|---|
| Date | 2011-08-11 19:03 -0700 |
| Subject | Processing a large string |
| Message-ID | <b16af723-854c-449d-8b45-565d73579e17@br5g2000vbb.googlegroups.com> |
Hi, Say I have a very big string with a pattern like: akakksssk3dhdhdhdbddb3dkdkdkddk3dmdmdmd3dkdkdkdk3asnsn..... I want to split the sting into separate parts on the "3" and process each part separately. I might run into memory limitations if I use "split" and get a big array(?) I wondered if there's a way I could read (stream?) the string from start to finish and read what's delimited by the "3" into a variable, process the smaller string variable then append/build a new string with the processed data? Would I loop it and read it char by char till a "3"...? Or? Thanks.
[toc] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2011-08-12 03:15 +0100 |
| Message-ID | <mailman.2201.1313115355.1164.python-list@python.org> |
| In reply to | #11245 |
On 12/08/2011 03:03, goldtech wrote:
> Hi,
>
> Say I have a very big string with a pattern like:
>
> akakksssk3dhdhdhdbddb3dkdkdkddk3dmdmdmd3dkdkdkdk3asnsn.....
>
> I want to split the sting into separate parts on the "3" and process
> each part separately. I might run into memory limitations if I use
> "split" and get a big array(?) I wondered if there's a way I could
> read (stream?) the string from start to finish and read what's
> delimited by the "3" into a variable, process the smaller string
> variable then append/build a new string with the processed data?
>
> Would I loop it and read it char by char till a "3"...? Or?
>
You could write a generator like this:
def split(string, sep):
pos = 0
try:
while True:
next_pos = string.index(sep, pos)
yield string[pos : next_pos]
pos = next_pos + 1
except ValueError:
yield string[pos : ]
string = "akakksssk3dhdhdhdbddb3dkdkdkddk3dmdmdmd3dkdkdkdk3asnsn..."
for part in split(string, "3"):
print(part)
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2011-08-12 12:30 +1000 |
| Message-ID | <4e449062$0$29975$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #11245 |
goldtech wrote:
> Hi,
>
> Say I have a very big string with a pattern like:
>
> akakksssk3dhdhdhdbddb3dkdkdkddk3dmdmdmd3dkdkdkdk3asnsn.....
Define "big".
What seems big to you is probably not big to your computer.
> I want to split the sting into separate parts on the "3" and process
> each part separately. I might run into memory limitations if I use
> "split" and get a big array(?) I wondered if there's a way I could
> read (stream?) the string from start to finish and read what's
> delimited by the "3" into a variable, process the smaller string
> variable then append/build a new string with the processed data?
>
> Would I loop it and read it char by char till a "3"...? Or?
You could, but unless there are a lot of 3s, it will probably be slow. If
the 3s are far apart, it will be better to do this:
# untested
def split(source):
start = 0
i = source.find("3")
while i >= 0:
yield source[start:i]
start = i+1
i = source.find("3", start)
That should give you the pieces of the string one at a time, as efficiently
as possible.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Nobody <nobody@nowhere.com> |
|---|---|
| Date | 2011-08-12 05:11 +0100 |
| Message-ID | <pan.2011.08.12.04.11.40.19000@nowhere.com> |
| In reply to | #11245 |
On Thu, 11 Aug 2011 19:03:36 -0700, goldtech wrote: > Say I have a very big string with a pattern like: > > akakksssk3dhdhdhdbddb3dkdkdkddk3dmdmdmd3dkdkdkdk3asnsn..... > > I want to split the sting into separate parts on the "3" and process > each part separately. I might run into memory limitations if I use > "split" and get a big array(?) I wondered if there's a way I could > read (stream?) the string from start to finish and read what's > delimited by the "3" into a variable, process the smaller string > variable then append/build a new string with the processed data? > > Would I loop it and read it char by char till a "3"...? Or? Use the .find() or .index() methods to find the next occurrence of a character. Building a large string by concatenation is inefficient, as each append will copy the original string. If you must have the result as a single string, using cStringIO would be preferable. But you'd be better off if you can work with a list of strings.
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2011-08-12 10:39 +0200 |
| Message-ID | <j22oqv$9ro$1@solani.org> |
| In reply to | #11245 |
goldtech wrote:
> Hi,
>
> Say I have a very big string with a pattern like:
>
> akakksssk3dhdhdhdbddb3dkdkdkddk3dmdmdmd3dkdkdkdk3asnsn.....
>
> I want to split the sting into separate parts on the "3" and process
> each part separately. I might run into memory limitations if I use
> "split" and get a big array(?) I wondered if there's a way I could
> read (stream?) the string from start to finish and read what's
> delimited by the "3" into a variable, process the smaller string
> variable then append/build a new string with the processed data?
>
> Would I loop it and read it char by char till a "3"...? Or?
You can read the file in chunks:
from functools import partial
def read_chunks(instream, chunksize=None):
if chunksize is None:
chunksize = 2**20
return iter(partial(instream.read, chunksize), "")
def split_file(instream, delimiter, chunksize=None):
leftover = ""
chunk = None
for chunk in read_chunks(instream):
chunk = leftover + chunk
parts = chunk.split(delimiter)
leftover = parts.pop()
for part in parts:
yield part
if leftover or chunk is None or chunk.endswith(delimiter):
yield leftover
I hope I got the corner cases right.
PS: This has come up before, but I couldn't find the relevant threads...
[toc] | [prev] | [next] | [standalone]
| From | goldtech <goldtech@worldpost.com> |
|---|---|
| Date | 2011-08-12 06:36 -0700 |
| Message-ID | <bc665478-94a8-461f-9ea0-602f29fd1d22@h14g2000yqd.googlegroups.com> |
| In reply to | #11259 |
Thanks for all this info.
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2011-08-12 16:48 +0200 |
| Message-ID | <mailman.2220.1313160461.1164.python-list@python.org> |
| In reply to | #11259 |
Peter Otten wrote: > goldtech wrote: >> Say I have a very big string with a pattern like: >> >> akakksssk3dhdhdhdbddb3dkdkdkddk3dmdmdmd3dkdkdkdk3asnsn..... >> >> I want to split the sting into separate parts on the "3" and process >> each part separately. I might run into memory limitations if I use >> "split" and get a big array(?) I wondered if there's a way I could >> read (stream?) the string from start to finish and read what's >> delimited by the "3" into a variable, process the smaller string >> variable then append/build a new string with the processed data? > PS: This has come up before, but I couldn't find the relevant threads... Alex Martelli a looong time ago: > from __future__ import generators > > def splitby(fileobj, splitter, bufsize=8192): > buf = '' > > while True: > try: > item, buf = buf.split(splitter, 1) > except ValueError: > more = fileobj.read(bufsize) > if not more: break > buf += more > else: > yield item + splitter > > if buf: > yield buf http://mail.python.org/pipermail/python-list/2002-September/770673.html
[toc] | [prev] | [next] | [standalone]
| From | Paul Rudin <paul.nospam@rudin.co.uk> |
|---|---|
| Date | 2011-08-28 20:18 +0100 |
| Message-ID | <87y5ydqr9o.fsf@no-fixed-abode.cable.virginmedia.net> |
| In reply to | #11245 |
goldtech <goldtech@worldpost.com> writes: > Hi, > > Say I have a very big string with a pattern like: > > akakksssk3dhdhdhdbddb3dkdkdkddk3dmdmdmd3dkdkdkdk3asnsn..... > > I want to split the sting into separate parts on the "3" and process > each part separately. I might run into memory limitations if I use > "split" and get a big array(?) I wondered if there's a way I could > read (stream?) the string from start to finish and read what's > delimited by the "3" into a variable, process the smaller string > variable then append/build a new string with the processed data? > > Would I loop it and read it char by char till a "3"...? Or? > > Thanks. s = "akakksssk3dhdhdhdbddb3dkdkdkddk3dmdmdmd3dkdkdkdk3asnsn" for k, subs in itertools.groupby(s, lambda x: x=="3"): print ''.join(subs) what you actually do in the body of the loop depends on what you want to do with the bits.
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web