Groups > comp.lang.python > #90729 > unrolled thread

Fastest way to remove the first x characters from a very long string

Started by	bruceg113355@gmail.com
First post	2015-05-16 06:28 -0700
Last post	2015-05-16 23:24 +0000
Articles	14 — 9 participants

Back to article view | Back to comp.lang.python

  Fastest way to remove the first x characters from a very long string bruceg113355@gmail.com - 2015-05-16 06:28 -0700
    Re: Fastest way to remove the first x characters from a very long string Joel Goldstick <joel.goldstick@gmail.com> - 2015-05-16 09:43 -0400
    Re: Fastest way to remove the first x characters from a very long string Chris Angelico <rosuav@gmail.com> - 2015-05-16 23:45 +1000
      Re: Fastest way to remove the first x characters from a very long string bruceg113355@gmail.com - 2015-05-16 07:02 -0700
        Re: Fastest way to remove the first x characters from a very long string bruceg113355@gmail.com - 2015-05-16 09:22 -0700
          Re: Fastest way to remove the first x characters from a very long string Ian Kelly <ian.g.kelly@gmail.com> - 2015-05-16 10:57 -0600
          Re: Fastest way to remove the first x characters from a very long string Chris Angelico <rosuav@gmail.com> - 2015-05-17 02:59 +1000
            Re: Fastest way to remove the first x characters from a very long string bruceg113355@gmail.com - 2015-05-16 10:35 -0700
              Re: Fastest way to remove the first x characters from a very long string Cameron Simpson <cs@zip.com.au> - 2015-05-17 08:41 +1000
    Re: Fastest way to remove the first x characters from a very long string Grant Edwards <invalid@invalid.invalid> - 2015-05-16 14:59 +0000
      Re: Fastest way to remove the first x characters from a very long string Rustom Mody <rustompmody@gmail.com> - 2015-05-16 08:13 -0700
        Re: Fastest way to remove the first x characters from a very long string bruceg113355@gmail.com - 2015-05-16 09:24 -0700
          Re: Fastest way to remove the first x characters from a very long string Irmen de Jong <irmen.NOSPAM@xs4all.nl> - 2015-05-16 18:55 +0200
    Re: Fastest way to remove the first x characters from a very long string Denis McMahon <denismfmcmahon@gmail.com> - 2015-05-16 23:24 +0000

#90729 — Fastest way to remove the first x characters from a very long string

From	bruceg113355@gmail.com
Date	2015-05-16 06:28 -0700
Subject	Fastest way to remove the first x characters from a very long string
Message-ID	<6a383ce2-5975-4225-b4f2-f744c9d7a516@googlegroups.com>

I have a string that contains 10 million characters.

The string is formatted as:

"0000001 : some hexadecimal text ... \n
0000002 : some hexadecimal text ... \n
0000003 : some hexadecimal text ... \n
...
0100000 : some hexadecimal text ... \n
0100001 : some hexadecimal text ... \n"

and I need the string to look like:

"some hexadecimal text ... \n
some hexadecimal text ... \n
some hexadecimal text ... \n
...
some hexadecimal text ... \n
some hexadecimal text ... \n"

I can split the string at the ":" then iterate through the list removing the first 8 characters then convert back to a string. This method works, but it takes too long to execute.

Any tricks to remove the first n characters of each line in a string faster?

Thanks,
Bruce

[toc] | [next] | [standalone]

#90730

From	Joel Goldstick <joel.goldstick@gmail.com>
Date	2015-05-16 09:43 -0400
Message-ID	<mailman.71.1431783817.17265.python-list@python.org>
In reply to	#90729

On Sat, May 16, 2015 at 9:28 AM,  <bruceg113355@gmail.com> wrote:
> I have a string that contains 10 million characters.
>
> The string is formatted as:
>
> "0000001 : some hexadecimal text ... \n
> 0000002 : some hexadecimal text ... \n
> 0000003 : some hexadecimal text ... \n
> ...
> 0100000 : some hexadecimal text ... \n
> 0100001 : some hexadecimal text ... \n"
>
> and I need the string to look like:
>
> "some hexadecimal text ... \n
> some hexadecimal text ... \n
> some hexadecimal text ... \n
> ...
> some hexadecimal text ... \n
> some hexadecimal text ... \n"
>
> I can split the string at the ":" then iterate through the list removing the first 8 characters then convert back to a string. This method works, but it takes too long to execute.
>
> Any tricks to remove the first n characters of each line in a string faster?
>
slicing might be faster than searching for :

Do you need to do this all at once?  If not, use a generator

> Thanks,
> Bruce
> --
> https://mail.python.org/mailman/listinfo/python-list



-- 
Joel Goldstick
http://joelgoldstick.com

[toc] | [prev] | [next] | [standalone]

#90731

From	Chris Angelico <rosuav@gmail.com>
Date	2015-05-16 23:45 +1000
Message-ID	<mailman.72.1431783967.17265.python-list@python.org>
In reply to	#90729

On Sat, May 16, 2015 at 11:28 PM,  <bruceg113355@gmail.com> wrote:
> I have a string that contains 10 million characters.
>
> The string is formatted as:
>
> "0000001 : some hexadecimal text ... \n
> 0000002 : some hexadecimal text ... \n
> 0000003 : some hexadecimal text ... \n
> ...
> 0100000 : some hexadecimal text ... \n
> 0100001 : some hexadecimal text ... \n"
>
> and I need the string to look like:
>
> "some hexadecimal text ... \n
> some hexadecimal text ... \n
> some hexadecimal text ... \n
> ...
> some hexadecimal text ... \n
> some hexadecimal text ... \n"
>
> I can split the string at the ":" then iterate through the list removing the first 8 characters then convert back to a string. This method works, but it takes too long to execute.
>
> Any tricks to remove the first n characters of each line in a string faster?

Given that your definition is "each line", what I'd advise is first
splitting the string into lines, then changing each line, and then
rejoining them into a single string.

lines = original_text.split("\n")
new_text = "\n".join(line[8:] for line in lines)

Would that work?

ChrisA

[toc] | [prev] | [next] | [standalone]

#90733

From	bruceg113355@gmail.com
Date	2015-05-16 07:02 -0700
Message-ID	<4272d9d9-3d5b-4b8b-9875-6b66634b490c@googlegroups.com>
In reply to	#90731

On Saturday, May 16, 2015 at 9:46:17 AM UTC-4, Chris Angelico wrote:
> On Sat, May 16, 2015 at 11:28 PM,  <bruceg113355@gmail.com> wrote:
> > I have a string that contains 10 million characters.
> >
> > The string is formatted as:
> >
> > "0000001 : some hexadecimal text ... \n
> > 0000002 : some hexadecimal text ... \n
> > 0000003 : some hexadecimal text ... \n
> > ...
> > 0100000 : some hexadecimal text ... \n
> > 0100001 : some hexadecimal text ... \n"
> >
> > and I need the string to look like:
> >
> > "some hexadecimal text ... \n
> > some hexadecimal text ... \n
> > some hexadecimal text ... \n
> > ...
> > some hexadecimal text ... \n
> > some hexadecimal text ... \n"
> >
> > I can split the string at the ":" then iterate through the list removing the first 8 characters then convert back to a string. This method works, but it takes too long to execute.
> >
> > Any tricks to remove the first n characters of each line in a string faster?
> 
> Given that your definition is "each line", what I'd advise is first
> splitting the string into lines, then changing each line, and then
> rejoining them into a single string.
> 
> lines = original_text.split("\n")
> new_text = "\n".join(line[8:] for line in lines)
> 
> Would that work?
> 
> ChrisA


Hi Chris,

I meant to say I can split the string at the \n.

Your approach using .join is what I was looking for.
Thank you,

Bruce

[toc] | [prev] | [next] | [standalone]

#90740

From	bruceg113355@gmail.com
Date	2015-05-16 09:22 -0700
Message-ID	<f3836ca1-3f87-4c5c-9839-4d8a35aa77e4@googlegroups.com>
In reply to	#90733

On Saturday, May 16, 2015 at 10:06:31 AM UTC-4, Stefan Ram wrote:
> bruceg113355@gmail.com writes:
> >Your approach using .join is what I was looking for.
> 
>   I'd appreciate a report of your measurements.

# Original Approach
# -----------------
ss = ss.split("\n")
ss1 = ""
for sdata in ss:
    ss1 = ss1 + (sdata[OFFSET:] + "\n")


# Chris's Approach
# ----------------
lines = ss.split("\n")
new_text = "\n".join(line[8:] for line in lines)  


Test #1, Number of Characters: 165110
Original Approach: 18ms
Chris's Approach:   1ms

Test #2, Number of Characters: 470763
Original Approach: 593ms
Chris's Approach:   16ms

Test #3, Number of Characters: 944702
Original Approach: 2.824s
Chris's Approach:    47ms

Test #4, Number of Characters: 5557394
Original Approach: 122s
Chris's Approach:   394ms

[toc] | [prev] | [next] | [standalone]

#90744

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2015-05-16 10:57 -0600
Message-ID	<mailman.77.1431795491.17265.python-list@python.org>
In reply to	#90740

On Sat, May 16, 2015 at 10:22 AM,  <bruceg113355@gmail.com> wrote:
> # Chris's Approach
> # ----------------
> lines = ss.split("\n")
> new_text = "\n".join(line[8:] for line in lines)

Looks like the approach you have may be fast enough already, but I'd
wager the generator expression could be replaced with:

    map(operator.itemgetter(slice(8, None)), lines)

for a modest speed-up. On the downside, this is less readable.
Substitute itertools.imap for map if using Python 2.x.

[toc] | [prev] | [next] | [standalone]

#90745

From	Chris Angelico <rosuav@gmail.com>
Date	2015-05-17 02:59 +1000
Message-ID	<mailman.78.1431795548.17265.python-list@python.org>
In reply to	#90740

On Sun, May 17, 2015 at 2:22 AM,  <bruceg113355@gmail.com> wrote:
> # Original Approach
> # -----------------
> ss = ss.split("\n")
> ss1 = ""
> for sdata in ss:
>     ss1 = ss1 + (sdata[OFFSET:] + "\n")
>
>
> # Chris's Approach
> # ----------------
> lines = ss.split("\n")
> new_text = "\n".join(line[8:] for line in lines)

Ah, yep. This is exactly what str.join() exists for :) Though do make
sure the results are the same for each - there are two noteworthy
differences between these two. Your version has a customizable OFFSET,
where mine is hard-coded; I'm sure you know how to change that part.
The subtler one is that "\n".join(...) won't put a \n after the final
string - your version ends up adding one more newline. If that's
important to you, you'll have to add one explicitly. (I suspect
probably not, though; ss.split("\n") won't expect a final newline, so
you'll get a blank entry in the list if there is one, and then you'll
end up reinstating the newline when that blank gets joined in.) Just
remember to check correctness before performance, and you should be
safe.

ChrisA

[toc] | [prev] | [next] | [standalone]

#90746

From	bruceg113355@gmail.com
Date	2015-05-16 10:35 -0700
Message-ID	<8354e4ba-0a80-48a1-b2e9-6edb4b67ff36@googlegroups.com>
In reply to	#90745

On Saturday, May 16, 2015 at 12:59:19 PM UTC-4, Chris Angelico wrote:
> On Sun, May 17, 2015 at 2:22 AM,  <bruceg113355@gmail.com> wrote:
> > # Original Approach
> > # -----------------
> > ss = ss.split("\n")
> > ss1 = ""
> > for sdata in ss:
> >     ss1 = ss1 + (sdata[OFFSET:] + "\n")
> >
> >
> > # Chris's Approach
> > # ----------------
> > lines = ss.split("\n")
> > new_text = "\n".join(line[8:] for line in lines)
> 
> Ah, yep. This is exactly what str.join() exists for :) Though do make
> sure the results are the same for each - there are two noteworthy
> differences between these two. Your version has a customizable OFFSET,
> where mine is hard-coded; I'm sure you know how to change that part.
> The subtler one is that "\n".join(...) won't put a \n after the final
> string - your version ends up adding one more newline. If that's
> important to you, you'll have to add one explicitly. (I suspect
> probably not, though; ss.split("\n") won't expect a final newline, so
> you'll get a blank entry in the list if there is one, and then you'll
> end up reinstating the newline when that blank gets joined in.) Just
> remember to check correctness before performance, and you should be
> safe.
> 
> ChrisA

Hi Chris,

Your approach more than meets my requirements.
Data is formatted correctly and performance is simply amazing. 
OFFSET and \n are small details.

Thank you again,
Bruce

[toc] | [prev] | [next] | [standalone]

#90760

From	Cameron Simpson <cs@zip.com.au>
Date	2015-05-17 08:41 +1000
Message-ID	<mailman.87.1431817664.17265.python-list@python.org>
In reply to	#90746

On 16May2015 10:35, bruceg113355@gmail.com <bruceg113355@gmail.com> wrote:
>On Saturday, May 16, 2015 at 12:59:19 PM UTC-4, Chris Angelico wrote:
>> On Sun, May 17, 2015 at 2:22 AM,  <bruceg113355@gmail.com> wrote:
>> > # Original Approach
>> > # -----------------
>> > ss = ss.split("\n")
>> > ss1 = ""
>> > for sdata in ss:
>> >     ss1 = ss1 + (sdata[OFFSET:] + "\n")
>> >
>> > # Chris's Approach
>> > # ----------------
>> > lines = ss.split("\n")
>> > new_text = "\n".join(line[8:] for line in lines)
[...]
>
>Your approach more than meets my requirements.
>Data is formatted correctly and performance is simply amazing.
>OFFSET and \n are small details.

The only comment I'd make at this point is to consider if you really need a 
single string at the end. Keeping it as a list of lines may be more flexible.  
(It will consume more memory.) If you're doing more stuff with the string as 
lines then you'd need to re-split it, and so forth.

Cheers,
Cameron Simpson <cs@zip.com.au>

[toc] | [prev] | [next] | [standalone]

#90737

From	Grant Edwards <invalid@invalid.invalid>
Date	2015-05-16 14:59 +0000
Message-ID	<mj7m19$qn$1@reader1.panix.com>
In reply to	#90729

On 2015-05-16, bruceg113355@gmail.com <bruceg113355@gmail.com> wrote:

> I have a string that contains 10 million characters.
>
> The string is formatted as:
>
> "0000001 : some hexadecimal text ... \n
> 0000002 : some hexadecimal text ... \n
> 0000003 : some hexadecimal text ... \n
> ...
> 0100000 : some hexadecimal text ... \n
> 0100001 : some hexadecimal text ... \n"
>
> and I need the string to look like:
>
> "some hexadecimal text ... \n
> some hexadecimal text ... \n
> some hexadecimal text ... \n
> ...
> some hexadecimal text ... \n
> some hexadecimal text ... \n"
>
> I can split the string at the ":" then iterate through the list
> removing the first 8 characters then convert back to a string. This
> method works, but it takes too long to execute.
>
> Any tricks to remove the first n characters of each line in a string faster?

Well, if the strings are all in a file, I'd probably just use sed:

$ sed 's/^........//g' file1.txt >file2.txt

or

$ sed 's/^.*://g' file1.txt >file2.txt

[toc] | [prev] | [next] | [standalone]

#90738

From	Rustom Mody <rustompmody@gmail.com>
Date	2015-05-16 08:13 -0700
Message-ID	<3faf260f-7777-4a60-8212-981340e478b3@googlegroups.com>
In reply to	#90737

On Saturday, May 16, 2015 at 8:30:02 PM UTC+5:30, Grant Edwards wrote:
> On 2015-05-16, bruceg113355 wrote:
> 
> > I have a string that contains 10 million characters.
> >
> > The string is formatted as:
> >
> > "0000001 : some hexadecimal text ... \n
> > 0000002 : some hexadecimal text ... \n
> > 0000003 : some hexadecimal text ... \n
> > ...
> > 0100000 : some hexadecimal text ... \n
> > 0100001 : some hexadecimal text ... \n"
> >
> > and I need the string to look like:
> >
> > "some hexadecimal text ... \n
> > some hexadecimal text ... \n
> > some hexadecimal text ... \n
> > ...
> > some hexadecimal text ... \n
> > some hexadecimal text ... \n"
> >
> > I can split the string at the ":" then iterate through the list
> > removing the first 8 characters then convert back to a string. This
> > method works, but it takes too long to execute.
> >
> > Any tricks to remove the first n characters of each line in a string faster?
> 
> Well, if the strings are all in a file, I'd probably just use sed:
> 
> $ sed 's/^........//g' file1.txt >file2.txt
> 
> or
> 
> $ sed 's/^.*://g' file1.txt >file2.txt


And if they are not in a file you could start by putting them (it) there :-)

Seriously... How does your 'string' come into existence?
How/when do you get hold of it?

[toc] | [prev] | [next] | [standalone]

#90741

From	bruceg113355@gmail.com
Date	2015-05-16 09:24 -0700
Message-ID	<5b815f6c-d639-4193-a644-ea2e1a1759a6@googlegroups.com>
In reply to	#90738

On Saturday, May 16, 2015 at 11:13:45 AM UTC-4, Rustom Mody wrote:
> On Saturday, May 16, 2015 at 8:30:02 PM UTC+5:30, Grant Edwards wrote:
> > On 2015-05-16, bruceg113355 wrote:
> > 
> > > I have a string that contains 10 million characters.
> > >
> > > The string is formatted as:
> > >
> > > "0000001 : some hexadecimal text ... \n
> > > 0000002 : some hexadecimal text ... \n
> > > 0000003 : some hexadecimal text ... \n
> > > ...
> > > 0100000 : some hexadecimal text ... \n
> > > 0100001 : some hexadecimal text ... \n"
> > >
> > > and I need the string to look like:
> > >
> > > "some hexadecimal text ... \n
> > > some hexadecimal text ... \n
> > > some hexadecimal text ... \n
> > > ...
> > > some hexadecimal text ... \n
> > > some hexadecimal text ... \n"
> > >
> > > I can split the string at the ":" then iterate through the list
> > > removing the first 8 characters then convert back to a string. This
> > > method works, but it takes too long to execute.
> > >
> > > Any tricks to remove the first n characters of each line in a string faster?
> > 
> > Well, if the strings are all in a file, I'd probably just use sed:
> > 
> > $ sed 's/^........//g' file1.txt >file2.txt
> > 
> > or
> > 
> > $ sed 's/^.*://g' file1.txt >file2.txt
> 
> 
> And if they are not in a file you could start by putting them (it) there :-)
> 
> Seriously... How does your 'string' come into existence?
> How/when do you get hold of it?

Data is coming from a wxPython TextCtrl widget.
The widget is displaying data received on a serial port for a user to analyze.

[toc] | [prev] | [next] | [standalone]

#90743

From	Irmen de Jong <irmen.NOSPAM@xs4all.nl>
Date	2015-05-16 18:55 +0200
Message-ID	<55577688$0$2821$e4fe514c@news.xs4all.nl>
In reply to	#90741

On 16-5-2015 18:24, bruceg113355@gmail.com wrote:
> Data is coming from a wxPython TextCtrl widget.

Hm, there should be a better source of the data before it ends up in the textctrl widget.

> The widget is displaying data received on a serial port for a user to analyze.

If this is read from a serial port, can't you process the data directly when it arrives?
This may give you the chance to simply operate on the line as soon as it arrives from
the port, before pasting it all in the textctrl

Irmen

[toc] | [prev] | [next] | [standalone]

#90761

From	Denis McMahon <denismfmcmahon@gmail.com>
Date	2015-05-16 23:24 +0000
Message-ID	<mj8jik$f1c$5@dont-email.me>
In reply to	#90729

On Sat, 16 May 2015 06:28:19 -0700, bruceg113355 wrote:

> I have a string that contains 10 million characters.
> 
> The string is formatted as:
> 
> "0000001 : some hexadecimal text ... \n 0000002 : some hexadecimal text
> ... \n 0000003 : some hexadecimal text ... \n ...
> 0100000 : some hexadecimal text ... \n 0100001 : some hexadecimal text
> ... \n"
> 
> and I need the string to look like:
> 
> "some hexadecimal text ... \n some hexadecimal text ... \n some
> hexadecimal text ... \n ...
> some hexadecimal text ... \n some hexadecimal text ... \n"

Looks to me as if you have a 10 Mbyte encoded file with line numbers as 
ascii text and you're trying to strip the line numbers before decoding 
the file.

Are you looking for a one-off solution, or do you have a lot of these 
files?

If you have a lot of files to process, you could try using something like 
sed.

sed -i.old 's/^\d+ : //' *.ext

-- 
Denis McMahon, denismfmcmahon@gmail.com

[toc] | [prev] | [standalone]

csiph-web

Fastest way to remove the first x characters from a very long string

Contents

#90729 — Fastest way to remove the first x characters from a very long string

#90730

#90731

#90733

#90740

#90744

#90745

#90746

#90760

#90737

#90738

#90741

#90743

#90761