Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #90729 > unrolled thread

Fastest way to remove the first x characters from a very long string

Started bybruceg113355@gmail.com
First post2015-05-16 06:28 -0700
Last post2015-05-16 23:24 +0000
Articles 14 — 9 participants

Back to article view | Back to comp.lang.python


Contents

  Fastest way to remove the first x characters from a very long string bruceg113355@gmail.com - 2015-05-16 06:28 -0700
    Re: Fastest way to remove the first x characters from a very long string Joel Goldstick <joel.goldstick@gmail.com> - 2015-05-16 09:43 -0400
    Re: Fastest way to remove the first x characters from a very long string Chris Angelico <rosuav@gmail.com> - 2015-05-16 23:45 +1000
      Re: Fastest way to remove the first x characters from a very long string bruceg113355@gmail.com - 2015-05-16 07:02 -0700
        Re: Fastest way to remove the first x characters from a very long string bruceg113355@gmail.com - 2015-05-16 09:22 -0700
          Re: Fastest way to remove the first x characters from a very long string Ian Kelly <ian.g.kelly@gmail.com> - 2015-05-16 10:57 -0600
          Re: Fastest way to remove the first x characters from a very long string Chris Angelico <rosuav@gmail.com> - 2015-05-17 02:59 +1000
            Re: Fastest way to remove the first x characters from a very long string bruceg113355@gmail.com - 2015-05-16 10:35 -0700
              Re: Fastest way to remove the first x characters from a very long string Cameron Simpson <cs@zip.com.au> - 2015-05-17 08:41 +1000
    Re: Fastest way to remove the first x characters from a very long string Grant Edwards <invalid@invalid.invalid> - 2015-05-16 14:59 +0000
      Re: Fastest way to remove the first x characters from a very long string Rustom Mody <rustompmody@gmail.com> - 2015-05-16 08:13 -0700
        Re: Fastest way to remove the first x characters from a very long string bruceg113355@gmail.com - 2015-05-16 09:24 -0700
          Re: Fastest way to remove the first x characters from a very long string Irmen de Jong <irmen.NOSPAM@xs4all.nl> - 2015-05-16 18:55 +0200
    Re: Fastest way to remove the first x characters from a very long string Denis McMahon <denismfmcmahon@gmail.com> - 2015-05-16 23:24 +0000

#90729 — Fastest way to remove the first x characters from a very long string

Frombruceg113355@gmail.com
Date2015-05-16 06:28 -0700
SubjectFastest way to remove the first x characters from a very long string
Message-ID<6a383ce2-5975-4225-b4f2-f744c9d7a516@googlegroups.com>
I have a string that contains 10 million characters.

The string is formatted as:

"0000001 : some hexadecimal text ... \n
0000002 : some hexadecimal text ... \n
0000003 : some hexadecimal text ... \n
...
0100000 : some hexadecimal text ... \n
0100001 : some hexadecimal text ... \n"

and I need the string to look like:

"some hexadecimal text ... \n
some hexadecimal text ... \n
some hexadecimal text ... \n
...
some hexadecimal text ... \n
some hexadecimal text ... \n"

I can split the string at the ":" then iterate through the list removing the first 8 characters then convert back to a string. This method works, but it takes too long to execute.

Any tricks to remove the first n characters of each line in a string faster?

Thanks,
Bruce

[toc] | [next] | [standalone]


#90730

FromJoel Goldstick <joel.goldstick@gmail.com>
Date2015-05-16 09:43 -0400
Message-ID<mailman.71.1431783817.17265.python-list@python.org>
In reply to#90729
On Sat, May 16, 2015 at 9:28 AM,  <bruceg113355@gmail.com> wrote:
> I have a string that contains 10 million characters.
>
> The string is formatted as:
>
> "0000001 : some hexadecimal text ... \n
> 0000002 : some hexadecimal text ... \n
> 0000003 : some hexadecimal text ... \n
> ...
> 0100000 : some hexadecimal text ... \n
> 0100001 : some hexadecimal text ... \n"
>
> and I need the string to look like:
>
> "some hexadecimal text ... \n
> some hexadecimal text ... \n
> some hexadecimal text ... \n
> ...
> some hexadecimal text ... \n
> some hexadecimal text ... \n"
>
> I can split the string at the ":" then iterate through the list removing the first 8 characters then convert back to a string. This method works, but it takes too long to execute.
>
> Any tricks to remove the first n characters of each line in a string faster?
>
slicing might be faster than searching for :

Do you need to do this all at once?  If not, use a generator

> Thanks,
> Bruce
> --
> https://mail.python.org/mailman/listinfo/python-list



-- 
Joel Goldstick
http://joelgoldstick.com

[toc] | [prev] | [next] | [standalone]


#90731

FromChris Angelico <rosuav@gmail.com>
Date2015-05-16 23:45 +1000
Message-ID<mailman.72.1431783967.17265.python-list@python.org>
In reply to#90729
On Sat, May 16, 2015 at 11:28 PM,  <bruceg113355@gmail.com> wrote:
> I have a string that contains 10 million characters.
>
> The string is formatted as:
>
> "0000001 : some hexadecimal text ... \n
> 0000002 : some hexadecimal text ... \n
> 0000003 : some hexadecimal text ... \n
> ...
> 0100000 : some hexadecimal text ... \n
> 0100001 : some hexadecimal text ... \n"
>
> and I need the string to look like:
>
> "some hexadecimal text ... \n
> some hexadecimal text ... \n
> some hexadecimal text ... \n
> ...
> some hexadecimal text ... \n
> some hexadecimal text ... \n"
>
> I can split the string at the ":" then iterate through the list removing the first 8 characters then convert back to a string. This method works, but it takes too long to execute.
>
> Any tricks to remove the first n characters of each line in a string faster?

Given that your definition is "each line", what I'd advise is first
splitting the string into lines, then changing each line, and then
rejoining them into a single string.

lines = original_text.split("\n")
new_text = "\n".join(line[8:] for line in lines)

Would that work?

ChrisA

[toc] | [prev] | [next] | [standalone]


#90733

Frombruceg113355@gmail.com
Date2015-05-16 07:02 -0700
Message-ID<4272d9d9-3d5b-4b8b-9875-6b66634b490c@googlegroups.com>
In reply to#90731
On Saturday, May 16, 2015 at 9:46:17 AM UTC-4, Chris Angelico wrote:
> On Sat, May 16, 2015 at 11:28 PM,  <bruceg113355@gmail.com> wrote:
> > I have a string that contains 10 million characters.
> >
> > The string is formatted as:
> >
> > "0000001 : some hexadecimal text ... \n
> > 0000002 : some hexadecimal text ... \n
> > 0000003 : some hexadecimal text ... \n
> > ...
> > 0100000 : some hexadecimal text ... \n
> > 0100001 : some hexadecimal text ... \n"
> >
> > and I need the string to look like:
> >
> > "some hexadecimal text ... \n
> > some hexadecimal text ... \n
> > some hexadecimal text ... \n
> > ...
> > some hexadecimal text ... \n
> > some hexadecimal text ... \n"
> >
> > I can split the string at the ":" then iterate through the list removing the first 8 characters then convert back to a string. This method works, but it takes too long to execute.
> >
> > Any tricks to remove the first n characters of each line in a string faster?
> 
> Given that your definition is "each line", what I'd advise is first
> splitting the string into lines, then changing each line, and then
> rejoining them into a single string.
> 
> lines = original_text.split("\n")
> new_text = "\n".join(line[8:] for line in lines)
> 
> Would that work?
> 
> ChrisA


Hi Chris,

I meant to say I can split the string at the \n.

Your approach using .join is what I was looking for.
Thank you,

Bruce

[toc] | [prev] | [next] | [standalone]


#90740

Frombruceg113355@gmail.com
Date2015-05-16 09:22 -0700
Message-ID<f3836ca1-3f87-4c5c-9839-4d8a35aa77e4@googlegroups.com>
In reply to#90733
On Saturday, May 16, 2015 at 10:06:31 AM UTC-4, Stefan Ram wrote:
> bruceg113355@gmail.com writes:
> >Your approach using .join is what I was looking for.
> 
>   I'd appreciate a report of your measurements.

# Original Approach
# -----------------
ss = ss.split("\n")
ss1 = ""
for sdata in ss:
    ss1 = ss1 + (sdata[OFFSET:] + "\n")


# Chris's Approach
# ----------------
lines = ss.split("\n")
new_text = "\n".join(line[8:] for line in lines)  


Test #1, Number of Characters: 165110
Original Approach: 18ms
Chris's Approach:   1ms

Test #2, Number of Characters: 470763
Original Approach: 593ms
Chris's Approach:   16ms

Test #3, Number of Characters: 944702
Original Approach: 2.824s
Chris's Approach:    47ms

Test #4, Number of Characters: 5557394
Original Approach: 122s
Chris's Approach:   394ms

[toc] | [prev] | [next] | [standalone]


#90744

FromIan Kelly <ian.g.kelly@gmail.com>
Date2015-05-16 10:57 -0600
Message-ID<mailman.77.1431795491.17265.python-list@python.org>
In reply to#90740
On Sat, May 16, 2015 at 10:22 AM,  <bruceg113355@gmail.com> wrote:
> # Chris's Approach
> # ----------------
> lines = ss.split("\n")
> new_text = "\n".join(line[8:] for line in lines)

Looks like the approach you have may be fast enough already, but I'd
wager the generator expression could be replaced with:

    map(operator.itemgetter(slice(8, None)), lines)

for a modest speed-up. On the downside, this is less readable.
Substitute itertools.imap for map if using Python 2.x.

[toc] | [prev] | [next] | [standalone]


#90745

FromChris Angelico <rosuav@gmail.com>
Date2015-05-17 02:59 +1000
Message-ID<mailman.78.1431795548.17265.python-list@python.org>
In reply to#90740
On Sun, May 17, 2015 at 2:22 AM,  <bruceg113355@gmail.com> wrote:
> # Original Approach
> # -----------------
> ss = ss.split("\n")
> ss1 = ""
> for sdata in ss:
>     ss1 = ss1 + (sdata[OFFSET:] + "\n")
>
>
> # Chris's Approach
> # ----------------
> lines = ss.split("\n")
> new_text = "\n".join(line[8:] for line in lines)

Ah, yep. This is exactly what str.join() exists for :) Though do make
sure the results are the same for each - there are two noteworthy
differences between these two. Your version has a customizable OFFSET,
where mine is hard-coded; I'm sure you know how to change that part.
The subtler one is that "\n".join(...) won't put a \n after the final
string - your version ends up adding one more newline. If that's
important to you, you'll have to add one explicitly. (I suspect
probably not, though; ss.split("\n") won't expect a final newline, so
you'll get a blank entry in the list if there is one, and then you'll
end up reinstating the newline when that blank gets joined in.) Just
remember to check correctness before performance, and you should be
safe.

ChrisA

[toc] | [prev] | [next] | [standalone]


#90746

Frombruceg113355@gmail.com
Date2015-05-16 10:35 -0700
Message-ID<8354e4ba-0a80-48a1-b2e9-6edb4b67ff36@googlegroups.com>
In reply to#90745
On Saturday, May 16, 2015 at 12:59:19 PM UTC-4, Chris Angelico wrote:
> On Sun, May 17, 2015 at 2:22 AM,  <bruceg113355@gmail.com> wrote:
> > # Original Approach
> > # -----------------
> > ss = ss.split("\n")
> > ss1 = ""
> > for sdata in ss:
> >     ss1 = ss1 + (sdata[OFFSET:] + "\n")
> >
> >
> > # Chris's Approach
> > # ----------------
> > lines = ss.split("\n")
> > new_text = "\n".join(line[8:] for line in lines)
> 
> Ah, yep. This is exactly what str.join() exists for :) Though do make
> sure the results are the same for each - there are two noteworthy
> differences between these two. Your version has a customizable OFFSET,
> where mine is hard-coded; I'm sure you know how to change that part.
> The subtler one is that "\n".join(...) won't put a \n after the final
> string - your version ends up adding one more newline. If that's
> important to you, you'll have to add one explicitly. (I suspect
> probably not, though; ss.split("\n") won't expect a final newline, so
> you'll get a blank entry in the list if there is one, and then you'll
> end up reinstating the newline when that blank gets joined in.) Just
> remember to check correctness before performance, and you should be
> safe.
> 
> ChrisA

Hi Chris,

Your approach more than meets my requirements.
Data is formatted correctly and performance is simply amazing. 
OFFSET and \n are small details.

Thank you again,
Bruce

[toc] | [prev] | [next] | [standalone]


#90760

FromCameron Simpson <cs@zip.com.au>
Date2015-05-17 08:41 +1000
Message-ID<mailman.87.1431817664.17265.python-list@python.org>
In reply to#90746
On 16May2015 10:35, bruceg113355@gmail.com <bruceg113355@gmail.com> wrote:
>On Saturday, May 16, 2015 at 12:59:19 PM UTC-4, Chris Angelico wrote:
>> On Sun, May 17, 2015 at 2:22 AM,  <bruceg113355@gmail.com> wrote:
>> > # Original Approach
>> > # -----------------
>> > ss = ss.split("\n")
>> > ss1 = ""
>> > for sdata in ss:
>> >     ss1 = ss1 + (sdata[OFFSET:] + "\n")
>> >
>> > # Chris's Approach
>> > # ----------------
>> > lines = ss.split("\n")
>> > new_text = "\n".join(line[8:] for line in lines)
[...]
>
>Your approach more than meets my requirements.
>Data is formatted correctly and performance is simply amazing.
>OFFSET and \n are small details.

The only comment I'd make at this point is to consider if you really need a 
single string at the end. Keeping it as a list of lines may be more flexible.  
(It will consume more memory.) If you're doing more stuff with the string as 
lines then you'd need to re-split it, and so forth.

Cheers,
Cameron Simpson <cs@zip.com.au>

[toc] | [prev] | [next] | [standalone]


#90737

FromGrant Edwards <invalid@invalid.invalid>
Date2015-05-16 14:59 +0000
Message-ID<mj7m19$qn$1@reader1.panix.com>
In reply to#90729
On 2015-05-16, bruceg113355@gmail.com <bruceg113355@gmail.com> wrote:

> I have a string that contains 10 million characters.
>
> The string is formatted as:
>
> "0000001 : some hexadecimal text ... \n
> 0000002 : some hexadecimal text ... \n
> 0000003 : some hexadecimal text ... \n
> ...
> 0100000 : some hexadecimal text ... \n
> 0100001 : some hexadecimal text ... \n"
>
> and I need the string to look like:
>
> "some hexadecimal text ... \n
> some hexadecimal text ... \n
> some hexadecimal text ... \n
> ...
> some hexadecimal text ... \n
> some hexadecimal text ... \n"
>
> I can split the string at the ":" then iterate through the list
> removing the first 8 characters then convert back to a string. This
> method works, but it takes too long to execute.
>
> Any tricks to remove the first n characters of each line in a string faster?

Well, if the strings are all in a file, I'd probably just use sed:

$ sed 's/^........//g' file1.txt >file2.txt

or

$ sed 's/^.*://g' file1.txt >file2.txt

[toc] | [prev] | [next] | [standalone]


#90738

FromRustom Mody <rustompmody@gmail.com>
Date2015-05-16 08:13 -0700
Message-ID<3faf260f-7777-4a60-8212-981340e478b3@googlegroups.com>
In reply to#90737
On Saturday, May 16, 2015 at 8:30:02 PM UTC+5:30, Grant Edwards wrote:
> On 2015-05-16, bruceg113355 wrote:
> 
> > I have a string that contains 10 million characters.
> >
> > The string is formatted as:
> >
> > "0000001 : some hexadecimal text ... \n
> > 0000002 : some hexadecimal text ... \n
> > 0000003 : some hexadecimal text ... \n
> > ...
> > 0100000 : some hexadecimal text ... \n
> > 0100001 : some hexadecimal text ... \n"
> >
> > and I need the string to look like:
> >
> > "some hexadecimal text ... \n
> > some hexadecimal text ... \n
> > some hexadecimal text ... \n
> > ...
> > some hexadecimal text ... \n
> > some hexadecimal text ... \n"
> >
> > I can split the string at the ":" then iterate through the list
> > removing the first 8 characters then convert back to a string. This
> > method works, but it takes too long to execute.
> >
> > Any tricks to remove the first n characters of each line in a string faster?
> 
> Well, if the strings are all in a file, I'd probably just use sed:
> 
> $ sed 's/^........//g' file1.txt >file2.txt
> 
> or
> 
> $ sed 's/^.*://g' file1.txt >file2.txt


And if they are not in a file you could start by putting them (it) there :-)

Seriously... How does your 'string' come into existence?
How/when do you get hold of it?

[toc] | [prev] | [next] | [standalone]


#90741

Frombruceg113355@gmail.com
Date2015-05-16 09:24 -0700
Message-ID<5b815f6c-d639-4193-a644-ea2e1a1759a6@googlegroups.com>
In reply to#90738
On Saturday, May 16, 2015 at 11:13:45 AM UTC-4, Rustom Mody wrote:
> On Saturday, May 16, 2015 at 8:30:02 PM UTC+5:30, Grant Edwards wrote:
> > On 2015-05-16, bruceg113355 wrote:
> > 
> > > I have a string that contains 10 million characters.
> > >
> > > The string is formatted as:
> > >
> > > "0000001 : some hexadecimal text ... \n
> > > 0000002 : some hexadecimal text ... \n
> > > 0000003 : some hexadecimal text ... \n
> > > ...
> > > 0100000 : some hexadecimal text ... \n
> > > 0100001 : some hexadecimal text ... \n"
> > >
> > > and I need the string to look like:
> > >
> > > "some hexadecimal text ... \n
> > > some hexadecimal text ... \n
> > > some hexadecimal text ... \n
> > > ...
> > > some hexadecimal text ... \n
> > > some hexadecimal text ... \n"
> > >
> > > I can split the string at the ":" then iterate through the list
> > > removing the first 8 characters then convert back to a string. This
> > > method works, but it takes too long to execute.
> > >
> > > Any tricks to remove the first n characters of each line in a string faster?
> > 
> > Well, if the strings are all in a file, I'd probably just use sed:
> > 
> > $ sed 's/^........//g' file1.txt >file2.txt
> > 
> > or
> > 
> > $ sed 's/^.*://g' file1.txt >file2.txt
> 
> 
> And if they are not in a file you could start by putting them (it) there :-)
> 
> Seriously... How does your 'string' come into existence?
> How/when do you get hold of it?

Data is coming from a wxPython TextCtrl widget.
The widget is displaying data received on a serial port for a user to analyze.

[toc] | [prev] | [next] | [standalone]


#90743

FromIrmen de Jong <irmen.NOSPAM@xs4all.nl>
Date2015-05-16 18:55 +0200
Message-ID<55577688$0$2821$e4fe514c@news.xs4all.nl>
In reply to#90741
On 16-5-2015 18:24, bruceg113355@gmail.com wrote:
> Data is coming from a wxPython TextCtrl widget.

Hm, there should be a better source of the data before it ends up in the textctrl widget.

> The widget is displaying data received on a serial port for a user to analyze.

If this is read from a serial port, can't you process the data directly when it arrives?
This may give you the chance to simply operate on the line as soon as it arrives from
the port, before pasting it all in the textctrl



Irmen

[toc] | [prev] | [next] | [standalone]


#90761

FromDenis McMahon <denismfmcmahon@gmail.com>
Date2015-05-16 23:24 +0000
Message-ID<mj8jik$f1c$5@dont-email.me>
In reply to#90729
On Sat, 16 May 2015 06:28:19 -0700, bruceg113355 wrote:

> I have a string that contains 10 million characters.
> 
> The string is formatted as:
> 
> "0000001 : some hexadecimal text ... \n 0000002 : some hexadecimal text
> ... \n 0000003 : some hexadecimal text ... \n ...
> 0100000 : some hexadecimal text ... \n 0100001 : some hexadecimal text
> ... \n"
> 
> and I need the string to look like:
> 
> "some hexadecimal text ... \n some hexadecimal text ... \n some
> hexadecimal text ... \n ...
> some hexadecimal text ... \n some hexadecimal text ... \n"

Looks to me as if you have a 10 Mbyte encoded file with line numbers as 
ascii text and you're trying to strip the line numbers before decoding 
the file.

Are you looking for a one-off solution, or do you have a lot of these 
files?

If you have a lot of files to process, you could try using something like 
sed.

sed -i.old 's/^\d+ : //' *.ext

-- 
Denis McMahon, denismfmcmahon@gmail.com

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web