Groups > microsoft.public.scripting.vbscript > #12321 > unrolled thread

Why does a loop using instr() and mid() get slower on larger data ?

Started by	"R.Wieser" <address@not.available>
First post	2021-01-02 12:41 +0100
Last post	2021-01-02 21:47 +0100
Articles	14 — 3 participants

Back to article view | Back to microsoft.public.scripting.vbscript

  Why does a loop using instr() and mid() get slower on larger data ? "R.Wieser" <address@not.available> - 2021-01-02 12:41 +0100
    Re: Why does a loop using instr() and mid() get slower on larger data ? "Mayayana" <mayayana@invalid.nospam> - 2021-01-02 08:39 -0500
      Re: Why does a loop using instr() and mid() get slower on larger data ? "R.Wieser" <address@not.available> - 2021-01-02 16:46 +0100
        Re: Why does a loop using instr() and mid() get slower on larger data ? "Mayayana" <mayayana@invalid.nospam> - 2021-01-02 15:26 -0500
          Re: Why does a loop using instr() and mid() get slower on larger data ? "R.Wieser" <address@not.available> - 2021-01-02 22:11 +0100
            Re: Why does a loop using instr() and mid() get slower on larger data ? "Mayayana" <mayayana@invalid.nospam> - 2021-01-02 17:27 -0500
              Re: Why does a loop using instr() and mid() get slower on larger data ? - found it (I think) "R.Wieser" <address@not.available> - 2021-01-03 08:58 +0100
                Re: Why does a loop using instr() and mid() get slower on larger data ? - found it (I think) "Mayayana" <mayayana@invalid.nospam> - 2021-01-03 08:54 -0500
                  Re: Why does a loop using instr() and mid() get slower on larger data ? - found it (I think) "R.Wieser" <address@not.available> - 2021-01-03 16:01 +0100
                    Re: Why does a loop using instr() and mid() get slower on larger data ? - found it (I think) "Mayayana" <mayayana@invalid.nospam> - 2021-01-03 12:51 -0500
                      Re: Why does a loop using instr() and mid() get slower on larger data ? - found it (I think) "R.Wieser" <address@not.available> - 2021-01-03 19:26 +0100
            Re: Why does a loop using instr() and mid() get slower on larger data ? "Mayayana" <mayayana@invalid.nospam> - 2021-01-02 17:43 -0500
    Re: Why does a loop using instr() and mid() get slower on larger data ? JJ <jj4public@gmail.com> - 2021-01-03 02:47 +0700
      Re: Why does a loop using instr() and mid() get slower on larger data ? "R.Wieser" <address@not.available> - 2021-01-02 21:47 +0100

#12321 — Why does a loop using instr() and mid() get slower on larger data ?

From	"R.Wieser" <address@not.available>
Date	2021-01-02 12:41 +0100
Subject	Why does a loop using instr() and mid() get slower on larger data ?
Message-ID	<rspm4t$1pm1$1@gioia.aioe.org>

Hello all,

I'm using InStr() and mid() in a loop on large data strings.   What I've 
noticed is that the larger the datastring, the slower the process seems to 
become.   The time difference between results being returned for a few KByte 
and even 5 MByte of data is rather noticable (and I've got a 25 MByte 
file...).

Does anyone know what is causing it (and perhaps how to fix it) ?

Regards,
Rudy Wieser

[toc] | [next] | [standalone]

#12322

From	"Mayayana" <mayayana@invalid.nospam>
Date	2021-01-02 08:39 -0500
Message-ID	<rspt2m$613$1@dont-email.me>
In reply to	#12321

"R.Wieser" <address@not.available> wrote
|
| I'm using InStr() and mid() in a loop on large data strings.   What I've
| noticed is that the larger the datastring, the slower the process seems to
| become.   The time difference between results being returned for a few 
KByte
| and even 5 MByte of data is rather noticable (and I've got a 25 MByte
| file...).
|

For concatenation or adding to a string:

   Memory allocation. When you add a character to a 25 MB
string you have to allocate new memory of 25 MB + 1 byte.
According to Matthew Curland, VB6 has a string cache to
work from, up to a point. After that point it gets slow.
Presumably WScript and other programs also set aside a minor
cache, but I don't know about that.

  People end up doing things like writing string builder classes
to get around the problem. The fastest solution I know of in
VB is to allocate a string bigger than might be needed, then
point an array at it so that it can be treated as an array for
reading. Using the Mid *statement* then allows writing to
the string pretty much as fast as CopyMemory.

  VBS doesn't have the Mid statement, only the function.
But it is extremely fast to do Join. So whenever I need to
build a string I'll do it that way:

For i = 1 to 10000
  s = "something" & x & "something else"
  A(i) = s
Next

s = Join(A, "")

For InStr:

  If you see a lag just reading
a string or doing InStr then I don't know the problem,
assuming that you're not counting the time to allocate the
string. VBS is just generally slow. But I do find that searching
case-insensitive is much slower. So much so that if you have
a lot to do it's worth allocating a second string UCase(s)
and then search that: InStr(pt1, s, "SOMETHING", 0)

  I just tried loading a 100 MB text file and looking for a
string near the end. It took .14 seconds. I'd expect a few
ms in VB, but VBS does a lot of clunky wrapping. Everything
has to be converted to/from variants. Direct memory access
is wrapped in layers. I just
assume anything big will be slow. On the other hand, if
you need to deal with 25 MB files enough to matter then
maybe something else is wrong. No *typical* text file should
need to be that big.

  I have found a big difference with my server logs. I parse
server logs to extract all entries with a particular
date, so that I can create one file for each day. That's thousands
of InStr calls for each day. With a 5 MB file I can extract
and write files for 5 days in about 1-2 seconds. With a 300
MB server log I have to leave it to work. Probably more than
a minute for each day. When the logs get big I use InStrRev,
but it's still very slow. It's possible WScript is not keeping
that string in RAM. I don't know.

[toc] | [prev] | [next] | [standalone]

#12323

From	"R.Wieser" <address@not.available>
Date	2021-01-02 16:46 +0100
Message-ID	<rsq4fr$2tr$1@gioia.aioe.org>
In reply to	#12322

Mayayana,

> For concatenation or adding to a string:

I know that that can get rather slow quicly.  But I'm not adding anything to 
the origional data, just instr() to find something and using mid() to copy a 
small part following it (and than work on it).

>  VBS doesn't have the Mid statement, only the function.

:-) Yup.  Nice aint it ?   You have to grab-and-store the (huge) strings 
before and after what you want to replace and than glue them together with 
the new string in between.

>  If you see a lag just reading

That what I have, just while reading.

And the odd thing is that just finding a string in the first few lines slows 
down depending on the size of the to-be-searched string.    I have no 
explanation for that, other than that it does /something/ involving the size 
of the string.   I would not know what though, as variant strings are 
size-prefixed (they can have embedded NULs)

> But I do find that searching case-insensitive is much slower.

Thats to be expected.   Thanks for reminding me though.  I don't think that 
the file contents are case sensitive, so I could just lcase() the lot of it 
and have some speed gain.

>  I just tried loading a 100 MB text file and looking
> for a string near the end. It took .14 seconds

:-)  To many variables.   Loading and searching together ?  ReadLine or 
ReadAll ?

> On the other hand, if you need to deal with 25 MB files
> enough to matter then maybe something else is wrong.
> No *typical* text file should need to be that big.

It definitily is a typical textfile ... for an OpenStreetMap .OSM datafile 
that is. :-)

I'm trying to convert such an dataset into a usable image.   The first step 
was to go thru all the "node" elements and plot their coordinates.

Just to check I rewrote the script to read the file line-by-line and see 
what would happen.  The end result is a few times faster than reading all of 
it into memory and than have a go at it.   I did not quite expect that ...

Rgeards,
Rudy Wieser

[toc] | [prev] | [next] | [standalone]

#12325

From	"Mayayana" <mayayana@invalid.nospam>
Date	2021-01-02 15:26 -0500
Message-ID	<rsqku7$ksc$1@dont-email.me>
In reply to	#12323

"R.Wieser" <address@not.available> wrote

| >  I just tried loading a 100 MB text file and looking
| > for a string near the end. It took .14 seconds
|
| :-)  To many variables.   Loading and searching together ?  ReadLine or
| ReadAll ?
|

  No, dummy. :) I loaded the whole file as a string.
Then I put a Timer call on either side of InStr, so that
I'd just be testing the actual InStr time.

| I'm trying to convert such an dataset into a usable image.   The first 
step
| was to go thru all the "node" elements and plot their coordinates.
|
| Just to check I rewrote the script to read the file line-by-line and see
| what would happen.  The end result is a few times faster than reading all 
of
| it into memory and than have a go at it.   I did not quite expect that ...
|

   That is surprising. I rarely read by line. But I
rarely deal with big files.

   Open Street Map... I
don't suppose you've done anything with weather?
The US service, NOAA, made a big mess of their
radar maps in trying to update them. I'v been puttering
around with options. I can easily get a forecast in JSON,
but can't find a source of simple radar maps that show
where the rain is right now, and how hard it's coming
down.

[toc] | [prev] | [next] | [standalone]

#12327

From	"R.Wieser" <address@not.available>
Date	2021-01-02 22:11 +0100
Message-ID	<rsqnho$157m$2@gioia.aioe.org>
In reply to	#12325

Mayayana,

>  No, dummy. :)

Hey ! Watch who are you calling dummy, dummy. :-)

> I loaded the whole file as a string.
> Then I put a Timer call on either side of InStr,
> so that I'd just be testing the actual InStr time.

OK, understood.

And I noticed you mentioned 0.14 seconds on a 100MB file (I first thought 
you said 14 sec).   Here a repeated search on just a 5MB file gave me 
between 2 and 5 results a second, or between 200 and 500 mSec.     I would 
have expected that with a much larger datastring it would become a multiple 
of that.

> I don't suppose you've done anything with weather?

Other than viewing an rain probability image or two ? No, not really. Sorry.

Regards,
Rudy Wieser

[toc] | [prev] | [next] | [standalone]

#12328

From	"Mayayana" <mayayana@invalid.nospam>
Date	2021-01-02 17:27 -0500
Message-ID	<rsqs1f$5le$1@dont-email.me>
In reply to	#12327

"R.Wieser" <address@not.available> wrote

| And I noticed you mentioned 0.14 seconds on a 100MB file (I first thought
| you said 14 sec).   Here a repeated search on just a 5MB file gave me
| between 2 and 5 results a second, or between 200 and 500 mSec.     I would
| have expected that with a much larger datastring it would become a 
multiple
| of that.
|

  I don't know. InStr is incredibly fast. I suspect the
wrapper bloat is much of the time. You have to tell
WScript to do it. WScript has to run the operation.
The string must be converted to BSTR. It all adds up.
And I suppose our systems may vary.

 I just tried loading the bible. 4.29 MB of plain text.
I added "wordyyy" near the end and looked for that.
Position: 4,497,580  Time: 0
  It didn't register as being enough time to measure.
But if you're doing multiple searches that introduces lots of other
variables.

 So I tried this loop with the same Bible text:

pt2 = 1
pt1 = 1

x = 0
t1 = Timer
Do Until pt1 = 0
  pt1 = InStr(pt2, s1, "John", 0)
  pt2 = pt1 + 1
  x = x + 1
Loop

t2 = timer
MsgBox x - 1 & vbCrLf & (t2 - t1)

 It returns 140 instances, found in 0 seconds. Interesting.
So once all the overhead slop is dealt with, the looping
code seems to take almost no time at all.

[toc] | [prev] | [next] | [standalone]

#12330 — Re: Why does a loop using instr() and mid() get slower on larger data ? - found it (I think)

From	"R.Wieser" <address@not.available>
Date	2021-01-03 08:58 +0100
Subject	Re: Why does a loop using instr() and mid() get slower on larger data ? - found it (I think)
Message-ID	<rsrtfa$1v8c$1@gioia.aioe.org>
In reply to	#12328

Mayayana,

>  I don't know. InStr is incredibly fast. I suspect the
> wrapper bloat is much of the time.

Yesterday, ofcourse just after I called it a night and switched my 'puter 
off, I got a brainfart.   Just now I tested it.

You mentioned caseless searching, and that is what I have been doing (and 
forgot to mention, sorry).   Changing that last InStr argument back to 
casefull searching sped the whole process up considerably.

What I think that happens is that InStr() first converts both strings to 
either upper or lower case, and only than does the search.    While that 
isn't a problem for relativly short strings, the conversion does take some 
time when you throw 25 MB at it ...

... especially when you do that in a loop which finds over 40,000 .

Hence that reading the file line-by-line and than doing the caseless InStr() 
sped the process up.

The whole process using the 25 MB string still takes several seconds to 
complete though.  Oh well.

Regards,
Rudy Wieser

[toc] | [prev] | [next] | [standalone]

#12331 — Re: Why does a loop using instr() and mid() get slower on larger data ? - found it (I think)

From	"Mayayana" <mayayana@invalid.nospam>
Date	2021-01-03 08:54 -0500
Subject	Re: Why does a loop using instr() and mid() get slower on larger data ? - found it (I think)
Message-ID	<rssian$hjo$1@dont-email.me>
In reply to	#12330

"R.Wieser" <address@not.available> wrote

| The whole process using the 25 MB string still takes several seconds to
| complete though.  Oh well.
|

  Did you try a UCase or LCase first and then
ReadAll? I don't see the point of reading by line
unless you expect to be quitting near the
beginning.

  I don't know how case works. It wouldn't be
difficult to just search for "65 or 97" (A or a).
But either way it's a more complex search.

 I imagine it might also depend on the length
of the search string and how many similarities
there are. If you have 25 MB of "orange is the
new black" and one instance of "orange is the
new green" near the end, I would expect all those
near matches to bog things down. But I've never
actually tested it.

[toc] | [prev] | [next] | [standalone]

#12332 — Re: Why does a loop using instr() and mid() get slower on larger data ? - found it (I think)

From	"R.Wieser" <address@not.available>
Date	2021-01-03 16:01 +0100
Subject	Re: Why does a loop using instr() and mid() get slower on larger data ? - found it (I think)
Message-ID	<rssm7m$5ai$1@gioia.aioe.org>
In reply to	#12331

Mayayana,

> | The whole process using the 25 MB string still takes several
> | seconds to complete though.  Oh well.

>  Did you try a UCase or LCase first and then ReadAll?

Whut ?   How do you suppose I would be able to do that ?    I would think I 
would need to read the data before being able to act upon it ....

> I don't see the point of reading by line

The drawback of reading everything into memory and doing an initial lcase or 
ucase is that you have to assume that none of the files data is in a 
mixed-case form thats important to the user.   In my case the OSM file 
contains name of streets, etc.   Those I would like to keep the casing.

> I imagine it might also depend on the length
> of the search string and how many similarities
> there are.

Ofcourse!  And as such not something I would think twice about.

Regards,
Rudy Wieser

[toc] | [prev] | [next] | [standalone]

#12333 — Re: Why does a loop using instr() and mid() get slower on larger data ? - found it (I think)

From	"Mayayana" <mayayana@invalid.nospam>
Date	2021-01-03 12:51 -0500
Subject	Re: Why does a loop using instr() and mid() get slower on larger data ? - found it (I think)
Message-ID	<rst07m$pa8$1@dont-email.me>
In reply to	#12332

"R.Wieser" <address@not.available> wrote

| >  Did you try a UCase or LCase first and then ReadAll?
|
| Whut ?   How do you suppose I would be able to do that ?    I would think 
I
| would need to read the data before being able to act upon it ....
|

   I just meant using both of those methods as one
option to test.
s = TS.ReadAll
s = UCase(s)
Pt1 = Instr(s, "findstring")

| > I don't see the point of reading by line
|
| The drawback of reading everything into memory and doing an initial lcase 
or
| ucase is that you have to assume that none of the files data is in a
| mixed-case form thats important to the user.   In my case the OSM file
| contains name of streets, etc.   Those I would like to keep the casing.
|

    I've run into that before. In that case I just
have two:

s2 = UCase(s)
pt1 = InStr(s2, "MAPLE ST.")
if Pt1 > 0 then x = Mid(s, pt1, 9)

  The speed difference is worth it. Ive written a lot of
scripts that do intensive work, such as converting
an HXS to CHM and heavily editing every HTML file
before doing the CHM compile.
  I've found the two things that make a very notable
difference in speed are concatenating with arrays
and case-sensitive search. When I'm doing it for 100
files, and rebuilding each file, that makes a big difference.

[toc] | [prev] | [next] | [standalone]

#12334 — Re: Why does a loop using instr() and mid() get slower on larger data ? - found it (I think)

From	"R.Wieser" <address@not.available>
Date	2021-01-03 19:26 +0100
Subject	Re: Why does a loop using instr() and mid() get slower on larger data ? - found it (I think)
Message-ID	<rst27r$vb$1@gioia.aioe.org>
In reply to	#12333

Mayayana,

>  I just meant using both of those methods as one
> option to test.
> s = TS.ReadAll
> s = UCase(s)

I would do the same.   Its just that it looked like you suggested the 
reverse order ...

>    I've run into that before. In that case I just
> have two:

I was also considereing using that.  Didn't like to have a duplicate of the 
same, rather large string though.   But that may still be my "640 KB is 
enough for everyone" memories talking though. Programming (small) 
microcontrollers might also have something to do with it.  :-)

>  The speed difference is worth it.

I will most likely try it out, if only to see the difference.

>  I've found the two things that make a very notable
> difference in speed are concatenating with arrays
> and case-sensitive search

:-) I've just ran into the latter, but can't say I've ever used the former.

Thanks.

Regards,
Rudy Wieser

[toc] | [prev] | [next] | [standalone]

#12329

From	"Mayayana" <mayayana@invalid.nospam>
Date	2021-01-02 17:43 -0500
Message-ID	<rsqsu1$bjt$1@dont-email.me>
In reply to	#12327

   One other note: You probably know this, but
the time measuring is not so accurate at very low levels.
In my experience VBScript timer is pretty good. But
if you're going to measure small things you need to
take multiple readings.

 To mix things up even more, I just tried loading my
100 MB file, which is a list of IP addresses and various
other numbers. I did the same loop for "200". It found
641105 instances in 1.109375 seconds! But the whole
operation, including loading the file, was more like 5
or 6 seconds. Nevertheless, the actual InStr loops
seem to be something like 2 nanoseconds each, and
that includes reassigning the value of pt2, as well as
adding 1 to x. Wowza!

[toc] | [prev] | [next] | [standalone]

#12324

From	JJ <jj4public@gmail.com>
Date	2021-01-03 02:47 +0700
Message-ID	<1392nds00okcb$.17b0a2wzh38wa$.dlg@40tude.net>
In reply to	#12321

On Sat, 2 Jan 2021 12:41:16 +0100, R.Wieser wrote:
> Hello all,
> 
> I'm using InStr() and mid() in a loop on large data strings.   What I've 
> noticed is that the larger the datastring, the slower the process seems to 
> become.   The time difference between results being returned for a few KByte 
> and even 5 MByte of data is rather noticable (and I've got a 25 MByte 
> file...).
> 
> Does anyone know what is causing it (and perhaps how to fix it) ?
> 
> Regards,
> Rudy Wieser

Because InStr() is basically a loop which enumerates the characters within 
the source string, until the matching substring is found.

The process will be fast if the matching substring happens to be at the 
start or near the start of the source string. Otherwise, it can be a slow 
process depending on the source string length.

If the contents of the source string is a plain text, string indexing may be 
used to increase substring search. But it won't be noticably beneficial if 
the substring search needs to be done less than 3 times (for each source 
string), considering that the string index generation requires processing 
time also.

[toc] | [prev] | [next] | [standalone]

#12326

From	"R.Wieser" <address@not.available>
Date	2021-01-02 21:47 +0100
Message-ID	<rsqnho$157m$1@gioia.aioe.org>
In reply to	#12324

JJ,

> The process will be fast if the matching substring happens to
> be at the start or near the start of the source string.

:-) I'm aware of that.

I ran the same instr() in a loop over a small and a large datastring, and I 
got (lots of) results back at different speeds - for the same data (the 
larger datastring just contained more of it).

Regards,
Rudy Wieser

[toc] | [prev] | [standalone]

csiph-web

Why does a loop using instr() and mid() get slower on larger data ?

Contents

#12321 — Why does a loop using instr() and mid() get slower on larger data ?

#12322

#12323

#12325

#12327

#12328

#12330 — Re: Why does a loop using instr() and mid() get slower on larger data ? - found it (I think)

#12331 — Re: Why does a loop using instr() and mid() get slower on larger data ? - found it (I think)

#12332 — Re: Why does a loop using instr() and mid() get slower on larger data ? - found it (I think)

#12333 — Re: Why does a loop using instr() and mid() get slower on larger data ? - found it (I think)

#12334 — Re: Why does a loop using instr() and mid() get slower on larger data ? - found it (I think)

#12329

#12324

#12326