Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > microsoft.public.scripting.vbscript > #12321 > unrolled thread
| Started by | "R.Wieser" <address@not.available> |
|---|---|
| First post | 2021-01-02 12:41 +0100 |
| Last post | 2021-01-02 21:47 +0100 |
| Articles | 14 — 3 participants |
Back to article view | Back to microsoft.public.scripting.vbscript
Why does a loop using instr() and mid() get slower on larger data ? "R.Wieser" <address@not.available> - 2021-01-02 12:41 +0100
Re: Why does a loop using instr() and mid() get slower on larger data ? "Mayayana" <mayayana@invalid.nospam> - 2021-01-02 08:39 -0500
Re: Why does a loop using instr() and mid() get slower on larger data ? "R.Wieser" <address@not.available> - 2021-01-02 16:46 +0100
Re: Why does a loop using instr() and mid() get slower on larger data ? "Mayayana" <mayayana@invalid.nospam> - 2021-01-02 15:26 -0500
Re: Why does a loop using instr() and mid() get slower on larger data ? "R.Wieser" <address@not.available> - 2021-01-02 22:11 +0100
Re: Why does a loop using instr() and mid() get slower on larger data ? "Mayayana" <mayayana@invalid.nospam> - 2021-01-02 17:27 -0500
Re: Why does a loop using instr() and mid() get slower on larger data ? - found it (I think) "R.Wieser" <address@not.available> - 2021-01-03 08:58 +0100
Re: Why does a loop using instr() and mid() get slower on larger data ? - found it (I think) "Mayayana" <mayayana@invalid.nospam> - 2021-01-03 08:54 -0500
Re: Why does a loop using instr() and mid() get slower on larger data ? - found it (I think) "R.Wieser" <address@not.available> - 2021-01-03 16:01 +0100
Re: Why does a loop using instr() and mid() get slower on larger data ? - found it (I think) "Mayayana" <mayayana@invalid.nospam> - 2021-01-03 12:51 -0500
Re: Why does a loop using instr() and mid() get slower on larger data ? - found it (I think) "R.Wieser" <address@not.available> - 2021-01-03 19:26 +0100
Re: Why does a loop using instr() and mid() get slower on larger data ? "Mayayana" <mayayana@invalid.nospam> - 2021-01-02 17:43 -0500
Re: Why does a loop using instr() and mid() get slower on larger data ? JJ <jj4public@gmail.com> - 2021-01-03 02:47 +0700
Re: Why does a loop using instr() and mid() get slower on larger data ? "R.Wieser" <address@not.available> - 2021-01-02 21:47 +0100
| From | "R.Wieser" <address@not.available> |
|---|---|
| Date | 2021-01-02 12:41 +0100 |
| Subject | Why does a loop using instr() and mid() get slower on larger data ? |
| Message-ID | <rspm4t$1pm1$1@gioia.aioe.org> |
Hello all, I'm using InStr() and mid() in a loop on large data strings. What I've noticed is that the larger the datastring, the slower the process seems to become. The time difference between results being returned for a few KByte and even 5 MByte of data is rather noticable (and I've got a 25 MByte file...). Does anyone know what is causing it (and perhaps how to fix it) ? Regards, Rudy Wieser
[toc] | [next] | [standalone]
| From | "Mayayana" <mayayana@invalid.nospam> |
|---|---|
| Date | 2021-01-02 08:39 -0500 |
| Message-ID | <rspt2m$613$1@dont-email.me> |
| In reply to | #12321 |
"R.Wieser" <address@not.available> wrote | | I'm using InStr() and mid() in a loop on large data strings. What I've | noticed is that the larger the datastring, the slower the process seems to | become. The time difference between results being returned for a few KByte | and even 5 MByte of data is rather noticable (and I've got a 25 MByte | file...). | For concatenation or adding to a string: Memory allocation. When you add a character to a 25 MB string you have to allocate new memory of 25 MB + 1 byte. According to Matthew Curland, VB6 has a string cache to work from, up to a point. After that point it gets slow. Presumably WScript and other programs also set aside a minor cache, but I don't know about that. People end up doing things like writing string builder classes to get around the problem. The fastest solution I know of in VB is to allocate a string bigger than might be needed, then point an array at it so that it can be treated as an array for reading. Using the Mid *statement* then allows writing to the string pretty much as fast as CopyMemory. VBS doesn't have the Mid statement, only the function. But it is extremely fast to do Join. So whenever I need to build a string I'll do it that way: For i = 1 to 10000 s = "something" & x & "something else" A(i) = s Next s = Join(A, "") For InStr: If you see a lag just reading a string or doing InStr then I don't know the problem, assuming that you're not counting the time to allocate the string. VBS is just generally slow. But I do find that searching case-insensitive is much slower. So much so that if you have a lot to do it's worth allocating a second string UCase(s) and then search that: InStr(pt1, s, "SOMETHING", 0) I just tried loading a 100 MB text file and looking for a string near the end. It took .14 seconds. I'd expect a few ms in VB, but VBS does a lot of clunky wrapping. Everything has to be converted to/from variants. Direct memory access is wrapped in layers. I just assume anything big will be slow. On the other hand, if you need to deal with 25 MB files enough to matter then maybe something else is wrong. No *typical* text file should need to be that big. I have found a big difference with my server logs. I parse server logs to extract all entries with a particular date, so that I can create one file for each day. That's thousands of InStr calls for each day. With a 5 MB file I can extract and write files for 5 days in about 1-2 seconds. With a 300 MB server log I have to leave it to work. Probably more than a minute for each day. When the logs get big I use InStrRev, but it's still very slow. It's possible WScript is not keeping that string in RAM. I don't know.
[toc] | [prev] | [next] | [standalone]
| From | "R.Wieser" <address@not.available> |
|---|---|
| Date | 2021-01-02 16:46 +0100 |
| Message-ID | <rsq4fr$2tr$1@gioia.aioe.org> |
| In reply to | #12322 |
Mayayana, > For concatenation or adding to a string: I know that that can get rather slow quicly. But I'm not adding anything to the origional data, just instr() to find something and using mid() to copy a small part following it (and than work on it). > VBS doesn't have the Mid statement, only the function. :-) Yup. Nice aint it ? You have to grab-and-store the (huge) strings before and after what you want to replace and than glue them together with the new string in between. > If you see a lag just reading That what I have, just while reading. And the odd thing is that just finding a string in the first few lines slows down depending on the size of the to-be-searched string. I have no explanation for that, other than that it does /something/ involving the size of the string. I would not know what though, as variant strings are size-prefixed (they can have embedded NULs) > But I do find that searching case-insensitive is much slower. Thats to be expected. Thanks for reminding me though. I don't think that the file contents are case sensitive, so I could just lcase() the lot of it and have some speed gain. > I just tried loading a 100 MB text file and looking > for a string near the end. It took .14 seconds :-) To many variables. Loading and searching together ? ReadLine or ReadAll ? > On the other hand, if you need to deal with 25 MB files > enough to matter then maybe something else is wrong. > No *typical* text file should need to be that big. It definitily is a typical textfile ... for an OpenStreetMap .OSM datafile that is. :-) I'm trying to convert such an dataset into a usable image. The first step was to go thru all the "node" elements and plot their coordinates. Just to check I rewrote the script to read the file line-by-line and see what would happen. The end result is a few times faster than reading all of it into memory and than have a go at it. I did not quite expect that ... Rgeards, Rudy Wieser
[toc] | [prev] | [next] | [standalone]
| From | "Mayayana" <mayayana@invalid.nospam> |
|---|---|
| Date | 2021-01-02 15:26 -0500 |
| Message-ID | <rsqku7$ksc$1@dont-email.me> |
| In reply to | #12323 |
"R.Wieser" <address@not.available> wrote | > I just tried loading a 100 MB text file and looking | > for a string near the end. It took .14 seconds | | :-) To many variables. Loading and searching together ? ReadLine or | ReadAll ? | No, dummy. :) I loaded the whole file as a string. Then I put a Timer call on either side of InStr, so that I'd just be testing the actual InStr time. | I'm trying to convert such an dataset into a usable image. The first step | was to go thru all the "node" elements and plot their coordinates. | | Just to check I rewrote the script to read the file line-by-line and see | what would happen. The end result is a few times faster than reading all of | it into memory and than have a go at it. I did not quite expect that ... | That is surprising. I rarely read by line. But I rarely deal with big files. Open Street Map... I don't suppose you've done anything with weather? The US service, NOAA, made a big mess of their radar maps in trying to update them. I'v been puttering around with options. I can easily get a forecast in JSON, but can't find a source of simple radar maps that show where the rain is right now, and how hard it's coming down.
[toc] | [prev] | [next] | [standalone]
| From | "R.Wieser" <address@not.available> |
|---|---|
| Date | 2021-01-02 22:11 +0100 |
| Message-ID | <rsqnho$157m$2@gioia.aioe.org> |
| In reply to | #12325 |
Mayayana, > No, dummy. :) Hey ! Watch who are you calling dummy, dummy. :-) > I loaded the whole file as a string. > Then I put a Timer call on either side of InStr, > so that I'd just be testing the actual InStr time. OK, understood. And I noticed you mentioned 0.14 seconds on a 100MB file (I first thought you said 14 sec). Here a repeated search on just a 5MB file gave me between 2 and 5 results a second, or between 200 and 500 mSec. I would have expected that with a much larger datastring it would become a multiple of that. > I don't suppose you've done anything with weather? Other than viewing an rain probability image or two ? No, not really. Sorry. Regards, Rudy Wieser
[toc] | [prev] | [next] | [standalone]
| From | "Mayayana" <mayayana@invalid.nospam> |
|---|---|
| Date | 2021-01-02 17:27 -0500 |
| Message-ID | <rsqs1f$5le$1@dont-email.me> |
| In reply to | #12327 |
"R.Wieser" <address@not.available> wrote | And I noticed you mentioned 0.14 seconds on a 100MB file (I first thought | you said 14 sec). Here a repeated search on just a 5MB file gave me | between 2 and 5 results a second, or between 200 and 500 mSec. I would | have expected that with a much larger datastring it would become a multiple | of that. | I don't know. InStr is incredibly fast. I suspect the wrapper bloat is much of the time. You have to tell WScript to do it. WScript has to run the operation. The string must be converted to BSTR. It all adds up. And I suppose our systems may vary. I just tried loading the bible. 4.29 MB of plain text. I added "wordyyy" near the end and looked for that. Position: 4,497,580 Time: 0 It didn't register as being enough time to measure. But if you're doing multiple searches that introduces lots of other variables. So I tried this loop with the same Bible text: pt2 = 1 pt1 = 1 x = 0 t1 = Timer Do Until pt1 = 0 pt1 = InStr(pt2, s1, "John", 0) pt2 = pt1 + 1 x = x + 1 Loop t2 = timer MsgBox x - 1 & vbCrLf & (t2 - t1) It returns 140 instances, found in 0 seconds. Interesting. So once all the overhead slop is dealt with, the looping code seems to take almost no time at all.
[toc] | [prev] | [next] | [standalone]
| From | "R.Wieser" <address@not.available> |
|---|---|
| Date | 2021-01-03 08:58 +0100 |
| Subject | Re: Why does a loop using instr() and mid() get slower on larger data ? - found it (I think) |
| Message-ID | <rsrtfa$1v8c$1@gioia.aioe.org> |
| In reply to | #12328 |
Mayayana, > I don't know. InStr is incredibly fast. I suspect the > wrapper bloat is much of the time. Yesterday, ofcourse just after I called it a night and switched my 'puter off, I got a brainfart. Just now I tested it. You mentioned caseless searching, and that is what I have been doing (and forgot to mention, sorry). Changing that last InStr argument back to casefull searching sped the whole process up considerably. What I think that happens is that InStr() first converts both strings to either upper or lower case, and only than does the search. While that isn't a problem for relativly short strings, the conversion does take some time when you throw 25 MB at it ... ... especially when you do that in a loop which finds over 40,000 . Hence that reading the file line-by-line and than doing the caseless InStr() sped the process up. The whole process using the 25 MB string still takes several seconds to complete though. Oh well. Regards, Rudy Wieser
[toc] | [prev] | [next] | [standalone]
| From | "Mayayana" <mayayana@invalid.nospam> |
|---|---|
| Date | 2021-01-03 08:54 -0500 |
| Subject | Re: Why does a loop using instr() and mid() get slower on larger data ? - found it (I think) |
| Message-ID | <rssian$hjo$1@dont-email.me> |
| In reply to | #12330 |
"R.Wieser" <address@not.available> wrote | The whole process using the 25 MB string still takes several seconds to | complete though. Oh well. | Did you try a UCase or LCase first and then ReadAll? I don't see the point of reading by line unless you expect to be quitting near the beginning. I don't know how case works. It wouldn't be difficult to just search for "65 or 97" (A or a). But either way it's a more complex search. I imagine it might also depend on the length of the search string and how many similarities there are. If you have 25 MB of "orange is the new black" and one instance of "orange is the new green" near the end, I would expect all those near matches to bog things down. But I've never actually tested it.
[toc] | [prev] | [next] | [standalone]
| From | "R.Wieser" <address@not.available> |
|---|---|
| Date | 2021-01-03 16:01 +0100 |
| Subject | Re: Why does a loop using instr() and mid() get slower on larger data ? - found it (I think) |
| Message-ID | <rssm7m$5ai$1@gioia.aioe.org> |
| In reply to | #12331 |
Mayayana, > | The whole process using the 25 MB string still takes several > | seconds to complete though. Oh well. > Did you try a UCase or LCase first and then ReadAll? Whut ? How do you suppose I would be able to do that ? I would think I would need to read the data before being able to act upon it .... > I don't see the point of reading by line The drawback of reading everything into memory and doing an initial lcase or ucase is that you have to assume that none of the files data is in a mixed-case form thats important to the user. In my case the OSM file contains name of streets, etc. Those I would like to keep the casing. > I imagine it might also depend on the length > of the search string and how many similarities > there are. Ofcourse! And as such not something I would think twice about. Regards, Rudy Wieser
[toc] | [prev] | [next] | [standalone]
| From | "Mayayana" <mayayana@invalid.nospam> |
|---|---|
| Date | 2021-01-03 12:51 -0500 |
| Subject | Re: Why does a loop using instr() and mid() get slower on larger data ? - found it (I think) |
| Message-ID | <rst07m$pa8$1@dont-email.me> |
| In reply to | #12332 |
"R.Wieser" <address@not.available> wrote
| > Did you try a UCase or LCase first and then ReadAll?
|
| Whut ? How do you suppose I would be able to do that ? I would think
I
| would need to read the data before being able to act upon it ....
|
I just meant using both of those methods as one
option to test.
s = TS.ReadAll
s = UCase(s)
Pt1 = Instr(s, "findstring")
| > I don't see the point of reading by line
|
| The drawback of reading everything into memory and doing an initial lcase
or
| ucase is that you have to assume that none of the files data is in a
| mixed-case form thats important to the user. In my case the OSM file
| contains name of streets, etc. Those I would like to keep the casing.
|
I've run into that before. In that case I just
have two:
s2 = UCase(s)
pt1 = InStr(s2, "MAPLE ST.")
if Pt1 > 0 then x = Mid(s, pt1, 9)
The speed difference is worth it. Ive written a lot of
scripts that do intensive work, such as converting
an HXS to CHM and heavily editing every HTML file
before doing the CHM compile.
I've found the two things that make a very notable
difference in speed are concatenating with arrays
and case-sensitive search. When I'm doing it for 100
files, and rebuilding each file, that makes a big difference.
[toc] | [prev] | [next] | [standalone]
| From | "R.Wieser" <address@not.available> |
|---|---|
| Date | 2021-01-03 19:26 +0100 |
| Subject | Re: Why does a loop using instr() and mid() get slower on larger data ? - found it (I think) |
| Message-ID | <rst27r$vb$1@gioia.aioe.org> |
| In reply to | #12333 |
Mayayana, > I just meant using both of those methods as one > option to test. > s = TS.ReadAll > s = UCase(s) I would do the same. Its just that it looked like you suggested the reverse order ... > I've run into that before. In that case I just > have two: I was also considereing using that. Didn't like to have a duplicate of the same, rather large string though. But that may still be my "640 KB is enough for everyone" memories talking though. Programming (small) microcontrollers might also have something to do with it. :-) > The speed difference is worth it. I will most likely try it out, if only to see the difference. > I've found the two things that make a very notable > difference in speed are concatenating with arrays > and case-sensitive search :-) I've just ran into the latter, but can't say I've ever used the former. Thanks. Regards, Rudy Wieser
[toc] | [prev] | [next] | [standalone]
| From | "Mayayana" <mayayana@invalid.nospam> |
|---|---|
| Date | 2021-01-02 17:43 -0500 |
| Message-ID | <rsqsu1$bjt$1@dont-email.me> |
| In reply to | #12327 |
One other note: You probably know this, but the time measuring is not so accurate at very low levels. In my experience VBScript timer is pretty good. But if you're going to measure small things you need to take multiple readings. To mix things up even more, I just tried loading my 100 MB file, which is a list of IP addresses and various other numbers. I did the same loop for "200". It found 641105 instances in 1.109375 seconds! But the whole operation, including loading the file, was more like 5 or 6 seconds. Nevertheless, the actual InStr loops seem to be something like 2 nanoseconds each, and that includes reassigning the value of pt2, as well as adding 1 to x. Wowza!
[toc] | [prev] | [next] | [standalone]
| From | JJ <jj4public@gmail.com> |
|---|---|
| Date | 2021-01-03 02:47 +0700 |
| Message-ID | <1392nds00okcb$.17b0a2wzh38wa$.dlg@40tude.net> |
| In reply to | #12321 |
On Sat, 2 Jan 2021 12:41:16 +0100, R.Wieser wrote: > Hello all, > > I'm using InStr() and mid() in a loop on large data strings. What I've > noticed is that the larger the datastring, the slower the process seems to > become. The time difference between results being returned for a few KByte > and even 5 MByte of data is rather noticable (and I've got a 25 MByte > file...). > > Does anyone know what is causing it (and perhaps how to fix it) ? > > Regards, > Rudy Wieser Because InStr() is basically a loop which enumerates the characters within the source string, until the matching substring is found. The process will be fast if the matching substring happens to be at the start or near the start of the source string. Otherwise, it can be a slow process depending on the source string length. If the contents of the source string is a plain text, string indexing may be used to increase substring search. But it won't be noticably beneficial if the substring search needs to be done less than 3 times (for each source string), considering that the string index generation requires processing time also.
[toc] | [prev] | [next] | [standalone]
| From | "R.Wieser" <address@not.available> |
|---|---|
| Date | 2021-01-02 21:47 +0100 |
| Message-ID | <rsqnho$157m$1@gioia.aioe.org> |
| In reply to | #12324 |
JJ, > The process will be fast if the matching substring happens to > be at the start or near the start of the source string. :-) I'm aware of that. I ran the same instr() in a loop over a small and a large datastring, and I got (lots of) results back at different speeds - for the same data (the larger datastring just contained more of it). Regards, Rudy Wieser
[toc] | [prev] | [standalone]
Back to top | Article view | microsoft.public.scripting.vbscript
csiph-web