Groups > comp.lang.java.programmer > #10804 > unrolled thread

Fast String search algorithm

Started by	Roedy Green <see_website@mindprod.com.invalid>
First post	2011-12-16 10:08 -0800
Last post	2011-12-18 09:17 -0800
Articles	12 — 11 participants

Back to article view | Back to comp.lang.java.programmer

  Fast String search algorithm Roedy Green <see_website@mindprod.com.invalid> - 2011-12-16 10:08 -0800
    Re: Fast String search algorithm Tom Anderson <twic@urchin.earth.li> - 2011-12-16 18:58 +0000
      Re: Fast String search algorithm markspace <-@.> - 2011-12-17 09:26 -0800
        Re: Fast String search algorithm Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2011-12-17 21:07 -0800
    Re: Fast String search algorithm Tassilo Horn <tassilo@member.fsf.org> - 2011-12-16 20:07 +0100
      Re: Fast String search algorithm Roedy Green <see_website@mindprod.com.invalid> - 2011-12-16 13:32 -0800
    Re: Fast String search algorithm Gene Wirchenko <genew@ocis.net> - 2011-12-16 12:12 -0800
      Re: Fast String search algorithm glen herrmannsfeldt <gah@ugcs.caltech.edu> - 2011-12-16 21:22 +0000
    Re: Fast String search algorithm Arne Vajhøj <arne@vajhoej.dk> - 2011-12-16 22:20 -0500
      Re: Fast String search algorithm Lew <lewbloch@gmail.com> - 2011-12-16 21:39 -0800
    Re: Fast String search algorithm Wanja Gayk <brixomatic@yahoo.com> - 2011-12-17 16:15 +0100
    Re: Fast String search algorithm Travers Naran <tnaran@gmail.com> - 2011-12-18 09:17 -0800

#10804 — Fast String search algorithm

From	Roedy Green <see_website@mindprod.com.invalid>
Date	2011-12-16 10:08 -0800
Subject	Fast String search algorithm
Message-ID	<i02ne7ptvvl8ofv9uracdu2jm3oq61v2h3@4ax.com>

Let's say I had a million records each with a text field.  I wanted to
find all records that contained a given substring.  Are there fast
algorithms to do that or do you have to scan the whole thing linearly?

I imagine you could build an index of words.  Then maybe you could
figure out which words contain the string, and narrow your search to
records with those words.

Similarly, I was wondering how you might build an index to all the
files on a computer so that you could find one just by giving a wild
card or partial filename.
-- 
Roedy Green Canadian Mind Products
http://mindprod.com
For me, the appeal of computer programming is that
even though I am quite a klutz,
I can still produce something, in a sense
perfect, because the computer gives me as many
chances as I please to get it right.

[toc] | [next] | [standalone]

#10808

From	Tom Anderson <twic@urchin.earth.li>
Date	2011-12-16 18:58 +0000
Message-ID	<alpine.DEB.2.00.1112161852030.13913@urchin.earth.li>
In reply to	#10804

On Fri, 16 Dec 2011, Roedy Green wrote:

> Let's say I had a million records each with a text field.  I wanted to 
> find all records that contained a given substring.  Are there fast 
> algorithms to do that or do you have to scan the whole thing linearly?

http://en.wikipedia.org/wiki/Inverted_index

> I imagine you could build an index of words.  Then maybe you could 
> figure out which words contain the string, and narrow your search to 
> records with those words.

Exactly that!

> Similarly, I was wondering how you might build an index to all the files 
> on a computer so that you could find one just by giving a wild card or 
> partial filename.

There is a standard utility on unix that does just that:

http://en.wikipedia.org/wiki/Locate_%28Unix%29

The source is in C, but might be informative (this is for the 'mlocate' 
version):

https://fedorahosted.org/mlocate/browser/src

tom

-- 
Get a fucking hobby that isn't breathing, browsing 4chan, or fapping. --
The Well Cultured Anonymous, on Manners

[toc] | [prev] | [next] | [standalone]

#10828

From	markspace <-@.>
Date	2011-12-17 09:26 -0800
Message-ID	<jcijcm$lpg$1@dont-email.me>
In reply to	#10808

On 12/16/2011 10:58 AM, Tom Anderson wrote:
> On Fri, 16 Dec 2011, Roedy Green wrote:
>
>> Let's say I had a million records each with a text field. I wanted to
>> find all records that contained a given substring. Are there fast
>> algorithms to do that or do you have to scan the whole thing linearly?
>
> http://en.wikipedia.org/wiki/Inverted_index
>

An implementation of an inverted index is often called a search engine.

<http://en.wikipedia.org/wiki/Search_engine_%28computing%29>

One of the best known implementations is a web search engine.  Alt 
Vista, Yahoo, and Google are well known examples.

There's a Java based search engine I happen to know about, Lucene.

<http://lucene.apache.org/java/docs/index.html>

This is a roll-your-own library, but it has a lot of features.  Beyond a 
cursory reading of the documentation I can't really say much about it 
unfortunately.

[toc] | [prev] | [next] | [standalone]

#10845

From	Daniel Pitts <newsgroup.nospam@virtualinfinity.net>
Date	2011-12-17 21:07 -0800
Message-ID	<NyeHq.20142$c27.19773@newsfe22.iad>
In reply to	#10828

On 12/17/11 9:26 AM, markspace wrote:
> On 12/16/2011 10:58 AM, Tom Anderson wrote:
>> On Fri, 16 Dec 2011, Roedy Green wrote:
>>
>>> Let's say I had a million records each with a text field. I wanted to
>>> find all records that contained a given substring. Are there fast
>>> algorithms to do that or do you have to scan the whole thing linearly?
>>
>> http://en.wikipedia.org/wiki/Inverted_index
>>
>
> An implementation of an inverted index is often called a search engine.
>
> <http://en.wikipedia.org/wiki/Search_engine_%28computing%29>
>
> One of the best known implementations is a web search engine. Alt Vista,
> Yahoo, and Google are well known examples.
>
> There's a Java based search engine I happen to know about, Lucene.
>
> <http://lucene.apache.org/java/docs/index.html>
>
> This is a roll-your-own library, but it has a lot of features. Beyond a
> cursory reading of the documentation I can't really say much about it
> unfortunately.
>
Solr is a webservice(ish) program which is based around Lucene.

<http://lucene.apache.org/solr/>

[toc] | [prev] | [next] | [standalone]

#10809

From	Tassilo Horn <tassilo@member.fsf.org>
Date	2011-12-16 20:07 +0100
Message-ID	<87r504s5ci.fsf@thinkpad.tsdh.de>
In reply to	#10804

Roedy Green <see_website@mindprod.com.invalid> writes:

> Let's say I had a million records each with a text field.  I wanted to
> find all records that contained a given substring.  Are there fast
> algorithms to do that or do you have to scan the whole thing linearly?

Sure.  See http://en.wikipedia.org/wiki/Substring_index

> Similarly, I was wondering how you might build an index to all the
> files on a computer so that you could find one just by giving a wild
> card or partial filename.

On UNIXes, that's exactly what the locate tool does.  Then you can
search with globbing wildcards

  % # all files containing the substring foo
  % locate *foo*

or even with regular expressions

  % # all files that end with "r.c" or "r.cpp"
  % locate -r '.*r\.\(c\|cpp\)$'

Bye,
Tassilo

[toc] | [prev] | [next] | [standalone]

#10814

From	Roedy Green <see_website@mindprod.com.invalid>
Date	2011-12-16 13:32 -0800
Message-ID	<14ene75g20le2r5fc5ajrtj71flhbq488o@4ax.com>
In reply to	#10809

On Fri, 16 Dec 2011 20:07:25 +0100, Tassilo Horn
<tassilo@member.fsf.org> wrote, quoted or indirectly quoted someone
who said :

>On UNIXes, that's exactly what the locate tool does.  Then you can
>search with globbing wildcards

i am trying to talk the JPSoft people into adding a locate feature
into Take Command.  I wanted to be able to tell them roughly how to
implement it.

They file descriptions, and fast change directory give partial
directory names.  I hoped they could come up with an integrated system
that was thread safe.  (The current ones locks everybody out during
refresh).
-- 
Roedy Green Canadian Mind Products
http://mindprod.com
For me, the appeal of computer programming is that
even though I am quite a klutz,
I can still produce something, in a sense
perfect, because the computer gives me as many
chances as I please to get it right.

[toc] | [prev] | [next] | [standalone]

#10810

From	Gene Wirchenko <genew@ocis.net>
Date	2011-12-16 12:12 -0800
Message-ID	<eg9ne7d8tt9vf7brjrpo7d5cgrf0p10bnu@4ax.com>
In reply to	#10804

On Fri, 16 Dec 2011 10:08:02 -0800, Roedy Green
<see_website@mindprod.com.invalid> wrote:

>Let's say I had a million records each with a text field.  I wanted to
>find all records that contained a given substring.  Are there fast
>algorithms to do that or do you have to scan the whole thing linearly?

     Look up the Boyer-Moore string search algorithm.

[snip]

Sincerely,

Gene Wirchenko

[toc] | [prev] | [next] | [standalone]

#10812

From	glen herrmannsfeldt <gah@ugcs.caltech.edu>
Date	2011-12-16 21:22 +0000
Message-ID	<jcgcqr$17p$1@speranza.aioe.org>
In reply to	#10810

Gene Wirchenko <genew@ocis.net> wrote:
> On Fri, 16 Dec 2011 10:08:02 -0800, Roedy Green
> <see_website@mindprod.com.invalid> wrote:

>>Let's say I had a million records each with a text field.  I wanted to
>>find all records that contained a given substring.  Are there fast
>>algorithms to do that or do you have to scan the whole thing linearly?

>     Look up the Boyer-Moore string search algorithm.

Well, Boyer-Moore still scans linearly, but just does it faster.

Still, you could use something Boyer-Moore related if you know which
characters were in each record.  If a record didn't have a character
in your substring, then it couldn't have the substring.

Still, understand Boyer-Moore before you get much farther 
with string searching.

-- glen

[toc] | [prev] | [next] | [standalone]

#10816

From	Arne Vajhøj <arne@vajhoej.dk>
Date	2011-12-16 22:20 -0500
Message-ID	<4eec0a99$0$286$14726298@news.sunsite.dk>
In reply to	#10804

On 12/16/2011 1:08 PM, Roedy Green wrote:
> Let's say I had a million records each with a text field.  I wanted to
> find all records that contained a given substring.  Are there fast
> algorithms to do that or do you have to scan the whole thing linearly?
>
> I imagine you could build an index of words.  Then maybe you could
> figure out which words contain the string, and narrow your search to
> records with those words.
>
> Similarly, I was wondering how you might build an index to all the
> files on a computer so that you could find one just by giving a wild
> card or partial filename.

Use a database that support full text search.

Arne

[toc] | [prev] | [next] | [standalone]

#10817

From	Lew <lewbloch@gmail.com>
Date	2011-12-16 21:39 -0800
Message-ID	<30004790.458.1324100378032.JavaMail.geo-discussion-forums@pruu23>
In reply to	#10816

Arne Vajhøj wrote:
> Roedy Green wrote:
>> Let's say I had a million records each with a text field.  I wanted to
>> find all records that contained a given substring.  Are there fast
>> algorithms to do that or do you have to scan the whole thing linearly?
>>
>> I imagine you could build an index of words.  Then maybe you could
>> figure out which words contain the string, and narrow your search to
>> records with those words.
>>
>> Similarly, I was wondering how you might build an index to all the
>> files on a computer so that you could find one just by giving a wild
>> card or partial filename.
> 
> Use a database that support full text search.

Perhaps 
<http://www.google.com/search?q=latent+semantic+indexing>

Latent semantic indexing, or LSI, is an algorithm that maps word associations 
culled from document bases into cartographic clusters in n-space, then projects 
those clusters into m-space, m < n.  The Wikipedia article (found in the 
referenced search) describes it in terms of the underlying matrix operations, 
which is great fun for those who like such things but loses the forest for the 
trees.  I think of it as dusty blobs floating in infospace (like cyberspace but 
doesn't need computers) with bright spots where terms (words, phrases, other 
semantic molecules) live near each other, like "driver" not far from "wedge".  
With lots of documents and terms, you have a lot of points of light, m times n 
dimensions.  The reduction brings it down to about one to three hundred 
dimensions, but it's still a hyperspatial map (in the cartographic sense).

It is compute-intensive to map out all those correlations but makes for not 
only a fast search, but one that seems almost magical in its ability to pull 
meaningful answers out of the morass.  The term "latent semantic indexing" 
refers to how all that dusty matrix math ("indexing") pulls out the meaning 
("semantic") lurking in the way words coalesce in meaningful documents 
("latent").  The correlations capture the correlations in the content that 
embody the linguistic meaning, so search results correlate quite well to
semantics in the query's meaning.

To a person fond of maths, information science, library science, linguistics, 
semiotics, cartography, AI or document management LSI is a delight.

-- 
Lew

-- 
Lew

[toc] | [prev] | [next] | [standalone]

#10824

From	Wanja Gayk <brixomatic@yahoo.com>
Date	2011-12-17 16:15 +0100
Message-ID	<MPG.2956d080b7b7c0c69896d9@202.177.16.121>
In reply to	#10804

In article <i02ne7ptvvl8ofv9uracdu2jm3oq61v2h3@4ax.com>, 
see_website@mindprod.com.invalid says...
 
> Let's say I had a million records each with a text field.  I wanted to
> find all records that contained a given substring.  Are there fast
> algorithms to do that or do you have to scan the whole thing linearly?

To match strings there is the Boyer-Moore-Horspool-matcher and the 
Knuth-Morris-Pratt matcher. I'd try both on real data and see which one 
performs better in this case.

Kind regards,
Wanja

-- 
..Alesi's problem was that the back of the car was jumping up and down 
dangerously - and I can assure you from having been teammate to 
Jean Alesi and knowing what kind of cars that he can pull up with, 
when Jean Alesi says that a car is dangerous - it is. [Jonathan Palmer]

--- Posted via news://freenews.netfront.net/ - Complaints to news@netfront.net ---

[toc] | [prev] | [next] | [standalone]

#10853

From	Travers Naran <tnaran@gmail.com>
Date	2011-12-18 09:17 -0800
Message-ID	<jcl77s$g27$1@dont-email.me>
In reply to	#10804

On 16/12/2011 10:08 AM, Roedy Green wrote:
> Let's say I had a million records each with a text field.  I wanted to
> find all records that contained a given substring.  Are there fast
> algorithms to do that or do you have to scan the whole thing linearly?
>
> I imagine you could build an index of words.  Then maybe you could
> figure out which words contain the string, and narrow your search to
> records with those words.
>
> Similarly, I was wondering how you might build an index to all the
> files on a computer so that you could find one just by giving a wild
> card or partial filename.

In addition to the other links mentioned, here are some other links I 
found helpful for the same problem:

http://en.wikipedia.org/wiki/Bloom_filter
http://en.wikipedia.org/wiki/Bitap_algorithm
http://johannburkard.de/software/stringsearch/

The Bloom Filter can be used to help reduce the set of files to search.

[toc] | [prev] | [standalone]

csiph-web

Fast String search algorithm

Contents

#10804 — Fast String search algorithm

#10808

#10828

#10845

#10809

#10814

#10810

#10812

#10816

#10817

#10824

#10853