Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.java.programmer > #10804 > unrolled thread
| Started by | Roedy Green <see_website@mindprod.com.invalid> |
|---|---|
| First post | 2011-12-16 10:08 -0800 |
| Last post | 2011-12-18 09:17 -0800 |
| Articles | 12 — 11 participants |
Back to article view | Back to comp.lang.java.programmer
Fast String search algorithm Roedy Green <see_website@mindprod.com.invalid> - 2011-12-16 10:08 -0800
Re: Fast String search algorithm Tom Anderson <twic@urchin.earth.li> - 2011-12-16 18:58 +0000
Re: Fast String search algorithm markspace <-@.> - 2011-12-17 09:26 -0800
Re: Fast String search algorithm Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2011-12-17 21:07 -0800
Re: Fast String search algorithm Tassilo Horn <tassilo@member.fsf.org> - 2011-12-16 20:07 +0100
Re: Fast String search algorithm Roedy Green <see_website@mindprod.com.invalid> - 2011-12-16 13:32 -0800
Re: Fast String search algorithm Gene Wirchenko <genew@ocis.net> - 2011-12-16 12:12 -0800
Re: Fast String search algorithm glen herrmannsfeldt <gah@ugcs.caltech.edu> - 2011-12-16 21:22 +0000
Re: Fast String search algorithm Arne Vajhøj <arne@vajhoej.dk> - 2011-12-16 22:20 -0500
Re: Fast String search algorithm Lew <lewbloch@gmail.com> - 2011-12-16 21:39 -0800
Re: Fast String search algorithm Wanja Gayk <brixomatic@yahoo.com> - 2011-12-17 16:15 +0100
Re: Fast String search algorithm Travers Naran <tnaran@gmail.com> - 2011-12-18 09:17 -0800
| From | Roedy Green <see_website@mindprod.com.invalid> |
|---|---|
| Date | 2011-12-16 10:08 -0800 |
| Subject | Fast String search algorithm |
| Message-ID | <i02ne7ptvvl8ofv9uracdu2jm3oq61v2h3@4ax.com> |
Let's say I had a million records each with a text field. I wanted to find all records that contained a given substring. Are there fast algorithms to do that or do you have to scan the whole thing linearly? I imagine you could build an index of words. Then maybe you could figure out which words contain the string, and narrow your search to records with those words. Similarly, I was wondering how you might build an index to all the files on a computer so that you could find one just by giving a wild card or partial filename. -- Roedy Green Canadian Mind Products http://mindprod.com For me, the appeal of computer programming is that even though I am quite a klutz, I can still produce something, in a sense perfect, because the computer gives me as many chances as I please to get it right.
[toc] | [next] | [standalone]
| From | Tom Anderson <twic@urchin.earth.li> |
|---|---|
| Date | 2011-12-16 18:58 +0000 |
| Message-ID | <alpine.DEB.2.00.1112161852030.13913@urchin.earth.li> |
| In reply to | #10804 |
On Fri, 16 Dec 2011, Roedy Green wrote: > Let's say I had a million records each with a text field. I wanted to > find all records that contained a given substring. Are there fast > algorithms to do that or do you have to scan the whole thing linearly? http://en.wikipedia.org/wiki/Inverted_index > I imagine you could build an index of words. Then maybe you could > figure out which words contain the string, and narrow your search to > records with those words. Exactly that! > Similarly, I was wondering how you might build an index to all the files > on a computer so that you could find one just by giving a wild card or > partial filename. There is a standard utility on unix that does just that: http://en.wikipedia.org/wiki/Locate_%28Unix%29 The source is in C, but might be informative (this is for the 'mlocate' version): https://fedorahosted.org/mlocate/browser/src tom -- Get a fucking hobby that isn't breathing, browsing 4chan, or fapping. -- The Well Cultured Anonymous, on Manners
[toc] | [prev] | [next] | [standalone]
| From | markspace <-@.> |
|---|---|
| Date | 2011-12-17 09:26 -0800 |
| Message-ID | <jcijcm$lpg$1@dont-email.me> |
| In reply to | #10808 |
On 12/16/2011 10:58 AM, Tom Anderson wrote: > On Fri, 16 Dec 2011, Roedy Green wrote: > >> Let's say I had a million records each with a text field. I wanted to >> find all records that contained a given substring. Are there fast >> algorithms to do that or do you have to scan the whole thing linearly? > > http://en.wikipedia.org/wiki/Inverted_index > An implementation of an inverted index is often called a search engine. <http://en.wikipedia.org/wiki/Search_engine_%28computing%29> One of the best known implementations is a web search engine. Alt Vista, Yahoo, and Google are well known examples. There's a Java based search engine I happen to know about, Lucene. <http://lucene.apache.org/java/docs/index.html> This is a roll-your-own library, but it has a lot of features. Beyond a cursory reading of the documentation I can't really say much about it unfortunately.
[toc] | [prev] | [next] | [standalone]
| From | Daniel Pitts <newsgroup.nospam@virtualinfinity.net> |
|---|---|
| Date | 2011-12-17 21:07 -0800 |
| Message-ID | <NyeHq.20142$c27.19773@newsfe22.iad> |
| In reply to | #10828 |
On 12/17/11 9:26 AM, markspace wrote: > On 12/16/2011 10:58 AM, Tom Anderson wrote: >> On Fri, 16 Dec 2011, Roedy Green wrote: >> >>> Let's say I had a million records each with a text field. I wanted to >>> find all records that contained a given substring. Are there fast >>> algorithms to do that or do you have to scan the whole thing linearly? >> >> http://en.wikipedia.org/wiki/Inverted_index >> > > An implementation of an inverted index is often called a search engine. > > <http://en.wikipedia.org/wiki/Search_engine_%28computing%29> > > One of the best known implementations is a web search engine. Alt Vista, > Yahoo, and Google are well known examples. > > There's a Java based search engine I happen to know about, Lucene. > > <http://lucene.apache.org/java/docs/index.html> > > This is a roll-your-own library, but it has a lot of features. Beyond a > cursory reading of the documentation I can't really say much about it > unfortunately. > Solr is a webservice(ish) program which is based around Lucene. <http://lucene.apache.org/solr/>
[toc] | [prev] | [next] | [standalone]
| From | Tassilo Horn <tassilo@member.fsf.org> |
|---|---|
| Date | 2011-12-16 20:07 +0100 |
| Message-ID | <87r504s5ci.fsf@thinkpad.tsdh.de> |
| In reply to | #10804 |
Roedy Green <see_website@mindprod.com.invalid> writes: > Let's say I had a million records each with a text field. I wanted to > find all records that contained a given substring. Are there fast > algorithms to do that or do you have to scan the whole thing linearly? Sure. See http://en.wikipedia.org/wiki/Substring_index > Similarly, I was wondering how you might build an index to all the > files on a computer so that you could find one just by giving a wild > card or partial filename. On UNIXes, that's exactly what the locate tool does. Then you can search with globbing wildcards % # all files containing the substring foo % locate *foo* or even with regular expressions % # all files that end with "r.c" or "r.cpp" % locate -r '.*r\.\(c\|cpp\)$' Bye, Tassilo
[toc] | [prev] | [next] | [standalone]
| From | Roedy Green <see_website@mindprod.com.invalid> |
|---|---|
| Date | 2011-12-16 13:32 -0800 |
| Message-ID | <14ene75g20le2r5fc5ajrtj71flhbq488o@4ax.com> |
| In reply to | #10809 |
On Fri, 16 Dec 2011 20:07:25 +0100, Tassilo Horn <tassilo@member.fsf.org> wrote, quoted or indirectly quoted someone who said : >On UNIXes, that's exactly what the locate tool does. Then you can >search with globbing wildcards i am trying to talk the JPSoft people into adding a locate feature into Take Command. I wanted to be able to tell them roughly how to implement it. They file descriptions, and fast change directory give partial directory names. I hoped they could come up with an integrated system that was thread safe. (The current ones locks everybody out during refresh). -- Roedy Green Canadian Mind Products http://mindprod.com For me, the appeal of computer programming is that even though I am quite a klutz, I can still produce something, in a sense perfect, because the computer gives me as many chances as I please to get it right.
[toc] | [prev] | [next] | [standalone]
| From | Gene Wirchenko <genew@ocis.net> |
|---|---|
| Date | 2011-12-16 12:12 -0800 |
| Message-ID | <eg9ne7d8tt9vf7brjrpo7d5cgrf0p10bnu@4ax.com> |
| In reply to | #10804 |
On Fri, 16 Dec 2011 10:08:02 -0800, Roedy Green
<see_website@mindprod.com.invalid> wrote:
>Let's say I had a million records each with a text field. I wanted to
>find all records that contained a given substring. Are there fast
>algorithms to do that or do you have to scan the whole thing linearly?
Look up the Boyer-Moore string search algorithm.
[snip]
Sincerely,
Gene Wirchenko
[toc] | [prev] | [next] | [standalone]
| From | glen herrmannsfeldt <gah@ugcs.caltech.edu> |
|---|---|
| Date | 2011-12-16 21:22 +0000 |
| Message-ID | <jcgcqr$17p$1@speranza.aioe.org> |
| In reply to | #10810 |
Gene Wirchenko <genew@ocis.net> wrote: > On Fri, 16 Dec 2011 10:08:02 -0800, Roedy Green > <see_website@mindprod.com.invalid> wrote: >>Let's say I had a million records each with a text field. I wanted to >>find all records that contained a given substring. Are there fast >>algorithms to do that or do you have to scan the whole thing linearly? > Look up the Boyer-Moore string search algorithm. Well, Boyer-Moore still scans linearly, but just does it faster. Still, you could use something Boyer-Moore related if you know which characters were in each record. If a record didn't have a character in your substring, then it couldn't have the substring. Still, understand Boyer-Moore before you get much farther with string searching. -- glen
[toc] | [prev] | [next] | [standalone]
| From | Arne Vajhøj <arne@vajhoej.dk> |
|---|---|
| Date | 2011-12-16 22:20 -0500 |
| Message-ID | <4eec0a99$0$286$14726298@news.sunsite.dk> |
| In reply to | #10804 |
On 12/16/2011 1:08 PM, Roedy Green wrote: > Let's say I had a million records each with a text field. I wanted to > find all records that contained a given substring. Are there fast > algorithms to do that or do you have to scan the whole thing linearly? > > I imagine you could build an index of words. Then maybe you could > figure out which words contain the string, and narrow your search to > records with those words. > > Similarly, I was wondering how you might build an index to all the > files on a computer so that you could find one just by giving a wild > card or partial filename. Use a database that support full text search. Arne
[toc] | [prev] | [next] | [standalone]
| From | Lew <lewbloch@gmail.com> |
|---|---|
| Date | 2011-12-16 21:39 -0800 |
| Message-ID | <30004790.458.1324100378032.JavaMail.geo-discussion-forums@pruu23> |
| In reply to | #10816 |
Arne Vajhøj wrote:
> Roedy Green wrote:
>> Let's say I had a million records each with a text field. I wanted to
>> find all records that contained a given substring. Are there fast
>> algorithms to do that or do you have to scan the whole thing linearly?
>>
>> I imagine you could build an index of words. Then maybe you could
>> figure out which words contain the string, and narrow your search to
>> records with those words.
>>
>> Similarly, I was wondering how you might build an index to all the
>> files on a computer so that you could find one just by giving a wild
>> card or partial filename.
>
> Use a database that support full text search.
Perhaps
<http://www.google.com/search?q=latent+semantic+indexing>
Latent semantic indexing, or LSI, is an algorithm that maps word associations
culled from document bases into cartographic clusters in n-space, then projects
those clusters into m-space, m < n. The Wikipedia article (found in the
referenced search) describes it in terms of the underlying matrix operations,
which is great fun for those who like such things but loses the forest for the
trees. I think of it as dusty blobs floating in infospace (like cyberspace but
doesn't need computers) with bright spots where terms (words, phrases, other
semantic molecules) live near each other, like "driver" not far from "wedge".
With lots of documents and terms, you have a lot of points of light, m times n
dimensions. The reduction brings it down to about one to three hundred
dimensions, but it's still a hyperspatial map (in the cartographic sense).
It is compute-intensive to map out all those correlations but makes for not
only a fast search, but one that seems almost magical in its ability to pull
meaningful answers out of the morass. The term "latent semantic indexing"
refers to how all that dusty matrix math ("indexing") pulls out the meaning
("semantic") lurking in the way words coalesce in meaningful documents
("latent"). The correlations capture the correlations in the content that
embody the linguistic meaning, so search results correlate quite well to
semantics in the query's meaning.
To a person fond of maths, information science, library science, linguistics,
semiotics, cartography, AI or document management LSI is a delight.
--
Lew
--
Lew
[toc] | [prev] | [next] | [standalone]
| From | Wanja Gayk <brixomatic@yahoo.com> |
|---|---|
| Date | 2011-12-17 16:15 +0100 |
| Message-ID | <MPG.2956d080b7b7c0c69896d9@202.177.16.121> |
| In reply to | #10804 |
In article <i02ne7ptvvl8ofv9uracdu2jm3oq61v2h3@4ax.com>, see_website@mindprod.com.invalid says... > Let's say I had a million records each with a text field. I wanted to > find all records that contained a given substring. Are there fast > algorithms to do that or do you have to scan the whole thing linearly? To match strings there is the Boyer-Moore-Horspool-matcher and the Knuth-Morris-Pratt matcher. I'd try both on real data and see which one performs better in this case. Kind regards, Wanja -- ..Alesi's problem was that the back of the car was jumping up and down dangerously - and I can assure you from having been teammate to Jean Alesi and knowing what kind of cars that he can pull up with, when Jean Alesi says that a car is dangerous - it is. [Jonathan Palmer] --- Posted via news://freenews.netfront.net/ - Complaints to news@netfront.net ---
[toc] | [prev] | [next] | [standalone]
| From | Travers Naran <tnaran@gmail.com> |
|---|---|
| Date | 2011-12-18 09:17 -0800 |
| Message-ID | <jcl77s$g27$1@dont-email.me> |
| In reply to | #10804 |
On 16/12/2011 10:08 AM, Roedy Green wrote: > Let's say I had a million records each with a text field. I wanted to > find all records that contained a given substring. Are there fast > algorithms to do that or do you have to scan the whole thing linearly? > > I imagine you could build an index of words. Then maybe you could > figure out which words contain the string, and narrow your search to > records with those words. > > Similarly, I was wondering how you might build an index to all the > files on a computer so that you could find one just by giving a wild > card or partial filename. In addition to the other links mentioned, here are some other links I found helpful for the same problem: http://en.wikipedia.org/wiki/Bloom_filter http://en.wikipedia.org/wiki/Bitap_algorithm http://johannburkard.de/software/stringsearch/ The Bloom Filter can be used to help reduce the set of files to search.
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.java.programmer
csiph-web