Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.java.programmer > #10817
| From | Lew <lewbloch@gmail.com> |
|---|---|
| Newsgroups | comp.lang.java.programmer |
| Subject | Re: Fast String search algorithm |
| Date | 2011-12-16 21:39 -0800 |
| Organization | http://groups.google.com |
| Message-ID | <30004790.458.1324100378032.JavaMail.geo-discussion-forums@pruu23> (permalink) |
| References | <i02ne7ptvvl8ofv9uracdu2jm3oq61v2h3@4ax.com> <4eec0a99$0$286$14726298@news.sunsite.dk> |
Arne Vajhøj wrote:
> Roedy Green wrote:
>> Let's say I had a million records each with a text field. I wanted to
>> find all records that contained a given substring. Are there fast
>> algorithms to do that or do you have to scan the whole thing linearly?
>>
>> I imagine you could build an index of words. Then maybe you could
>> figure out which words contain the string, and narrow your search to
>> records with those words.
>>
>> Similarly, I was wondering how you might build an index to all the
>> files on a computer so that you could find one just by giving a wild
>> card or partial filename.
>
> Use a database that support full text search.
Perhaps
<http://www.google.com/search?q=latent+semantic+indexing>
Latent semantic indexing, or LSI, is an algorithm that maps word associations
culled from document bases into cartographic clusters in n-space, then projects
those clusters into m-space, m < n. The Wikipedia article (found in the
referenced search) describes it in terms of the underlying matrix operations,
which is great fun for those who like such things but loses the forest for the
trees. I think of it as dusty blobs floating in infospace (like cyberspace but
doesn't need computers) with bright spots where terms (words, phrases, other
semantic molecules) live near each other, like "driver" not far from "wedge".
With lots of documents and terms, you have a lot of points of light, m times n
dimensions. The reduction brings it down to about one to three hundred
dimensions, but it's still a hyperspatial map (in the cartographic sense).
It is compute-intensive to map out all those correlations but makes for not
only a fast search, but one that seems almost magical in its ability to pull
meaningful answers out of the morass. The term "latent semantic indexing"
refers to how all that dusty matrix math ("indexing") pulls out the meaning
("semantic") lurking in the way words coalesce in meaningful documents
("latent"). The correlations capture the correlations in the content that
embody the linguistic meaning, so search results correlate quite well to
semantics in the query's meaning.
To a person fond of maths, information science, library science, linguistics,
semiotics, cartography, AI or document management LSI is a delight.
--
Lew
--
Lew
Back to comp.lang.java.programmer | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
Fast String search algorithm Roedy Green <see_website@mindprod.com.invalid> - 2011-12-16 10:08 -0800
Re: Fast String search algorithm Tom Anderson <twic@urchin.earth.li> - 2011-12-16 18:58 +0000
Re: Fast String search algorithm markspace <-@.> - 2011-12-17 09:26 -0800
Re: Fast String search algorithm Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2011-12-17 21:07 -0800
Re: Fast String search algorithm Tassilo Horn <tassilo@member.fsf.org> - 2011-12-16 20:07 +0100
Re: Fast String search algorithm Roedy Green <see_website@mindprod.com.invalid> - 2011-12-16 13:32 -0800
Re: Fast String search algorithm Gene Wirchenko <genew@ocis.net> - 2011-12-16 12:12 -0800
Re: Fast String search algorithm glen herrmannsfeldt <gah@ugcs.caltech.edu> - 2011-12-16 21:22 +0000
Re: Fast String search algorithm Arne Vajhøj <arne@vajhoej.dk> - 2011-12-16 22:20 -0500
Re: Fast String search algorithm Lew <lewbloch@gmail.com> - 2011-12-16 21:39 -0800
Re: Fast String search algorithm Wanja Gayk <brixomatic@yahoo.com> - 2011-12-17 16:15 +0100
Re: Fast String search algorithm Travers Naran <tnaran@gmail.com> - 2011-12-18 09:17 -0800
csiph-web