Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.java.programmer > #10817

Re: Fast String search algorithm

From Lew <lewbloch@gmail.com>
Newsgroups comp.lang.java.programmer
Subject Re: Fast String search algorithm
Date 2011-12-16 21:39 -0800
Organization http://groups.google.com
Message-ID <30004790.458.1324100378032.JavaMail.geo-discussion-forums@pruu23> (permalink)
References <i02ne7ptvvl8ofv9uracdu2jm3oq61v2h3@4ax.com> <4eec0a99$0$286$14726298@news.sunsite.dk>

Show all headers | View raw


Arne Vajhøj wrote:
> Roedy Green wrote:
>> Let's say I had a million records each with a text field.  I wanted to
>> find all records that contained a given substring.  Are there fast
>> algorithms to do that or do you have to scan the whole thing linearly?
>>
>> I imagine you could build an index of words.  Then maybe you could
>> figure out which words contain the string, and narrow your search to
>> records with those words.
>>
>> Similarly, I was wondering how you might build an index to all the
>> files on a computer so that you could find one just by giving a wild
>> card or partial filename.
> 
> Use a database that support full text search.

Perhaps 
<http://www.google.com/search?q=latent+semantic+indexing>

Latent semantic indexing, or LSI, is an algorithm that maps word associations 
culled from document bases into cartographic clusters in n-space, then projects 
those clusters into m-space, m < n.  The Wikipedia article (found in the 
referenced search) describes it in terms of the underlying matrix operations, 
which is great fun for those who like such things but loses the forest for the 
trees.  I think of it as dusty blobs floating in infospace (like cyberspace but 
doesn't need computers) with bright spots where terms (words, phrases, other 
semantic molecules) live near each other, like "driver" not far from "wedge".  
With lots of documents and terms, you have a lot of points of light, m times n 
dimensions.  The reduction brings it down to about one to three hundred 
dimensions, but it's still a hyperspatial map (in the cartographic sense).

It is compute-intensive to map out all those correlations but makes for not 
only a fast search, but one that seems almost magical in its ability to pull 
meaningful answers out of the morass.  The term "latent semantic indexing" 
refers to how all that dusty matrix math ("indexing") pulls out the meaning 
("semantic") lurking in the way words coalesce in meaningful documents 
("latent").  The correlations capture the correlations in the content that 
embody the linguistic meaning, so search results correlate quite well to
semantics in the query's meaning.

To a person fond of maths, information science, library science, linguistics, 
semiotics, cartography, AI or document management LSI is a delight.

-- 
Lew






-- 
Lew

Back to comp.lang.java.programmer | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Fast String search algorithm Roedy Green <see_website@mindprod.com.invalid> - 2011-12-16 10:08 -0800
  Re: Fast String search algorithm Tom Anderson <twic@urchin.earth.li> - 2011-12-16 18:58 +0000
    Re: Fast String search algorithm markspace <-@.> - 2011-12-17 09:26 -0800
      Re: Fast String search algorithm Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2011-12-17 21:07 -0800
  Re: Fast String search algorithm Tassilo Horn <tassilo@member.fsf.org> - 2011-12-16 20:07 +0100
    Re: Fast String search algorithm Roedy Green <see_website@mindprod.com.invalid> - 2011-12-16 13:32 -0800
  Re: Fast String search algorithm Gene Wirchenko <genew@ocis.net> - 2011-12-16 12:12 -0800
    Re: Fast String search algorithm glen herrmannsfeldt <gah@ugcs.caltech.edu> - 2011-12-16 21:22 +0000
  Re: Fast String search algorithm Arne Vajhøj <arne@vajhoej.dk> - 2011-12-16 22:20 -0500
    Re: Fast String search algorithm Lew <lewbloch@gmail.com> - 2011-12-16 21:39 -0800
  Re: Fast String search algorithm Wanja Gayk <brixomatic@yahoo.com> - 2011-12-17 16:15 +0100
  Re: Fast String search algorithm Travers Naran <tnaran@gmail.com> - 2011-12-18 09:17 -0800

csiph-web