Path: csiph.com!x330-a1.tempe.blueboxinc.net!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!news.glorb.com!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail
From: Lew <lewbloch@gmail.com>
Newsgroups: comp.lang.java.programmer
Subject: Re: Fast String search algorithm
Date: Fri, 16 Dec 2011 21:39:38 -0800 (PST)
Organization: http://groups.google.com
Lines: 67
Message-ID: <30004790.458.1324100378032.JavaMail.geo-discussion-forums@pruu23>
References: <i02ne7ptvvl8ofv9uracdu2jm3oq61v2h3@4ax.com> <4eec0a99$0$286$14726298@news.sunsite.dk>
Reply-To: comp.lang.java.programmer@googlegroups.com
NNTP-Posting-Host: 173.164.137.213
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
X-Trace: posting.google.com 1324100380 5216 127.0.0.1 (17 Dec 2011 05:39:40 GMT)
X-Complaints-To: groups-abuse@google.com
NNTP-Posting-Date: Sat, 17 Dec 2011 05:39:40 +0000 (UTC)
In-Reply-To: <4eec0a99$0$286$14726298@news.sunsite.dk>
Complaints-To: groups-abuse@google.com
Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=173.164.137.213; posting-account=CP-lKQoAAAAGtB5diOuGlDQk0jIwmH0T
User-Agent: G2/1.0
X-Google-Web-Client: true
Xref: x330-a1.tempe.blueboxinc.net comp.lang.java.programmer:10817

Arne Vajh=F8j wrote:
> Roedy Green wrote:
>> Let's say I had a million records each with a text field.  I wanted to
>> find all records that contained a given substring.  Are there fast
>> algorithms to do that or do you have to scan the whole thing linearly?
>>
>> I imagine you could build an index of words.  Then maybe you could
>> figure out which words contain the string, and narrow your search to
>> records with those words.
>>
>> Similarly, I was wondering how you might build an index to all the
>> files on a computer so that you could find one just by giving a wild
>> card or partial filename.
>=20
> Use a database that support full text search.

Perhaps=20
<http://www.google.com/search?q=3Dlatent+semantic+indexing>

Latent semantic indexing, or LSI, is an algorithm that maps word associatio=
ns=20
culled from document bases into cartographic clusters in n-space, then proj=
ects=20
those clusters into m-space, m < n.  The Wikipedia article (found in the=20
referenced search) describes it in terms of the underlying matrix operation=
s,=20
which is great fun for those who like such things but loses the forest for =
the=20
trees.  I think of it as dusty blobs floating in infospace (like cyberspace=
 but=20
doesn't need computers) with bright spots where terms (words, phrases, othe=
r=20
semantic molecules) live near each other, like "driver" not far from "wedge=
". =20
With lots of documents and terms, you have a lot of points of light, m time=
s n=20
dimensions.  The reduction brings it down to about one to three hundred=20
dimensions, but it's still a hyperspatial map (in the cartographic sense).

It is compute-intensive to map out all those correlations but makes for not=
=20
only a fast search, but one that seems almost magical in its ability to pul=
l=20
meaningful answers out of the morass.  The term "latent semantic indexing"=
=20
refers to how all that dusty matrix math ("indexing") pulls out the meaning=
=20
("semantic") lurking in the way words coalesce in meaningful documents=20
("latent").  The correlations capture the correlations in the content that=
=20
embody the linguistic meaning, so search results correlate quite well to
semantics in the query's meaning.

To a person fond of maths, information science, library science, linguistic=
s,=20
semiotics, cartography, AI or document management LSI is a delight.

--=20
Lew






--=20
Lew