Path: csiph.com!x330-a1.tempe.blueboxinc.net!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!news.glorb.com!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail From: Lew Newsgroups: comp.lang.java.programmer Subject: Re: Fast String search algorithm Date: Fri, 16 Dec 2011 21:39:38 -0800 (PST) Organization: http://groups.google.com Lines: 67 Message-ID: <30004790.458.1324100378032.JavaMail.geo-discussion-forums@pruu23> References: <4eec0a99$0$286$14726298@news.sunsite.dk> Reply-To: comp.lang.java.programmer@googlegroups.com NNTP-Posting-Host: 173.164.137.213 Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Trace: posting.google.com 1324100380 5216 127.0.0.1 (17 Dec 2011 05:39:40 GMT) X-Complaints-To: groups-abuse@google.com NNTP-Posting-Date: Sat, 17 Dec 2011 05:39:40 +0000 (UTC) In-Reply-To: <4eec0a99$0$286$14726298@news.sunsite.dk> Complaints-To: groups-abuse@google.com Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=173.164.137.213; posting-account=CP-lKQoAAAAGtB5diOuGlDQk0jIwmH0T User-Agent: G2/1.0 X-Google-Web-Client: true Xref: x330-a1.tempe.blueboxinc.net comp.lang.java.programmer:10817 Arne Vajh=F8j wrote: > Roedy Green wrote: >> Let's say I had a million records each with a text field. I wanted to >> find all records that contained a given substring. Are there fast >> algorithms to do that or do you have to scan the whole thing linearly? >> >> I imagine you could build an index of words. Then maybe you could >> figure out which words contain the string, and narrow your search to >> records with those words. >> >> Similarly, I was wondering how you might build an index to all the >> files on a computer so that you could find one just by giving a wild >> card or partial filename. >=20 > Use a database that support full text search. Perhaps=20 Latent semantic indexing, or LSI, is an algorithm that maps word associatio= ns=20 culled from document bases into cartographic clusters in n-space, then proj= ects=20 those clusters into m-space, m < n. The Wikipedia article (found in the=20 referenced search) describes it in terms of the underlying matrix operation= s,=20 which is great fun for those who like such things but loses the forest for = the=20 trees. I think of it as dusty blobs floating in infospace (like cyberspace= but=20 doesn't need computers) with bright spots where terms (words, phrases, othe= r=20 semantic molecules) live near each other, like "driver" not far from "wedge= ". =20 With lots of documents and terms, you have a lot of points of light, m time= s n=20 dimensions. The reduction brings it down to about one to three hundred=20 dimensions, but it's still a hyperspatial map (in the cartographic sense). It is compute-intensive to map out all those correlations but makes for not= =20 only a fast search, but one that seems almost magical in its ability to pul= l=20 meaningful answers out of the morass. The term "latent semantic indexing"= =20 refers to how all that dusty matrix math ("indexing") pulls out the meaning= =20 ("semantic") lurking in the way words coalesce in meaningful documents=20 ("latent"). The correlations capture the correlations in the content that= =20 embody the linguistic meaning, so search results correlate quite well to semantics in the query's meaning. To a person fond of maths, information science, library science, linguistic= s,=20 semiotics, cartography, AI or document management LSI is a delight. --=20 Lew --=20 Lew