Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!gegeweb.org!de-l.enfer-du-nord.net!feeder1.enfer-du-nord.net!feeder.erje.net!news-1.dfn.de!news.dfn.de!cache.uni-koblenz.de!not-for-mail
From: Tassilo Horn <tassilo@member.fsf.org>
Newsgroups: comp.lang.java.programmer
Subject: Re: Fast String search algorithm
Date: Fri, 16 Dec 2011 20:07:25 +0100
Organization: University Koblenz-Landau Campus Koblenz
Lines: 25
Message-ID: <87r504s5ci.fsf@thinkpad.tsdh.de>
References: <i02ne7ptvvl8ofv9uracdu2jm3oq61v2h3@4ax.com>
NNTP-Posting-Host: 91-67-11-43-dynip.superkabel.de
Mime-Version: 1.0
Content-Type: text/plain
X-Trace: cache.uni-koblenz.de 1324062446 30966 91.67.11.43 (16 Dec 2011 19:07:26 GMT)
X-Complaints-To: news@cache.uni-koblenz.de
NNTP-Posting-Date: Fri, 16 Dec 2011 19:07:26 +0000 (UTC)
User-Agent: Gnus/5.110018 (No Gnus v0.18) Emacs/24.0.92 (gnu/linux)
Cancel-Lock: sha1:UCJwt5ybRyMv04CnMACCyEyG8jY=
Xref: x330-a1.tempe.blueboxinc.net comp.lang.java.programmer:10809

Roedy Green <see_website@mindprod.com.invalid> writes:

> Let's say I had a million records each with a text field.  I wanted to
> find all records that contained a given substring.  Are there fast
> algorithms to do that or do you have to scan the whole thing linearly?

Sure.  See http://en.wikipedia.org/wiki/Substring_index

> Similarly, I was wondering how you might build an index to all the
> files on a computer so that you could find one just by giving a wild
> card or partial filename.

On UNIXes, that's exactly what the locate tool does.  Then you can
search with globbing wildcards

  % # all files containing the substring foo
  % locate *foo*

or even with regular expressions

  % # all files that end with "r.c" or "r.cpp"
  % locate -r '.*r\.\(c\|cpp\)$'

Bye,
Tassilo