Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.java.programmer > #14785

Patricia trie vs binary search.

From markspace <-@.>
Newsgroups comp.lang.java.programmer
Subject Patricia trie vs binary search.
Date 2012-05-24 16:07 -0700
Organization A noiseless patient Spider
Message-ID <jpmev4$63l$1@dont-email.me> (permalink)

Show all headers | View raw


For some reason I was thinking about sub-string searches today.  I 
looked up Patricia tries (a kind of radix tree) to see if they would 
help.  While interesting, the radix tree seems to have a lot of overhead 
for large numbers of entries.

The radix tree uses a bucket at each level to hold all children (and 
there could be quite a lot of children).  Each child if present requires 
a pointer (an object in Java) to hold it.  For the example given, this 
could be as much as one object per character in each string, plus the 
bucket to hold it and its siblings.  If the number strings is very 
large, this could really result in an explosion of memory usage.

<http://en.wikipedia.org/wiki/Radix_tree>

So is there a better way for large data sets?

For the example give, it appears to be a kind of incremental search. 
Those, first we find the "r" which is common to all the words in the 
example.  Then if we're looking for, say, "romane" we'd find that the 
"om" is common to all words that match.  Then we find that "an" matches 
both "romane" and "romanus".  Finally we find the "e" which tells us 
"romane" exist in the tree.

So I implemented a simple "incremental binary search".  Given a string, 
the search tells you if there's any elements that begin with the 
parameter string.  It also remembers the previous match, so that the 
next search picks up from that location.  "ro" could be matched next 
along with "rom".  If the search ever returns a non-match condition, you 
know you've hit the maximum extent.  There are no further matches, 
regardless how many additional characters you append.

This seems useful for "short-circuiting" long string matches, allowing 
you to pick out dictionary matches without inuring average n^2 time.

My question is: does this really buy me anything?  Or did I just 
duplicate some other well known algorithms that works just as well or 
better?

/*
  * Copyright 2012 Brenden Towey.  All rights reserved.
  */
package com.techdarwinia.trie;

import java.io.FileReader;
import java.io.Reader;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Scanner;

/**
  *
  * @author Brenden Towey
  */
public class IncrementalStringSearch
{

    private int low;
    private int high;
    private String[] array;

    public IncrementalStringSearch( String[] array )
    {
       this.array = array;
       high = array.length - 1;
    }

    public boolean find( String findString )
    {
       if( low > high ) return false;
       for( ;; ) {
          int index = ( low + high ) / 2;
          String foundSub = array[index];
          if( array[index].length() > findString.length() )
             foundSub = array[index].substring( 0, findString.length() );
          else
             foundSub = array[index];
          if( foundSub.equals( findString ) )
             return true;
          if( foundSub.compareTo( findString ) > 0 ) {
             high = index - 1;
             if( high < low ) return false;
          } else {
             low = index + 1;
             if( low > high ) return false;
          }
       }
    }

    public static void main( String[] args ) throws Exception
    {
       ArrayList<String> passwords = new ArrayList<>();

       Reader ins = new FileReader( "test/com/techdarwinia/trie/bpw.txt" );
       Scanner scanner = new Scanner( ins );
       scanner.useDelimiter( ",?\\s+" );
       while( scanner.hasNext() ) {
          passwords.add( scanner.next() );
       }
       ins.close();
       Collections.sort( passwords );

       String test = "passxword ";
       for( int i = 0; i < test.length(); i++ ) {
          IncrementalStringSearch inc = new IncrementalStringSearch(
                  passwords.toArray( new String[0] ) );
          for( int j = i + 1; j <= test.length(); j++ ) {
             String string = test.substring( i, j );
             if( !inc.find( string ) ){
                if( j > i+1 && passwords.contains( test.substring( i, 
j-1 ) ) ) {
                   System.out.println( "Found: "+test.substring( i, j-1 ) );
                }
                break;
             }
          }
       }
    }
}

Back to comp.lang.java.programmer | Previous | NextNext in thread | Find similar | Unroll thread


Thread

Patricia trie vs binary search. markspace <-@.> - 2012-05-24 16:07 -0700
  Re: Patricia trie vs binary search. glen herrmannsfeldt <gah@ugcs.caltech.edu> - 2012-05-24 23:39 +0000
    Re: Patricia trie vs binary search. markspace <-@.> - 2012-05-24 17:56 -0700
      Password quality (Was: Patricia trie vs binary search.) Lew <lewbloch@gmail.com> - 2012-05-25 09:41 -0700
        Re: Password quality (Was: Patricia trie vs binary search.) markspace <-@.> - 2012-05-25 12:17 -0700
  Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-26 17:30 -0700
    Re: Patricia trie vs binary search. markspace <-@.> - 2012-05-26 18:17 -0700
    Re: Patricia trie vs binary search. Gene Wirchenko <genew@ocis.net> - 2012-05-27 18:44 -0700
      Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-27 22:00 -0700
        Re: Patricia trie vs binary search. markspace <-@.> - 2012-05-28 08:20 -0700
          Re: Patricia trie vs binary search. markspace <-@.> - 2012-05-28 14:38 -0700
        Re: Patricia trie vs binary search. Gene Wirchenko <genew@ocis.net> - 2012-05-28 09:20 -0700
          Re: Patricia trie vs binary search. Lew <noone@lewscanon.com> - 2012-05-28 21:54 -0700
            Re: Patricia trie vs binary search. Gene Wirchenko <genew@ocis.net> - 2012-05-29 09:14 -0700
              Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-29 09:55 -0700
                Re: Patricia trie vs binary search. Gene Wirchenko <genew@ocis.net> - 2012-05-29 11:17 -0700
                Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-29 11:22 -0700
                Re: Patricia trie vs binary search. Gene Wirchenko <genew@ocis.net> - 2012-05-29 14:44 -0700
                Re: Patricia trie vs binary search. Lew <lewbloch@gmail.com> - 2012-05-29 14:03 -0700
                Re: Patricia trie vs binary search. Gene Wirchenko <genew@ocis.net> - 2012-05-29 14:49 -0700
                Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-29 15:23 -0700
                Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-29 15:39 -0700
                Re: Patricia trie vs binary search. Gene Wirchenko <genew@ocis.net> - 2012-05-29 16:08 -0700
                Re: Patricia trie vs binary search. Lew <lewbloch@gmail.com> - 2012-05-29 18:25 -0700
            Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-29 09:16 -0700
              Re: Patricia trie vs binary search. Jeff Higgins <jeff@invalid.invalid> - 2012-05-29 13:37 -0400
                Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-29 10:49 -0700
                Re: Patricia trie vs binary search. Jeff Higgins <jeff@invalid.invalid> - 2012-05-29 13:58 -0400
                Re: Patricia trie vs binary search. Jeff Higgins <jeff@invalid.invalid> - 2012-05-29 14:20 -0400
                Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-29 11:21 -0700
                Re: Patricia trie vs binary search. Jeff Higgins <jeff@invalid.invalid> - 2012-05-29 14:29 -0400
                Re: Patricia trie vs binary search. Jeff Higgins <jeff@invalid.invalid> - 2012-05-29 15:00 -0400
  Re: Patricia trie vs binary search. Jeff Higgins <jeff@invalid.invalid> - 2012-05-29 09:24 -0400

csiph-web