Groups > comp.lang.java.programmer > #14785 > unrolled thread

Patricia trie vs binary search.

Started by	markspace <-@.>
First post	2012-05-24 16:07 -0700
Last post	2012-05-29 09:24 -0400
Articles	20 on this page of 33 — 7 participants

Back to article view | Back to comp.lang.java.programmer

  Patricia trie vs binary search. markspace <-@.> - 2012-05-24 16:07 -0700
    Re: Patricia trie vs binary search. glen herrmannsfeldt <gah@ugcs.caltech.edu> - 2012-05-24 23:39 +0000
      Re: Patricia trie vs binary search. markspace <-@.> - 2012-05-24 17:56 -0700
        Password quality (Was: Patricia trie vs binary search.) Lew <lewbloch@gmail.com> - 2012-05-25 09:41 -0700
          Re: Password quality (Was: Patricia trie vs binary search.) markspace <-@.> - 2012-05-25 12:17 -0700
    Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-26 17:30 -0700
      Re: Patricia trie vs binary search. markspace <-@.> - 2012-05-26 18:17 -0700
      Re: Patricia trie vs binary search. Gene Wirchenko <genew@ocis.net> - 2012-05-27 18:44 -0700
        Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-27 22:00 -0700
          Re: Patricia trie vs binary search. markspace <-@.> - 2012-05-28 08:20 -0700
            Re: Patricia trie vs binary search. markspace <-@.> - 2012-05-28 14:38 -0700
          Re: Patricia trie vs binary search. Gene Wirchenko <genew@ocis.net> - 2012-05-28 09:20 -0700
            Re: Patricia trie vs binary search. Lew <noone@lewscanon.com> - 2012-05-28 21:54 -0700
              Re: Patricia trie vs binary search. Gene Wirchenko <genew@ocis.net> - 2012-05-29 09:14 -0700
                Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-29 09:55 -0700
                  Re: Patricia trie vs binary search. Gene Wirchenko <genew@ocis.net> - 2012-05-29 11:17 -0700
                    Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-29 11:22 -0700
                      Re: Patricia trie vs binary search. Gene Wirchenko <genew@ocis.net> - 2012-05-29 14:44 -0700
                    Re: Patricia trie vs binary search. Lew <lewbloch@gmail.com> - 2012-05-29 14:03 -0700
                      Re: Patricia trie vs binary search. Gene Wirchenko <genew@ocis.net> - 2012-05-29 14:49 -0700
                        Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-29 15:23 -0700
                          Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-29 15:39 -0700
                            Re: Patricia trie vs binary search. Gene Wirchenko <genew@ocis.net> - 2012-05-29 16:08 -0700
                              Re: Patricia trie vs binary search. Lew <lewbloch@gmail.com> - 2012-05-29 18:25 -0700
              Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-29 09:16 -0700
                Re: Patricia trie vs binary search. Jeff Higgins <jeff@invalid.invalid> - 2012-05-29 13:37 -0400
                  Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-29 10:49 -0700
                    Re: Patricia trie vs binary search. Jeff Higgins <jeff@invalid.invalid> - 2012-05-29 13:58 -0400
                      Re: Patricia trie vs binary search. Jeff Higgins <jeff@invalid.invalid> - 2012-05-29 14:20 -0400
                      Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-29 11:21 -0700
                        Re: Patricia trie vs binary search. Jeff Higgins <jeff@invalid.invalid> - 2012-05-29 14:29 -0400
                          Re: Patricia trie vs binary search. Jeff Higgins <jeff@invalid.invalid> - 2012-05-29 15:00 -0400
    Re: Patricia trie vs binary search. Jeff Higgins <jeff@invalid.invalid> - 2012-05-29 09:24 -0400

Page 1 of 2 [1] 2 Next page →

#14785 — Patricia trie vs binary search.

From	markspace <-@.>
Date	2012-05-24 16:07 -0700
Subject	Patricia trie vs binary search.
Message-ID	<jpmev4$63l$1@dont-email.me>

For some reason I was thinking about sub-string searches today.  I 
looked up Patricia tries (a kind of radix tree) to see if they would 
help.  While interesting, the radix tree seems to have a lot of overhead 
for large numbers of entries.

The radix tree uses a bucket at each level to hold all children (and 
there could be quite a lot of children).  Each child if present requires 
a pointer (an object in Java) to hold it.  For the example given, this 
could be as much as one object per character in each string, plus the 
bucket to hold it and its siblings.  If the number strings is very 
large, this could really result in an explosion of memory usage.

<http://en.wikipedia.org/wiki/Radix_tree>

So is there a better way for large data sets?

For the example give, it appears to be a kind of incremental search. 
Those, first we find the "r" which is common to all the words in the 
example.  Then if we're looking for, say, "romane" we'd find that the 
"om" is common to all words that match.  Then we find that "an" matches 
both "romane" and "romanus".  Finally we find the "e" which tells us 
"romane" exist in the tree.

So I implemented a simple "incremental binary search".  Given a string, 
the search tells you if there's any elements that begin with the 
parameter string.  It also remembers the previous match, so that the 
next search picks up from that location.  "ro" could be matched next 
along with "rom".  If the search ever returns a non-match condition, you 
know you've hit the maximum extent.  There are no further matches, 
regardless how many additional characters you append.

This seems useful for "short-circuiting" long string matches, allowing 
you to pick out dictionary matches without inuring average n^2 time.

My question is: does this really buy me anything?  Or did I just 
duplicate some other well known algorithms that works just as well or 
better?

/*
  * Copyright 2012 Brenden Towey.  All rights reserved.
  */
package com.techdarwinia.trie;

import java.io.FileReader;
import java.io.Reader;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Scanner;

/**
  *
  * @author Brenden Towey
  */
public class IncrementalStringSearch
{

    private int low;
    private int high;
    private String[] array;

    public IncrementalStringSearch( String[] array )
    {
       this.array = array;
       high = array.length - 1;
    }

    public boolean find( String findString )
    {
       if( low > high ) return false;
       for( ;; ) {
          int index = ( low + high ) / 2;
          String foundSub = array[index];
          if( array[index].length() > findString.length() )
             foundSub = array[index].substring( 0, findString.length() );
          else
             foundSub = array[index];
          if( foundSub.equals( findString ) )
             return true;
          if( foundSub.compareTo( findString ) > 0 ) {
             high = index - 1;
             if( high < low ) return false;
          } else {
             low = index + 1;
             if( low > high ) return false;
          }
       }
    }

    public static void main( String[] args ) throws Exception
    {
       ArrayList<String> passwords = new ArrayList<>();

       Reader ins = new FileReader( "test/com/techdarwinia/trie/bpw.txt" );
       Scanner scanner = new Scanner( ins );
       scanner.useDelimiter( ",?\\s+" );
       while( scanner.hasNext() ) {
          passwords.add( scanner.next() );
       }
       ins.close();
       Collections.sort( passwords );

       String test = "passxword ";
       for( int i = 0; i < test.length(); i++ ) {
          IncrementalStringSearch inc = new IncrementalStringSearch(
                  passwords.toArray( new String[0] ) );
          for( int j = i + 1; j <= test.length(); j++ ) {
             String string = test.substring( i, j );
             if( !inc.find( string ) ){
                if( j > i+1 && passwords.contains( test.substring( i, 
j-1 ) ) ) {
                   System.out.println( "Found: "+test.substring( i, j-1 ) );
                }
                break;
             }
          }
       }
    }
}

[toc] | [next] | [standalone]

#14786

From	glen herrmannsfeldt <gah@ugcs.caltech.edu>
Date	2012-05-24 23:39 +0000
Message-ID	<jpmgrb$o5a$1@speranza.aioe.org>
In reply to	#14785

markspace <-@.> wrote:
> For some reason I was thinking about sub-string searches today.  I 
> looked up Patricia tries (a kind of radix tree) to see if they would 
> help.  While interesting, the radix tree seems to have a lot of overhead 
> for large numbers of entries.

I am not so sure which substring problem you are interested in,
but you might look at the Aho-Korasick algorithm.

If you want to allow for insertions and/or deletions, look
at dynamic programming.

-- glen

[toc] | [prev] | [next] | [standalone]

#14787

From	markspace <-@.>
Date	2012-05-24 17:56 -0700
Message-ID	<jpmlbf$a3o$1@dont-email.me>
In reply to	#14786

On 5/24/2012 4:39 PM, glen herrmannsfeldt wrote:
> markspace<-@.>  wrote:
>> For some reason I was thinking about sub-string searches today.  I
>> looked up Patricia tries (a kind of radix tree) to see if they would
>> help.  While interesting, the radix tree seems to have a lot of overhead
>> for large numbers of entries.
>
> I am not so sure which substring problem you are interested in,
> but you might look at the Aho-Korasick algorithm.
>
> If you want to allow for insertions and/or deletions, look
> at dynamic programming.

Thanks for those pointers (pun not intended).  The "sub-string search" 
was referring to was picking words out of a longer, undelimited string.

The example I posted looked for bad passwords within a larger password.

"passxword" contains at least two bad passwords, pass and word, 
according to the dictionary file I download from here (that's the 
bpw.txt file in the sample code):

<http://www.godlikeproductions.com/badpasswords.php>

The code I posted find pass, ass, word, and or in the string 
"passxword", which are all bad passwords according to that link above. 
(I deliberately avoid reporting single letters as bad).

Practically speaking, hashing appears to be just as fast for a lookup, 
but hashing doesn't tell you when to stop, so it's always N^2 for this 
sort of problem.  My algorithm should average less, hopefully much less 
if you have very long input strings.

I really don't know what set me off looking at this; it's just a little 
random experiment I suppose.

[toc] | [prev] | [next] | [standalone]

#14795 — Password quality (Was: Patricia trie vs binary search.)

From	Lew <lewbloch@gmail.com>
Date	2012-05-25 09:41 -0700
Subject	Password quality (Was: Patricia trie vs binary search.)
Message-ID	<a076f3cb-d062-41e1-9a9a-43de60de82fb@googlegroups.com>
In reply to	#14787

markspace wrote:
> Thanks for those pointers (pun not intended).  The "sub-string search" 
> was referring to was picking words out of a longer, undelimited string.
> 
> The example I posted looked for bad passwords within a larger password.
> 
> "passxword" contains at least two bad passwords, pass and word, 
> according to the dictionary file I download from here (that's the 
> bpw.txt file in the sample code):
> 
> <http://www.godlikeproductions.com/badpasswords.php>
> 
> The code I posted find pass, ass, word, and or in the string 
> "passxword", which are all bad passwords according to that link above. 
> (I deliberately avoid reporting single letters as bad).

I wonder about eliminating two-letter combinations. How much entropy does that add (or subtract) from passwords?

It's practicable and arguably more reliable to use passphrases comprising 
all natural words whose entropy exceeds that of a fairly long Mxyzptlk® 
key. (Note: "Mxyzptlk" may well pass all your password checks, yet is highly 
guessable. Equally flawed are other stringlets that pass naive checks, 
like "XB-17", "UB40" and others.) See 
http://world.std.com/~reinhold/diceware.html 
for how to create high-entropy, highly memorable passphrases.

> Practically speaking, hashing appears to be just as fast for a lookup, 
> but hashing doesn't tell you when to stop, so it's always N^2 for this 
> sort of problem.  My algorithm should average less, hopefully much less 
> if you have very long input strings.
> 
> I really don't know what set me off looking at this; it's just a little 
> random experiment I suppose.

Your main question of space- and time-efficient substring matching is 
a fairly well-studied problem. I don't off the top of my head have better 
answers, but your approach to experiment is viable.

You could instrument (profile) your code over different algorithms and 
input dataset profiles (that is, relative proportion of long, medium-length 
and short strings, relative proportion of "bad" substrings in the mix, and 
other factors that affect the "shape" of the data your algorithms process).
Algorithm analysis should give you the big O, but not the constant factors 
or where they break for lower /n/.

A lot of algorithm analysis critically depends on what constitutes "average" 
datasets.

-- 
Lew

[toc] | [prev] | [next] | [standalone]

#14797 — Re: Password quality (Was: Patricia trie vs binary search.)

From	markspace <-@.>
Date	2012-05-25 12:17 -0700
Subject	Re: Password quality (Was: Patricia trie vs binary search.)
Message-ID	<jpolrn$ale$1@dont-email.me>
In reply to	#14795

On 5/25/2012 9:41 AM, Lew wrote:
>
> I wonder about eliminating two-letter combinations. How much entropy
> does that add (or subtract) from passwords?

I was thinking the same thing.  Also searching for the longest possible 
word, and not bactracking, might be sufficient.  Once you find "word", 
why go back and scan for two or three letter combinations?

>
> It's practicable and arguably more reliable to use passphrases
> comprising all natural words whose entropy exceeds that of a fairly
> long Mxyzptlk® key. (Note: "Mxyzptlk" may well pass all your password
> checks, yet is highly guessable. Equally flawed are other stringlets

The idea was that if you increase the required length of a passphrase, 
users may defeat your requirement by just repeating a shorter bad 
password. "birdbirdbirdbird" is a pretty guessable 16 letter password, 
and "bird" appears in the bad password list.  However, 
"birdaliceferretsalut" is a pretty decent password, even though each of 
its four component words appears in the bad password list.  So if you 
can spot the individual component words and make sure they don't repeat, 
you've improved entropy a bit.

> that pass naive checks, like "XB-17", "UB40" and others.) See
> http://world.std.com/~reinhold/diceware.html for how to create
> high-entropy, highly memorable passphrases.

Yes, there should be other checks.  Overall length, and let's say at 
least 5 to 7 different characters.  So even though "UB40UB40UB40UB40" is 
16 characters and no sub-words appear on the bad password list, it only 
uses 4 different characters, which we might not want to allow.

> Your main question of space- and time-efficient substring matching
> is a fairly well-studied problem. I don't off the top of my head have
> better answers, but your approach to experiment is viable.

Right, although comparing algorithms can be hard too.  I would have to 
implement each algorithm such that it was optimal, and I don't always 
have the skill or time to do that.  "Many eyes" on a problem is often 
the more efficient solution.  (This is also know as "research" and "not 
re-inventing the wheel".)  Though certainly it wouldn't hurt to make the 
attempt.

Glen's answer above was very helpful.  Wikipedia has cross-referenced 
their string-matching algorithms so that finding one leads to all the 
others.

<http://en.wikipedia.org/wiki/Category:String_matching_algorithms>

This gives me something to chew on, at least.

[toc] | [prev] | [next] | [standalone]

#14825

From	Daniel Pitts <newsgroup.nospam@virtualinfinity.net>
Date	2012-05-26 17:30 -0700
Message-ID	<uAewr.45671$On2.38133@newsfe16.iad>
In reply to	#14785

On 5/24/12 4:07 PM, markspace wrote:
> For some reason I was thinking about sub-string searches today. I looked
> up Patricia tries (a kind of radix tree) to see if they would help.
> While interesting, the radix tree seems to have a lot of overhead for
> large numbers of entries.
>
> The radix tree uses a bucket at each level to hold all children (and
> there could be quite a lot of children). Each child if present requires
> a pointer (an object in Java) to hold it. For the example given, this
> could be as much as one object per character in each string, plus the
> bucket to hold it and its siblings. If the number strings is very large,
> this could really result in an explosion of memory usage.

I tend to use a Deterministic Finite State Automata for this.  You can 
load the entire English dictionary fairly easily with that scheme. Yes, 
you use a bit of memory, but unless you're doing this on an embedded 
device, its probably not enough memory to be concerned about.

Most of what I know about "searching" and "parsing", I've learned from 
"Parsing Techniques - A Practical Guide" 
<http://dickgrune.com/Books/PTAPG_1st_Edition/>.  Free PDF or PS 
downloads on that page.

Very worth a read. I'm sure parsing theory has been much extended since 
this book was written, however it is definitely a good introduction to 
the concepts in the space.

HTH,
Daniel.

[toc] | [prev] | [next] | [standalone]

#14826

From	markspace <-@.>
Date	2012-05-26 18:17 -0700
Message-ID	<jprvba$ur$1@dont-email.me>
In reply to	#14825

That looks very interesting.  Thanks!  I'll have to check it out.

[toc] | [prev] | [next] | [standalone]

#14847

From	Gene Wirchenko <genew@ocis.net>
Date	2012-05-27 18:44 -0700
Message-ID	<s0m5s7t5gt33plp5m6cn07g4ovehvedfb7@4ax.com>
In reply to	#14825

On Sat, 26 May 2012 17:30:17 -0700, Daniel Pitts
<newsgroup.nospam@virtualinfinity.net> wrote:

[snip]

>I tend to use a Deterministic Finite State Automata for this.  You can 
>load the entire English dictionary fairly easily with that scheme. Yes, 
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>you use a bit of memory, but unless you're doing this on an embedded 
>device, its probably not enough memory to be concerned about.

     Including all affixes?

[snip]

Sincerely,

Gene Wirchenko

[toc] | [prev] | [next] | [standalone]

#14852

From	Daniel Pitts <newsgroup.nospam@virtualinfinity.net>
Date	2012-05-27 22:00 -0700
Message-ID	<ADDwr.2826$9Q6.1871@newsfe18.iad>
In reply to	#14847

On 5/27/12 6:44 PM, Gene Wirchenko wrote:
> On Sat, 26 May 2012 17:30:17 -0700, Daniel Pitts
> <newsgroup.nospam@virtualinfinity.net>  wrote:
>
> [snip]
>
>> I tend to use a Deterministic Finite State Automata for this.  You can
>> load the entire English dictionary fairly easily with that scheme. Yes,
>        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> you use a bit of memory, but unless you're doing this on an embedded
>> device, its probably not enough memory to be concerned about.
>
>       Including all affixes?
I suppose it depends on the particular dictionary, but we're only 
talking a few hundred thousand entries, at least with the Moby 
word-lists as a base:

cat compound-words.txt often-mispelled.txt english-most-frequent.txt 
male-names.txt female-names.txt common-names.txt common-dictionary.txt 
official-scrabble-* | sort | uniq | wc -lc
   388997 4801599

That's just over 4MB of strings, if stored as naively as possible. 
Certainly no problem for a modern day computer. It can actually be quite 
compressed when converted it to a DFSA, assuming reasonably optimized 
implementations.
>
> [snip]
>
> Sincerely,
>
> Gene Wirchenko

[toc] | [prev] | [next] | [standalone]

#14861

From	markspace <-@.>
Date	2012-05-28 08:20 -0700
Message-ID	<jq053r$hr7$1@dont-email.me>
In reply to	#14852

On 5/27/2012 10:00 PM, Daniel Pitts wrote:
> the Moby
> word-lists


Moby word lists are neat, thanks for pointing that out.

[toc] | [prev] | [next] | [standalone]

#14872

From	markspace <-@.>
Date	2012-05-28 14:38 -0700
Message-ID	<jq0r9f$101$1@dont-email.me>
In reply to	#14861

On 5/28/2012 8:20 AM, markspace wrote:
> On 5/27/2012 10:00 PM, Daniel Pitts wrote:
>> the Moby
>> word-lists
>
>
> Moby word lists are neat, thanks for pointing that out.

While I'm thinking about it, here's a link I found from BYU with links 
to other corpus and word lists:

http://corpus.byu.edu/

[toc] | [prev] | [next] | [standalone]

#14864

From	Gene Wirchenko <genew@ocis.net>
Date	2012-05-28 09:20 -0700
Message-ID	<hr87s756sdaku47lh8sebnegsmskclu637@4ax.com>
In reply to	#14852

On Sun, 27 May 2012 22:00:14 -0700, Daniel Pitts
<newsgroup.nospam@virtualinfinity.net> wrote:

>On 5/27/12 6:44 PM, Gene Wirchenko wrote:
>> On Sat, 26 May 2012 17:30:17 -0700, Daniel Pitts
>> <newsgroup.nospam@virtualinfinity.net>  wrote:
>>
>> [snip]
>>
>>> I tend to use a Deterministic Finite State Automata for this.  You can
>>> load the entire English dictionary fairly easily with that scheme. Yes,
>>        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>> you use a bit of memory, but unless you're doing this on an embedded
>>> device, its probably not enough memory to be concerned about.
>>
>>       Including all affixes?
>I suppose it depends on the particular dictionary, but we're only 
>talking a few hundred thousand entries, at least with the Moby 
>word-lists as a base:

     Considering how many affixes can be applied to some words, I find
that very questionable:
          self:
               *ish
               *ishly
               un*ish
               un*ishly
               *less
               *lessly
               un*less
               un*lessly
          position:
               *s
               *ed
               *al
               *ally
               re*
               re*s
               re*ed
               *less
               mis*
               *er
               *ers
          friend
               *s
               *ly
               *liness
               be*
               be*ing
               be*ed
               be*er
               be*ers
These are not particularly extreme examples.

[snip]

Sincerely,

Gene Wirchenko

[toc] | [prev] | [next] | [standalone]

#14874

From	Lew <noone@lewscanon.com>
Date	2012-05-28 21:54 -0700
Message-ID	<jq1kpt$1ee$1@news.albasani.net>
In reply to	#14864

On 05/28/2012 09:20 AM, Gene Wirchenko wrote:
> On Sun, 27 May 2012 22:00:14 -0700, Daniel Pitts
> <newsgroup.nospam@virtualinfinity.net>  wrote:
>
>> On 5/27/12 6:44 PM, Gene Wirchenko wrote:
>>> On Sat, 26 May 2012 17:30:17 -0700, Daniel Pitts
>>> <newsgroup.nospam@virtualinfinity.net>   wrote:
>>>
>>> [snip]
>>>
>>>> I tend to use a Deterministic Finite State Automata for this.  You can
>>>> load the entire English dictionary fairly easily with that scheme. Yes,
>>>         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>> you use a bit of memory, but unless you're doing this on an embedded
>>>> device, its probably not enough memory to be concerned about.
>>>
>>>        Including all affixes?
>> I suppose it depends on the particular dictionary, but we're only
>> talking a few hundred thousand entries, at least with the Moby
>> word-lists as a base:
>
>       Considering how many affixes can be applied to some words, I find
> that very questionable:
>            self:
>                 *ish
>                 *ishly
>                 un*ish
>                 un*ishly
>                 *less
>                 *lessly
>                 un*less
>                 un*lessly
>            position:
>                 *s
>                 *ed
>                 *al
>                 *ally
>                 re*
>                 re*s
>                 re*ed
>                 *less
>                 mis*
>                 *er
>                 *ers
>            friend
>                 *s
>                 *ly
>                 *liness
>                 be*
>                 be*ing
>                 be*ed
>                 be*er
>                 be*ers
> These are not particularly extreme examples.

It's not a question of how extreme the examples are but how many there are.

Not all words can be legitimately affixed. Many can be affixed by algorithm, 
or by bitmaps as to which affixes apply, so you only store the root, the 
bitmap and perhaps one more form.

I don't know how much memory expansion you think your factors will cause, as 
you only hand wave and say there will be some and act like it's a problem, but 
let's say it doubles the size of the dictionary. By Daniel's experiment 
upthread, that would bring it to around 8 MiB, let's round and say 10MiB. 
Being text and all, that should compress to about 3 MiB or less.

Now I am interested to hear what sort of trouble you assert that 3 MiB or so 
of storage will cause.

-- 
Lew
Honi soit qui mal y pense.
http://upload.wikimedia.org/wikipedia/commons/c/cf/Friz.jpg

[toc] | [prev] | [next] | [standalone]

#14887

From	Gene Wirchenko <genew@ocis.net>
Date	2012-05-29 09:14 -0700
Message-ID	<7ss9s7tu1a8v9qbkc3qnhfg8rhck3ov4bi@4ax.com>
In reply to	#14874

On Mon, 28 May 2012 21:54:39 -0700, Lew <noone@lewscanon.com> wrote:

>On 05/28/2012 09:20 AM, Gene Wirchenko wrote:
>> On Sun, 27 May 2012 22:00:14 -0700, Daniel Pitts
>> <newsgroup.nospam@virtualinfinity.net>  wrote:
>>
>>> On 5/27/12 6:44 PM, Gene Wirchenko wrote:

[snip]

>> These are not particularly extreme examples.
>
>It's not a question of how extreme the examples are but how many there are.

     I mentioned extremity just to point out that such base words are
not that unusual.  There are a lot of them.

>Not all words can be legitimately affixed. Many can be affixed by algorithm, 
>or by bitmaps as to which affixes apply, so you only store the root, the 
>bitmap and perhaps one more form.

     There are multiple affixes.  I suppose they could be combined
into aggregate affixes.  e.g. -less + -ness -> -lessness.

>I don't know how much memory expansion you think your factors will cause, as 
>you only hand wave and say there will be some and act like it's a problem, but 
>let's say it doubles the size of the dictionary. By Daniel's experiment 

     Let's not make up data.  I would like to know how much of the
English language actually is in his dataset.

>upthread, that would bring it to around 8 MiB, let's round and say 10MiB. 
>Being text and all, that should compress to about 3 MiB or less.
>
>Now I am interested to hear what sort of trouble you assert that 3 MiB or so 
>of storage will cause.

     Assuming your facts and then calling me on what you made up?

http://oxforddictionaries.com/words/how-many-words-are-there-in-the-english-language
is short but interesting reading.  Its final paragraph: "This suggests
that there are, at the very least, a quarter of a million distinct
English words, excluding inflections, and words from technical and
regional vocabulary not covered by the OED, or words not yet added to
the published dictionary, of which perhaps 20 per cent are no longer
in current use. If distinct senses were counted, the total would
probably approach three quarters of a million."

     Now, add on what they exclude.  Is one million words out of line?
     Itis one thing to have some of a language encoded.  It is quite
another to try to handle everything.  Exceptional cases tend to be
more difficult to handle.

Sincerely,

Gene Wirchenko

[toc] | [prev] | [next] | [standalone]

#14891

From	Daniel Pitts <newsgroup.nospam@virtualinfinity.net>
Date	2012-05-29 09:55 -0700
Message-ID	<Gb7xr.31449$TC4.15127@newsfe14.iad>
In reply to	#14887

On 5/29/12 9:14 AM, Gene Wirchenko wrote:
> On Mon, 28 May 2012 21:54:39 -0700, Lew<noone@lewscanon.com>  wrote:
>
>> On 05/28/2012 09:20 AM, Gene Wirchenko wrote:
>>> On Sun, 27 May 2012 22:00:14 -0700, Daniel Pitts
>>> <newsgroup.nospam@virtualinfinity.net>   wrote:
>>>
>>>> On 5/27/12 6:44 PM, Gene Wirchenko wrote:
>
> [snip]
>
>>> These are not particularly extreme examples.
>>
>> It's not a question of how extreme the examples are but how many there are.
>
>       I mentioned extremity just to point out that such base words are
> not that unusual.  There are a lot of them.
>
>> Not all words can be legitimately affixed. Many can be affixed by algorithm,
>> or by bitmaps as to which affixes apply, so you only store the root, the
>> bitmap and perhaps one more form.
>
>       There are multiple affixes.  I suppose they could be combined
> into aggregate affixes.  e.g. -less + -ness ->  -lessness.
>
>> I don't know how much memory expansion you think your factors will cause, as
>> you only hand wave and say there will be some and act like it's a problem, but
>> let's say it doubles the size of the dictionary. By Daniel's experiment
>
>       Let's not make up data.  I would like to know how much of the
> English language actually is in his dataset.
Agreed, let's not make up data. Using the Moby word-list as a guide, it 
includes a significant number of those particular suffixes and prefixes 
you've mentioned. Granted, its not a complete word-list, but then again 
nothing is.  It is "complete enough" for most purposes.

>
>> upthread, that would bring it to around 8 MiB, let's round and say 10MiB.
>> Being text and all, that should compress to about 3 MiB or less.
>>
>> Now I am interested to hear what sort of trouble you assert that 3 MiB or so
>> of storage will cause.
>
>       Assuming your facts and then calling me on what you made up?
>
> http://oxforddictionaries.com/words/how-many-words-are-there-in-the-english-language
> is short but interesting reading.  Its final paragraph: "This suggests
> that there are, at the very least, a quarter of a million distinct
> English words, excluding inflections, and words from technical and
> regional vocabulary not covered by the OED, or words not yet added to
> the published dictionary, of which perhaps 20 per cent are no longer
> in current use. If distinct senses were counted, the total would
> probably approach three quarters of a million."
>
>       Now, add on what they exclude.  Is one million words out of line?
>       Itis one thing to have some of a language encoded.  It is quite
> another to try to handle everything.  Exceptional cases tend to be
> more difficult to handle.

Are you arguing that a modern system can't handle that number of words? 
A modern desktop has more than enough memory to easily handle a quarter 
*billion* words, which is a 100 times greater than your guessed upper limit.

And that's *without* compression.

[toc] | [prev] | [next] | [standalone]

#14895

From	Gene Wirchenko <genew@ocis.net>
Date	2012-05-29 11:17 -0700
Message-ID	<pg4as7lvhhlsinnmhncvrne8vflek6agb7@4ax.com>
In reply to	#14891

On Tue, 29 May 2012 09:55:02 -0700, Daniel Pitts
<newsgroup.nospam@virtualinfinity.net> wrote:

[snip]

>Are you arguing that a modern system can't handle that number of words? 

     No.  I simply stating that the real size of the problem is much
bigger.

>A modern desktop has more than enough memory to easily handle a quarter 
>*billion* words, which is a 100 times greater than your guessed upper limit.
>
>And that's *without* compression.

     Sure.  If that is all that it does.  My main (and older) desktop
box has 1.5 GB.  I have trouble with not enough memory at times.
Adding another app might break its back.

Sincerely,

Gene Wirchenko

[toc] | [prev] | [next] | [standalone]

#14898

From	Daniel Pitts <newsgroup.nospam@virtualinfinity.net>
Date	2012-05-29 11:22 -0700
Message-ID	<Mt8xr.31457$TC4.3795@newsfe14.iad>
In reply to	#14895

On 5/29/12 11:17 AM, Gene Wirchenko wrote:
> On Tue, 29 May 2012 09:55:02 -0700, Daniel Pitts
> <newsgroup.nospam@virtualinfinity.net>  wrote:
>
> [snip]
>
>> Are you arguing that a modern system can't handle that number of words?
>
>       No.  I simply stating that the real size of the problem is much
> bigger.
>
>> A modern desktop has more than enough memory to easily handle a quarter
>> *billion* words, which is a 100 times greater than your guessed upper limit.
>>
>> And that's *without* compression.
>
>       Sure.  If that is all that it does.  My main (and older) desktop
> box has 1.5 GB.  I have trouble with not enough memory at times.
> Adding another app might break its back.
>
> Sincerely,
>
> Gene Wirchenko

Moving the goalpost again are we?

[toc] | [prev] | [next] | [standalone]

#14916

From	Gene Wirchenko <genew@ocis.net>
Date	2012-05-29 14:44 -0700
Message-ID	<jmgas7tt9fqdmi2agn7uci7bgm0ofh8nul@4ax.com>
In reply to	#14898

On Tue, 29 May 2012 11:22:35 -0700, Daniel Pitts
<newsgroup.nospam@virtualinfinity.net> wrote:

>On 5/29/12 11:17 AM, Gene Wirchenko wrote:
>> On Tue, 29 May 2012 09:55:02 -0700, Daniel Pitts
>> <newsgroup.nospam@virtualinfinity.net>  wrote:
>>
>> [snip]
>>
>>> Are you arguing that a modern system can't handle that number of words?
>>
>>       No.  I simply stating that the real size of the problem is much
>> bigger.
>>
>>> A modern desktop has more than enough memory to easily handle a quarter
>>> *billion* words, which is a 100 times greater than your guessed upper limit.
>>>
>>> And that's *without* compression.
>>
>>       Sure.  If that is all that it does.  My main (and older) desktop
>> box has 1.5 GB.  I have trouble with not enough memory at times.
>> Adding another app might break its back.

>Moving the goalpost again are we?

     No.  Someone claimed that the amount of memory needed was low and
that modern systems have many times that amount.  Sure, they do, but
they also typically run more than one app.  Assuming that your app can
use all of memory is not very friendly.

Sincerely,

Gene Wirchenko

[toc] | [prev] | [next] | [standalone]

#14910

From	Lew <lewbloch@gmail.com>
Date	2012-05-29 14:03 -0700
Message-ID	<5f02c9ad-a414-41dd-8ac9-723d7a8651e5@googlegroups.com>
In reply to	#14895

Gene Wirchenko wrote:
> Daniel Pitts wrote:
> 
> [snip]
> 
> >Are you arguing that a modern system can't handle that number of words? 
> 
>      No.  I simply stating that the real size of the problem is much
> bigger.

With no numbers that differ from Daniel's to back up your claim.

You called my numbers "made up", but it turned out they were 
*larger* than the real numbers.

You cite "a quarter of a million" words. Daniel counted roughly 
*150%* of that in his word base.

My numbers were generous. Yours are not even significantly different, 
and in fact are smaller than his. The numbers do show there is not 
much problem, yet somehow you claim with no logic or reasoning or 
different data that they do show a problem.

Clearly you are mistaken.

Daniel showed evidence from experimentation. His numbers jibe with 
yours. Without compression, his data occupy roughly 5 MiB of memory.

Show the problem or withdraw the claim.

>> A modern desktop has more than enough memory to easily handle a quarter 
>> *billion* words, which is a 100 times greater than your guessed upper limit.
>>
>> And that's *without* compression.
> 
>      Sure.  If that is all that it does.  My main (and older) desktop
> box has 1.5 GB.  I have trouble with not enough memory at times.
> Adding another app might break its back.

Again, how much damage will < 5 MiB of data do to that system?

How about 50 MiB? That's *ten times* the number of words you might need to handle. 
Without any compression.

You haven't shown a problem. You just chant, "The sky is falling!"

-- 
Lew

[toc] | [prev] | [next] | [standalone]

#14917

From	Gene Wirchenko <genew@ocis.net>
Date	2012-05-29 14:49 -0700
Message-ID	<gpgas71pk57o0s68ie0qf1pgn9dg5hbkpb@4ax.com>
In reply to	#14910

On Tue, 29 May 2012 14:03:10 -0700 (PDT), Lew <lewbloch@gmail.com>
wrote:

>Gene Wirchenko wrote:
>> Daniel Pitts wrote:
>> 
>> [snip]
>> 
>> >Are you arguing that a modern system can't handle that number of words? 
>> 
>>      No.  I simply stating that the real size of the problem is much
>> bigger.
>
>With no numbers that differ from Daniel's to back up your claim.
>
>You called my numbers "made up", but it turned out they were 
>*larger* than the real numbers.
>
>You cite "a quarter of a million" words. Daniel counted roughly 
>*150%* of that in his word base.

     Ah, selective reading.

     For root forms, it was 1/4 million.  With affixes -- and remember
that my first question was about them -- the figure was 3/4 million.
This is double what Daniel counted, and 3/4 million does not include
technical words, etc.  Take a look at the *full* paragraph that I
quoted, not just the lowest number.

>My numbers were generous. Yours are not even significantly different, 
>and in fact are smaller than his. The numbers do show there is not 
>much problem, yet somehow you claim with no logic or reasoning or 
>different data that they do show a problem.

     Take another look at that paragraph I quoted.  Really.

>Clearly you are mistaken.
>
>Daniel showed evidence from experimentation. His numbers jibe with 
>yours. Without compression, his data occupy roughly 5 MiB of memory.
>
>Show the problem or withdraw the claim.

     Read my statement of the problem.

>>> A modern desktop has more than enough memory to easily handle a quarter 
>>> *billion* words, which is a 100 times greater than your guessed upper limit.
>>>
>>> And that's *without* compression.
>> 
>>      Sure.  If that is all that it does.  My main (and older) desktop
>> box has 1.5 GB.  I have trouble with not enough memory at times.
>> Adding another app might break its back.
>
>Again, how much damage will < 5 MiB of data do to that system?
>
>How about 50 MiB? That's *ten times* the number of words you might need to handle. 
>Without any compression.

     Fine.  My system is currently using 1547 MB of memory.  It only
has 1536 MB.  With some intensive uses, I have seen the commit go to
as high as about 2200 MB.  My system crawls then.  Using 50 MB more in
those circumstances would not be good.

>You haven't shown a problem. You just chant, "The sky is falling!"

     I believe that I have.  Handwaving the possibility away does not
make it go away.

Sincerely,

Gene Wirchenko

[toc] | [prev] | [next] | [standalone]

Page 1 of 2 [1] 2 Next page →

csiph-web

Patricia trie vs binary search.

Contents

#14785 — Patricia trie vs binary search.

#14786

#14787

#14795 — Password quality (Was: Patricia trie vs binary search.)

#14797 — Re: Password quality (Was: Patricia trie vs binary search.)

#14825

#14826

#14847

#14852

#14861

#14872

#14864

#14874

#14887

#14891

#14895

#14898

#14916

#14910

#14917