Groups > comp.lang.java.programmer > #14785 > unrolled thread

Patricia trie vs binary search.

Started by	markspace <-@.>
First post	2012-05-24 16:07 -0700
Last post	2012-05-29 09:24 -0400
Articles	13 on this page of 33 — 7 participants

Back to article view | Back to comp.lang.java.programmer

  Patricia trie vs binary search. markspace <-@.> - 2012-05-24 16:07 -0700
    Re: Patricia trie vs binary search. glen herrmannsfeldt <gah@ugcs.caltech.edu> - 2012-05-24 23:39 +0000
      Re: Patricia trie vs binary search. markspace <-@.> - 2012-05-24 17:56 -0700
        Password quality (Was: Patricia trie vs binary search.) Lew <lewbloch@gmail.com> - 2012-05-25 09:41 -0700
          Re: Password quality (Was: Patricia trie vs binary search.) markspace <-@.> - 2012-05-25 12:17 -0700
    Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-26 17:30 -0700
      Re: Patricia trie vs binary search. markspace <-@.> - 2012-05-26 18:17 -0700
      Re: Patricia trie vs binary search. Gene Wirchenko <genew@ocis.net> - 2012-05-27 18:44 -0700
        Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-27 22:00 -0700
          Re: Patricia trie vs binary search. markspace <-@.> - 2012-05-28 08:20 -0700
            Re: Patricia trie vs binary search. markspace <-@.> - 2012-05-28 14:38 -0700
          Re: Patricia trie vs binary search. Gene Wirchenko <genew@ocis.net> - 2012-05-28 09:20 -0700
            Re: Patricia trie vs binary search. Lew <noone@lewscanon.com> - 2012-05-28 21:54 -0700
              Re: Patricia trie vs binary search. Gene Wirchenko <genew@ocis.net> - 2012-05-29 09:14 -0700
                Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-29 09:55 -0700
                  Re: Patricia trie vs binary search. Gene Wirchenko <genew@ocis.net> - 2012-05-29 11:17 -0700
                    Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-29 11:22 -0700
                      Re: Patricia trie vs binary search. Gene Wirchenko <genew@ocis.net> - 2012-05-29 14:44 -0700
                    Re: Patricia trie vs binary search. Lew <lewbloch@gmail.com> - 2012-05-29 14:03 -0700
                      Re: Patricia trie vs binary search. Gene Wirchenko <genew@ocis.net> - 2012-05-29 14:49 -0700
                        Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-29 15:23 -0700
                          Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-29 15:39 -0700
                            Re: Patricia trie vs binary search. Gene Wirchenko <genew@ocis.net> - 2012-05-29 16:08 -0700
                              Re: Patricia trie vs binary search. Lew <lewbloch@gmail.com> - 2012-05-29 18:25 -0700
              Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-29 09:16 -0700
                Re: Patricia trie vs binary search. Jeff Higgins <jeff@invalid.invalid> - 2012-05-29 13:37 -0400
                  Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-29 10:49 -0700
                    Re: Patricia trie vs binary search. Jeff Higgins <jeff@invalid.invalid> - 2012-05-29 13:58 -0400
                      Re: Patricia trie vs binary search. Jeff Higgins <jeff@invalid.invalid> - 2012-05-29 14:20 -0400
                      Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-29 11:21 -0700
                        Re: Patricia trie vs binary search. Jeff Higgins <jeff@invalid.invalid> - 2012-05-29 14:29 -0400
                          Re: Patricia trie vs binary search. Jeff Higgins <jeff@invalid.invalid> - 2012-05-29 15:00 -0400
    Re: Patricia trie vs binary search. Jeff Higgins <jeff@invalid.invalid> - 2012-05-29 09:24 -0400

Page 2 of 2 — ← Prev page 1 [2]

#14920

From	Daniel Pitts <newsgroup.nospam@virtualinfinity.net>
Date	2012-05-29 15:23 -0700
Message-ID	<X%bxr.36808$6Y6.35155@newsfe19.iad>
In reply to	#14917

On 5/29/12 2:49 PM, Gene Wirchenko wrote:
> On Tue, 29 May 2012 14:03:10 -0700 (PDT), Lew<lewbloch@gmail.com>
> wrote:
>
>> Gene Wirchenko wrote:
>>> Daniel Pitts wrote:
>>>
>>> [snip]
>>>
>>>> Are you arguing that a modern system can't handle that number of words?
>>>
>>>       No.  I simply stating that the real size of the problem is much
>>> bigger.
>>
>> With no numbers that differ from Daniel's to back up your claim.
>>
>> You called my numbers "made up", but it turned out they were
>> *larger* than the real numbers.
>>
>> You cite "a quarter of a million" words. Daniel counted roughly
>> *150%* of that in his word base.
>
>       Ah, selective reading.
>
>       For root forms, it was 1/4 million.  With affixes -- and remember
> that my first question was about them -- the figure was 3/4 million.
> This is double what Daniel counted, and 3/4 million does not include
> technical words, etc.  Take a look at the *full* paragraph that I
> quoted, not just the lowest number.
>
>> My numbers were generous. Yours are not even significantly different,
>> and in fact are smaller than his. The numbers do show there is not
>> much problem, yet somehow you claim with no logic or reasoning or
>> different data that they do show a problem.
>
>       Take another look at that paragraph I quoted.  Really.
>
>> Clearly you are mistaken.
>>
>> Daniel showed evidence from experimentation. His numbers jibe with
>> yours. Without compression, his data occupy roughly 5 MiB of memory.
>>
>> Show the problem or withdraw the claim.
>
>       Read my statement of the problem.
>
>>>> A modern desktop has more than enough memory to easily handle a quarter
>>>> *billion* words, which is a 100 times greater than your guessed upper limit.
>>>>
>>>> And that's *without* compression.
>>>
>>>       Sure.  If that is all that it does.  My main (and older) desktop
>>> box has 1.5 GB.  I have trouble with not enough memory at times.
>>> Adding another app might break its back.
>>
>> Again, how much damage will<  5 MiB of data do to that system?
>>
>> How about 50 MiB? That's *ten times* the number of words you might need to handle.
>> Without any compression.
>
>       Fine.  My system is currently using 1547 MB of memory.  It only
> has 1536 MB.  With some intensive uses, I have seen the commit go to
> as high as about 2200 MB.  My system crawls then.  Using 50 MB more in
> those circumstances would not be good.
>
>> You haven't shown a problem. You just chant, "The sky is falling!"
>
>       I believe that I have.  Handwaving the possibility away does not
> make it go away.
>
> Sincerely,
>
> Gene Wirchenko


Test program:

package net.virtualinfinity.moby;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.util.Arrays;

public class WordTree {
     WordNode start = new WordNode(-1);

     public void addWord(String word) {
         start.addWord(word);
     }

     public static void main(String[] args) throws IOException {
         WordTree tree = new WordTree();
         int wordsLoaded = 0;
         for (String filename: args) {
             System.out.println("Loading " + filename);
             BufferedReader reader = new BufferedReader(new 
FileReader(filename), 1024*256);
             String line;
             while ((line = reader.readLine()) != null) {
                 ++wordsLoaded;
                 if ((wordsLoaded & 127) == 0) {
                     printStatus(wordsLoaded);
                 }
                 tree.addWord(line.trim());
             }
             System.gc();
             printStatus(wordsLoaded);
             System.out.println();
         }
     }

     private static void printStatus(int wordsLoaded) {
         System.out.print("\rWords loaded: " + wordsLoaded + ", Total 
Memory used:  " + Runtime.getRuntime().totalMemory() + ".           ");
     }
}


class WordNode {
     private boolean terminal;
     private final int value;
     private WordNode[] next;

     public WordNode(int ch) {
         value = ch;
     }

     public void addWord(String word) {
         if ("".equals(word)) {
             terminal = true;
             return;
         }
         final int ch = word.codePointAt(0);
         getNext(ch).addWord(word.substring(Character.charCount(ch)));
     }

     private WordNode getNext(int ch) {
         if (next == null) {
             next = new WordNode[1];
             next[0] = new WordNode(ch);
         }
         for (WordNode node: next) {
             if (node.value == ch) {
                 return node;
             }
         }
         next = Arrays.copyOf(next, next.length +1);
         final WordNode newNode = new WordNode(ch);
         next[next.length-1] = newNode;
         return newNode;
     }
}

------- Output -----

Loading compound-words.txt
Words loaded: 256772, Total Memory used:  120909824.
Loading often-mispelled.txt
Words loaded: 257138, Total Memory used:  121106432.
Loading english-most-frequent.txt
Words loaded: 258141, Total Memory used:  121106432.
Loading male-names.txt
Words loaded: 262038, Total Memory used:  121106432.
Loading female-names.txt
Words loaded: 266984, Total Memory used:  121106432.
Loading common-names.txt
Words loaded: 288970, Total Memory used:  121106432.
Loading common-dictionary.txt
Words loaded: 363520, Total Memory used:  127373312.
Loading official-scrabble-1st-edition.txt
Words loaded: 477329, Total Memory used:  129957888.
Loading official-scrabble-2nd-edition-delta.txt
Words loaded: 481489, Total Memory used:  129957888.

[toc] | [prev] | [next] | [standalone]

#14922

From	Daniel Pitts <newsgroup.nospam@virtualinfinity.net>
Date	2012-05-29 15:39 -0700
Message-ID	<pecxr.10088$br3.3802@newsfe10.iad>
In reply to	#14920

On 5/29/12 3:23 PM, Daniel Pitts wrote:
> On 5/29/12 2:49 PM, Gene Wirchenko wrote:
>> On Tue, 29 May 2012 14:03:10 -0700 (PDT), Lew<lewbloch@gmail.com>
>> wrote:
>>
>>> Gene Wirchenko wrote:
>>>> Daniel Pitts wrote:
>>>>
>>>> [snip]
>>>>
>>>>> Are you arguing that a modern system can't handle that number of
>>>>> words?
>>>>
>>>> No. I simply stating that the real size of the problem is much
>>>> bigger.
>>>
>>> With no numbers that differ from Daniel's to back up your claim.
>>>
>>> You called my numbers "made up", but it turned out they were
>>> *larger* than the real numbers.
>>>
>>> You cite "a quarter of a million" words. Daniel counted roughly
>>> *150%* of that in his word base.
>>
>> Ah, selective reading.
>>
>> For root forms, it was 1/4 million. With affixes -- and remember
>> that my first question was about them -- the figure was 3/4 million.
>> This is double what Daniel counted, and 3/4 million does not include
>> technical words, etc. Take a look at the *full* paragraph that I
>> quoted, not just the lowest number.
>>
>>> My numbers were generous. Yours are not even significantly different,
>>> and in fact are smaller than his. The numbers do show there is not
>>> much problem, yet somehow you claim with no logic or reasoning or
>>> different data that they do show a problem.
>>
>> Take another look at that paragraph I quoted. Really.
>>
>>> Clearly you are mistaken.
>>>
>>> Daniel showed evidence from experimentation. His numbers jibe with
>>> yours. Without compression, his data occupy roughly 5 MiB of memory.
>>>
>>> Show the problem or withdraw the claim.
>>
>> Read my statement of the problem.
>>
>>>>> A modern desktop has more than enough memory to easily handle a
>>>>> quarter
>>>>> *billion* words, which is a 100 times greater than your guessed
>>>>> upper limit.
>>>>>
>>>>> And that's *without* compression.
>>>>
>>>> Sure. If that is all that it does. My main (and older) desktop
>>>> box has 1.5 GB. I have trouble with not enough memory at times.
>>>> Adding another app might break its back.
>>>
>>> Again, how much damage will< 5 MiB of data do to that system?
>>>
>>> How about 50 MiB? That's *ten times* the number of words you might
>>> need to handle.
>>> Without any compression.
>>
>> Fine. My system is currently using 1547 MB of memory. It only
>> has 1536 MB. With some intensive uses, I have seen the commit go to
>> as high as about 2200 MB. My system crawls then. Using 50 MB more in
>> those circumstances would not be good.
>>
>>> You haven't shown a problem. You just chant, "The sky is falling!"
>>
>> I believe that I have. Handwaving the possibility away does not
>> make it go away.
>>
>> Sincerely,
>>
>> Gene Wirchenko
>
>
> Test program:
>
> package net.virtualinfinity.moby;
>
> import java.io.BufferedReader;
> import java.io.FileReader;
> import java.io.IOException;
> import java.util.Arrays;
>
> public class WordTree {
> WordNode start = new WordNode(-1);
>
> public void addWord(String word) {
> start.addWord(word);
> }
>
> public static void main(String[] args) throws IOException {
> WordTree tree = new WordTree();
> int wordsLoaded = 0;
> for (String filename: args) {
> System.out.println("Loading " + filename);
> BufferedReader reader = new BufferedReader(new FileReader(filename),
> 1024*256);
> String line;
> while ((line = reader.readLine()) != null) {
> ++wordsLoaded;
> if ((wordsLoaded & 127) == 0) {
> printStatus(wordsLoaded);
> }
> tree.addWord(line.trim());
> }
> System.gc();
> printStatus(wordsLoaded);
> System.out.println();
> }
> }
>
> private static void printStatus(int wordsLoaded) {
> System.out.print("\rWords loaded: " + wordsLoaded + ", Total Memory
> used: " + Runtime.getRuntime().totalMemory() + ". ");
> }
> }
>
>
> class WordNode {
> private boolean terminal;
> private final int value;
> private WordNode[] next;
>
> public WordNode(int ch) {
> value = ch;
> }
>
> public void addWord(String word) {
> if ("".equals(word)) {
> terminal = true;
> return;
> }
> final int ch = word.codePointAt(0);
> getNext(ch).addWord(word.substring(Character.charCount(ch)));
> }
>
> private WordNode getNext(int ch) {
> if (next == null) {
> next = new WordNode[1];
> next[0] = new WordNode(ch);
> }
> for (WordNode node: next) {
> if (node.value == ch) {
> return node;
> }
> }
> next = Arrays.copyOf(next, next.length +1);
> final WordNode newNode = new WordNode(ch);
> next[next.length-1] = newNode;
> return newNode;
> }
> }
>
> ------- Output -----
>
> Loading compound-words.txt
> Words loaded: 256772, Total Memory used: 120909824.
> Loading often-mispelled.txt
> Words loaded: 257138, Total Memory used: 121106432.
> Loading english-most-frequent.txt
> Words loaded: 258141, Total Memory used: 121106432.
> Loading male-names.txt
> Words loaded: 262038, Total Memory used: 121106432.
> Loading female-names.txt
> Words loaded: 266984, Total Memory used: 121106432.
> Loading common-names.txt
> Words loaded: 288970, Total Memory used: 121106432.
> Loading common-dictionary.txt
> Words loaded: 363520, Total Memory used: 127373312.
> Loading official-scrabble-1st-edition.txt
> Words loaded: 477329, Total Memory used: 129957888.
> Loading official-scrabble-2nd-edition-delta.txt
> Words loaded: 481489, Total Memory used: 129957888.

BTW, if I check the memory usage before loading words and after, the 
difference is ~ 42MB

So, loading 481k words takes up about 42MB. This is in java, which has a 
fairly high overhead per string. And the implementation of my data 
structure is also fairly naive as well.

Extrapolating that data to an extreme 2 million words, that would be 
less than 200MB in memory.

My gut feeling beats your gut feeling, and my science proves it true. If 
you are going to reply with a counter argument, please provide a 
reproducible experiment to prove your argument.  Otherwise, this 
conversation is over.

[toc] | [prev] | [next] | [standalone]

#14927

From	Gene Wirchenko <genew@ocis.net>
Date	2012-05-29 16:08 -0700
Message-ID	<balas7lf5g3qkmruqvgrneba63mb8deh11@4ax.com>
In reply to	#14922

On Tue, 29 May 2012 15:39:16 -0700, Daniel Pitts
<newsgroup.nospam@virtualinfinity.net> wrote:

[snip]

>BTW, if I check the memory usage before loading words and after, the 
>difference is ~ 42MB
>
>So, loading 481k words takes up about 42MB. This is in java, which has a 
>fairly high overhead per string. And the implementation of my data 
>structure is also fairly naive as well.

     So about 100 bytes per word.

>Extrapolating that data to an extreme 2 million words, that would be 
>less than 200MB in memory.

     Or about 100 MB for my SWAG of one million words.

>My gut feeling beats your gut feeling, and my science proves it true. If 

     It sure does.  Lew's figures (up to 10 MB) were considerably
lower, and that is what I was objecting to.  They seemed too low.

>you are going to reply with a counter argument, please provide a 
>reproducible experiment to prove your argument.  Otherwise, this 
>conversation is over.

     Actually, I started with a question, namely
          Including all affixes?
and I have been trying to get a reasonable answer to it.  I knew that
your word counts were low compared to others that I have seen and
wanted to know a realistic memory use figure for the *whole* English
language.  You have provided a sufficiently good approximation.  Thank
you.

Sincerely,

Gene Wirchenko

[toc] | [prev] | [next] | [standalone]

#14929

From	Lew <lewbloch@gmail.com>
Date	2012-05-29 18:25 -0700
Message-ID	<81e956b0-cec4-40e8-9980-c041b8d3369d@googlegroups.com>
In reply to	#14927

Gene Wirchenko wrote:
> Daniel Pitts wrote:
> 
> [snip]
> 
>>BTW, if I check the memory usage before loading words and after, the 
>>difference is ~ 42MB
>>
>>So, loading 481k words takes up about 42MB. This is in java, which has a 
>>fairly high overhead per string. And the implementation of my data 
>>structure is also fairly naive as well.

We've been talking apples and oranges. You refer here to the memory 
footprint if the entire word list were in memory as naive Strings. I've been 
talking about the footprint in storage (e.g., the SD card of a smartphone). 

So, Gene, you have been misinterpreting my point.

There is little reason to hold the dictionary in memory at all times if you 
are in a memory-constrained environment. If one does need to hold it
in memory, and one were worried about memory footprint, one would not 
hold it in a memory-intensive, naive structure, in RAM all at once.

Compression exceeding 90% is typical with text. Assuming 75% compression,
a very reasonable and safe estimate, the data would occupy about 10 MB, just 
as I said. That's if you keep it in memory in compressed form.

A disk-based solution would likely occupy only a few KB at a time.

Nice try to distort the issue into a scary monster, though.

-- 
Lew

[toc] | [prev] | [next] | [standalone]

#14888

From	Daniel Pitts <newsgroup.nospam@virtualinfinity.net>
Date	2012-05-29 09:16 -0700
Message-ID	<WD6xr.7917$0x2.2975@newsfe13.iad>
In reply to	#14874

On 5/28/12 9:54 PM, Lew wrote:
> On 05/28/2012 09:20 AM, Gene Wirchenko wrote:
>> On Sun, 27 May 2012 22:00:14 -0700, Daniel Pitts
>> <newsgroup.nospam@virtualinfinity.net> wrote:
>>
>>> On 5/27/12 6:44 PM, Gene Wirchenko wrote:
>>>> On Sat, 26 May 2012 17:30:17 -0700, Daniel Pitts
>>>> <newsgroup.nospam@virtualinfinity.net> wrote:
>>>>
>>>> [snip]
>>>>
>>>>> I tend to use a Deterministic Finite State Automata for this. You can
>>>>> load the entire English dictionary fairly easily with that scheme.
>>>>> Yes,
>>>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>>> you use a bit of memory, but unless you're doing this on an embedded
>>>>> device, its probably not enough memory to be concerned about.
>>>>
>>>> Including all affixes?
>>> I suppose it depends on the particular dictionary, but we're only
>>> talking a few hundred thousand entries, at least with the Moby
>>> word-lists as a base:
>>
>> Considering how many affixes can be applied to some words, I find
>> that very questionable:
>> self:
>> *ish
>> *ishly
>> un*ish
>> un*ishly
>> *less
>> *lessly
>> un*less
>> un*lessly
>> position:
>> *s
>> *ed
>> *al
>> *ally
>> re*
>> re*s
>> re*ed
>> *less
>> mis*
>> *er
>> *ers
>> friend
>> *s
>> *ly
>> *liness
>> be*
>> be*ing
>> be*ed
>> be*er
>> be*ers
>> These are not particularly extreme examples.
>
> It's not a question of how extreme the examples are but how many there are.
>
> Not all words can be legitimately affixed. Many can be affixed by
> algorithm, or by bitmaps as to which affixes apply, so you only store
> the root, the bitmap and perhaps one more form.
>
> I don't know how much memory expansion you think your factors will
> cause, as you only hand wave and say there will be some and act like
> it's a problem, but let's say it doubles the size of the dictionary. By
> Daniel's experiment upthread, that would bring it to around 8 MiB, let's
> round and say 10MiB. Being text and all, that should compress to about 3
> MiB or less.
>
> Now I am interested to hear what sort of trouble you assert that 3 MiB
> or so of storage will cause.
>
Actually, the word lists I used to come up with my figures include those 
inflections of the words that it is used for.

For example "ishly" and "ness" suffixes:

 > grep "ishly\b" * | wc -l
      323

Which includes:
wordishly
womanishly
wolfishly
wishly
winterishly
wildishly
whorishly
whoreishly

Most of these entries aren't even in Thunderbirds spell-check dictionary.

 > grep "ness\b" * | wc -l
    11762

Includes entries such as:
woodlessness
wonderfulness
unmysteriousness
unexperiencedness
undeceptiveness
nonvolatileness


Again, most of these aren't spell-check.

My numbers are accurate, but thanks for questioning them, so I could 
further explore and see for myself.

[toc] | [prev] | [next] | [standalone]

#14892

From	Jeff Higgins <jeff@invalid.invalid>
Date	2012-05-29 13:37 -0400
Message-ID	<jq31hd$gof$1@dont-email.me>
In reply to	#14888

On 05/29/2012 12:16 PM, Daniel Pitts wrote:
>
> Most of these entries aren't even in Thunderbirds spell-check dictionary.
>
How many seven letter words can I construct from the twenty-six letters 
A through Z? How many of these words are defined English words? What 
are/are not the common affixes in this set of English words?

[toc] | [prev] | [next] | [standalone]

#14893

From	Daniel Pitts <newsgroup.nospam@virtualinfinity.net>
Date	2012-05-29 10:49 -0700
Message-ID	<008xr.10053$br3.9895@newsfe10.iad>
In reply to	#14892

On 5/29/12 10:37 AM, Jeff Higgins wrote:
> On 05/29/2012 12:16 PM, Daniel Pitts wrote:
>>
>> Most of these entries aren't even in Thunderbirds spell-check dictionary.
>>
> How many seven letter words can I construct from the twenty-six letters
> A through Z? How many of these words are defined English words? What
> are/are not the common affixes in this set of English words?

How about you download the Moby word list yourself and do these 
analyses? It is a very simple grep for the first question.
egrep "\b[a-zA-Z]{7}\b" *.txt

[toc] | [prev] | [next] | [standalone]

#14894

From	Jeff Higgins <jeff@invalid.invalid>
Date	2012-05-29 13:58 -0400
Message-ID	<jq32o1$ov7$1@dont-email.me>
In reply to	#14893

On 05/29/2012 01:49 PM, Daniel Pitts wrote:
> On 5/29/12 10:37 AM, Jeff Higgins wrote:
>> On 05/29/2012 12:16 PM, Daniel Pitts wrote:
>>>
>>> Most of these entries aren't even in Thunderbirds spell-check
>>> dictionary.
>>>
>> How many seven letter words can I construct from the twenty-six letters
>> A through Z? How many of these words are defined English words? What
>> are/are not the common affixes in this set of English words?
>
> How about you download the Moby word list yourself and do these
> analyses?
I was asking you.
It is a very simple grep for the first question.
> egrep "\b[a-zA-Z]{7}\b" *.txt
Nope.

[toc] | [prev] | [next] | [standalone]

#14896

From	Jeff Higgins <jeff@invalid.invalid>
Date	2012-05-29 14:20 -0400
Message-ID	<jq3412$2bo$1@dont-email.me>
In reply to	#14894

On 05/29/2012 01:58 PM, Jeff Higgins wrote:
> On 05/29/2012 01:49 PM, Daniel Pitts wrote:
>> On 5/29/12 10:37 AM, Jeff Higgins wrote:
>>> On 05/29/2012 12:16 PM, Daniel Pitts wrote:
>>>>
>>>> Most of these entries aren't even in Thunderbirds spell-check
>>>> dictionary.
>>>>
>>> How many seven letter words can I construct from the twenty-six letters
>>> A through Z? How many of these words are defined English words? What
>>> are/are not the common affixes in this set of English words?
>>
>> How about you download the Moby word list yourself and do these
>> analyses?
> I was asking you.
> It is a very simple grep for the first question.
>> egrep "\b[a-zA-Z]{7}\b" *.txt
> Nope.
>
Well, I had hoped to continue a discussion but you seem to have run out 
of time, patience, etc.

[toc] | [prev] | [next] | [standalone]

#14897

From	Daniel Pitts <newsgroup.nospam@virtualinfinity.net>
Date	2012-05-29 11:21 -0700
Message-ID	<1t8xr.31456$TC4.5163@newsfe14.iad>
In reply to	#14894

On 5/29/12 10:58 AM, Jeff Higgins wrote:
> On 05/29/2012 01:49 PM, Daniel Pitts wrote:
>> On 5/29/12 10:37 AM, Jeff Higgins wrote:
>>> On 05/29/2012 12:16 PM, Daniel Pitts wrote:
>>>>
>>>> Most of these entries aren't even in Thunderbirds spell-check
>>>> dictionary.
>>>>
>>> How many seven letter words can I construct from the twenty-six letters
>>> A through Z? How many of these words are defined English words? What
>>> are/are not the common affixes in this set of English words?
>>
>> How about you download the Moby word list yourself and do these
>> analyses?
> I was asking you.
> It is a very simple grep for the first question.
>> egrep "\b[a-zA-Z]{7}\b" *.txt
> Nope.
>
I'll be glad to for a consultant fee. $150 per hour or any partial hour.

[toc] | [prev] | [next] | [standalone]

#14899

From	Jeff Higgins <jeff@invalid.invalid>
Date	2012-05-29 14:29 -0400
Message-ID	<jq34is$63e$1@dont-email.me>
In reply to	#14897

On 05/29/2012 02:21 PM, Daniel Pitts wrote:
> On 5/29/12 10:58 AM, Jeff Higgins wrote:
>> On 05/29/2012 01:49 PM, Daniel Pitts wrote:
>>> On 5/29/12 10:37 AM, Jeff Higgins wrote:
>>>> On 05/29/2012 12:16 PM, Daniel Pitts wrote:
>>>>>
>>>>> Most of these entries aren't even in Thunderbirds spell-check
>>>>> dictionary.
>>>>>
>>>> How many seven letter words can I construct from the twenty-six letters
>>>> A through Z? How many of these words are defined English words? What
>>>> are/are not the common affixes in this set of English words?
>>>
>>> How about you download the Moby word list yourself and do these
>>> analyses?
>> I was asking you.
>> It is a very simple grep for the first question.
>>> egrep "\b[a-zA-Z]{7}\b" *.txt
>> Nope.
>>
> I'll be glad to for a consultant fee. $150 per hour or any partial hour.

bad breath

[toc] | [prev] | [next] | [standalone]

#14900

From	Jeff Higgins <jeff@invalid.invalid>
Date	2012-05-29 15:00 -0400
Message-ID	<jq36c3$h39$1@dont-email.me>
In reply to	#14899

On 05/29/2012 02:29 PM, Jeff Higgins wrote:
> On 05/29/2012 02:21 PM, Daniel Pitts wrote:
>> On 5/29/12 10:58 AM, Jeff Higgins wrote:
>>> On 05/29/2012 01:49 PM, Daniel Pitts wrote:
>>>> On 5/29/12 10:37 AM, Jeff Higgins wrote:
>>>>> On 05/29/2012 12:16 PM, Daniel Pitts wrote:
>>>>>>
>>>>>> Most of these entries aren't even in Thunderbirds spell-check
>>>>>> dictionary.
>>>>>>
>>>>> How many seven letter words can I construct from the twenty-six
>>>>> letters
>>>>> A through Z? How many of these words are defined English words? What
>>>>> are/are not the common affixes in this set of English words?
>>>>
>>>> How about you download the Moby word list yourself and do these
>>>> analyses?
>>> I was asking you.
>>> It is a very simple grep for the first question.
>>>> egrep "\b[a-zA-Z]{7}\b" *.txt
>>> Nope.
>>>
>> I'll be glad to for a consultant fee. $150 per hour or any partial hour.
>
> bad breath
Note to self: Apply sentence commpression to posts from cljp.

[toc] | [prev] | [next] | [standalone]

#14882

From	Jeff Higgins <jeff@invalid.invalid>
Date	2012-05-29 09:24 -0400
Message-ID	<jq2imb$f3h$1@dont-email.me>
In reply to	#14785

On 05/24/2012 07:07 PM, markspace wrote:
> For some reason I was thinking about sub-string searches today.  I
> looked up Patricia tries (a kind of radix tree) to see if they would
> help. While interesting, the radix tree seems to have a lot of overhead
> for large numbers of entries.
>
> The radix tree uses a bucket at each level to hold all children (and
> there could be quite a lot of children). Each child if present requires
> a pointer (an object in Java) to hold it. For the example given, this
> could be as much as one object per character in each string, plus the
> bucket to hold it and its siblings. If the number strings is very large,
> this could really result in an explosion of memory usage.
>
> <http://en.wikipedia.org/wiki/Radix_tree>
>
> So is there a better way for large data sets?
>
Different. Better?
<http://www.pathcom.com/~vadco/cwg.html>

[toc] | [prev] | [standalone]

Page 2 of 2 — ← Prev page 1 [2]

csiph-web

Patricia trie vs binary search.

Contents

#14920

#14922

#14927

#14929

#14888

#14892

#14893

#14894

#14896

#14897

#14899

#14900

#14882