Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.java.programmer > #14922
| Path | csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!npeer01.iad.highwinds-media.com!news.highwinds-media.com!feed-me.highwinds-media.com!post01.iad.highwinds-media.com!newsfe10.iad.POSTED!83aa503d!not-for-mail |
|---|---|
| From | Daniel Pitts <newsgroup.nospam@virtualinfinity.net> |
| User-Agent | Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:12.0) Gecko/20120428 Thunderbird/12.0.1 |
| MIME-Version | 1.0 |
| Newsgroups | comp.lang.java.programmer |
| Subject | Re: Patricia trie vs binary search. |
| References | <jpmev4$63l$1@dont-email.me> <uAewr.45671$On2.38133@newsfe16.iad> <s0m5s7t5gt33plp5m6cn07g4ovehvedfb7@4ax.com> <ADDwr.2826$9Q6.1871@newsfe18.iad> <hr87s756sdaku47lh8sebnegsmskclu637@4ax.com> <jq1kpt$1ee$1@news.albasani.net> <7ss9s7tu1a8v9qbkc3qnhfg8rhck3ov4bi@4ax.com> <Gb7xr.31449$TC4.15127@newsfe14.iad> <pg4as7lvhhlsinnmhncvrne8vflek6agb7@4ax.com> <5f02c9ad-a414-41dd-8ac9-723d7a8651e5@googlegroups.com> <gpgas71pk57o0s68ie0qf1pgn9dg5hbkpb@4ax.com> <X%bxr.36808$6Y6.35155@newsfe19.iad> |
| In-Reply-To | <X%bxr.36808$6Y6.35155@newsfe19.iad> |
| Content-Type | text/plain; charset=ISO-8859-1; format=flowed |
| Content-Transfer-Encoding | 7bit |
| Lines | 194 |
| Message-ID | <pecxr.10088$br3.3802@newsfe10.iad> (permalink) |
| X-Complaints-To | abuse@newsrazor.net |
| NNTP-Posting-Date | Tue, 29 May 2012 22:39:17 UTC |
| Date | Tue, 29 May 2012 15:39:16 -0700 |
| X-Received-Bytes | 7276 |
| Xref | csiph.com comp.lang.java.programmer:14922 |
Show key headers only | View raw
On 5/29/12 3:23 PM, Daniel Pitts wrote:
> On 5/29/12 2:49 PM, Gene Wirchenko wrote:
>> On Tue, 29 May 2012 14:03:10 -0700 (PDT), Lew<lewbloch@gmail.com>
>> wrote:
>>
>>> Gene Wirchenko wrote:
>>>> Daniel Pitts wrote:
>>>>
>>>> [snip]
>>>>
>>>>> Are you arguing that a modern system can't handle that number of
>>>>> words?
>>>>
>>>> No. I simply stating that the real size of the problem is much
>>>> bigger.
>>>
>>> With no numbers that differ from Daniel's to back up your claim.
>>>
>>> You called my numbers "made up", but it turned out they were
>>> *larger* than the real numbers.
>>>
>>> You cite "a quarter of a million" words. Daniel counted roughly
>>> *150%* of that in his word base.
>>
>> Ah, selective reading.
>>
>> For root forms, it was 1/4 million. With affixes -- and remember
>> that my first question was about them -- the figure was 3/4 million.
>> This is double what Daniel counted, and 3/4 million does not include
>> technical words, etc. Take a look at the *full* paragraph that I
>> quoted, not just the lowest number.
>>
>>> My numbers were generous. Yours are not even significantly different,
>>> and in fact are smaller than his. The numbers do show there is not
>>> much problem, yet somehow you claim with no logic or reasoning or
>>> different data that they do show a problem.
>>
>> Take another look at that paragraph I quoted. Really.
>>
>>> Clearly you are mistaken.
>>>
>>> Daniel showed evidence from experimentation. His numbers jibe with
>>> yours. Without compression, his data occupy roughly 5 MiB of memory.
>>>
>>> Show the problem or withdraw the claim.
>>
>> Read my statement of the problem.
>>
>>>>> A modern desktop has more than enough memory to easily handle a
>>>>> quarter
>>>>> *billion* words, which is a 100 times greater than your guessed
>>>>> upper limit.
>>>>>
>>>>> And that's *without* compression.
>>>>
>>>> Sure. If that is all that it does. My main (and older) desktop
>>>> box has 1.5 GB. I have trouble with not enough memory at times.
>>>> Adding another app might break its back.
>>>
>>> Again, how much damage will< 5 MiB of data do to that system?
>>>
>>> How about 50 MiB? That's *ten times* the number of words you might
>>> need to handle.
>>> Without any compression.
>>
>> Fine. My system is currently using 1547 MB of memory. It only
>> has 1536 MB. With some intensive uses, I have seen the commit go to
>> as high as about 2200 MB. My system crawls then. Using 50 MB more in
>> those circumstances would not be good.
>>
>>> You haven't shown a problem. You just chant, "The sky is falling!"
>>
>> I believe that I have. Handwaving the possibility away does not
>> make it go away.
>>
>> Sincerely,
>>
>> Gene Wirchenko
>
>
> Test program:
>
> package net.virtualinfinity.moby;
>
> import java.io.BufferedReader;
> import java.io.FileReader;
> import java.io.IOException;
> import java.util.Arrays;
>
> public class WordTree {
> WordNode start = new WordNode(-1);
>
> public void addWord(String word) {
> start.addWord(word);
> }
>
> public static void main(String[] args) throws IOException {
> WordTree tree = new WordTree();
> int wordsLoaded = 0;
> for (String filename: args) {
> System.out.println("Loading " + filename);
> BufferedReader reader = new BufferedReader(new FileReader(filename),
> 1024*256);
> String line;
> while ((line = reader.readLine()) != null) {
> ++wordsLoaded;
> if ((wordsLoaded & 127) == 0) {
> printStatus(wordsLoaded);
> }
> tree.addWord(line.trim());
> }
> System.gc();
> printStatus(wordsLoaded);
> System.out.println();
> }
> }
>
> private static void printStatus(int wordsLoaded) {
> System.out.print("\rWords loaded: " + wordsLoaded + ", Total Memory
> used: " + Runtime.getRuntime().totalMemory() + ". ");
> }
> }
>
>
> class WordNode {
> private boolean terminal;
> private final int value;
> private WordNode[] next;
>
> public WordNode(int ch) {
> value = ch;
> }
>
> public void addWord(String word) {
> if ("".equals(word)) {
> terminal = true;
> return;
> }
> final int ch = word.codePointAt(0);
> getNext(ch).addWord(word.substring(Character.charCount(ch)));
> }
>
> private WordNode getNext(int ch) {
> if (next == null) {
> next = new WordNode[1];
> next[0] = new WordNode(ch);
> }
> for (WordNode node: next) {
> if (node.value == ch) {
> return node;
> }
> }
> next = Arrays.copyOf(next, next.length +1);
> final WordNode newNode = new WordNode(ch);
> next[next.length-1] = newNode;
> return newNode;
> }
> }
>
> ------- Output -----
>
> Loading compound-words.txt
> Words loaded: 256772, Total Memory used: 120909824.
> Loading often-mispelled.txt
> Words loaded: 257138, Total Memory used: 121106432.
> Loading english-most-frequent.txt
> Words loaded: 258141, Total Memory used: 121106432.
> Loading male-names.txt
> Words loaded: 262038, Total Memory used: 121106432.
> Loading female-names.txt
> Words loaded: 266984, Total Memory used: 121106432.
> Loading common-names.txt
> Words loaded: 288970, Total Memory used: 121106432.
> Loading common-dictionary.txt
> Words loaded: 363520, Total Memory used: 127373312.
> Loading official-scrabble-1st-edition.txt
> Words loaded: 477329, Total Memory used: 129957888.
> Loading official-scrabble-2nd-edition-delta.txt
> Words loaded: 481489, Total Memory used: 129957888.
BTW, if I check the memory usage before loading words and after, the
difference is ~ 42MB
So, loading 481k words takes up about 42MB. This is in java, which has a
fairly high overhead per string. And the implementation of my data
structure is also fairly naive as well.
Extrapolating that data to an extreme 2 million words, that would be
less than 200MB in memory.
My gut feeling beats your gut feeling, and my science proves it true. If
you are going to reply with a counter argument, please provide a
reproducible experiment to prove your argument. Otherwise, this
conversation is over.
Back to comp.lang.java.programmer | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
Patricia trie vs binary search. markspace <-@.> - 2012-05-24 16:07 -0700
Re: Patricia trie vs binary search. glen herrmannsfeldt <gah@ugcs.caltech.edu> - 2012-05-24 23:39 +0000
Re: Patricia trie vs binary search. markspace <-@.> - 2012-05-24 17:56 -0700
Password quality (Was: Patricia trie vs binary search.) Lew <lewbloch@gmail.com> - 2012-05-25 09:41 -0700
Re: Password quality (Was: Patricia trie vs binary search.) markspace <-@.> - 2012-05-25 12:17 -0700
Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-26 17:30 -0700
Re: Patricia trie vs binary search. markspace <-@.> - 2012-05-26 18:17 -0700
Re: Patricia trie vs binary search. Gene Wirchenko <genew@ocis.net> - 2012-05-27 18:44 -0700
Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-27 22:00 -0700
Re: Patricia trie vs binary search. markspace <-@.> - 2012-05-28 08:20 -0700
Re: Patricia trie vs binary search. markspace <-@.> - 2012-05-28 14:38 -0700
Re: Patricia trie vs binary search. Gene Wirchenko <genew@ocis.net> - 2012-05-28 09:20 -0700
Re: Patricia trie vs binary search. Lew <noone@lewscanon.com> - 2012-05-28 21:54 -0700
Re: Patricia trie vs binary search. Gene Wirchenko <genew@ocis.net> - 2012-05-29 09:14 -0700
Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-29 09:55 -0700
Re: Patricia trie vs binary search. Gene Wirchenko <genew@ocis.net> - 2012-05-29 11:17 -0700
Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-29 11:22 -0700
Re: Patricia trie vs binary search. Gene Wirchenko <genew@ocis.net> - 2012-05-29 14:44 -0700
Re: Patricia trie vs binary search. Lew <lewbloch@gmail.com> - 2012-05-29 14:03 -0700
Re: Patricia trie vs binary search. Gene Wirchenko <genew@ocis.net> - 2012-05-29 14:49 -0700
Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-29 15:23 -0700
Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-29 15:39 -0700
Re: Patricia trie vs binary search. Gene Wirchenko <genew@ocis.net> - 2012-05-29 16:08 -0700
Re: Patricia trie vs binary search. Lew <lewbloch@gmail.com> - 2012-05-29 18:25 -0700
Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-29 09:16 -0700
Re: Patricia trie vs binary search. Jeff Higgins <jeff@invalid.invalid> - 2012-05-29 13:37 -0400
Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-29 10:49 -0700
Re: Patricia trie vs binary search. Jeff Higgins <jeff@invalid.invalid> - 2012-05-29 13:58 -0400
Re: Patricia trie vs binary search. Jeff Higgins <jeff@invalid.invalid> - 2012-05-29 14:20 -0400
Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-29 11:21 -0700
Re: Patricia trie vs binary search. Jeff Higgins <jeff@invalid.invalid> - 2012-05-29 14:29 -0400
Re: Patricia trie vs binary search. Jeff Higgins <jeff@invalid.invalid> - 2012-05-29 15:00 -0400
Re: Patricia trie vs binary search. Jeff Higgins <jeff@invalid.invalid> - 2012-05-29 09:24 -0400
csiph-web