Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.java.programmer > #14888
| From | Daniel Pitts <newsgroup.nospam@virtualinfinity.net> |
|---|---|
| Newsgroups | comp.lang.java.programmer |
| Subject | Re: Patricia trie vs binary search. |
| References | (1 earlier) <uAewr.45671$On2.38133@newsfe16.iad> <s0m5s7t5gt33plp5m6cn07g4ovehvedfb7@4ax.com> <ADDwr.2826$9Q6.1871@newsfe18.iad> <hr87s756sdaku47lh8sebnegsmskclu637@4ax.com> <jq1kpt$1ee$1@news.albasani.net> |
| Message-ID | <WD6xr.7917$0x2.2975@newsfe13.iad> (permalink) |
| Date | 2012-05-29 09:16 -0700 |
On 5/28/12 9:54 PM, Lew wrote:
> On 05/28/2012 09:20 AM, Gene Wirchenko wrote:
>> On Sun, 27 May 2012 22:00:14 -0700, Daniel Pitts
>> <newsgroup.nospam@virtualinfinity.net> wrote:
>>
>>> On 5/27/12 6:44 PM, Gene Wirchenko wrote:
>>>> On Sat, 26 May 2012 17:30:17 -0700, Daniel Pitts
>>>> <newsgroup.nospam@virtualinfinity.net> wrote:
>>>>
>>>> [snip]
>>>>
>>>>> I tend to use a Deterministic Finite State Automata for this. You can
>>>>> load the entire English dictionary fairly easily with that scheme.
>>>>> Yes,
>>>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>>>> you use a bit of memory, but unless you're doing this on an embedded
>>>>> device, its probably not enough memory to be concerned about.
>>>>
>>>> Including all affixes?
>>> I suppose it depends on the particular dictionary, but we're only
>>> talking a few hundred thousand entries, at least with the Moby
>>> word-lists as a base:
>>
>> Considering how many affixes can be applied to some words, I find
>> that very questionable:
>> self:
>> *ish
>> *ishly
>> un*ish
>> un*ishly
>> *less
>> *lessly
>> un*less
>> un*lessly
>> position:
>> *s
>> *ed
>> *al
>> *ally
>> re*
>> re*s
>> re*ed
>> *less
>> mis*
>> *er
>> *ers
>> friend
>> *s
>> *ly
>> *liness
>> be*
>> be*ing
>> be*ed
>> be*er
>> be*ers
>> These are not particularly extreme examples.
>
> It's not a question of how extreme the examples are but how many there are.
>
> Not all words can be legitimately affixed. Many can be affixed by
> algorithm, or by bitmaps as to which affixes apply, so you only store
> the root, the bitmap and perhaps one more form.
>
> I don't know how much memory expansion you think your factors will
> cause, as you only hand wave and say there will be some and act like
> it's a problem, but let's say it doubles the size of the dictionary. By
> Daniel's experiment upthread, that would bring it to around 8 MiB, let's
> round and say 10MiB. Being text and all, that should compress to about 3
> MiB or less.
>
> Now I am interested to hear what sort of trouble you assert that 3 MiB
> or so of storage will cause.
>
Actually, the word lists I used to come up with my figures include those
inflections of the words that it is used for.
For example "ishly" and "ness" suffixes:
> grep "ishly\b" * | wc -l
323
Which includes:
wordishly
womanishly
wolfishly
wishly
winterishly
wildishly
whorishly
whoreishly
Most of these entries aren't even in Thunderbirds spell-check dictionary.
> grep "ness\b" * | wc -l
11762
Includes entries such as:
woodlessness
wonderfulness
unmysteriousness
unexperiencedness
undeceptiveness
nonvolatileness
Again, most of these aren't spell-check.
My numbers are accurate, but thanks for questioning them, so I could
further explore and see for myself.
Back to comp.lang.java.programmer | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
Patricia trie vs binary search. markspace <-@.> - 2012-05-24 16:07 -0700
Re: Patricia trie vs binary search. glen herrmannsfeldt <gah@ugcs.caltech.edu> - 2012-05-24 23:39 +0000
Re: Patricia trie vs binary search. markspace <-@.> - 2012-05-24 17:56 -0700
Password quality (Was: Patricia trie vs binary search.) Lew <lewbloch@gmail.com> - 2012-05-25 09:41 -0700
Re: Password quality (Was: Patricia trie vs binary search.) markspace <-@.> - 2012-05-25 12:17 -0700
Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-26 17:30 -0700
Re: Patricia trie vs binary search. markspace <-@.> - 2012-05-26 18:17 -0700
Re: Patricia trie vs binary search. Gene Wirchenko <genew@ocis.net> - 2012-05-27 18:44 -0700
Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-27 22:00 -0700
Re: Patricia trie vs binary search. markspace <-@.> - 2012-05-28 08:20 -0700
Re: Patricia trie vs binary search. markspace <-@.> - 2012-05-28 14:38 -0700
Re: Patricia trie vs binary search. Gene Wirchenko <genew@ocis.net> - 2012-05-28 09:20 -0700
Re: Patricia trie vs binary search. Lew <noone@lewscanon.com> - 2012-05-28 21:54 -0700
Re: Patricia trie vs binary search. Gene Wirchenko <genew@ocis.net> - 2012-05-29 09:14 -0700
Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-29 09:55 -0700
Re: Patricia trie vs binary search. Gene Wirchenko <genew@ocis.net> - 2012-05-29 11:17 -0700
Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-29 11:22 -0700
Re: Patricia trie vs binary search. Gene Wirchenko <genew@ocis.net> - 2012-05-29 14:44 -0700
Re: Patricia trie vs binary search. Lew <lewbloch@gmail.com> - 2012-05-29 14:03 -0700
Re: Patricia trie vs binary search. Gene Wirchenko <genew@ocis.net> - 2012-05-29 14:49 -0700
Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-29 15:23 -0700
Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-29 15:39 -0700
Re: Patricia trie vs binary search. Gene Wirchenko <genew@ocis.net> - 2012-05-29 16:08 -0700
Re: Patricia trie vs binary search. Lew <lewbloch@gmail.com> - 2012-05-29 18:25 -0700
Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-29 09:16 -0700
Re: Patricia trie vs binary search. Jeff Higgins <jeff@invalid.invalid> - 2012-05-29 13:37 -0400
Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-29 10:49 -0700
Re: Patricia trie vs binary search. Jeff Higgins <jeff@invalid.invalid> - 2012-05-29 13:58 -0400
Re: Patricia trie vs binary search. Jeff Higgins <jeff@invalid.invalid> - 2012-05-29 14:20 -0400
Re: Patricia trie vs binary search. Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-05-29 11:21 -0700
Re: Patricia trie vs binary search. Jeff Higgins <jeff@invalid.invalid> - 2012-05-29 14:29 -0400
Re: Patricia trie vs binary search. Jeff Higgins <jeff@invalid.invalid> - 2012-05-29 15:00 -0400
Re: Patricia trie vs binary search. Jeff Higgins <jeff@invalid.invalid> - 2012-05-29 09:24 -0400
csiph-web