Groups > comp.lang.python > #61553 > unrolled thread

Re: grab dict keys/values without iterating ?!

Started by	Tamer Higazi <tameritoke2@arcor.de>
First post	2013-12-11 12:07 +0200
Last post	2013-12-11 18:05 -0700
Articles	11 — 8 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Re: grab dict keys/values without iterating ?! Tamer Higazi <tameritoke2@arcor.de> - 2013-12-11 12:07 +0200
    Re: grab dict keys/values without iterating ?! rusi <rustompmody@gmail.com> - 2013-12-11 05:31 -0800
      Re: grab dict keys/values without iterating ?! Roy Smith <roy@panix.com> - 2013-12-11 09:46 -0500
        Re: grab dict keys/values without iterating ?! rusi <rustompmody@gmail.com> - 2013-12-11 07:08 -0800
        Re: grab dict keys/values without iterating ?! Tim Chase <python.list@tim.thechases.com> - 2013-12-11 09:42 -0600
      Re: grab dict keys/values without iterating ?! Travis Griggs <travisgriggs@gmail.com> - 2013-12-11 09:19 -0800
      Re: grab dict keys/values without iterating ?! Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-11 18:09 +0000
    Re: grab dict keys/values without iterating ?! Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-12-11 13:44 +0000
      Re: grab dict keys/values without iterating ?! Tim Chase <python.list@tim.thechases.com> - 2013-12-11 08:30 -0600
      Re: grab dict keys/values without iterating ?! Ian Kelly <ian.g.kelly@gmail.com> - 2013-12-11 18:02 -0700
      Re: grab dict keys/values without iterating ?! Ian Kelly <ian.g.kelly@gmail.com> - 2013-12-11 18:05 -0700

#61553 — Re: grab dict keys/values without iterating ?!

From	Tamer Higazi <tameritoke2@arcor.de>
Date	2013-12-11 12:07 +0200
Subject	Re: grab dict keys/values without iterating ?!
Message-ID	<mailman.3885.1386762407.18130.python-list@python.org>

Hi Dave!

You were absolutely right.
I don't want to iterate the entire dict to get me the key/values

Let us say this dict would have 20.000 entries, but I want only those 
with "Aa" to be grabed.
Those starting with these 2 letters would be only 5 or 6 then it would 
take a lot of time.

In which way would you prefer to store the data, and which functions or 
methods would you use effectively to accomplish this task ?

I deeply apologize of not defining the question more defined. English is 
not my mother tongue.
I'll do my best next time.

Thanks

Tamer

On 11.12.2013 06:47, Dave Angel wrote:
> On Wed, 11 Dec 2013 02:02:20 +0200, Tamer Higazi 
> <tameritoke2@arcor.de> wrote:
>> Is there a way to get dict by search terms without iterating the 
> entire
>> dictionary ?!
>
>> I want to grab the dict's key and values started with 'Ar'...
>
> Your wording is so ambiguous that each respondent has guessed 
> differently.
> I'm guessing that you want all key/value pairs for which the key 
> begins with the two letters 'Ar' I'm guessing further that your 
> objection to iterating the entire dictionary is not code size but 
> performance.
> If both assumptions are valid then I'll point out that a dict has no 
> ordering to it. If you want an approach that doesn't iterate over the 
> entire structure you'll need to store the data differently.  For 
> example if you stored all the keys in a sorted list you could use bisect.
>

[toc] | [next] | [standalone]

#61570

From	rusi <rustompmody@gmail.com>
Date	2013-12-11 05:31 -0800
Message-ID	<3efc283f-419d-41b6-ad20-c2901c3b9f78@googlegroups.com>
In reply to	#61553

Reordering to un-top-post.

> On 11.12.2013 06:47, Dave Angel wrote:
> > On Wed, 11 Dec 2013 02:02:20 +0200, Tamer Higazi  wrote:
> >> Is there a way to get dict by search terms without iterating the 
> > entire
> >> dictionary ?!
> >> I want to grab the dict's key and values started with 'Ar'...
> > Your wording is so ambiguous that each respondent has guessed 
> > differently.
> > I'm guessing that you want all key/value pairs for which the key 
> > begins with the two letters 'Ar' I'm guessing further that your 
> > objection to iterating the entire dictionary is not code size but 
> > performance.
> > If both assumptions are valid then I'll point out that a dict has no 
> > ordering to it. If you want an approach that doesn't iterate over the 
> > entire structure you'll need to store the data differently.  For 
> > example if you stored all the keys in a sorted list you could use bisect.



On Wednesday, December 11, 2013 3:37:08 PM UTC+5:30, Tamer Higazi wrote:
> Hi Dave!

> You were absolutely right.
> I don't want to iterate the entire dict to get me the key/values

> Let us say this dict would have 20.000 entries, but I want only those 
> with "Aa" to be grabed.
> Those starting with these 2 letters would be only 5 or 6 then it would 
> take a lot of time.

> In which way would you prefer to store the data, and which functions or 
> methods would you use effectively to accomplish this task ?


The classic data structure for this is the trie:
General idea: http://en.wikipedia.org/wiki/Trie
In python:
http://stackoverflow.com/questions/11015320/how-to-create-a-trie-in-python/


> I deeply apologize of not defining the question more defined. English is 
> not my mother tongue.
> I'll do my best next time.

English no issue.

But better not to top-post
http://en.wikipedia.org/wiki/Posting_style#Top-posting

[toc] | [prev] | [next] | [standalone]

#61579

From	Roy Smith <roy@panix.com>
Date	2013-12-11 09:46 -0500
Message-ID	<roy-DA38A7.09461211122013@news.panix.com>
In reply to	#61570

In article <3efc283f-419d-41b6-ad20-c2901c3b9f78@googlegroups.com>,
 rusi <rustompmody@gmail.com> wrote:

> The classic data structure for this is the trie:
> General idea: http://en.wikipedia.org/wiki/Trie
> In python:
> http://stackoverflow.com/questions/11015320/how-to-create-a-trie-in-python/

I agree that a trie fits the problem description well.  If I were coding 
this up in C or C++, that's probably the data structure I'd use, with 
each node containing an array of pointers to nodes, and the array 
indexed by the ascii value of the next character (this doesn't translate 
well to unicode).  Lookup speed would be awesomely fast.

The problem is, that doesn't make a whole lot of sense in Python.  The 
cited implementation uses dicts at each level.  By the time you've done 
that, you might as well just throw all the data into one big dict and 
use the full search string as the key.  It would be a lot less code, and 
probably run faster.

Still, as an exercise in learning about a useful (and under-appreciated) 
data structure, implementing this as a trie sounds like a wonderful idea.

[toc] | [prev] | [next] | [standalone]

#61582

From	rusi <rustompmody@gmail.com>
Date	2013-12-11 07:08 -0800
Message-ID	<a9a20a2f-a0e5-477c-b999-8fc1118d42d7@googlegroups.com>
In reply to	#61579

On Wednesday, December 11, 2013 8:16:12 PM UTC+5:30, Roy Smith wrote:
>  rusi  wrote:

> > The classic data structure for this is the trie:
> > General idea: http://en.wikipedia.org/wiki/Trie
> > In python:
> > http://stackoverflow.com/questions/11015320/how-to-create-a-trie-in-python/

> I agree that a trie fits the problem description well.  If I were coding 
> this up in C or C++, that's probably the data structure I'd use, with 
> each node containing an array of pointers to nodes, and the array 
> indexed by the ascii value of the next character (this doesn't translate 
> well to unicode).  Lookup speed would be awesomely fast.

> The problem is, that doesn't make a whole lot of sense in Python.  The 
> cited implementation uses dicts at each level.  By the time you've done 
> that, you might as well just throw all the data into one big dict and 
> use the full search string as the key.  It would be a lot less code, and 
> probably run faster.

> Still, as an exercise in learning about a useful (and under-appreciated) 
> data structure, implementing this as a trie sounds like a wonderful idea.

I was going to say that Steven's

data = {}
for c in "ABCDEFGHIJKLMNOPQRSTUVWXYZ":
    data[c] = {} 

is really the first step of the trie approach

except that instead of 

key = "Aardvark"
data[key[0]][key] = "some value"

he would need 
data[key[0]][key[1:] = "some value"

The catch is keys of length 1 eg "A"
Which makes the fully general (recursively unfolded) version easier
to write, though with all those zillions of dicts probably not very efficient.

Finitizing the trie into fixed small prefixes with leaves containing standard 
dicts looks sensible (to me without and testing of either efficiency or 
readability!)

The short prefix problem however remains

Of course there are other tricks possible:
If we know that the data is case-insensitive alphabetic ASCII* we can just 
normalize the data with 

def norm(s): return [ord(c.upper()) -ord('A') for c in s]

then
>>> norm("aBcD")
[0, 1, 2, 3]

and instead of dicts we can use 26-length lists

* O me O my! Ive used the terrible A-word!
Now RUN before the trolls get us!

[toc] | [prev] | [next] | [standalone]

#61585

From	Tim Chase <python.list@tim.thechases.com>
Date	2013-12-11 09:42 -0600
Message-ID	<mailman.3903.1386776501.18130.python-list@python.org>
In reply to	#61579

On 2013-12-11 09:46, Roy Smith wrote:
> The problem is, that doesn't make a whole lot of sense in Python.
> The cited implementation uses dicts at each level.  By the time
> you've done that, you might as well just throw all the data into
> one big dict and use the full search string as the key.  It would
> be a lot less code, and probably run faster.

You're right if the search term is a whole word, a single
dict-to-result works ideally. However, the OP asked about prefixes, so
the Python implementation I provided uses a dict-of-nested-dicts which
allows any arbitrary prefix, and then iterates over only the subset of
those that match.

If you need O(length-of-prefix) iteration of all results, I believe
that's the best way algorithm to use (implementation details could
differ for a more space-efficient structure perhaps; normalization
might help reduce dict-entries).

It's a specialized use-case, and doesn't have the O(1) lookup for
exact-matches (that's just a special case of prefix=word with no
sub-iteration, so would be O(length-of-search-word)).	

-tkc

[toc] | [prev] | [next] | [standalone]

#61601

From	Travis Griggs <travisgriggs@gmail.com>
Date	2013-12-11 09:19 -0800
Message-ID	<mailman.3916.1386782404.18130.python-list@python.org>
In reply to	#61570

On Dec 11, 2013, at 5:31 AM, rusi <rustompmody@gmail.com> wrote:

> 
> The classic data structure for this is the trie:
> General idea: http://en.wikipedia.org/wiki/Trie
> In python:
> http://stackoverflow.com/questions/11015320/how-to-create-a-trie-in-python/

My thoughts exactly!

If you wade through the comments there, someone has done a more-than-naive implementation here:

https://github.com/kmike/marisa-trie

The write up makes it look pretty favorable as well for performance (scroll 2/3s down to the Benchmarks section).

[toc] | [prev] | [next] | [standalone]

#61602

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2013-12-11 18:09 +0000
Message-ID	<mailman.3917.1386785364.18130.python-list@python.org>
In reply to	#61570

On 11/12/2013 17:19, Travis Griggs wrote:
>
> On Dec 11, 2013, at 5:31 AM, rusi <rustompmody@gmail.com> wrote:
>
>>
>> The classic data structure for this is the trie:
>> General idea: http://en.wikipedia.org/wiki/Trie
>> In python:
>> http://stackoverflow.com/questions/11015320/how-to-create-a-trie-in-python/
>
> My thoughts exactly!
>
> If you wade through the comments there, someone has done a more-than-naive implementation here:
>
> https://github.com/kmike/marisa-trie
>
> The write up makes it look pretty favorable as well for performance (scroll 2/3s down to the Benchmarks section).
>

http://kmike.ru/python-data-structures/ from the author of the above is 
well worth a read.

-- 
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.

Mark Lawrence

[toc] | [prev] | [next] | [standalone]

#61572

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-12-11 13:44 +0000
Message-ID	<52a86c57$0$29992$c3e8da3$5496439d@news.astraweb.com>
In reply to	#61553

On Wed, 11 Dec 2013 12:07:08 +0200, Tamer Higazi wrote:

> Hi Dave!
> 
> You were absolutely right.
> I don't want to iterate the entire dict to get me the key/values
> 
> Let us say this dict would have 20.000 entries, but I want only those
> with "Aa" to be grabed.
> Those starting with these 2 letters would be only 5 or 6 then it would
> take a lot of time.

What do you mean by "a lot of time"?

Here is a small test. I set up a dict with 456976 keys, and then iterate 
over them in just over a quarter of a second on my (old, slow) computer. 
Here is the code I use:

data = {}
letters = "abcdefghijklmnopqrstuvwxyz"
for a in letters.upper():
    for b in letters:
        for c in letters:
            for d in letters:
                key = a + b + c + d
                data[key] = None

print(len(data))

count = 0
with Timer():
    for key in data:
        if key.startswith("Aa"):
            count += 1

print("Found %d keys starting with 'Aa'")

The Timer() function is not standard to Python, but you can find it here:

http://code.activestate.com/recipes/577896

Are you sure that just using a normal dict will be too slow?

> In which way would you prefer to store the data, and which functions or
> methods would you use effectively to accomplish this task ?

I would use a dict, and iterate over the keys, until such time that I new 
that iterating was the bottle-neck causing my code to be too slow. Until 
I knew that absolutely for sure, I would not optimize.

If necessary, I would consider having 26 dicts, one for each initial 
letter:

data = {}
for c in "ABCDEFGHIJKLMNOPQRSTUVWXYZ":
    data[c] = {}

then store keys in the particular dict. That way, if I wanted keys 
starting with Aa, I would only search the A dict, not the B dict, C dict, 
etc.

key = "Aardvark"
data[key[0]][key] = "some value"

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#61577

From	Tim Chase <python.list@tim.thechases.com>
Date	2013-12-11 08:30 -0600
Message-ID	<mailman.3900.1386772178.18130.python-list@python.org>
In reply to	#61572

On 2013-12-11 13:44, Steven D'Aprano wrote:
> If necessary, I would consider having 26 dicts, one for each
> initial letter:
> 
> data = {}
> for c in "ABCDEFGHIJKLMNOPQRSTUVWXYZ":
>     data[c] = {}
> 
> then store keys in the particular dict. That way, if I wanted keys 
> starting with Aa, I would only search the A dict, not the B dict, C
> dict, etc.

That's what the convoluted code does that I put at the end of my
previous post in this thread, only to the Nth degree (the outermost
dict has the first letter which links to a dictionary of the 2nd
level/letter, to the 3rd level/letter, etc).

-tkc

[toc] | [prev] | [next] | [standalone]

#61635

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2013-12-11 18:02 -0700
Message-ID	<mailman.3944.1386810211.18130.python-list@python.org>
In reply to	#61572

On Wed, Dec 11, 2013 at 7:30 AM, Tim Chase
<python.list@tim.thechases.com> wrote:
> On 2013-12-11 13:44, Steven D'Aprano wrote:
>> If necessary, I would consider having 26 dicts, one for each
>> initial letter:
>>
>> data = {}
>> for c in "ABCDEFGHIJKLMNOPQRSTUVWXYZ":
>>     data[c] = {}
>>
>> then store keys in the particular dict. That way, if I wanted keys
>> starting with Aa, I would only search the A dict, not the B dict, C
>> dict, etc.
>
> That's what the convoluted code does that I put at the end of my
> previous post in this thread, only to the Nth degree (the outermost
> dict has the first letter which links to a dictionary of the 2nd
> level/letter, to the 3rd level/letter, etc).

This is what I did not so long ago when writing a utility for
typeahead lookup, except that to save some space and time I only
nested the dicts as deeply as there were still multiple entries.  As
an example of what the data structure looked like:

lookups = {
    'a': {
        'l': {
            'g': 'algebra',
            'p': 'alphanumeric',
        },
        's': 'asterisk',
    },
    'b': 'bobcat',
    ...
}

It does make the update process more complicated though, as adding new
words potentially requires existing words to be nested deeper than
they are currently.

[toc] | [prev] | [next] | [standalone]

#61637

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2013-12-11 18:05 -0700
Message-ID	<mailman.3946.1386810804.18130.python-list@python.org>
In reply to	#61572

On Wed, Dec 11, 2013 at 6:02 PM, Ian Kelly <ian.g.kelly@gmail.com> wrote:
> This is what I did not so long ago when writing a utility for
> typeahead lookup, except that to save some space and time I only
> nested the dicts as deeply as there were still multiple entries.  As
> an example of what the data structure looked like:
>
> lookups = {
>     'a': {
>         'l': {
>             'g': 'algebra',
>             'p': 'alphanumeric',
>         },
>         's': 'asterisk',
>     },
>     'b': 'bobcat',
>     ...
> }
>
> It does make the update process more complicated though, as adding new
> words potentially requires existing words to be nested deeper than
> they are currently.

And I'm simplifying that a bit, because I also included at each node
the preferred (in my case, the first alphabetically) completion for
that prefix, to avoid the need to iterate over the subtree.

[toc] | [prev] | [standalone]

csiph-web

Re: grab dict keys/values without iterating ?!

Contents

#61553 — Re: grab dict keys/values without iterating ?!

#61570

#61579

#61582

#61585

#61601

#61602

#61572

#61577

#61635

#61637