Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #61553 > unrolled thread
| Started by | Tamer Higazi <tameritoke2@arcor.de> |
|---|---|
| First post | 2013-12-11 12:07 +0200 |
| Last post | 2013-12-11 18:05 -0700 |
| Articles | 11 — 8 participants |
Back to article view | Back to comp.lang.python
This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by
below is the oldest one visible, not the original post.
Re: grab dict keys/values without iterating ?! Tamer Higazi <tameritoke2@arcor.de> - 2013-12-11 12:07 +0200
Re: grab dict keys/values without iterating ?! rusi <rustompmody@gmail.com> - 2013-12-11 05:31 -0800
Re: grab dict keys/values without iterating ?! Roy Smith <roy@panix.com> - 2013-12-11 09:46 -0500
Re: grab dict keys/values without iterating ?! rusi <rustompmody@gmail.com> - 2013-12-11 07:08 -0800
Re: grab dict keys/values without iterating ?! Tim Chase <python.list@tim.thechases.com> - 2013-12-11 09:42 -0600
Re: grab dict keys/values without iterating ?! Travis Griggs <travisgriggs@gmail.com> - 2013-12-11 09:19 -0800
Re: grab dict keys/values without iterating ?! Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-12-11 18:09 +0000
Re: grab dict keys/values without iterating ?! Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-12-11 13:44 +0000
Re: grab dict keys/values without iterating ?! Tim Chase <python.list@tim.thechases.com> - 2013-12-11 08:30 -0600
Re: grab dict keys/values without iterating ?! Ian Kelly <ian.g.kelly@gmail.com> - 2013-12-11 18:02 -0700
Re: grab dict keys/values without iterating ?! Ian Kelly <ian.g.kelly@gmail.com> - 2013-12-11 18:05 -0700
| From | Tamer Higazi <tameritoke2@arcor.de> |
|---|---|
| Date | 2013-12-11 12:07 +0200 |
| Subject | Re: grab dict keys/values without iterating ?! |
| Message-ID | <mailman.3885.1386762407.18130.python-list@python.org> |
Hi Dave! You were absolutely right. I don't want to iterate the entire dict to get me the key/values Let us say this dict would have 20.000 entries, but I want only those with "Aa" to be grabed. Those starting with these 2 letters would be only 5 or 6 then it would take a lot of time. In which way would you prefer to store the data, and which functions or methods would you use effectively to accomplish this task ? I deeply apologize of not defining the question more defined. English is not my mother tongue. I'll do my best next time. Thanks Tamer On 11.12.2013 06:47, Dave Angel wrote: > On Wed, 11 Dec 2013 02:02:20 +0200, Tamer Higazi > <tameritoke2@arcor.de> wrote: >> Is there a way to get dict by search terms without iterating the > entire >> dictionary ?! > >> I want to grab the dict's key and values started with 'Ar'... > > Your wording is so ambiguous that each respondent has guessed > differently. > I'm guessing that you want all key/value pairs for which the key > begins with the two letters 'Ar' I'm guessing further that your > objection to iterating the entire dictionary is not code size but > performance. > If both assumptions are valid then I'll point out that a dict has no > ordering to it. If you want an approach that doesn't iterate over the > entire structure you'll need to store the data differently. For > example if you stored all the keys in a sorted list you could use bisect. >
[toc] | [next] | [standalone]
| From | rusi <rustompmody@gmail.com> |
|---|---|
| Date | 2013-12-11 05:31 -0800 |
| Message-ID | <3efc283f-419d-41b6-ad20-c2901c3b9f78@googlegroups.com> |
| In reply to | #61553 |
Reordering to un-top-post. > On 11.12.2013 06:47, Dave Angel wrote: > > On Wed, 11 Dec 2013 02:02:20 +0200, Tamer Higazi wrote: > >> Is there a way to get dict by search terms without iterating the > > entire > >> dictionary ?! > >> I want to grab the dict's key and values started with 'Ar'... > > Your wording is so ambiguous that each respondent has guessed > > differently. > > I'm guessing that you want all key/value pairs for which the key > > begins with the two letters 'Ar' I'm guessing further that your > > objection to iterating the entire dictionary is not code size but > > performance. > > If both assumptions are valid then I'll point out that a dict has no > > ordering to it. If you want an approach that doesn't iterate over the > > entire structure you'll need to store the data differently. For > > example if you stored all the keys in a sorted list you could use bisect. On Wednesday, December 11, 2013 3:37:08 PM UTC+5:30, Tamer Higazi wrote: > Hi Dave! > You were absolutely right. > I don't want to iterate the entire dict to get me the key/values > Let us say this dict would have 20.000 entries, but I want only those > with "Aa" to be grabed. > Those starting with these 2 letters would be only 5 or 6 then it would > take a lot of time. > In which way would you prefer to store the data, and which functions or > methods would you use effectively to accomplish this task ? The classic data structure for this is the trie: General idea: http://en.wikipedia.org/wiki/Trie In python: http://stackoverflow.com/questions/11015320/how-to-create-a-trie-in-python/ > I deeply apologize of not defining the question more defined. English is > not my mother tongue. > I'll do my best next time. English no issue. But better not to top-post http://en.wikipedia.org/wiki/Posting_style#Top-posting
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2013-12-11 09:46 -0500 |
| Message-ID | <roy-DA38A7.09461211122013@news.panix.com> |
| In reply to | #61570 |
In article <3efc283f-419d-41b6-ad20-c2901c3b9f78@googlegroups.com>, rusi <rustompmody@gmail.com> wrote: > The classic data structure for this is the trie: > General idea: http://en.wikipedia.org/wiki/Trie > In python: > http://stackoverflow.com/questions/11015320/how-to-create-a-trie-in-python/ I agree that a trie fits the problem description well. If I were coding this up in C or C++, that's probably the data structure I'd use, with each node containing an array of pointers to nodes, and the array indexed by the ascii value of the next character (this doesn't translate well to unicode). Lookup speed would be awesomely fast. The problem is, that doesn't make a whole lot of sense in Python. The cited implementation uses dicts at each level. By the time you've done that, you might as well just throw all the data into one big dict and use the full search string as the key. It would be a lot less code, and probably run faster. Still, as an exercise in learning about a useful (and under-appreciated) data structure, implementing this as a trie sounds like a wonderful idea.
[toc] | [prev] | [next] | [standalone]
| From | rusi <rustompmody@gmail.com> |
|---|---|
| Date | 2013-12-11 07:08 -0800 |
| Message-ID | <a9a20a2f-a0e5-477c-b999-8fc1118d42d7@googlegroups.com> |
| In reply to | #61579 |
On Wednesday, December 11, 2013 8:16:12 PM UTC+5:30, Roy Smith wrote:
> rusi wrote:
> > The classic data structure for this is the trie:
> > General idea: http://en.wikipedia.org/wiki/Trie
> > In python:
> > http://stackoverflow.com/questions/11015320/how-to-create-a-trie-in-python/
> I agree that a trie fits the problem description well. If I were coding
> this up in C or C++, that's probably the data structure I'd use, with
> each node containing an array of pointers to nodes, and the array
> indexed by the ascii value of the next character (this doesn't translate
> well to unicode). Lookup speed would be awesomely fast.
> The problem is, that doesn't make a whole lot of sense in Python. The
> cited implementation uses dicts at each level. By the time you've done
> that, you might as well just throw all the data into one big dict and
> use the full search string as the key. It would be a lot less code, and
> probably run faster.
> Still, as an exercise in learning about a useful (and under-appreciated)
> data structure, implementing this as a trie sounds like a wonderful idea.
I was going to say that Steven's
data = {}
for c in "ABCDEFGHIJKLMNOPQRSTUVWXYZ":
data[c] = {}
is really the first step of the trie approach
except that instead of
key = "Aardvark"
data[key[0]][key] = "some value"
he would need
data[key[0]][key[1:] = "some value"
The catch is keys of length 1 eg "A"
Which makes the fully general (recursively unfolded) version easier
to write, though with all those zillions of dicts probably not very efficient.
Finitizing the trie into fixed small prefixes with leaves containing standard
dicts looks sensible (to me without and testing of either efficiency or
readability!)
The short prefix problem however remains
Of course there are other tricks possible:
If we know that the data is case-insensitive alphabetic ASCII* we can just
normalize the data with
def norm(s): return [ord(c.upper()) -ord('A') for c in s]
then
>>> norm("aBcD")
[0, 1, 2, 3]
and instead of dicts we can use 26-length lists
* O me O my! Ive used the terrible A-word!
Now RUN before the trolls get us!
[toc] | [prev] | [next] | [standalone]
| From | Tim Chase <python.list@tim.thechases.com> |
|---|---|
| Date | 2013-12-11 09:42 -0600 |
| Message-ID | <mailman.3903.1386776501.18130.python-list@python.org> |
| In reply to | #61579 |
On 2013-12-11 09:46, Roy Smith wrote: > The problem is, that doesn't make a whole lot of sense in Python. > The cited implementation uses dicts at each level. By the time > you've done that, you might as well just throw all the data into > one big dict and use the full search string as the key. It would > be a lot less code, and probably run faster. You're right if the search term is a whole word, a single dict-to-result works ideally. However, the OP asked about prefixes, so the Python implementation I provided uses a dict-of-nested-dicts which allows any arbitrary prefix, and then iterates over only the subset of those that match. If you need O(length-of-prefix) iteration of all results, I believe that's the best way algorithm to use (implementation details could differ for a more space-efficient structure perhaps; normalization might help reduce dict-entries). It's a specialized use-case, and doesn't have the O(1) lookup for exact-matches (that's just a special case of prefix=word with no sub-iteration, so would be O(length-of-search-word)). -tkc
[toc] | [prev] | [next] | [standalone]
| From | Travis Griggs <travisgriggs@gmail.com> |
|---|---|
| Date | 2013-12-11 09:19 -0800 |
| Message-ID | <mailman.3916.1386782404.18130.python-list@python.org> |
| In reply to | #61570 |
On Dec 11, 2013, at 5:31 AM, rusi <rustompmody@gmail.com> wrote: > > The classic data structure for this is the trie: > General idea: http://en.wikipedia.org/wiki/Trie > In python: > http://stackoverflow.com/questions/11015320/how-to-create-a-trie-in-python/ My thoughts exactly! If you wade through the comments there, someone has done a more-than-naive implementation here: https://github.com/kmike/marisa-trie The write up makes it look pretty favorable as well for performance (scroll 2/3s down to the Benchmarks section).
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2013-12-11 18:09 +0000 |
| Message-ID | <mailman.3917.1386785364.18130.python-list@python.org> |
| In reply to | #61570 |
On 11/12/2013 17:19, Travis Griggs wrote: > > On Dec 11, 2013, at 5:31 AM, rusi <rustompmody@gmail.com> wrote: > >> >> The classic data structure for this is the trie: >> General idea: http://en.wikipedia.org/wiki/Trie >> In python: >> http://stackoverflow.com/questions/11015320/how-to-create-a-trie-in-python/ > > My thoughts exactly! > > If you wade through the comments there, someone has done a more-than-naive implementation here: > > https://github.com/kmike/marisa-trie > > The write up makes it look pretty favorable as well for performance (scroll 2/3s down to the Benchmarks section). > http://kmike.ru/python-data-structures/ from the author of the above is well worth a read. -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-12-11 13:44 +0000 |
| Message-ID | <52a86c57$0$29992$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #61553 |
On Wed, 11 Dec 2013 12:07:08 +0200, Tamer Higazi wrote:
> Hi Dave!
>
> You were absolutely right.
> I don't want to iterate the entire dict to get me the key/values
>
> Let us say this dict would have 20.000 entries, but I want only those
> with "Aa" to be grabed.
> Those starting with these 2 letters would be only 5 or 6 then it would
> take a lot of time.
What do you mean by "a lot of time"?
Here is a small test. I set up a dict with 456976 keys, and then iterate
over them in just over a quarter of a second on my (old, slow) computer.
Here is the code I use:
data = {}
letters = "abcdefghijklmnopqrstuvwxyz"
for a in letters.upper():
for b in letters:
for c in letters:
for d in letters:
key = a + b + c + d
data[key] = None
print(len(data))
count = 0
with Timer():
for key in data:
if key.startswith("Aa"):
count += 1
print("Found %d keys starting with 'Aa'")
The Timer() function is not standard to Python, but you can find it here:
http://code.activestate.com/recipes/577896
Are you sure that just using a normal dict will be too slow?
> In which way would you prefer to store the data, and which functions or
> methods would you use effectively to accomplish this task ?
I would use a dict, and iterate over the keys, until such time that I new
that iterating was the bottle-neck causing my code to be too slow. Until
I knew that absolutely for sure, I would not optimize.
If necessary, I would consider having 26 dicts, one for each initial
letter:
data = {}
for c in "ABCDEFGHIJKLMNOPQRSTUVWXYZ":
data[c] = {}
then store keys in the particular dict. That way, if I wanted keys
starting with Aa, I would only search the A dict, not the B dict, C dict,
etc.
key = "Aardvark"
data[key[0]][key] = "some value"
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Tim Chase <python.list@tim.thechases.com> |
|---|---|
| Date | 2013-12-11 08:30 -0600 |
| Message-ID | <mailman.3900.1386772178.18130.python-list@python.org> |
| In reply to | #61572 |
On 2013-12-11 13:44, Steven D'Aprano wrote:
> If necessary, I would consider having 26 dicts, one for each
> initial letter:
>
> data = {}
> for c in "ABCDEFGHIJKLMNOPQRSTUVWXYZ":
> data[c] = {}
>
> then store keys in the particular dict. That way, if I wanted keys
> starting with Aa, I would only search the A dict, not the B dict, C
> dict, etc.
That's what the convoluted code does that I put at the end of my
previous post in this thread, only to the Nth degree (the outermost
dict has the first letter which links to a dictionary of the 2nd
level/letter, to the 3rd level/letter, etc).
-tkc
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2013-12-11 18:02 -0700 |
| Message-ID | <mailman.3944.1386810211.18130.python-list@python.org> |
| In reply to | #61572 |
On Wed, Dec 11, 2013 at 7:30 AM, Tim Chase
<python.list@tim.thechases.com> wrote:
> On 2013-12-11 13:44, Steven D'Aprano wrote:
>> If necessary, I would consider having 26 dicts, one for each
>> initial letter:
>>
>> data = {}
>> for c in "ABCDEFGHIJKLMNOPQRSTUVWXYZ":
>> data[c] = {}
>>
>> then store keys in the particular dict. That way, if I wanted keys
>> starting with Aa, I would only search the A dict, not the B dict, C
>> dict, etc.
>
> That's what the convoluted code does that I put at the end of my
> previous post in this thread, only to the Nth degree (the outermost
> dict has the first letter which links to a dictionary of the 2nd
> level/letter, to the 3rd level/letter, etc).
This is what I did not so long ago when writing a utility for
typeahead lookup, except that to save some space and time I only
nested the dicts as deeply as there were still multiple entries. As
an example of what the data structure looked like:
lookups = {
'a': {
'l': {
'g': 'algebra',
'p': 'alphanumeric',
},
's': 'asterisk',
},
'b': 'bobcat',
...
}
It does make the update process more complicated though, as adding new
words potentially requires existing words to be nested deeper than
they are currently.
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2013-12-11 18:05 -0700 |
| Message-ID | <mailman.3946.1386810804.18130.python-list@python.org> |
| In reply to | #61572 |
On Wed, Dec 11, 2013 at 6:02 PM, Ian Kelly <ian.g.kelly@gmail.com> wrote:
> This is what I did not so long ago when writing a utility for
> typeahead lookup, except that to save some space and time I only
> nested the dicts as deeply as there were still multiple entries. As
> an example of what the data structure looked like:
>
> lookups = {
> 'a': {
> 'l': {
> 'g': 'algebra',
> 'p': 'alphanumeric',
> },
> 's': 'asterisk',
> },
> 'b': 'bobcat',
> ...
> }
>
> It does make the update process more complicated though, as adding new
> words potentially requires existing words to be nested deeper than
> they are currently.
And I'm simplifying that a bit, because I also included at each node
the preferred (in my case, the first alphabetically) completion for
that prefix, to avoid the need to iterate over the subtree.
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web