Groups > comp.lang.python > #98925 > unrolled thread

Which type should be used when testing static structure appartenance

Started by	Nicolas Évrard <nicoe@altern.org>
First post	2015-11-17 11:27 -0300
Last post	2015-11-19 00:42 +1100
Articles	6 — 4 participants

Back to article view | Back to comp.lang.python

  Which type should be used when testing static structure appartenance Nicolas Évrard <nicoe@altern.org> - 2015-11-17 11:27 -0300
    Re: Which type should be used when testing static structure appartenance Steven D'Aprano <steve@pearwood.info> - 2015-11-18 22:37 +1100
      Re: Which type should be used when testing static structure appartenance Chris Angelico <rosuav@gmail.com> - 2015-11-18 23:40 +1100
        Re: Which type should be used when testing static structure appartenance Steven D'Aprano <steve@pearwood.info> - 2015-11-19 00:30 +1100
          Re: Which type should be used when testing static structure appartenance Jon Ribbens <jon+usenet@unequivocal.co.uk> - 2015-11-18 13:32 +0000
          Re: Which type should be used when testing static structure appartenance Chris Angelico <rosuav@gmail.com> - 2015-11-19 00:42 +1100

#98925 — Which type should be used when testing static structure appartenance

From	Nicolas Évrard <nicoe@altern.org>
Date	2015-11-17 11:27 -0300
Subject	Which type should be used when testing static structure appartenance
Message-ID	<mailman.388.1447770859.16136.python-list@python.org>

Hello,

I saw the following retweet by Raymond Hettinger in this morning:

    https://twitter.com/sanityinc/status/666485814214287360

    Programming tip: many of those arrays and hashes in your code
    should actually be sets. Match data structures to data
    constraints!

I saw just in time because in a review I wrote something like this:

    if operator not in ('where', 'not where')

and my colleague proposed that I should use a list instead of a tuple.
But reading the mentioned tweet I tend to think that a set would be a
better choice.

What are your opinion on this issue (I realize it's not something of
the utmost importance but rather a "philosophical" question).

-- 
Nicolas Évrard - B2CK SPRL
E-mail/Jabber: nicolas.evrard@b2ck.com
Tel: +32 472 54 46 59
Website: http://www.b2ck.com/

[toc] | [next] | [standalone]

#98961

From	Steven D'Aprano <steve@pearwood.info>
Date	2015-11-18 22:37 +1100
Message-ID	<564c62f3$0$1593$c3e8da3$5496439d@news.astraweb.com>
In reply to	#98925

On Wed, 18 Nov 2015 01:27 am, Nicolas Évrard wrote:

> Hello,
> 
> I saw the following retweet by Raymond Hettinger in this morning:
> 
>     https://twitter.com/sanityinc/status/666485814214287360
> 
>     Programming tip: many of those arrays and hashes in your code
>     should actually be sets. Match data structures to data
>     constraints!

This is why I hold Twitter in contempt. What Raymond says could be a good
tip, or a load of old hogwash, and there is no way of knowing because you
can't explain much in 130 characters. If it were a blog post, he could
explain what he means. But its a tweet, so he can't.

Of course you should match the data structure to your data constraints, but
what does that mean in practice? *Which* arrays and hashes should be sets?
How do you know which should be changed to sets?

> I saw just in time because in a review I wrote something like this:
> 
>     if operator not in ('where', 'not where')
> 
> and my colleague proposed that I should use a list instead of a tuple.
> But reading the mentioned tweet I tend to think that a set would be a
> better choice.

Exactly. Raymond's tweet only serves to muddy the water and add more
confusion. Before, you thought you knew the right answer: use a tuple.
Then, your colleague proposed a list, adding some uncertainty and
confusion: which is better, a list or a tuple? And now Raymond's tweet has
*increased* the uncertainty: should you use a set?

> What are your opinion on this issue (I realize it's not something of
> the utmost importance but rather a "philosophical" question).

I would say that there is no doubt that in Python 2, using a tuple is far
and away the best solution for the situation you show above.

There are three factors to weigh up: 

(1) how easy is it to create the data structure?

(2) how much space does the data structure use?

(3) how fast is searching the data structure?

How easy is it to create the data structure? In Python 2, lists and tuples
are effectively just as easy to type, but sets not so much:

('where', 'not where')
['where', 'not where']
set(['where', 'not where'])

Not only are sets longer to type, but they are created at runtime. Compare
the byte-code compiled for each expression:

py> from dis import dis
py> a = compile("('where', 'not where')", "", "single")
py> b = compile("['where', 'not where']", "", "single")
py> c = compile("set(['where', 'not where'])", "", "single")
py> dis(a)
  1           0 LOAD_CONST               3 (('where', 'not where'))
              3 PRINT_EXPR
              4 LOAD_CONST               2 (None)
              7 RETURN_VALUE
py> dis(b)
  1           0 LOAD_CONST               0 ('where')
              3 LOAD_CONST               1 ('not where')
              6 BUILD_LIST               2
              9 PRINT_EXPR
             10 LOAD_CONST               2 (None)
             13 RETURN_VALUE
py> dis(c)
  1           0 LOAD_NAME                0 (set)
              3 LOAD_CONST               0 ('where')
              6 LOAD_CONST               1 ('not where')
              9 BUILD_LIST               2
             12 CALL_FUNCTION            1
             15 PRINT_EXPR
             16 LOAD_CONST               2 (None)
             19 RETURN_VALUE

Python can optimise the creation of the tuple to compile-time, but it has to
create the list at run time. It also creates the set at run time, but it
also needs to do a name lookup. All this takes time.

How much space does each data structure take? Again, a tuple easily wins
over the others:

py> import sys
py> sys.getsizeof(('where', 'not where'))
32
py> sys.getsizeof(['where', 'not where'])
40
py> sys.getsizeof(set(['where', 'not where']))
112

The tuple is smaller than the list, and MUCH smaller than the set. (The
exact sizes you see will depend on the precise version and platform you are
using, but I am confident that the tuple will win.)

Last but not least, which is fastest? For that, we can use the timeit
module, running it as a script from the shell:

[steve@ando ~]$ python -m timeit -s "x = 'not where'" "x in ('where', 'not
where')"
10000000 loops, best of 3: 0.122 usec per loop
[steve@ando ~]$ python -m timeit -s "x = 'not where'" "x in ['where', 'not
where']"
10000000 loops, best of 3: 0.119 usec per loop
[steve@ando ~]$ python -m timeit -s "x = 'not where'" "x in
set(['where', 'not where'])"
1000000 loops, best of 3: 0.728 usec per loop

The difference between 0.122 microseconds and 0.119 microseconds is too
close to call. We can say that the tuple and the list are equally fast, and
the set much, much slower.

But... we can speed the set up, by only creating it once:

[steve@ando ~]$ python -m timeit -s "x = 'not where'" -s "S =
set(['where', 'not where'])" "x in S"
10000000 loops, best of 3: 0.101 usec per loop

That's better! But only *marginally* faster than the tuple, and to get the
speed increase you have to factor out the set into a constant.

So for Python 2, I would say that tuples are the clean winner, at least for
the case where you only have two items to check. If you have more than two,
sets start becoming more attractive.

How about Python 3? Python 3 has the advantage of using set literals. Let's
test the speed:

[steve@ando ~]$ python3.3 -m timeit -s "x = 'not where'" "x in
('where', 'not where')"
10000000 loops, best of 3: 0.11 usec per loop
[steve@ando ~]$ python3.3 -m timeit -s "x = 'not where'" "x in
['where', 'not where']"
10000000 loops, best of 3: 0.113 usec per loop
[steve@ando ~]$ python3.3 -m timeit -s "x = 'not where'" "x in
{'where', 'not where'}"
10000000 loops, best of 3: 0.0907 usec per loop

This makes sets much more attractive in Python 3.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#98963

From	Chris Angelico <rosuav@gmail.com>
Date	2015-11-18 23:40 +1100
Message-ID	<mailman.411.1447850412.16136.python-list@python.org>
In reply to	#98961

On Wed, Nov 18, 2015 at 10:37 PM, Steven D'Aprano <steve@pearwood.info> wrote:
> How about Python 3? Python 3 has the advantage of using set literals.

2.7 has set literals too. All the questions of performance should be
secondary to code clarity, though; so I would say the choices are: Set
literal if available, else tuple. Forget the performance.

ChrisA

[toc] | [prev] | [next] | [standalone]

#98968

From	Steven D'Aprano <steve@pearwood.info>
Date	2015-11-19 00:30 +1100
Message-ID	<564c7d80$0$1606$c3e8da3$5496439d@news.astraweb.com>
In reply to	#98963

On Wed, 18 Nov 2015 11:40 pm, Chris Angelico wrote:

> On Wed, Nov 18, 2015 at 10:37 PM, Steven D'Aprano <steve@pearwood.info>
> wrote:
>> How about Python 3? Python 3 has the advantage of using set literals.
> 
> 2.7 has set literals too. 

Ah, so it does. I forgot about that. Most of my code has to run on multiple
versions (back to 2.4) and I remembered that set literals don't work:

[steve@ando ~]$ python2.6 -c "{1, 2, 3}"
  File "<string>", line 1
    {1, 2, 3}
      ^
SyntaxError: invalid syntax

> All the questions of performance should be 
> secondary to code clarity, though; 

"All"? Surely not.

> so I would say the choices are: Set 
> literal if available, else tuple. Forget the performance.

It seems rather strange to argue that we should ignore performance when the
whole reason for using sets in the first place is for performance.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#98969

From	Jon Ribbens <jon+usenet@unequivocal.co.uk>
Date	2015-11-18 13:32 +0000
Message-ID	<slrnn4ovjp.1t1.jon+usenet@frosty.unequivocal.co.uk>
In reply to	#98968

On 2015-11-18, Steven D'Aprano <steve@pearwood.info> wrote:
> On Wed, 18 Nov 2015 11:40 pm, Chris Angelico wrote:
>> so I would say the choices are: Set 
>> literal if available, else tuple. Forget the performance.
>
> It seems rather strange to argue that we should ignore performance when the
> whole reason for using sets in the first place is for performance.

Who says the whole reason for using sets is performance? Sets behave
differently to tuples or lists - the members are guaranteed to be
unique.

[toc] | [prev] | [next] | [standalone]

#98971

From	Chris Angelico <rosuav@gmail.com>
Date	2015-11-19 00:42 +1100
Message-ID	<mailman.415.1447854155.16136.python-list@python.org>
In reply to	#98968

On Thu, Nov 19, 2015 at 12:30 AM, Steven D'Aprano <steve@pearwood.info> wrote:
> On Wed, 18 Nov 2015 11:40 pm, Chris Angelico wrote:
>> All the questions of performance should be
>> secondary to code clarity, though;
>
> "All"? Surely not.

The OP's example was checking if a string was equal to either of two
strings. Even if that's in a tight loop, the performance difference
between the various options is negligible.

The "all" is a little misleading (of course there are times when you
warp your code for the sake of performance), but I was talking about
this example, where it's basically coming down to microbenchmarks.

>> so I would say the choices are: Set
>> literal if available, else tuple. Forget the performance.
>
> It seems rather strange to argue that we should ignore performance when the
> whole reason for using sets in the first place is for performance.

They do perform well, but that's not the main point - not when you're
working with just two strings. Of course, when you can get performance
AND readability, it's perfect. That doesn't happen with Py2 sets, but
it does with Python 3:

rosuav@sikorsky:~$ python -m timeit -s "x='asdf'" "x in {'asdf','qwer'}"
10000000 loops, best of 3: 0.12 usec per loop
rosuav@sikorsky:~$ python -m timeit -s "x='asdf'" "x in ('asdf','qwer')"
10000000 loops, best of 3: 0.0344 usec per loop
rosuav@sikorsky:~$ python -m timeit -s "x='asdf'" "x=='asdf' or x=='qwer'"
10000000 loops, best of 3: 0.0392 usec per loop
rosuav@sikorsky:~$ python3 -m timeit -s "x='asdf'" "x in {'asdf','qwer'}"
10000000 loops, best of 3: 0.0356 usec per loop
rosuav@sikorsky:~$ python3 -m timeit -s "x='asdf'" "x in ('asdf','qwer')"
10000000 loops, best of 3: 0.0342 usec per loop
rosuav@sikorsky:~$ python3 -m timeit -s "x='asdf'" "x=='asdf' or x=='qwer'"
10000000 loops, best of 3: 0.0418 usec per loop

No set construction in Py3 - the optimizer figures out that you don't
need mutability, and uses a constant frozenset. (Both versions do this
with list->tuple.) Despite the performance hit from using a set in
Py2, though, I would still advocate its use (assuming you don't need
to support 2.6 or older), because it accurately represents the
*concept* of "is this any one of these".

ChrisA

[toc] | [prev] | [standalone]

csiph-web

Which type should be used when testing static structure appartenance

Contents

#98925 — Which type should be used when testing static structure appartenance

#98961

#98963

#98968

#98969

#98971