Groups > comp.lang.python > #27194 > unrolled thread

set and dict iteration

Started by	Aaron Brady <castironpi@gmail.com>
First post	2012-08-16 11:00 -0700
Last post	2012-08-17 10:47 -0700
Articles	20 on this page of 45 — 13 participants

Back to article view | Back to comp.lang.python

  set and dict iteration Aaron Brady <castironpi@gmail.com> - 2012-08-16 11:00 -0700
    Re: set and dict iteration Dave Angel <d@davea.name> - 2012-08-16 15:49 -0400
      Re: set and dict iteration Paul Rubin <no.email@nospam.invalid> - 2012-08-16 14:26 -0700
        Re: set and dict iteration Dave Angel <davea@dejaviewphoto.com> - 2012-08-16 19:11 -0400
          Re: set and dict iteration Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-17 02:24 +0000
            Re: set and dict iteration Paul Rubin <no.email@nospam.invalid> - 2012-08-16 19:30 -0700
              Re: set and dict iteration Aaron Brady <castironpi@gmail.com> - 2012-08-17 11:11 -0700
            Re: set and dict iteration Aaron Brady <castironpi@gmail.com> - 2012-08-17 11:37 -0700
              Re: set and dict iteration Chris Angelico <rosuav@gmail.com> - 2012-08-18 07:57 +1000
                Re: set and dict iteration Aaron Brady <castironpi@gmail.com> - 2012-08-18 13:29 -0700
                Re: set and dict iteration Aaron Brady <castironpi@gmail.com> - 2012-08-18 13:29 -0700
                  Re: set and dict iteration MRAB <python@mrabarnett.plus.com> - 2012-08-18 23:14 +0100
                    Re: set and dict iteration Aaron Brady <castironpi@gmail.com> - 2012-08-18 19:28 -0700
                      Re: set and dict iteration Aaron Brady <castironpi@gmail.com> - 2012-08-23 09:49 -0700
                        Re: set and dict iteration Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-27 13:17 -0600
                          Re: set and dict iteration Aaron Brady <castironpi@gmail.com> - 2012-09-02 10:43 -0700
                            Re: set and dict iteration Ian Kelly <ian.g.kelly@gmail.com> - 2012-09-03 13:29 -0600
                              Re: set and dict iteration Aaron Brady <castironpi@gmail.com> - 2012-09-03 13:04 -0700
                                Re: set and dict iteration Dave Angel <d@davea.name> - 2012-09-03 16:27 -0400
                                  Re: set and dict iteration Aaron Brady <castironpi@gmail.com> - 2012-09-03 17:24 -0700
                                  Re: set and dict iteration Aaron Brady <castironpi@gmail.com> - 2012-09-03 17:24 -0700
                              Re: set and dict iteration Aaron Brady <castironpi@gmail.com> - 2012-09-03 13:04 -0700
                                Re: set and dict iteration Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-04 01:26 +0000
                                  Re: set and dict iteration Dave Angel <d@davea.name> - 2012-09-03 21:50 -0400
                                    Re: set and dict iteration Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-04 01:59 +0000
                                      Re: set and dict iteration Aaron Brady <castironpi@gmail.com> - 2012-09-08 08:42 -0700
                          Re: set and dict iteration Aaron Brady <castironpi@gmail.com> - 2012-09-02 10:43 -0700
                        Re: set and dict iteration Serhiy Storchaka <storchaka@gmail.com> - 2012-09-04 00:46 +0300
                      Re: set and dict iteration Aaron Brady <castironpi@gmail.com> - 2012-08-23 09:49 -0700
                        Re: set and dict iteration Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-23 18:11 +0000
                          Re: set and dict iteration Aaron Brady <castironpi@gmail.com> - 2012-08-27 11:10 -0700
                          Re: set and dict iteration Aaron Brady <castironpi@gmail.com> - 2012-09-08 15:07 -0700
                    Re: set and dict iteration Aaron Brady <castironpi@gmail.com> - 2012-08-18 19:28 -0700
                    Re: set and dict iteration Thomas Rachel <nutznetz-0c1b6768-bfa9-48d5-a470-7603bd3aa915@spamschutz.glglgl.de> - 2012-09-08 22:06 +0200
                      Re: set and dict iteration Hans Mulder <hansmu@xs4all.nl> - 2012-09-08 23:18 +0200
                      Re: set and dict iteration MRAB <python@mrabarnett.plus.com> - 2012-09-08 22:22 +0100
        Re: set and dict iteration Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-16 17:43 -0600
          Re: set and dict iteration Paul Rubin <no.email@nospam.invalid> - 2012-08-16 18:01 -0700
            Re: set and dict iteration Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> - 2012-08-17 13:16 +0200
            Re: set and dict iteration Aaron Brady <castironpi@gmail.com> - 2012-08-17 11:03 -0700
            Re: set and dict iteration 88888 Dihedral <dihedral88888@googlemail.com> - 2012-09-10 13:14 -0700
    Re: set and dict iteration Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-16 16:55 -0600
    Re: set and dict iteration Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-16 17:07 -0600
      Re: set and dict iteration Aaron Brady <castironpi@gmail.com> - 2012-08-17 10:47 -0700
      Re: set and dict iteration Aaron Brady <castironpi@gmail.com> - 2012-08-17 10:47 -0700

Page 1 of 3 [1] 2 3 Next page →

#27194 — set and dict iteration

From	Aaron Brady <castironpi@gmail.com>
Date	2012-08-16 11:00 -0700
Subject	set and dict iteration
Message-ID	<b8dd3aca-2a87-4124-ad6e-66a8720af99a@googlegroups.com>

Hello,

I observed an inconsistency in the behavior of 'set' and 'dict' iterators.  It is "by design" according to the docs.

'''
http://docs.python.org/dev/library/stdtypes.html#dict-views

iter(dictview).  Iterating views while adding or deleting entries in the dictionary may raise a RuntimeError or fail to iterate over all entries.
'''

The 'set' has the same behavior.  Iteration might also complete successfully.

The inconsistency is, if we remove an element from a set and add another during iteration, the new element might appear later in the iteration, and might not, depending on the hash code; therefore comparing the size of the set between iterations isn't adequate.  Example:

http://home.comcast.net/~castironpi-misc/clpy-0062%20set%20iterators.py  

'''
# py: { 'ver': '3' }

set0= set( ( 1, 2 ) )

iter0= iter( set0 )
print( next( iter0 ) )

set0.add( 3 )
set0.remove( 2 )
print( next( iter0 ) )


print( )

set0= set( ( 6, 7 ) )

iter0= iter( set0 )
print( next( iter0 ) )

set0.add( 8 )
set0.remove( 7 )
print( next( iter0 ) )
'''

Output:

'''
1
3

6
Traceback (most recent call last):
  File [...] line 22, in <module>
    print( next( iter0 ) )
StopIteration
'''

Iteration should behave the same regardless of the contents of the set.  Continuing iteration over sets and dicts after a modification isn't defined; it should unconditionally raise an error.

What's going on, is '8' is added before the position of the iterator due to hashing in the second part, but the size doesn't change, so the iterator reaches the end of the set after '7' is removed.

The inconsistency isn't easily solved.  One possibility is to use a timestamp or other serial index in the object and iterators, and compare them on every iteration to determine if a modification has occurred.

Another possibility which the author prefers, is to maintain a secondary collection of the iterators of an object, and invalidate them upon modification.  The applicable collection structure is a doubly-linked linked list, informally depicted:

http://home.comcast.net/~castironpi-misc/clpy-0062%20set%20iterators.png

Upon modification, the set traverses its iterators, setting an 'invalid' flag on each; and subsequent calls to any of them raise an 'IterationError'.  Adding and removing iterators to and from the secondary list is performed in O( 1 ) time with no penalty.

The above example depicted a 'Set'.  'Dicts' have the same anomaly, but the solution is ambiguous, since dict values can be changed meaningfully without altering the structure of the object.  In the author's opinion, the dict should not raise an 'IterationError' on value changes, only key changes like the set, but the argument isn't conclusive.

[toc] | [next] | [standalone]

#27195

From	Dave Angel <d@davea.name>
Date	2012-08-16 15:49 -0400
Message-ID	<mailman.3389.1345146609.4697.python-list@python.org>
In reply to	#27194

On 08/16/2012 02:00 PM, Aaron Brady wrote:
> Hello,
>
> I observed an inconsistency in the behavior of 'set' and 'dict' iterators.  It is "by design" according to the docs.
>
> '''
> http://docs.python.org/dev/library/stdtypes.html#dict-views
>
> iter(dictview).  Iterating views while adding or deleting entries in the dictionary may raise a RuntimeError or fail to iterate over all entries.
> '''
>
> The 'set' has the same behavior.  Iteration might also complete successfully.
>
> The inconsistency is, if we remove an element from a set and add another during iteration, the new element might appear later in the iteration, and might not, depending on the hash code; therefore comparing the size of the set between iterations isn't adequate.  Example:
> <SNIP>
>
>
> Iteration should behave the same regardless of the contents of the set.  Continuing iteration over sets and dicts after a modification isn't defined; it should unconditionally raise an error.

Why is it the iterator's job to protect against the user's  bug?  The
doc is clear enough.  If you don't change the collection, you won't have
a problem.

> <SNIP more details>.

Everything else is implementation defined.  Why should an implementation
be forced to have ANY extra data structure to detect a static bug in the
caller's code?



-- 

DaveA

[toc] | [prev] | [next] | [standalone]

#27202

From	Paul Rubin <no.email@nospam.invalid>
Date	2012-08-16 14:26 -0700
Message-ID	<7xy5le7cli.fsf@ruckus.brouhaha.com>
In reply to	#27195

Dave Angel <d@davea.name> writes:
> Everything else is implementation defined.  Why should an implementation
> be forced to have ANY extra data structure to detect a static bug in the
> caller's code?

For the same reason the interpreter checks for type errors at runtime
and raises TypeError, instead of letting the program go into the weeds.

[toc] | [prev] | [next] | [standalone]

#27213

From	Dave Angel <davea@dejaviewphoto.com>
Date	2012-08-16 19:11 -0400
Message-ID	<mailman.3404.1345158704.4697.python-list@python.org>
In reply to	#27202

On 08/16/2012 05:26 PM, Paul Rubin wrote:
> Dave Angel <d@davea.name> writes:
>> Everything else is implementation defined.  Why should an implementation
>> be forced to have ANY extra data structure to detect a static bug in the
>> caller's code?
> For the same reason the interpreter checks for type errors at runtime
> and raises TypeError, instead of letting the program go into the weeds.

There's an enormous difference between type errors, which affect the low
level dispatch, and checking for whether a dict has changed and may have
invalidated the iterator.  If we were really going to keep track of what
iterators are tracking a given dict or set, why stop there?  Why not
check if another process has changed a file we're iterating through?  Or ...

[toc] | [prev] | [next] | [standalone]

#27218

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-08-17 02:24 +0000
Message-ID	<502dab6c$0$29978$c3e8da3$5496439d@news.astraweb.com>
In reply to	#27213

On Thu, 16 Aug 2012 19:11:19 -0400, Dave Angel wrote:

> On 08/16/2012 05:26 PM, Paul Rubin wrote:
>> Dave Angel <d@davea.name> writes:
>>> Everything else is implementation defined.  Why should an
>>> implementation be forced to have ANY extra data structure to detect a
>>> static bug in the caller's code?
>> For the same reason the interpreter checks for type errors at runtime
>> and raises TypeError, instead of letting the program go into the weeds.
> 
> There's an enormous difference between type errors, which affect the low
> level dispatch, and checking for whether a dict has changed and may have
> invalidated the iterator.  If we were really going to keep track of what
> iterators are tracking a given dict or set, why stop there?  Why not
> check if another process has changed a file we're iterating through?  Or
> ...

Which is why Python doesn't do it -- because it is (claimed to be) 
excessively expensive for the benefit that you would get.

Not because it is a matter of principle that data integrity is 
unimportant. Data integrity *is* important, but in the opinion of the 
people who wrote these particular data structures, the effort required to 
guarantee correct iteration in the face of mutation is too expensive for 
the benefit.

Are they right? I don't know. I know that the list sort method goes to a 
lot of trouble to prevent code from modifying lists while they are being 
sorted. During the sort, the list temporarily appears to be empty to 
anything which attempts to access it. So at least sometimes, the Python 
developers spend effort to ensure data integrity.

Luckily, Python is open source. If anyone thinks that sets and dicts 
should include more code protecting against mutation-during-iteration, 
they are more than welcome to come up with a patch. Don't forget unit and 
regression tests, and also a set of timing results which show that the 
slow-down isn't excessive.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#27220

From	Paul Rubin <no.email@nospam.invalid>
Date	2012-08-16 19:30 -0700
Message-ID	<7xboiadzcd.fsf@ruckus.brouhaha.com>
In reply to	#27218

Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:
> Luckily, Python is open source. If anyone thinks that sets and dicts 
> should include more code protecting against mutation-during-iteration, 
> they are more than welcome to come up with a patch. Don't forget unit and 
> regression tests, and also a set of timing results which show that the 
> slow-down isn't excessive.

It could be a debugging option, in which case even a fairly significant
slowdown is acceptable.

[toc] | [prev] | [next] | [standalone]

#27253

From	Aaron Brady <castironpi@gmail.com>
Date	2012-08-17 11:11 -0700
Message-ID	<de5c72da-404f-4569-bcf1-78bdeebf813a@googlegroups.com>
In reply to	#27220

On Thursday, August 16, 2012 9:30:42 PM UTC-5, Paul Rubin wrote:
> Steven D'Aprano <steve+comp.lang.python@pearwood.info> writes:
> 
> > Luckily, Python is open source. If anyone thinks that sets and dicts 
> 
> > should include more code protecting against mutation-during-iteration, 
> 
> > they are more than welcome to come up with a patch. Don't forget unit and 
> 
> > regression tests, and also a set of timing results which show that the 
> 
> > slow-down isn't excessive.
> 
> 
> 
> It could be a debugging option, in which case even a fairly significant
> 
> slowdown is acceptable.

Another possibility is to use the 'gc.get_referrers' mechanism to obtain the iterators.


import gc
a= set( ( 0, 1, 2 ) )
b= iter( a )
c= iter( a )
d= iter( a )
print( gc.get_referrers( a ) )

Output:

[<set_iterator object at 0x00C0B9E0>, <set_iterator object at 0x00C0BA08>, <set_iterator object at 0x00C0BA30>, [others] ]

This approach wouldn't be as time-efficient as a dedicated secondary structure, due to the other objects which refer to the set, including variable namespaces.

[toc] | [prev] | [next] | [standalone]

#27255

From	Aaron Brady <castironpi@gmail.com>
Date	2012-08-17 11:37 -0700
Message-ID	<fe95c29c-2289-4e9c-870e-e3c475f13459@googlegroups.com>
In reply to	#27218

On Thursday, August 16, 2012 9:24:44 PM UTC-5, Steven D'Aprano wrote:
> On Thu, 16 Aug 2012 19:11:19 -0400, Dave Angel wrote:
> 
> 
> 
> > On 08/16/2012 05:26 PM, Paul Rubin wrote:
> 
> >> Dave Angel <d@davea.name> writes:
> 
> >>> Everything else is implementation defined.  Why should an
> 
> >>> implementation be forced to have ANY extra data structure to detect a
> 
> >>> static bug in the caller's code?
> 
> >> For the same reason the interpreter checks for type errors at runtime
> 
> >> and raises TypeError, instead of letting the program go into the weeds.
> 
> > 
> 
> > There's an enormous difference between type errors, which affect the low
> 
> > level dispatch, and checking for whether a dict has changed and may have
> 
> > invalidated the iterator.  If we were really going to keep track of what
> 
> > iterators are tracking a given dict or set, why stop there?  Why not
> 
> > check if another process has changed a file we're iterating through?  Or
> 
> > ...
> 
> 
> 
> Which is why Python doesn't do it -- because it is (claimed to be) 
> 
> excessively expensive for the benefit that you would get.
> 
> 
> 
> Not because it is a matter of principle that data integrity is 
> 
> unimportant. Data integrity *is* important, but in the opinion of the 
> 
> people who wrote these particular data structures, the effort required to 
> 
> guarantee correct iteration in the face of mutation is too expensive for 
> 
> the benefit.
> 
> 
> 
> Are they right? I don't know. I know that the list sort method goes to a 
> 
> lot of trouble to prevent code from modifying lists while they are being 
> 
> sorted. During the sort, the list temporarily appears to be empty to 
> 
> anything which attempts to access it. So at least sometimes, the Python 
> 
> developers spend effort to ensure data integrity.
> 
> 
> 
> Luckily, Python is open source. If anyone thinks that sets and dicts 
> 
> should include more code protecting against mutation-during-iteration, 
> 
> they are more than welcome to come up with a patch. Don't forget unit and 
> 
> regression tests, and also a set of timing results which show that the 
> 
> slow-down isn't excessive.

I contribute a patch some time ago.  It wasn't accepted.  However this thread seems to show a moderately more favorable sentiment than that one.

Is there a problem with hacking on the Beta?  Or should I wait for the Release?  Does anyone want to help me with the changes?  Perhaps P. Rubin could contribute the variation he suggested as well.

[toc] | [prev] | [next] | [standalone]

#27272

From	Chris Angelico <rosuav@gmail.com>
Date	2012-08-18 07:57 +1000
Message-ID	<mailman.3435.1345240665.4697.python-list@python.org>
In reply to	#27255

On Sat, Aug 18, 2012 at 4:37 AM, Aaron Brady <castironpi@gmail.com> wrote:
> Is there a problem with hacking on the Beta?

Nope. Hack on the beta, then when the release arrives, rebase your
work onto it. I doubt that anything of this nature will be changed
between now and then.

ChrisA

[toc] | [prev] | [next] | [standalone]

#27332

From	Aaron Brady <castironpi@gmail.com>
Date	2012-08-18 13:29 -0700
Message-ID	<mailman.3473.1345321756.4697.python-list@python.org>
In reply to	#27272

On Friday, August 17, 2012 4:57:41 PM UTC-5, Chris Angelico wrote:
> On Sat, Aug 18, 2012 at 4:37 AM, Aaron Brady <castironpi@gmail.com> wrote:
> 
> > Is there a problem with hacking on the Beta?
> 
> 
> 
> Nope. Hack on the beta, then when the release arrives, rebase your
> 
> work onto it. I doubt that anything of this nature will be changed
> 
> between now and then.
> 
> 
> 
> ChrisA

Thanks Chris, your post was encouraging.

I have a question about involving the 'tp_clear' field of the types.

http://docs.python.org/dev/c-api/typeobj.html#PyTypeObject.tp_clear

'''
...The tuple type does not implement a tp_clear function, because it’s possible to prove that no reference cycle can be composed entirely of tuples. 
'''

I didn't follow the reasoning in the proof; the premise is necessary but IMHO not obviously sufficient.  Nevertheless, the earlier diagram contains an overt homogeneous reference cycle.

Reposting: http://home.comcast.net/~castironpi-misc/clpy-0062%20set%20iterators.png

In my estimate, the 'tp_traverse' and 'tp_clear' fields of the set doesn't need to visit the auxiliary collection; the same fields of the iterators don't need to visit the primary set or other iterators; and references in the linked list don't need to be included in the iterators' reference counts.

Can someone who is more familiar with the cycle detector and cycle breaker, help prove or disprove the above?

[toc] | [prev] | [next] | [standalone]

#27333

From	Aaron Brady <castironpi@gmail.com>
Date	2012-08-18 13:29 -0700
Message-ID	<d4708687-2925-421a-b755-969d6dac731a@googlegroups.com>
In reply to	#27272

On Friday, August 17, 2012 4:57:41 PM UTC-5, Chris Angelico wrote:
> On Sat, Aug 18, 2012 at 4:37 AM, Aaron Brady <castironpi@gmail.com> wrote:
> 
> > Is there a problem with hacking on the Beta?
> 
> 
> 
> Nope. Hack on the beta, then when the release arrives, rebase your
> 
> work onto it. I doubt that anything of this nature will be changed
> 
> between now and then.
> 
> 
> 
> ChrisA

Thanks Chris, your post was encouraging.

I have a question about involving the 'tp_clear' field of the types.

http://docs.python.org/dev/c-api/typeobj.html#PyTypeObject.tp_clear

'''
...The tuple type does not implement a tp_clear function, because it’s possible to prove that no reference cycle can be composed entirely of tuples. 
'''

I didn't follow the reasoning in the proof; the premise is necessary but IMHO not obviously sufficient.  Nevertheless, the earlier diagram contains an overt homogeneous reference cycle.

Reposting: http://home.comcast.net/~castironpi-misc/clpy-0062%20set%20iterators.png

In my estimate, the 'tp_traverse' and 'tp_clear' fields of the set doesn't need to visit the auxiliary collection; the same fields of the iterators don't need to visit the primary set or other iterators; and references in the linked list don't need to be included in the iterators' reference counts.

Can someone who is more familiar with the cycle detector and cycle breaker, help prove or disprove the above?

[toc] | [prev] | [next] | [standalone]

#27335

From	MRAB <python@mrabarnett.plus.com>
Date	2012-08-18 23:14 +0100
Message-ID	<mailman.3476.1345328046.4697.python-list@python.org>
In reply to	#27333

On 18/08/2012 21:29, Aaron Brady wrote:
> On Friday, August 17, 2012 4:57:41 PM UTC-5, Chris Angelico wrote:
>> On Sat, Aug 18, 2012 at 4:37 AM, Aaron Brady <castironpi@gmail.com> wrote:
>>
>> > Is there a problem with hacking on the Beta?
>>
>>
>>
>> Nope. Hack on the beta, then when the release arrives, rebase your
>>
>> work onto it. I doubt that anything of this nature will be changed
>>
>> between now and then.
>>
>>
>>
>> ChrisA
>
> Thanks Chris, your post was encouraging.
>
> I have a question about involving the 'tp_clear' field of the types.
>
> http://docs.python.org/dev/c-api/typeobj.html#PyTypeObject.tp_clear
>
> '''
> ...The tuple type does not implement a tp_clear function, because it’s possible to prove that no reference cycle can be composed entirely of tuples.
> '''
>
> I didn't follow the reasoning in the proof; the premise is necessary but IMHO not obviously sufficient.  Nevertheless, the earlier diagram contains an overt homogeneous reference cycle.
>
> Reposting: http://home.comcast.net/~castironpi-misc/clpy-0062%20set%20iterators.png
>
> In my estimate, the 'tp_traverse' and 'tp_clear' fields of the set doesn't need to visit the auxiliary collection; the same fields of the iterators don't need to visit the primary set or other iterators; and references in the linked list don't need to be included in the iterators' reference counts.
>
> Can someone who is more familiar with the cycle detector and cycle breaker, help prove or disprove the above?
>
In simple terms, when you create an immutable object it can contain
only references to pre-existing objects, but in order to create a cycle
you need to make an object refer to another which is created later, so
it's not possible to create a cycle out of immutable objects.

However, using Python's C API it _is_ possible to create such a cycle,
by mutating an otherwise-immutable tuple (see PyTuple_SetItem and
PyTuple_SET_ITEM).

[toc] | [prev] | [next] | [standalone]

#27339

From	Aaron Brady <castironpi@gmail.com>
Date	2012-08-18 19:28 -0700
Message-ID	<mailman.3480.1345343315.4697.python-list@python.org>
In reply to	#27335

On Saturday, August 18, 2012 5:14:05 PM UTC-5, MRAB wrote:
> On 18/08/2012 21:29, Aaron Brady wrote:
> 
> > On Friday, August 17, 2012 4:57:41 PM UTC-5, Chris Angelico wrote:
> 
> >> On Sat, Aug 18, 2012 at 4:37 AM, Aaron Brady <castironpi@gmail.com> wrote:
> 
> >>
> 
> >> > Is there a problem with hacking on the Beta?
> 
> >>
> 
> >>
> 
> >>
> 
> >> Nope. Hack on the beta, then when the release arrives, rebase your
> 
> >>
> 
> >> work onto it. I doubt that anything of this nature will be changed
> 
> >>
> 
> >> between now and then.
> 
> >>
> 
> >>
> 
> >>
> 
> >> ChrisA
> 
> >
> 
> > Thanks Chris, your post was encouraging.
> 
> >
> 
> > I have a question about involving the 'tp_clear' field of the types.
> 
> >
> 
> > http://docs.python.org/dev/c-api/typeobj.html#PyTypeObject.tp_clear
> 
> >
> 
> > '''
> 
> > ...The tuple type does not implement a tp_clear function, because it’s possible to prove that no reference cycle can be composed entirely of tuples.
> 
> > '''
> 
> >
> 
> > I didn't follow the reasoning in the proof; the premise is necessary but IMHO not obviously sufficient.  Nevertheless, the earlier diagram contains an overt homogeneous reference cycle.
> 
> >
> 
> > Reposting: http://home.comcast.net/~castironpi-misc/clpy-0062%20set%20iterators.png
> 
> >
> 
> > In my estimate, the 'tp_traverse' and 'tp_clear' fields of the set doesn't need to visit the auxiliary collection; the same fields of the iterators don't need to visit the primary set or other iterators; and references in the linked list don't need to be included in the iterators' reference counts.
> 
> >
> 
> > Can someone who is more familiar with the cycle detector and cycle breaker, help prove or disprove the above?
> 
> >
> 
> In simple terms, when you create an immutable object it can contain
> 
> only references to pre-existing objects, but in order to create a cycle
> 
> you need to make an object refer to another which is created later, so
> 
> it's not possible to create a cycle out of immutable objects.
> 
> 
> 
> However, using Python's C API it _is_ possible to create such a cycle,
> 
> by mutating an otherwise-immutable tuple (see PyTuple_SetItem and
> 
> PyTuple_SET_ITEM).

Are there any precedents for storing uncounted references to PyObject's?

One apparent problematic case is creating an iterator to a set, then adding it to the set.  However the operation is a modification, and causes the iterator to be removed from the secondary list before the set is examined for collection.

Otherwise, the iterator keeps a counted reference to the set, but the set does not keep a counted reference to the iterator, so the iterator will always be freed first.  Therefore, the set's secondary list will be empty when the set is freed.

Concurrent addition and deletion of iterators should be disabled, and the iterators should remove themselves from the set's secondary list before they decrement their references to the set.

Please refresh the earlier diagram; counted references are distinguished separately.  Reposting: http://home.comcast.net/~castironpi-misc/clpy-0062%20set%20iterators.png

[toc] | [prev] | [next] | [standalone]

#27744

From	Aaron Brady <castironpi@gmail.com>
Date	2012-08-23 09:49 -0700
Message-ID	<c7452db1-5b78-4d54-81a1-1c8683631d6e@googlegroups.com>
In reply to	#27339

On Saturday, August 18, 2012 9:28:32 PM UTC-5, Aaron Brady wrote:
> On Saturday, August 18, 2012 5:14:05 PM UTC-5, MRAB wrote:
> 
> > On 18/08/2012 21:29, Aaron Brady wrote:
> 
> > > On Friday, August 17, 2012 4:57:41 PM UTC-5, Chris Angelico wrote:
> 
> > >> On Sat, Aug 18, 2012 at 4:37 AM, Aaron Brady <castironpi@gmail.com> wrote:
> 
> > >> > Is there a problem with hacking on the Beta?
> 
> > >> Nope. Hack on the beta, then when the release arrives, rebase your
> 
> > >> work onto it. I doubt that anything of this nature will be changed
> 
> > >> between now and then.
> 
> > >> ChrisA
> 
> > > Thanks Chris, your post was encouraging.
> 
> > > I have a question about involving the 'tp_clear' field of the types.
> 
> > > http://docs.python.org/dev/c-api/typeobj.html#PyTypeObject.tp_clear
> 
> > > '''
> 
> > > ...The tuple type does not implement a tp_clear function, because it’s possible to prove that no reference cycle can be composed entirely of tuples.
> 
> > > '''
> 
> > > I didn't follow the reasoning in the proof; the premise is necessary but IMHO not obviously sufficient.  Nevertheless, the earlier diagram contains an overt homogeneous reference cycle.
> 
> > > Reposting: http://home.comcast.net/~castironpi-misc/clpy-0062%20set%20iterators.png
> 
> > > In my estimate, the 'tp_traverse' and 'tp_clear' fields of the set doesn't need to visit the auxiliary collection; the same fields of the iterators don't need to visit the primary set or other iterators; and references in the linked list don't need to be included in the iterators' reference counts.
> 
> > > Can someone who is more familiar with the cycle detector and cycle breaker, help prove or disprove the above?
> 
> > In simple terms, when you create an immutable object it can contain
> 
> > only references to pre-existing objects, but in order to create a cycle
> 
> > you need to make an object refer to another which is created later, so
> 
> > it's not possible to create a cycle out of immutable objects.
> 
> > However, using Python's C API it _is_ possible to create such a cycle,
> 
> > by mutating an otherwise-immutable tuple (see PyTuple_SetItem and
> 
> > PyTuple_SET_ITEM).
> 
> Are there any precedents for storing uncounted references to PyObject's?
> 
> One apparent problematic case is creating an iterator to a set, then adding it to the set.  However the operation is a modification, and causes the iterator to be removed from the secondary list before the set is examined for collection.
> 
> Otherwise, the iterator keeps a counted reference to the set, but the set does not keep a counted reference to the iterator, so the iterator will always be freed first.  Therefore, the set's secondary list will be empty when the set is freed.
> 
> Concurrent addition and deletion of iterators should be disabled, and the iterators should remove themselves from the set's secondary list before they decrement their references to the set.
> 
> Please refresh the earlier diagram; counted references are distinguished separately.  Reposting: http://home.comcast.net/~castironpi-misc/clpy-0062%20set%20iterators.png

The patch for the above is only 40-60 lines.  However it introduces two new concepts.

The first is a "linked list", a classic dynamic data structure, first developed in 1955, cf. http://en.wikipedia.org/wiki/Linked_list .  Linked lists are absent in Python, including the standard library and CPython implementation, beyond the weak reference mechanism and garbage collector.  The "collections.deque" structure shares some of the linked list interface but uses arrays.

The second is "uncounted references".  The uncounted references are references to "set iterators" exclusively, exist only internally to "set" objects, and are invisible to the rest of the program.  The reason for the exception is that iterators are unique in the Python Data Model; iterators consist of a single immutable reference, unlike both immutable types such as strings and numbers, as well as container types.  Counted references could be used instead, but would be consistently wasted work for the garbage collector, though the benefit to programmers' peace of mind could be significant.

Please share your opinion!  Do you agree that the internal list resolves the inconsistency?  Do you agree with the strategy?  Do you agree that uncounted references are justified to introduce, or are counted references preferable?

[toc] | [prev] | [next] | [standalone]

#27996

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2012-08-27 13:17 -0600
Message-ID	<mailman.3883.1346095064.4697.python-list@python.org>
In reply to	#27744

On Thu, Aug 23, 2012 at 10:49 AM, Aaron Brady <castironpi@gmail.com> wrote:
> The patch for the above is only 40-60 lines.  However it introduces two new concepts.

Is there a link to the patch?

> The first is a "linked list", a classic dynamic data structure, first developed in 1955, cf. http://en.wikipedia.org/wiki/Linked_list .  Linked lists are absent in Python, including the standard library and CPython implementation, beyond the weak reference mechanism and garbage collector.  The "collections.deque" structure shares some of the linked list interface but uses arrays.
>
> The second is "uncounted references".  The uncounted references are references to "set iterators" exclusively, exist only internally to "set" objects, and are invisible to the rest of the program.  The reason for the exception is that iterators are unique in the Python Data Model; iterators consist of a single immutable reference, unlike both immutable types such as strings and numbers, as well as container types.  Counted references could be used instead, but would be consistently wasted work for the garbage collector, though the benefit to programmers' peace of mind could be significant.
>
> Please share your opinion!  Do you agree that the internal list resolves the inconsistency?  Do you agree with the strategy?  Do you agree that uncounted references are justified to introduce, or are counted references preferable?

This feature is a hard sell as it is; I think that adding uncounted
references into the mix is only going to make that worse.  May I
suggest an alternate approach?  Internally tag each set or dict with a
"version", which is just a C int.  Every time the hash table is
modified, increment the version.  When an iterator is created, store
the current version on the iterator.  When the iterator is advanced,
check that the iterator version matches the dict/set version.  If
they're not equal, raise an error.

This should add less overhead than the linked list without any
concerns about reference counting.  It does introduce a small bug in
that an error condition could be "missed", if the version is
incremented a multiple of 2**32 or 2**64 times between iterations --
but how often is that really likely to occur?  Bearing in mind that
this error is meant for debugging and not production error handling,
you could even make the version a single byte and I'd still be fine
with that.

Cheers,
Ian

[toc] | [prev] | [next] | [standalone]

#28280

From	Aaron Brady <castironpi@gmail.com>
Date	2012-09-02 10:43 -0700
Message-ID	<1567e8c7-a2bb-41f4-9be8-18e9f4d063cb@googlegroups.com>
In reply to	#27996

On Monday, August 27, 2012 2:17:45 PM UTC-5, Ian wrote:
> On Thu, Aug 23, 2012 at 10:49 AM, Aaron Brady <castironpi@gmail.com> wrote:
> 
> > The patch for the above is only 40-60 lines.  However it introduces two new concepts.
> 
> 
> 
> Is there a link to the patch?

Please see below.  It grew somewhat during development.

> > The first is a "linked list".
SNIP.
 
> > The second is "uncounted references".  The uncounted references are references to "set iterators" exclusively, exist only internally to "set" objects, and are invisible to the rest of the program.  The reason for the exception is that iterators are unique in the Python Data Model; iterators consist of a single immutable reference, unlike both immutable types such as strings and numbers, as well as container types.  Counted references could be used instead, but would be consistently wasted work for the garbage collector, though the benefit to programmers' peace of mind could be significant.
> 
> >
> 
> > Please share your opinion!  Do you agree that the internal list resolves the inconsistency?  Do you agree with the strategy?  Do you agree that uncounted references are justified to introduce, or are counted references preferable?
> 
> 
> 
> This feature is a hard sell as it is; I think that adding uncounted
> 
> references into the mix is only going to make that worse.  May I
> 
> suggest an alternate approach?  Internally tag each set or dict with a
> 
> "version", which is just a C int.  Every time the hash table is
> 
> modified, increment the version.  When an iterator is created, store
> 
> the current version on the iterator.  When the iterator is advanced,
> 
> check that the iterator version matches the dict/set version.  If
> 
> they're not equal, raise an error.
> 
> 
> 
> This should add less overhead than the linked list without any
> 
> concerns about reference counting.  It does introduce a small bug in
> 
> that an error condition could be "missed", if the version is
> 
> incremented a multiple of 2**32 or 2**64 times between iterations --
> 
> but how often is that really likely to occur?  Bearing in mind that
> 
> this error is meant for debugging and not production error handling,
> 
> you could even make the version a single byte and I'd still be fine
> 
> with that.
> 
> 
> 
> Cheers,
> 
> Ian


Hi Ian,

We could use a Python long object for the version index to prevent overflow.  Combined with P. Rubin's idea to count the number of open iterators, most use cases still wouldn't exceed a single word comparison; we could reset the counter when there weren't any.  Using the linked list collection, modification operations are expensive in rare cases.  Using the version index, iteration is expensive in rare cases.

I was more interested in the linked list for conceptual reasons, so I developed it further.  Changelog, diff file, test suite, and links are below.  The devs should be aware that a competing patch might be developed.  I would be pleased to hear what everybody thinks of it!

Linked list with uncounted references implementation for Win32.

Added:
- 'set_clear_ex' and 'set_clear_internal_ex' methods, differ in invalidation and conditional invalidation behavior and return type..  The 'set.clear()' method and 'tp_clear' type field both called the same method.
- 'set_invalidate_iter_linked' method.  Iterate over the iterators of a set, mark them invalid, and clear the list.
- 'setiter_unlink_internal' method.  Remove the iterator from the set's linked list of iterators.
- 'IterationError', global.
- New fields:
-- PySetObject: setiterobject *iter_linked.  Pointer to the first element of the linked list of the iterators of the set.
-- setiterobject: setiterobject *linked_pred, *linked_succ.  Predecessor and successor nodes in the linked list of iterators of the same set.
-- setiterobject: char ob_valid.  Validation status of the iterator.
- Result is compared with original in 'set_intersection_update' and '_multi' to determine whether to invalidate the list of iterators.  Asymptotic running time is unchanged.
- Pending: add 'tp_clear' field to 'PySetIter_Type'?
- Test script included, 'large numbers' test pending.

6 files changed: { setobject.h, setobject.c, exceptions.c, pyerrors.h, python3.def, python33stub.def }.  Test script 'set_iterator_test.py' new.  Linked list interface and pseudocode 'patch_pseudocode.txt'.
Zip file:
http://home.comcast.net/~castironpi-misc/clpy-0062-set_iterator_patch.zip
Diff file of 3.3.0b2:
http://home.comcast.net/~castironpi-misc/clpy-0062-set_iterator_diff.txt

[toc] | [prev] | [next] | [standalone]

#28366

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2012-09-03 13:29 -0600
Message-ID	<mailman.154.1346700607.27098.python-list@python.org>
In reply to	#28280

On Sun, Sep 2, 2012 at 11:43 AM, Aaron Brady <castironpi@gmail.com> wrote:
> We could use a Python long object for the version index to prevent overflow.  Combined with P. Rubin's idea to count the number of open iterators, most use cases still wouldn't exceed a single word comparison; we could reset the counter when there weren't any.

We could use a Python long; I just don't think the extra overhead is
justified in a data structure that is already highly optimized for
speed.  Incrementing and testing a C int is *much* faster than doing
the same with a Python long.

[toc] | [prev] | [next] | [standalone]

#28368

From	Aaron Brady <castironpi@gmail.com>
Date	2012-09-03 13:04 -0700
Message-ID	<ed8c3b7a-f9ff-45fe-9e2c-a225764da7ae@googlegroups.com>
In reply to	#28366

On Monday, September 3, 2012 2:30:24 PM UTC-5, Ian wrote:
> On Sun, Sep 2, 2012 at 11:43 AM, Aaron Brady <castironpi@gmail.com> wrote:
> 
> > We could use a Python long object for the version index to prevent overflow.  Combined with P. Rubin's idea to count the number of open iterators, most use cases still wouldn't exceed a single word comparison; we could reset the counter when there weren't any.
> 
> 
> 
> We could use a Python long; I just don't think the extra overhead is
> 
> justified in a data structure that is already highly optimized for
> 
> speed.  Incrementing and testing a C int is *much* faster than doing
> 
> the same with a Python long.

I think the technique would require two python longs and a bool in the set, and a python long in the iterator.

One long counts the number of existing (open) iterators.  Another counts the version.  The bool keeps track of whether an iterator has been created since the last modification, in which case the next modification requires incrementing the version and resetting the flag.

[toc] | [prev] | [next] | [standalone]

#28370

From	Dave Angel <d@davea.name>
Date	2012-09-03 16:27 -0400
Message-ID	<mailman.156.1346704107.27098.python-list@python.org>
In reply to	#28368

On 09/03/2012 04:04 PM, Aaron Brady wrote:
> On Monday, September 3, 2012 2:30:24 PM UTC-5, Ian wrote:
>> On Sun, Sep 2, 2012 at 11:43 AM, Aaron Brady <castironpi@gmail.com> wrote:
>>
>>> We could use a Python long object for the version index to prevent overflow.  Combined with P. Rubin's idea to count the number of open iterators, most use cases still wouldn't exceed a single word comparison; we could reset the counter when there weren't any.
>>
>>
>> We could use a Python long; I just don't think the extra overhead is
>>
>> justified in a data structure that is already highly optimized for
>>
>> speed.  Incrementing and testing a C int is *much* faster than doing
>>
>> the same with a Python long.
> I think the technique would require two python longs and a bool in the set, and a python long in the iterator.
>
> One long counts the number of existing (open) iterators.  Another counts the version.  The bool keeps track of whether an iterator has been created since the last modification, in which case the next modification requires incrementing the version and resetting the flag.

I think you're over-engineering the problem.  it's a bug if an iterator
is used after some change is made to the set it's iterating over.  We
don't need to catch every possible instance of the bug, that's what
testing is for.  The point is to "probably" detect it, and for that, all
we need is a counter in the set and a counter in the open iterator. 
Whenever changing the set, increment its count.  And whenever iterating,
check the two counters.  if they don't agree, throw an exception, and
destroy the iterator.  i suppose that could be done with a flag, but it
could just as easily be done by zeroing the pointer to the set.

I'd figure a byte or two would be good enough for the counts, but a C
uint would be somewhat faster, and wouldn't wrap as quickly.

-- 

DaveA

[toc] | [prev] | [next] | [standalone]

#28375

From	Aaron Brady <castironpi@gmail.com>
Date	2012-09-03 17:24 -0700
Message-ID	<d18e81d9-eed7-4f3c-b62d-e3be13774ec1@googlegroups.com>
In reply to	#28370

On Monday, September 3, 2012 3:28:28 PM UTC-5, Dave Angel wrote:
> On 09/03/2012 04:04 PM, Aaron Brady wrote:
> 
> > On Monday, September 3, 2012 2:30:24 PM UTC-5, Ian wrote:
> 
> >> On Sun, Sep 2, 2012 at 11:43 AM, Aaron Brady <castironpi@gmail.com> wrote:
> 
> >>
> 
> >>> We could use a Python long object for the version index to prevent overflow.  Combined with P. Rubin's idea to count the number of open iterators, most use cases still wouldn't exceed a single word comparison; we could reset the counter when there weren't any.
> 
> >>
> 
> >>
> 
> >> We could use a Python long; I just don't think the extra overhead is
> 
> >>
> 
> >> justified in a data structure that is already highly optimized for
> 
> >>
> 
> >> speed.  Incrementing and testing a C int is *much* faster than doing
> 
> >>
> 
> >> the same with a Python long.
> 
> > I think the technique would require two python longs and a bool in the set, and a python long in the iterator.
> 
> >
> 
> > One long counts the number of existing (open) iterators.  Another counts the version.  The bool keeps track of whether an iterator has been created since the last modification, in which case the next modification requires incrementing the version and resetting the flag.
> 
> 
> 
> I think you're over-engineering the problem.  it's a bug if an iterator
> 
> is used after some change is made to the set it's iterating over.  We
> 
> don't need to catch every possible instance of the bug, that's what
> 
> testing is for.  The point is to "probably" detect it, and for that, all
> 
> we need is a counter in the set and a counter in the open iterator. 
> 
> Whenever changing the set, increment its count.  And whenever iterating,
> 
> check the two counters.  if they don't agree, throw an exception, and
> 
> destroy the iterator.  i suppose that could be done with a flag, but it
> 
> could just as easily be done by zeroing the pointer to the set.
> 
> 
> 
> I'd figure a byte or two would be good enough for the counts, but a C
> 
> uint would be somewhat faster, and wouldn't wrap as quickly.
> 
> 
> 
> -- 
> 
> 
> 
> DaveA

Hi D. Angel,

The serial index constantly reminds me of upper limits.  I have the same problem with PHP arrays, though it's not a problem with the language itself.

The linked list doesn't have a counter, it invalidates iterators when a modification is made, therefore it's the "correct" structure in my interpretation.  But it does seem more precarious comparatively, IMHO.

Both strategies solve the problem I posed originally, they both involve trade-offs, and it's too late to include either in 3.3.0.

[toc] | [prev] | [next] | [standalone]

Page 1 of 3 [1] 2 3 Next page →

csiph-web

set and dict iteration

Contents

#27194 — set and dict iteration

#27195

#27202

#27213

#27218

#27220

#27253

#27255

#27272

#27332

#27333

#27335

#27339

#27744

#27996

#28280

#28366

#28368

#28370

#28375