Groups > comp.lang.python > #27996 > unrolled thread

Re: set and dict iteration

Started by	Ian Kelly <ian.g.kelly@gmail.com>
First post	2012-08-27 13:17 -0600
Last post	2012-09-02 10:43 -0700
Articles	13 — 4 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Re: set and dict iteration Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-27 13:17 -0600
    Re: set and dict iteration Aaron Brady <castironpi@gmail.com> - 2012-09-02 10:43 -0700
      Re: set and dict iteration Ian Kelly <ian.g.kelly@gmail.com> - 2012-09-03 13:29 -0600
        Re: set and dict iteration Aaron Brady <castironpi@gmail.com> - 2012-09-03 13:04 -0700
          Re: set and dict iteration Dave Angel <d@davea.name> - 2012-09-03 16:27 -0400
            Re: set and dict iteration Aaron Brady <castironpi@gmail.com> - 2012-09-03 17:24 -0700
            Re: set and dict iteration Aaron Brady <castironpi@gmail.com> - 2012-09-03 17:24 -0700
        Re: set and dict iteration Aaron Brady <castironpi@gmail.com> - 2012-09-03 13:04 -0700
          Re: set and dict iteration Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-04 01:26 +0000
            Re: set and dict iteration Dave Angel <d@davea.name> - 2012-09-03 21:50 -0400
              Re: set and dict iteration Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-04 01:59 +0000
                Re: set and dict iteration Aaron Brady <castironpi@gmail.com> - 2012-09-08 08:42 -0700
    Re: set and dict iteration Aaron Brady <castironpi@gmail.com> - 2012-09-02 10:43 -0700

#27996 — Re: set and dict iteration

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2012-08-27 13:17 -0600
Subject	Re: set and dict iteration
Message-ID	<mailman.3883.1346095064.4697.python-list@python.org>

On Thu, Aug 23, 2012 at 10:49 AM, Aaron Brady <castironpi@gmail.com> wrote:
> The patch for the above is only 40-60 lines.  However it introduces two new concepts.

Is there a link to the patch?

> The first is a "linked list", a classic dynamic data structure, first developed in 1955, cf. http://en.wikipedia.org/wiki/Linked_list .  Linked lists are absent in Python, including the standard library and CPython implementation, beyond the weak reference mechanism and garbage collector.  The "collections.deque" structure shares some of the linked list interface but uses arrays.
>
> The second is "uncounted references".  The uncounted references are references to "set iterators" exclusively, exist only internally to "set" objects, and are invisible to the rest of the program.  The reason for the exception is that iterators are unique in the Python Data Model; iterators consist of a single immutable reference, unlike both immutable types such as strings and numbers, as well as container types.  Counted references could be used instead, but would be consistently wasted work for the garbage collector, though the benefit to programmers' peace of mind could be significant.
>
> Please share your opinion!  Do you agree that the internal list resolves the inconsistency?  Do you agree with the strategy?  Do you agree that uncounted references are justified to introduce, or are counted references preferable?

This feature is a hard sell as it is; I think that adding uncounted
references into the mix is only going to make that worse.  May I
suggest an alternate approach?  Internally tag each set or dict with a
"version", which is just a C int.  Every time the hash table is
modified, increment the version.  When an iterator is created, store
the current version on the iterator.  When the iterator is advanced,
check that the iterator version matches the dict/set version.  If
they're not equal, raise an error.

This should add less overhead than the linked list without any
concerns about reference counting.  It does introduce a small bug in
that an error condition could be "missed", if the version is
incremented a multiple of 2**32 or 2**64 times between iterations --
but how often is that really likely to occur?  Bearing in mind that
this error is meant for debugging and not production error handling,
you could even make the version a single byte and I'd still be fine
with that.

Cheers,
Ian

[toc] | [next] | [standalone]

#28280

From	Aaron Brady <castironpi@gmail.com>
Date	2012-09-02 10:43 -0700
Message-ID	<1567e8c7-a2bb-41f4-9be8-18e9f4d063cb@googlegroups.com>
In reply to	#27996

On Monday, August 27, 2012 2:17:45 PM UTC-5, Ian wrote:
> On Thu, Aug 23, 2012 at 10:49 AM, Aaron Brady <castironpi@gmail.com> wrote:
> 
> > The patch for the above is only 40-60 lines.  However it introduces two new concepts.
> 
> 
> 
> Is there a link to the patch?

Please see below.  It grew somewhat during development.

> > The first is a "linked list".
SNIP.
 
> > The second is "uncounted references".  The uncounted references are references to "set iterators" exclusively, exist only internally to "set" objects, and are invisible to the rest of the program.  The reason for the exception is that iterators are unique in the Python Data Model; iterators consist of a single immutable reference, unlike both immutable types such as strings and numbers, as well as container types.  Counted references could be used instead, but would be consistently wasted work for the garbage collector, though the benefit to programmers' peace of mind could be significant.
> 
> >
> 
> > Please share your opinion!  Do you agree that the internal list resolves the inconsistency?  Do you agree with the strategy?  Do you agree that uncounted references are justified to introduce, or are counted references preferable?
> 
> 
> 
> This feature is a hard sell as it is; I think that adding uncounted
> 
> references into the mix is only going to make that worse.  May I
> 
> suggest an alternate approach?  Internally tag each set or dict with a
> 
> "version", which is just a C int.  Every time the hash table is
> 
> modified, increment the version.  When an iterator is created, store
> 
> the current version on the iterator.  When the iterator is advanced,
> 
> check that the iterator version matches the dict/set version.  If
> 
> they're not equal, raise an error.
> 
> 
> 
> This should add less overhead than the linked list without any
> 
> concerns about reference counting.  It does introduce a small bug in
> 
> that an error condition could be "missed", if the version is
> 
> incremented a multiple of 2**32 or 2**64 times between iterations --
> 
> but how often is that really likely to occur?  Bearing in mind that
> 
> this error is meant for debugging and not production error handling,
> 
> you could even make the version a single byte and I'd still be fine
> 
> with that.
> 
> 
> 
> Cheers,
> 
> Ian


Hi Ian,

We could use a Python long object for the version index to prevent overflow.  Combined with P. Rubin's idea to count the number of open iterators, most use cases still wouldn't exceed a single word comparison; we could reset the counter when there weren't any.  Using the linked list collection, modification operations are expensive in rare cases.  Using the version index, iteration is expensive in rare cases.

I was more interested in the linked list for conceptual reasons, so I developed it further.  Changelog, diff file, test suite, and links are below.  The devs should be aware that a competing patch might be developed.  I would be pleased to hear what everybody thinks of it!

Linked list with uncounted references implementation for Win32.

Added:
- 'set_clear_ex' and 'set_clear_internal_ex' methods, differ in invalidation and conditional invalidation behavior and return type..  The 'set.clear()' method and 'tp_clear' type field both called the same method.
- 'set_invalidate_iter_linked' method.  Iterate over the iterators of a set, mark them invalid, and clear the list.
- 'setiter_unlink_internal' method.  Remove the iterator from the set's linked list of iterators.
- 'IterationError', global.
- New fields:
-- PySetObject: setiterobject *iter_linked.  Pointer to the first element of the linked list of the iterators of the set.
-- setiterobject: setiterobject *linked_pred, *linked_succ.  Predecessor and successor nodes in the linked list of iterators of the same set.
-- setiterobject: char ob_valid.  Validation status of the iterator.
- Result is compared with original in 'set_intersection_update' and '_multi' to determine whether to invalidate the list of iterators.  Asymptotic running time is unchanged.
- Pending: add 'tp_clear' field to 'PySetIter_Type'?
- Test script included, 'large numbers' test pending.

6 files changed: { setobject.h, setobject.c, exceptions.c, pyerrors.h, python3.def, python33stub.def }.  Test script 'set_iterator_test.py' new.  Linked list interface and pseudocode 'patch_pseudocode.txt'.
Zip file:
http://home.comcast.net/~castironpi-misc/clpy-0062-set_iterator_patch.zip
Diff file of 3.3.0b2:
http://home.comcast.net/~castironpi-misc/clpy-0062-set_iterator_diff.txt

[toc] | [prev] | [next] | [standalone]

#28366

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2012-09-03 13:29 -0600
Message-ID	<mailman.154.1346700607.27098.python-list@python.org>
In reply to	#28280

On Sun, Sep 2, 2012 at 11:43 AM, Aaron Brady <castironpi@gmail.com> wrote:
> We could use a Python long object for the version index to prevent overflow.  Combined with P. Rubin's idea to count the number of open iterators, most use cases still wouldn't exceed a single word comparison; we could reset the counter when there weren't any.

We could use a Python long; I just don't think the extra overhead is
justified in a data structure that is already highly optimized for
speed.  Incrementing and testing a C int is *much* faster than doing
the same with a Python long.

[toc] | [prev] | [next] | [standalone]

#28368

From	Aaron Brady <castironpi@gmail.com>
Date	2012-09-03 13:04 -0700
Message-ID	<ed8c3b7a-f9ff-45fe-9e2c-a225764da7ae@googlegroups.com>
In reply to	#28366

On Monday, September 3, 2012 2:30:24 PM UTC-5, Ian wrote:
> On Sun, Sep 2, 2012 at 11:43 AM, Aaron Brady <castironpi@gmail.com> wrote:
> 
> > We could use a Python long object for the version index to prevent overflow.  Combined with P. Rubin's idea to count the number of open iterators, most use cases still wouldn't exceed a single word comparison; we could reset the counter when there weren't any.
> 
> 
> 
> We could use a Python long; I just don't think the extra overhead is
> 
> justified in a data structure that is already highly optimized for
> 
> speed.  Incrementing and testing a C int is *much* faster than doing
> 
> the same with a Python long.

I think the technique would require two python longs and a bool in the set, and a python long in the iterator.

One long counts the number of existing (open) iterators.  Another counts the version.  The bool keeps track of whether an iterator has been created since the last modification, in which case the next modification requires incrementing the version and resetting the flag.

[toc] | [prev] | [next] | [standalone]

#28370

From	Dave Angel <d@davea.name>
Date	2012-09-03 16:27 -0400
Message-ID	<mailman.156.1346704107.27098.python-list@python.org>
In reply to	#28368

On 09/03/2012 04:04 PM, Aaron Brady wrote:
> On Monday, September 3, 2012 2:30:24 PM UTC-5, Ian wrote:
>> On Sun, Sep 2, 2012 at 11:43 AM, Aaron Brady <castironpi@gmail.com> wrote:
>>
>>> We could use a Python long object for the version index to prevent overflow.  Combined with P. Rubin's idea to count the number of open iterators, most use cases still wouldn't exceed a single word comparison; we could reset the counter when there weren't any.
>>
>>
>> We could use a Python long; I just don't think the extra overhead is
>>
>> justified in a data structure that is already highly optimized for
>>
>> speed.  Incrementing and testing a C int is *much* faster than doing
>>
>> the same with a Python long.
> I think the technique would require two python longs and a bool in the set, and a python long in the iterator.
>
> One long counts the number of existing (open) iterators.  Another counts the version.  The bool keeps track of whether an iterator has been created since the last modification, in which case the next modification requires incrementing the version and resetting the flag.

I think you're over-engineering the problem.  it's a bug if an iterator
is used after some change is made to the set it's iterating over.  We
don't need to catch every possible instance of the bug, that's what
testing is for.  The point is to "probably" detect it, and for that, all
we need is a counter in the set and a counter in the open iterator. 
Whenever changing the set, increment its count.  And whenever iterating,
check the two counters.  if they don't agree, throw an exception, and
destroy the iterator.  i suppose that could be done with a flag, but it
could just as easily be done by zeroing the pointer to the set.

I'd figure a byte or two would be good enough for the counts, but a C
uint would be somewhat faster, and wouldn't wrap as quickly.

-- 

DaveA

[toc] | [prev] | [next] | [standalone]

#28375

From	Aaron Brady <castironpi@gmail.com>
Date	2012-09-03 17:24 -0700
Message-ID	<d18e81d9-eed7-4f3c-b62d-e3be13774ec1@googlegroups.com>
In reply to	#28370

On Monday, September 3, 2012 3:28:28 PM UTC-5, Dave Angel wrote:
> On 09/03/2012 04:04 PM, Aaron Brady wrote:
> 
> > On Monday, September 3, 2012 2:30:24 PM UTC-5, Ian wrote:
> 
> >> On Sun, Sep 2, 2012 at 11:43 AM, Aaron Brady <castironpi@gmail.com> wrote:
> 
> >>
> 
> >>> We could use a Python long object for the version index to prevent overflow.  Combined with P. Rubin's idea to count the number of open iterators, most use cases still wouldn't exceed a single word comparison; we could reset the counter when there weren't any.
> 
> >>
> 
> >>
> 
> >> We could use a Python long; I just don't think the extra overhead is
> 
> >>
> 
> >> justified in a data structure that is already highly optimized for
> 
> >>
> 
> >> speed.  Incrementing and testing a C int is *much* faster than doing
> 
> >>
> 
> >> the same with a Python long.
> 
> > I think the technique would require two python longs and a bool in the set, and a python long in the iterator.
> 
> >
> 
> > One long counts the number of existing (open) iterators.  Another counts the version.  The bool keeps track of whether an iterator has been created since the last modification, in which case the next modification requires incrementing the version and resetting the flag.
> 
> 
> 
> I think you're over-engineering the problem.  it's a bug if an iterator
> 
> is used after some change is made to the set it's iterating over.  We
> 
> don't need to catch every possible instance of the bug, that's what
> 
> testing is for.  The point is to "probably" detect it, and for that, all
> 
> we need is a counter in the set and a counter in the open iterator. 
> 
> Whenever changing the set, increment its count.  And whenever iterating,
> 
> check the two counters.  if they don't agree, throw an exception, and
> 
> destroy the iterator.  i suppose that could be done with a flag, but it
> 
> could just as easily be done by zeroing the pointer to the set.
> 
> 
> 
> I'd figure a byte or two would be good enough for the counts, but a C
> 
> uint would be somewhat faster, and wouldn't wrap as quickly.
> 
> 
> 
> -- 
> 
> 
> 
> DaveA

Hi D. Angel,

The serial index constantly reminds me of upper limits.  I have the same problem with PHP arrays, though it's not a problem with the language itself.

The linked list doesn't have a counter, it invalidates iterators when a modification is made, therefore it's the "correct" structure in my interpretation.  But it does seem more precarious comparatively, IMHO.

Both strategies solve the problem I posed originally, they both involve trade-offs, and it's too late to include either in 3.3.0.

[toc] | [prev] | [next] | [standalone]

#28376

From	Aaron Brady <castironpi@gmail.com>
Date	2012-09-03 17:24 -0700
Message-ID	<mailman.160.1346718245.27098.python-list@python.org>
In reply to	#28370

On Monday, September 3, 2012 3:28:28 PM UTC-5, Dave Angel wrote:
> On 09/03/2012 04:04 PM, Aaron Brady wrote:
> 
> > On Monday, September 3, 2012 2:30:24 PM UTC-5, Ian wrote:
> 
> >> On Sun, Sep 2, 2012 at 11:43 AM, Aaron Brady <castironpi@gmail.com> wrote:
> 
> >>
> 
> >>> We could use a Python long object for the version index to prevent overflow.  Combined with P. Rubin's idea to count the number of open iterators, most use cases still wouldn't exceed a single word comparison; we could reset the counter when there weren't any.
> 
> >>
> 
> >>
> 
> >> We could use a Python long; I just don't think the extra overhead is
> 
> >>
> 
> >> justified in a data structure that is already highly optimized for
> 
> >>
> 
> >> speed.  Incrementing and testing a C int is *much* faster than doing
> 
> >>
> 
> >> the same with a Python long.
> 
> > I think the technique would require two python longs and a bool in the set, and a python long in the iterator.
> 
> >
> 
> > One long counts the number of existing (open) iterators.  Another counts the version.  The bool keeps track of whether an iterator has been created since the last modification, in which case the next modification requires incrementing the version and resetting the flag.
> 
> 
> 
> I think you're over-engineering the problem.  it's a bug if an iterator
> 
> is used after some change is made to the set it's iterating over.  We
> 
> don't need to catch every possible instance of the bug, that's what
> 
> testing is for.  The point is to "probably" detect it, and for that, all
> 
> we need is a counter in the set and a counter in the open iterator. 
> 
> Whenever changing the set, increment its count.  And whenever iterating,
> 
> check the two counters.  if they don't agree, throw an exception, and
> 
> destroy the iterator.  i suppose that could be done with a flag, but it
> 
> could just as easily be done by zeroing the pointer to the set.
> 
> 
> 
> I'd figure a byte or two would be good enough for the counts, but a C
> 
> uint would be somewhat faster, and wouldn't wrap as quickly.
> 
> 
> 
> -- 
> 
> 
> 
> DaveA

Hi D. Angel,

The serial index constantly reminds me of upper limits.  I have the same problem with PHP arrays, though it's not a problem with the language itself.

The linked list doesn't have a counter, it invalidates iterators when a modification is made, therefore it's the "correct" structure in my interpretation.  But it does seem more precarious comparatively, IMHO.

Both strategies solve the problem I posed originally, they both involve trade-offs, and it's too late to include either in 3.3.0.

[toc] | [prev] | [next] | [standalone]

#28369

From	Aaron Brady <castironpi@gmail.com>
Date	2012-09-03 13:04 -0700
Message-ID	<mailman.155.1346702665.27098.python-list@python.org>
In reply to	#28366

On Monday, September 3, 2012 2:30:24 PM UTC-5, Ian wrote:
> On Sun, Sep 2, 2012 at 11:43 AM, Aaron Brady <castironpi@gmail.com> wrote:
> 
> > We could use a Python long object for the version index to prevent overflow.  Combined with P. Rubin's idea to count the number of open iterators, most use cases still wouldn't exceed a single word comparison; we could reset the counter when there weren't any.
> 
> 
> 
> We could use a Python long; I just don't think the extra overhead is
> 
> justified in a data structure that is already highly optimized for
> 
> speed.  Incrementing and testing a C int is *much* faster than doing
> 
> the same with a Python long.

I think the technique would require two python longs and a bool in the set, and a python long in the iterator.

One long counts the number of existing (open) iterators.  Another counts the version.  The bool keeps track of whether an iterator has been created since the last modification, in which case the next modification requires incrementing the version and resetting the flag.

[toc] | [prev] | [next] | [standalone]

#28379

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-09-04 01:26 +0000
Message-ID	<504558cb$0$29978$c3e8da3$5496439d@news.astraweb.com>
In reply to	#28369

On Mon, 03 Sep 2012 13:04:23 -0700, Aaron Brady wrote:

> On Monday, September 3, 2012 2:30:24 PM UTC-5, Ian wrote:
>> On Sun, Sep 2, 2012 at 11:43 AM, Aaron Brady <castironpi@gmail.com>
>> wrote:
>> 
>> > We could use a Python long object for the version index to prevent
>> > overflow.  Combined with P. Rubin's idea to count the number of open
>> > iterators, most use cases still wouldn't exceed a single word
>> > comparison; we could reset the counter when there weren't any.
>> 
>> We could use a Python long; I just don't think the extra overhead is
>> justified in a data structure that is already highly optimized for
>> speed.  Incrementing and testing a C int is *much* faster than doing
>> the same with a Python long.
> 
> I think the technique would require two python longs and a bool in the
> set, and a python long in the iterator.
> 
> One long counts the number of existing (open) iterators.  Another counts
> the version.  The bool keeps track of whether an iterator has been
> created since the last modification, in which case the next modification
> requires incrementing the version and resetting the flag.

I think that is over-engineered and could be the difference between 
having the patch accepted and having it rejected. After all, many people 
will argue that the existing solution to the problem is good enough.

Dicts are extremely common, and your patch increases both the memory 
usage of every dict, and the overhead of every write operation 
(__setitem__, __delitem__, update). Only a very few dicts will actually 
need this overhead, for the rest it is waste. It is important to keep 
that waste to a minimum or risk having the patch rejected.

An unsigned C int can count up to 4,294,967,295. I propose that you say 
that is enough iterators for anyone, and use a single, simple, version 
counter in the dict and the iterator. If somebody exceeds that many 
iterators to a single dict or set, and the version field overflows by 
exactly 2**32 versions, the results are no worse than they are now. You 
won't be introducing any new bugs.

Complicating the solution is, in my opinion, unnecessary. Why should 
every set and dict carry the cost of incrementing TWO Python longs and a 
flag when just a single C int is sufficient for all realistic use-cases?

The opportunity for failure is extremely narrow:

- I must have an iterator over a dict or set;
- and I must have paused the iteration in some way;
- and then I must create exactly 2**32 other iterators to the 
  same dict or set;
- and at some point modify the dict or set
- and then restart the first iterator

at which point some items returned by the iterator *may* be duplicated or 
skipped (depends on the nature of the modifications).

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#28381

From	Dave Angel <d@davea.name>
Date	2012-09-03 21:50 -0400
Message-ID	<mailman.162.1346723805.27098.python-list@python.org>
In reply to	#28379

On 09/03/2012 09:26 PM, Steven D'Aprano wrote:
> On Mon, 03 Sep 2012 13:04:23 -0700, Aaron Brady wrote:
>
>> <snip>
>>
>> I think the technique would require two python longs and a bool in the
>> set, and a python long in the iterator.
>>
>> One long counts the number of existing (open) iterators.  Another counts
>> the version.  The bool keeps track of whether an iterator has been
>> created since the last modification, in which case the next modification
>> requires incrementing the version and resetting the flag.
> I think that is over-engineered and could be the difference between 
> having the patch accepted and having it rejected. After all, many people 
> will argue that the existing solution to the problem is good enough.
>
> Dicts are extremely common, and your patch increases both the memory 
> usage of every dict, and the overhead of every write operation 
> (__setitem__, __delitem__, update). Only a very few dicts will actually 
> need this overhead, for the rest it is waste. It is important to keep 
> that waste to a minimum or risk having the patch rejected.
>
> An unsigned C int can count up to 4,294,967,295. I propose that you say 
> that is enough iterators for anyone, and use a single, simple, version 
> counter in the dict and the iterator. If somebody exceeds that many 
> iterators to a single dict or set, 

I think you have the count confused.  it has to be a count of how many
changes have been made to the dict or set, not how many iterators exist.

> and the version field overflows by 
> exactly 2**32 versions, the results are no worse than they are now. You 
> won't be introducing any new bugs.
>
> Complicating the solution is, in my opinion, unnecessary. Why should 
> every set and dict carry the cost of incrementing TWO Python longs and a 
> flag when just a single C int is sufficient for all realistic use-cases?
>
> The opportunity for failure is extremely narrow:
>
> - I must have an iterator over a dict or set;
> - and I must have paused the iteration in some way;
> - and then I must create exactly 2**32 other iterators to the 
>   same dict or set;
> - and at some point modify the dict or set
> - and then restart the first iterator
>
> at which point some items returned by the iterator *may* be duplicated or 
> skipped (depends on the nature of the modifications).
>
>

I agree with almost your entire point, exact that the place where it
would fail to detect a bug is when somebody has modified the dict
exactly 2**32 times while an iterator is paused.  See my own response,
of 4:27pm. (my time).

-- 

DaveA

[toc] | [prev] | [next] | [standalone]

#28382

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-09-04 01:59 +0000
Message-ID	<50456073$0$29978$c3e8da3$5496439d@news.astraweb.com>
In reply to	#28381

On Mon, 03 Sep 2012 21:50:57 -0400, Dave Angel wrote:

> On 09/03/2012 09:26 PM, Steven D'Aprano wrote:

>> An unsigned C int can count up to 4,294,967,295. I propose that you say
>> that is enough iterators for anyone, and use a single, simple, version
>> counter in the dict and the iterator. If somebody exceeds that many
>> iterators to a single dict or set,
> 
> I think you have the count confused.  it has to be a count of how many
> changes have been made to the dict or set, not how many iterators exist.

Oops, yes you are absolutely right. It's a version number, not a count of 
iterators.


-- 
Steven

[toc] | [prev] | [next] | [standalone]

#28726

From	Aaron Brady <castironpi@gmail.com>
Date	2012-09-08 08:42 -0700
Message-ID	<87f103a7-2bbd-405f-8a5e-3aaf02c80fff@googlegroups.com>
In reply to	#28382

On Monday, September 3, 2012 8:59:16 PM UTC-5, Steven D'Aprano wrote:
> On Mon, 03 Sep 2012 21:50:57 -0400, Dave Angel wrote:
> 
> 
> 
> > On 09/03/2012 09:26 PM, Steven D'Aprano wrote:
> 
> 
> 
> >> An unsigned C int can count up to 4,294,967,295. I propose that you say
> 
> >> that is enough iterators for anyone, and use a single, simple, version
> 
> >> counter in the dict and the iterator. If somebody exceeds that many
> 
> >> iterators to a single dict or set,
> 
> > 
> 
> > I think you have the count confused.  it has to be a count of how many
> 
> > changes have been made to the dict or set, not how many iterators exist.
> 
> 
> 
> Oops, yes you are absolutely right. It's a version number, not a count of 
> 
> iterators.
> 
> 
> 
> 
> 
> -- 
> 
> Steven

Hello.  We have a number of proposed solutions so far.

1) Collection of iterators
  a) Linked list
    i) Uncounted references
    ii) Counted references
    iii) Weak references
  b) Weak set
2) Serial index / timestamp
  a) No overflow - Python longs
  b) Overflow - C ints / shorts / chars
  c) Reset index if no iterators left
3) Iterator count
  - Raise exception on set modifications, not iteration

Note, "2b" still leaves the possibility of missing a case and letting an error pass silently, as the current behavior does.  The rest catch the error 100% of the time.

Anyway, I plan to develop the above patch for the 'dict' class.  Would anyone like to take over or help me do it?

[toc] | [prev] | [next] | [standalone]

#28281

From	Aaron Brady <castironpi@gmail.com>
Date	2012-09-02 10:43 -0700
Message-ID	<mailman.94.1346607808.27098.python-list@python.org>
In reply to	#27996

On Monday, August 27, 2012 2:17:45 PM UTC-5, Ian wrote:
> On Thu, Aug 23, 2012 at 10:49 AM, Aaron Brady <castironpi@gmail.com> wrote:
> 
> > The patch for the above is only 40-60 lines.  However it introduces two new concepts.
> 
> 
> 
> Is there a link to the patch?

Please see below.  It grew somewhat during development.

> > The first is a "linked list".
SNIP.
 
> > The second is "uncounted references".  The uncounted references are references to "set iterators" exclusively, exist only internally to "set" objects, and are invisible to the rest of the program.  The reason for the exception is that iterators are unique in the Python Data Model; iterators consist of a single immutable reference, unlike both immutable types such as strings and numbers, as well as container types.  Counted references could be used instead, but would be consistently wasted work for the garbage collector, though the benefit to programmers' peace of mind could be significant.
> 
> >
> 
> > Please share your opinion!  Do you agree that the internal list resolves the inconsistency?  Do you agree with the strategy?  Do you agree that uncounted references are justified to introduce, or are counted references preferable?
> 
> 
> 
> This feature is a hard sell as it is; I think that adding uncounted
> 
> references into the mix is only going to make that worse.  May I
> 
> suggest an alternate approach?  Internally tag each set or dict with a
> 
> "version", which is just a C int.  Every time the hash table is
> 
> modified, increment the version.  When an iterator is created, store
> 
> the current version on the iterator.  When the iterator is advanced,
> 
> check that the iterator version matches the dict/set version.  If
> 
> they're not equal, raise an error.
> 
> 
> 
> This should add less overhead than the linked list without any
> 
> concerns about reference counting.  It does introduce a small bug in
> 
> that an error condition could be "missed", if the version is
> 
> incremented a multiple of 2**32 or 2**64 times between iterations --
> 
> but how often is that really likely to occur?  Bearing in mind that
> 
> this error is meant for debugging and not production error handling,
> 
> you could even make the version a single byte and I'd still be fine
> 
> with that.
> 
> 
> 
> Cheers,
> 
> Ian


Hi Ian,

We could use a Python long object for the version index to prevent overflow.  Combined with P. Rubin's idea to count the number of open iterators, most use cases still wouldn't exceed a single word comparison; we could reset the counter when there weren't any.  Using the linked list collection, modification operations are expensive in rare cases.  Using the version index, iteration is expensive in rare cases.

I was more interested in the linked list for conceptual reasons, so I developed it further.  Changelog, diff file, test suite, and links are below.  The devs should be aware that a competing patch might be developed.  I would be pleased to hear what everybody thinks of it!

Linked list with uncounted references implementation for Win32.

Added:
- 'set_clear_ex' and 'set_clear_internal_ex' methods, differ in invalidation and conditional invalidation behavior and return type..  The 'set.clear()' method and 'tp_clear' type field both called the same method.
- 'set_invalidate_iter_linked' method.  Iterate over the iterators of a set, mark them invalid, and clear the list.
- 'setiter_unlink_internal' method.  Remove the iterator from the set's linked list of iterators.
- 'IterationError', global.
- New fields:
-- PySetObject: setiterobject *iter_linked.  Pointer to the first element of the linked list of the iterators of the set.
-- setiterobject: setiterobject *linked_pred, *linked_succ.  Predecessor and successor nodes in the linked list of iterators of the same set.
-- setiterobject: char ob_valid.  Validation status of the iterator.
- Result is compared with original in 'set_intersection_update' and '_multi' to determine whether to invalidate the list of iterators.  Asymptotic running time is unchanged.
- Pending: add 'tp_clear' field to 'PySetIter_Type'?
- Test script included, 'large numbers' test pending.

6 files changed: { setobject.h, setobject.c, exceptions.c, pyerrors.h, python3.def, python33stub.def }.  Test script 'set_iterator_test.py' new.  Linked list interface and pseudocode 'patch_pseudocode.txt'.
Zip file:
http://home.comcast.net/~castironpi-misc/clpy-0062-set_iterator_patch.zip
Diff file of 3.3.0b2:
http://home.comcast.net/~castironpi-misc/clpy-0062-set_iterator_diff.txt

[toc] | [prev] | [standalone]

csiph-web

Re: set and dict iteration

Contents

#27996 — Re: set and dict iteration

#28280

#28366

#28368

#28370

#28375

#28376

#28369

#28379

#28381

#28382

#28726

#28281