Groups > comp.lang.python > #33729 > unrolled thread

Inconsistent behaviour os str.find/str.index when providing optional parameters

Started by	Giacomo Alzetta <giacomo.alzetta@gmail.com>
First post	2012-11-21 04:43 -0800
Last post	2012-11-21 23:01 -0800
Articles	12 — 6 participants

Back to article view | Back to comp.lang.python

  Inconsistent behaviour os str.find/str.index when providing optional parameters Giacomo Alzetta <giacomo.alzetta@gmail.com> - 2012-11-21 04:43 -0800
    Re: Inconsistent behaviour os str.find/str.index when providing optional parameters MRAB <python@mrabarnett.plus.com> - 2012-11-21 13:32 +0000
    Re: Inconsistent behaviour os str.find/str.index when providing optional parameters Alister <alister.ware@ntlworld.com> - 2012-11-21 16:59 +0000
      Re: Inconsistent behaviour os str.find/str.index when providing optional parameters Hans Mulder <hansmu@xs4all.nl> - 2012-11-21 20:25 +0100
        Re: Inconsistent behaviour os str.find/str.index when providing optional parameters Giacomo Alzetta <giacomo.alzetta@gmail.com> - 2012-11-21 12:21 -0800
        Re: Inconsistent behaviour os str.find/str.index when providing optional parameters MRAB <python@mrabarnett.plus.com> - 2012-11-21 20:58 +0000
    Re: Inconsistent behaviour os str.find/str.index when providing optional parameters Terry Reedy <tjreedy@udel.edu> - 2012-11-21 22:41 -0500
    Re: Inconsistent behaviour os str.find/str.index when providing optional parameters MRAB <python@mrabarnett.plus.com> - 2012-11-22 04:00 +0000
      Re: Inconsistent behaviour os str.find/str.index when providing optional parameters Giacomo Alzetta <giacomo.alzetta@gmail.com> - 2012-11-21 23:01 -0800
        Re: Inconsistent behaviour os str.find/str.index when providing optional parameters Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-11-22 08:44 +0000
          Re: Inconsistent behaviour os str.find/str.index when providing optional parameters Giacomo Alzetta <giacomo.alzetta@gmail.com> - 2012-11-22 10:22 -0800
      Re: Inconsistent behaviour os str.find/str.index when providing optional parameters Giacomo Alzetta <giacomo.alzetta@gmail.com> - 2012-11-21 23:01 -0800

#33729 — Inconsistent behaviour os str.find/str.index when providing optional parameters

From	Giacomo Alzetta <giacomo.alzetta@gmail.com>
Date	2012-11-21 04:43 -0800
Subject	Inconsistent behaviour os str.find/str.index when providing optional parameters
Message-ID	<9ecd357d-aaaa-4f4d-a987-a478e92b2052@googlegroups.com>

I just came across this:

>>> 'spam'.find('', 5)
-1


Now, reading find's documentation:

>>> print(str.find.__doc__)
S.find(sub [,start [,end]]) -> int

Return the lowest index in S where substring sub is found,
such that sub is contained within S[start:end].  Optional
arguments start and end are interpreted as in slice notation.

Return -1 on failure.

Now, the empty string is a substring of every string so how can find fail?
find, from the doc, should be generally be equivalent to S[start:end].find(substring) + start, except if the substring is not found but since the empty string is a substring of the empty string it should never fail.

Looking at the source code for find(in stringlib/find.h):

Py_LOCAL_INLINE(Py_ssize_t)
stringlib_find(const STRINGLIB_CHAR* str, Py_ssize_t str_len,
               const STRINGLIB_CHAR* sub, Py_ssize_t sub_len,
               Py_ssize_t offset)
{
    Py_ssize_t pos;

    if (str_len < 0)
        return -1;

I believe it should be:

    if (str_len < 0)
        return (sub_len == 0 ? 0 : -1);

Is there any reason of having this unexpected behaviour or was this simply overlooked?

[toc] | [next] | [standalone]

#33735

From	MRAB <python@mrabarnett.plus.com>
Date	2012-11-21 13:32 +0000
Message-ID	<mailman.153.1353504767.29569.python-list@python.org>
In reply to	#33729

On 2012-11-21 12:43, Giacomo Alzetta wrote:
> I just came across this:
>
>>>> 'spam'.find('', 5)
> -1
>
>
> Now, reading find's documentation:
>
>>>> print(str.find.__doc__)
> S.find(sub [,start [,end]]) -> int
>
> Return the lowest index in S where substring sub is found,
> such that sub is contained within S[start:end].  Optional
> arguments start and end are interpreted as in slice notation.
>
> Return -1 on failure.
>
> Now, the empty string is a substring of every string so how can find fail?
> find, from the doc, should be generally be equivalent to S[start:end].find(substring) + start, except if the substring is not found but since the empty string is a substring of the empty string it should never fail.
>
[snip]
I think that returning -1 is correct (as far as returning -1 instead of
raising an exception like .index could be considered correct!) because
otherwise it whould be returning a non-existent index. For the string
"spam", the range is 0..4.

[toc] | [prev] | [next] | [standalone]

#33749

From	Alister <alister.ware@ntlworld.com>
Date	2012-11-21 16:59 +0000
Message-ID	<tL7rs.628211$Ol2.3970@fx25.am4>
In reply to	#33729

On Wed, 21 Nov 2012 04:43:57 -0800, Giacomo Alzetta wrote:

> I just came across this:
> 
>>>> 'spam'.find('', 5)
> -1
> 
> 
> Now, reading find's documentation:
> 
>>>> print(str.find.__doc__)
> S.find(sub [,start [,end]]) -> int
> 
> Return the lowest index in S where substring sub is found,
> such that sub is contained within S[start:end].  Optional arguments
> start and end are interpreted as in slice notation.
> 
> Return -1 on failure.
> 
> Now, the empty string is a substring of every string so how can find
> fail?
> find, from the doc, should be generally be equivalent to
> S[start:end].find(substring) + start, except if the substring is not
> found but since the empty string is a substring of the empty string it
> should never fail.
> 
> Looking at the source code for find(in stringlib/find.h):
> 
> Py_LOCAL_INLINE(Py_ssize_t)
> stringlib_find(const STRINGLIB_CHAR* str, Py_ssize_t str_len,
>                const STRINGLIB_CHAR* sub, Py_ssize_t sub_len,
>                Py_ssize_t offset)
> {
>     Py_ssize_t pos;
> 
>     if (str_len < 0)
>         return -1;
> 
> I believe it should be:
> 
>     if (str_len < 0)
>         return (sub_len == 0 ? 0 : -1);
> 
> Is there any reason of having this unexpected behaviour or was this
> simply overlooked?

why would you be searching for an empty string?
what result would you expect to get from such a search?



-- 
Turn your Pentium into an XT -- just add Windows!

[toc] | [prev] | [next] | [standalone]

#33759

From	Hans Mulder <hansmu@xs4all.nl>
Date	2012-11-21 20:25 +0100
Message-ID	<50ad2a95$0$6907$e4fe514c@news2.news.xs4all.nl>
In reply to	#33749

On 21/11/12 17:59:05, Alister wrote:
> On Wed, 21 Nov 2012 04:43:57 -0800, Giacomo Alzetta wrote:
> 
>> I just came across this:
>>
>>>>> 'spam'.find('', 5)
>> -1
>>
>>
>> Now, reading find's documentation:
>>
>>>>> print(str.find.__doc__)
>> S.find(sub [,start [,end]]) -> int
>>
>> Return the lowest index in S where substring sub is found,
>> such that sub is contained within S[start:end].  Optional arguments
>> start and end are interpreted as in slice notation.
>>
>> Return -1 on failure.
>>
>> Now, the empty string is a substring of every string so how can find
>> fail?
>> find, from the doc, should be generally be equivalent to
>> S[start:end].find(substring) + start, except if the substring is not
>> found but since the empty string is a substring of the empty string it
>> should never fail.
>>
>> Looking at the source code for find(in stringlib/find.h):
>>
>> Py_LOCAL_INLINE(Py_ssize_t)
>> stringlib_find(const STRINGLIB_CHAR* str, Py_ssize_t str_len,
>>                const STRINGLIB_CHAR* sub, Py_ssize_t sub_len,
>>                Py_ssize_t offset)
>> {
>>     Py_ssize_t pos;
>>
>>     if (str_len < 0)
>>         return -1;
>>
>> I believe it should be:
>>
>>     if (str_len < 0)
>>         return (sub_len == 0 ? 0 : -1);
>>
>> Is there any reason of having this unexpected behaviour or was this
>> simply overlooked?
> 
> why would you be searching for an empty string?
> what result would you expect to get from such a search?


In general, if

    needle in haystack[ start: ]

return True, then you' expect

    haystack.find(needle, start)

to return the smallest i >= start such that

    haystack[i:i+len(needle)] == needle

also returns True.

>>> "" in "spam"[5:]
True
>>> "spam"[5:5+len("")] == ""
True
>>>

So, you'd expect that spam.find("", 5) would return 5.

The only other consistent position would be that "spam"[5:]
should raise an IndexError, because 5 is an invalid index.

For that matter, I wouldn;t mind if "spam".find(s, 5) were
to raise an IndexError.  But if slicing at position 5
proudces an empry string, then .find should be able to
find that empty string.

-- HansM

[toc] | [prev] | [next] | [standalone]

#33767

From	Giacomo Alzetta <giacomo.alzetta@gmail.com>
Date	2012-11-21 12:21 -0800
Message-ID	<a58e52b1-52c1-4ef1-86e4-1cb1e261f6d4@googlegroups.com>
In reply to	#33759

Il giorno mercoledì 21 novembre 2012 20:25:10 UTC+1, Hans Mulder ha scritto:
> On 21/11/12 17:59:05, Alister wrote:
> 
> > On Wed, 21 Nov 2012 04:43:57 -0800, Giacomo Alzetta wrote:
> 
> > 
> 
> >> I just came across this:
> 
> >>
> 
> >>>>> 'spam'.find('', 5)
> 
> >> -1
> 
> >>
> 
> >>
> 
> >> Now, reading find's documentation:
> 
> >>
> 
> >>>>> print(str.find.__doc__)
> 
> >> S.find(sub [,start [,end]]) -> int
> 
> >>
> 
> >> Return the lowest index in S where substring sub is found,
> 
> >> such that sub is contained within S[start:end].  Optional arguments
> 
> >> start and end are interpreted as in slice notation.
> 
> >>
> 
> >> Return -1 on failure.
> 
> >>
> 
> >> Now, the empty string is a substring of every string so how can find
> 
> >> fail?
> 
> >> find, from the doc, should be generally be equivalent to
> 
> >> S[start:end].find(substring) + start, except if the substring is not
> 
> >> found but since the empty string is a substring of the empty string it
> 
> >> should never fail.
> 
> >>
> 
> >> Looking at the source code for find(in stringlib/find.h):
> 
> >>
> 
> >> Py_LOCAL_INLINE(Py_ssize_t)
> 
> >> stringlib_find(const STRINGLIB_CHAR* str, Py_ssize_t str_len,
> 
> >>                const STRINGLIB_CHAR* sub, Py_ssize_t sub_len,
> 
> >>                Py_ssize_t offset)
> 
> >> {
> 
> >>     Py_ssize_t pos;
> 
> >>
> 
> >>     if (str_len < 0)
> 
> >>         return -1;
> 
> >>
> 
> >> I believe it should be:
> 
> >>
> 
> >>     if (str_len < 0)
> 
> >>         return (sub_len == 0 ? 0 : -1);
> 
> >>
> 
> >> Is there any reason of having this unexpected behaviour or was this
> 
> >> simply overlooked?
> 
> > 
> 
> > why would you be searching for an empty string?
> 
> > what result would you expect to get from such a search?
> 
> 
> 
> 
> 
> In general, if
> 
> 
> 
>     needle in haystack[ start: ]
> 
> 
> 
> return True, then you' expect
> 
> 
> 
>     haystack.find(needle, start)
> 
> 
> 
> to return the smallest i >= start such that
> 
> 
> 
>     haystack[i:i+len(needle)] == needle
> 
> 
> 
> also returns True.
> 
> 
> 
> >>> "" in "spam"[5:]
> 
> True
> 
> >>> "spam"[5:5+len("")] == ""
> 
> True
> 
> >>>
> 
> 
> 
> So, you'd expect that spam.find("", 5) would return 5.
> 
> 
> 
> The only other consistent position would be that "spam"[5:]
> 
> should raise an IndexError, because 5 is an invalid index.
> 
> 
> 
> For that matter, I wouldn;t mind if "spam".find(s, 5) were
> 
> to raise an IndexError.  But if slicing at position 5
> 
> proudces an empry string, then .find should be able to
> 
> find that empty string.
> 
> 
> 
> -- HansM

Exactly! Either string[i:] with i >= len(string) should raise an IndexError or find(string, i) should return i.

Anyway, thinking about this inconsistency can be solved in a simpler way and without adding comparson. You simply check the substring length first. If it is 0 you already know that the string is a substring of the given string and you return the "offset", so the two ifs at the beginning of the function ought to be swapped.

[toc] | [prev] | [next] | [standalone]

#33770

From	MRAB <python@mrabarnett.plus.com>
Date	2012-11-21 20:58 +0000
Message-ID	<mailman.175.1353531533.29569.python-list@python.org>
In reply to	#33759

On 2012-11-21 19:25, Hans Mulder wrote:
> On 21/11/12 17:59:05, Alister wrote:
>> On Wed, 21 Nov 2012 04:43:57 -0800, Giacomo Alzetta wrote:
>>
>>> I just came across this:
>>>
>>>>>> 'spam'.find('', 5)
>>> -1
>>>
>>>
>>> Now, reading find's documentation:
>>>
>>>>>> print(str.find.__doc__)
>>> S.find(sub [,start [,end]]) -> int
>>>
>>> Return the lowest index in S where substring sub is found,
>>> such that sub is contained within S[start:end].  Optional arguments
>>> start and end are interpreted as in slice notation.
>>>
>>> Return -1 on failure.
>>>
>>> Now, the empty string is a substring of every string so how can find
>>> fail?
>>> find, from the doc, should be generally be equivalent to
>>> S[start:end].find(substring) + start, except if the substring is not
>>> found but since the empty string is a substring of the empty string it
>>> should never fail.
>>>
>>> Looking at the source code for find(in stringlib/find.h):
>>>
>>> Py_LOCAL_INLINE(Py_ssize_t)
>>> stringlib_find(const STRINGLIB_CHAR* str, Py_ssize_t str_len,
>>>                const STRINGLIB_CHAR* sub, Py_ssize_t sub_len,
>>>                Py_ssize_t offset)
>>> {
>>>     Py_ssize_t pos;
>>>
>>>     if (str_len < 0)
>>>         return -1;
>>>
>>> I believe it should be:
>>>
>>>     if (str_len < 0)
>>>         return (sub_len == 0 ? 0 : -1);
>>>
>>> Is there any reason of having this unexpected behaviour or was this
>>> simply overlooked?
>>
>> why would you be searching for an empty string?
>> what result would you expect to get from such a search?
>
>
> In general, if
>
>      needle in haystack[ start: ]
>
> return True, then you' expect
>
>      haystack.find(needle, start)
>
> to return the smallest i >= start such that
>
>      haystack[i:i+len(needle)] == needle
>
> also returns True.
>
>>>> "" in "spam"[5:]
> True
>>>> "spam"[5:5+len("")] == ""
> True
>>>>
>
> So, you'd expect that spam.find("", 5) would return 5.
>
> The only other consistent position would be that "spam"[5:]
> should raise an IndexError, because 5 is an invalid index.
>
> For that matter, I wouldn;t mind if "spam".find(s, 5) were
> to raise an IndexError.  But if slicing at position 5
> proudces an empry string, then .find should be able to
> find that empty string.
>
You'd expect that given:

     found = string.find(something, start, end)

if 'something' present then the following are true:

     0 <= found <= len(string)

     start <= found <= end

(I'm assuming here that 'start' and 'end' have already been adjusted
for counting from the end, ie originally they might have been negative
values.)

The only time that you can have found == len(string) and found == end
is when something == "" and start == len(string).

[toc] | [prev] | [next] | [standalone]

#33782

From	Terry Reedy <tjreedy@udel.edu>
Date	2012-11-21 22:41 -0500
Message-ID	<mailman.189.1353555686.29569.python-list@python.org>
In reply to	#33729

On 11/21/2012 8:32 AM, MRAB wrote:
> On 2012-11-21 12:43, Giacomo Alzetta wrote:
>> I just came across this:

 >>> 'spam'.find('')
0
 >>> 'spam'.find('', 1)
1
 >>> 'spam'.find('', 4)
4

>>>>> 'spam'.find('', 5)
>> -1
>>
>>
>> Now, reading find's documentation:
>>
>>>>> print(str.find.__doc__)
>> S.find(sub [,start [,end]]) -> int
>>
>> Return the lowest index in S where substring sub is found,
>> such that sub is contained within S[start:end].  Optional
>> arguments start and end are interpreted as in slice notation.

This seems not to be true, as 'spam'[4:] == 'spam'[5:] == ''

>> Return -1 on failure.
>>
>> Now, the empty string is a substring of every string so how can find
>> fail?
>> find, from the doc, should be generally be equivalent to
>> S[start:end].find(substring) + start, except if the substring is not
>> found but since the empty string is a substring of the empty string it
>> should never fail.
>>
> [snip]
> I think that returning -1 is correct (as far as returning -1 instead of
> raising an exception like .index could be considered correct!) because
> otherwise it whould be returning a non-existent index. For the string
> "spam", the range is 0..4.

I tend to agree, but perhaps the doc should be changed. In edge cases 
like this, there sometimes is no 'right' answer. I suspect that the 
current behavior is intentional. You might find a discussion on the tracker.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#33783

From	MRAB <python@mrabarnett.plus.com>
Date	2012-11-22 04:00 +0000
Message-ID	<mailman.190.1353556838.29569.python-list@python.org>
In reply to	#33729

On 2012-11-22 03:41, Terry Reedy wrote:
> On 11/21/2012 8:32 AM, MRAB wrote:
>> On 2012-11-21 12:43, Giacomo Alzetta wrote:
>>> I just came across this:
>
>   >>> 'spam'.find('')
> 0
>   >>> 'spam'.find('', 1)
> 1
>   >>> 'spam'.find('', 4)
> 4
>
>>>>>> 'spam'.find('', 5)
>>> -1
>>>
>>>
>>> Now, reading find's documentation:
>>>
>>>>>> print(str.find.__doc__)
>>> S.find(sub [,start [,end]]) -> int
>>>
>>> Return the lowest index in S where substring sub is found,
>>> such that sub is contained within S[start:end].  Optional
>>> arguments start and end are interpreted as in slice notation.
>
> This seems not to be true, as 'spam'[4:] == 'spam'[5:] == ''
>
It can't return 5 because 5 isn't an index in 'spam'.

It can't return 4 because 4 is below the start index.

>>> Return -1 on failure.
>>>
>>> Now, the empty string is a substring of every string so how can find
>>> fail?
>>> find, from the doc, should be generally be equivalent to
>>> S[start:end].find(substring) + start, except if the substring is not
>>> found but since the empty string is a substring of the empty string it
>>> should never fail.
>>>
>> [snip]
>> I think that returning -1 is correct (as far as returning -1 instead of
>> raising an exception like .index could be considered correct!) because
>> otherwise it whould be returning a non-existent index. For the string
>> "spam", the range is 0..4.
>
> I tend to agree, but perhaps the doc should be changed. In edge cases
> like this, there sometimes is no 'right' answer. I suspect that the
> current behavior is intentional. You might find a discussion on the tracker.
>

It's a special case, but the Zen has something to say about that! :-)

(The empty string is also the only substring which can start at len(S).)

[toc] | [prev] | [next] | [standalone]

#33789

From	Giacomo Alzetta <giacomo.alzetta@gmail.com>
Date	2012-11-21 23:01 -0800
Message-ID	<04a8334d-dc5c-4745-814a-5e02e04b1950@googlegroups.com>
In reply to	#33783

Il giorno giovedì 22 novembre 2012 05:00:39 UTC+1, MRAB ha scritto:
> On 2012-11-22 03:41, Terry Reedy wrote:
> It can't return 5 because 5 isn't an index in 'spam'.
> 
> 
> 
> It can't return 4 because 4 is below the start index.

Uhm. Maybe you are right, because returning a greater value would cause an IndexError, but then, *why* is 4 returned???

>>> 'spam'.find('', 4)
4
>>> 'spam'[4]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: string index out of range

4 is not a valid index either. I do not think the behaviour was completely intentional. If find should return indexes than 'spam'.find('', 4) must be -1, because 4 is not a valid index. If find should behave as if creating the slice and checking if the substring is in the slice than 'spam'.find('', i) should return i for every integer >= 4.

The docstring does not describe this edge case, so I think it could be improved.
If the first sentence(being an index in S) is kept, than it shouldn't say that start and end are treated as in slice notation, because that's actually not true. It should be added if start is greater or equal to len(S) then -1 is always returned(and in this case 'spam'.find('', 4) -> -1).
If find should not guarantee that the value returned is a valid index(when start isn't a valid index), then the first sentence should be rephrased to avoid giving this idea(and the comparisons in stringlib/find.h should be swapped to have the correct behaviour).
For example, maybe, it could be "Return the lowest index where substring sub is found (in S?), such that sub is contained in S[start:end]. ...

[toc] | [prev] | [next] | [standalone]

#33792

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-11-22 08:44 +0000
Message-ID	<50ade5e5$0$11104$c3e8da3@news.astraweb.com>
In reply to	#33789

On Wed, 21 Nov 2012 23:01:47 -0800, Giacomo Alzetta wrote:

> Il giorno giovedì 22 novembre 2012 05:00:39 UTC+1, MRAB ha scritto:
>> On 2012-11-22 03:41, Terry Reedy wrote: It can't return 5 because 5
>> isn't an index in 'spam'.
>> 
>> 
>> 
>> It can't return 4 because 4 is below the start index.
> 
> Uhm. Maybe you are right, because returning a greater value would cause
> an IndexError, but then, *why* is 4 returned???
> 
>>>> 'spam'.find('', 4)
> 4
>>>> 'spam'[4]
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> IndexError: string index out of range
> 
> 4 is not a valid index either. I do not think the behaviour was
> completely intentional.

The behaviour is certainly an edge case, but I think it is correct.

(Correct or not, it has been the same going all the way back to Python 
1.5, before strings even had methods, so it almost certainly will not be 
changed. Changing the behaviour now will very likely break hundreds, 
maybe thousands, of Python programs that expect the current behaviour.)

Consider your string as a sequence of boxes, with index positions 
labelled above the string:

0-1-2-3-4
|s|p|a|m|

The indexing model is that positions represent where you would cut 
*between* characters, not the character itself. Slices are the substring 
between cuts:

"spam"[1:3] => "pa"

while single indexes return the character to the right of the cut:

"spam"[1] => "p"

If there is no character to the right of the cut, indexing raises an 
error.

Now, consider "spam".find(substring, start). This should return the 
number of the first cut immediately to the left of the substring, 
beginning the search at cut #start.

"spam".find("pa", 1) => 1

because cut #1 is immediately to the left of "pa" at index 1.

By this logic, "spam".find("", 4) should return 4, because cut #4 is 
immediately to the left of the empty string. So Python's current 
behaviour is justified.

What about "spam".find("", 5)? Well, if you look at the string with the 
cuts marked as before:

0-1-2-3-4
|s|p|a|m|

you will see that there is no cut #5. Since there is no cut #5, we can't 
sensibly say we found *anything* there, not even the empty string. If you 
have four boxes, you can't say that you found anything in the fifth box.

I realise that this behaviour clashes somewhat with the slicing rule that 
says that if the slice indexes go past the end of the string, you get an 
empty string. But that rule is more for convenience than a fundamental 
rule about strings.

I think there is legitimate room for disagreement about the "right" 
behaviour here, but backwards compatibility trumps logical correctness 
here, and it is very unlikely to be changed.

> The docstring does not describe this edge case, so I think it could be
> improved. If the first sentence(being an index in S) is kept, than it
> shouldn't say that start and end are treated as in slice notation,
> because that's actually not true. 

+1

I think that you are right that the documentation needs to be improved.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#33811

From	Giacomo Alzetta <giacomo.alzetta@gmail.com>
Date	2012-11-22 10:22 -0800
Message-ID	<091906a7-7f54-48ae-928f-3acd4f511c43@googlegroups.com>
In reply to	#33792

Il giorno giovedì 22 novembre 2012 09:44:21 UTC+1, Steven D'Aprano ha scritto:
> On Wed, 21 Nov 2012 23:01:47 -0800, Giacomo Alzetta wrote:
> 
> 
> 
> > Il giorno giovedì 22 novembre 2012 05:00:39 UTC+1, MRAB ha scritto:
> 
> >> On 2012-11-22 03:41, Terry Reedy wrote: It can't return 5 because 5
> 
> >> isn't an index in 'spam'.
> 
> >> 
> 
> >> 
> 
> >> 
> 
> >> It can't return 4 because 4 is below the start index.
> 
> > 
> 
> > Uhm. Maybe you are right, because returning a greater value would cause
> 
> > an IndexError, but then, *why* is 4 returned???
> 
> > 
> 
> >>>> 'spam'.find('', 4)
> 
> > 4
> 
> >>>> 'spam'[4]
> 
> > Traceback (most recent call last):
> 
> >   File "<stdin>", line 1, in <module>
> 
> > IndexError: string index out of range
> 
> > 
> 
> > 4 is not a valid index either. I do not think the behaviour was
> 
> > completely intentional.
> 
> 
> 
> 
> 
> The behaviour is certainly an edge case, but I think it is correct.
> 
> 
> 
> (Correct or not, it has been the same going all the way back to Python 
> 
> 1.5, before strings even had methods, so it almost certainly will not be 
> 
> changed. Changing the behaviour now will very likely break hundreds, 
> 
> maybe thousands, of Python programs that expect the current behaviour.)
> 

My point was not to change the behaviour but only to point out this possible inconsistency between what str.find/str.index do and what they claim to do in the documentation.

Anyway I'm not so sure that changing the behaviour would break many programs... I mean, the change would only impact code that was looking for an empty string over the string's bounds. I don't see often using the lo and hi parameters for find/index, and I think I never saw someone using them when they get out of bounds. If you add looking for the empty string I think that the number of programs breaking will be minimum. And even if they break, it would be really easy to fix them.

Anyway, I understand what you mean and maybe it's better to keep this (at least to me) odd behaviour for backwards compatibility.

> 
> By this logic, "spam".find("", 4) should return 4, because cut #4 is 
> 
> immediately to the left of the empty string. So Python's current 
> 
> behaviour is justified.
> 
> 
> 
> What about "spam".find("", 5)? Well, if you look at the string with the 
> 
> cuts marked as before:
> 
> 
> 
> 0-1-2-3-4
> 
> |s|p|a|m|
> 
> 
> 
> you will see that there is no cut #5. Since there is no cut #5, we can't 
> 
> sensibly say we found *anything* there, not even the empty string. If you 
> 
> have four boxes, you can't say that you found anything in the fifth box.
> 
> 
> 
> I realise that this behaviour clashes somewhat with the slicing rule that 
> 
> says that if the slice indexes go past the end of the string, you get an 
> 
> empty string. But that rule is more for convenience than a fundamental 
> 
> rule about strings.

Yeah, I understand what you say, but the logic you pointed out is never cited anywhere, while slices are cited in the docstring.

> 
> > The docstring does not describe this edge case, so I think it could be
> 
> > improved. If the first sentence(being an index in S) is kept, than it
> 
> > shouldn't say that start and end are treated as in slice notation,
> 
> > because that's actually not true. 
> 
> 
> 
> +1
> 
> 
> 
> I think that you are right that the documentation needs to be improved.

Definitely. The sentence "Optional
arguments start and end are interpreted as in slice notation." should be changed to something like:
"Optional arguments start and end are interpreted as in slice notation, unless start is (strictly?) greater than the length of S or end is smaller than start, in which cases the search always fails."

In this way the 'spam'.find('', 4) *is* documented because start=len(S) -> start and end are treated like in slice notation and 4 makes sense, while 'spam'.find('', 5) -> -1 because 5 > len('spam') and thus the search fails
and also 'spam'.find('', 3, 2) -> -1 makes sense because 2 < 3(this edge case makes more sense, even though 'spam'[3:2] is still the empty string...).

[toc] | [prev] | [next] | [standalone]

#33790

From	Giacomo Alzetta <giacomo.alzetta@gmail.com>
Date	2012-11-21 23:01 -0800
Message-ID	<mailman.196.1353567717.29569.python-list@python.org>
In reply to	#33783

Il giorno giovedì 22 novembre 2012 05:00:39 UTC+1, MRAB ha scritto:
> On 2012-11-22 03:41, Terry Reedy wrote:
> It can't return 5 because 5 isn't an index in 'spam'.
> 
> 
> 
> It can't return 4 because 4 is below the start index.

Uhm. Maybe you are right, because returning a greater value would cause an IndexError, but then, *why* is 4 returned???

>>> 'spam'.find('', 4)
4
>>> 'spam'[4]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: string index out of range

4 is not a valid index either. I do not think the behaviour was completely intentional. If find should return indexes than 'spam'.find('', 4) must be -1, because 4 is not a valid index. If find should behave as if creating the slice and checking if the substring is in the slice than 'spam'.find('', i) should return i for every integer >= 4.

The docstring does not describe this edge case, so I think it could be improved.
If the first sentence(being an index in S) is kept, than it shouldn't say that start and end are treated as in slice notation, because that's actually not true. It should be added if start is greater or equal to len(S) then -1 is always returned(and in this case 'spam'.find('', 4) -> -1).
If find should not guarantee that the value returned is a valid index(when start isn't a valid index), then the first sentence should be rephrased to avoid giving this idea(and the comparisons in stringlib/find.h should be swapped to have the correct behaviour).
For example, maybe, it could be "Return the lowest index where substring sub is found (in S?), such that sub is contained in S[start:end]. ...

[toc] | [prev] | [standalone]

csiph-web

Inconsistent behaviour os str.find/str.index when providing optional parameters

Contents

#33729 — Inconsistent behaviour os str.find/str.index when providing optional parameters

#33735

#33749

#33759

#33767

#33770

#33782

#33783

#33789

#33792

#33811

#33790