Groups > comp.lang.python > #37571 > unrolled thread

Re: using split for a string : error

Started by	Chris Angelico <rosuav@gmail.com>
First post	2013-01-24 22:35 +1100
Last post	2013-01-25 12:07 +1100
Articles	3 — 2 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Re: using split for a string : error Chris Angelico <rosuav@gmail.com> - 2013-01-24 22:35 +1100
    Re: using split for a string : error Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-01-25 11:20 +1100
      Re: using split for a string : error Chris Angelico <rosuav@gmail.com> - 2013-01-25 12:07 +1100

#37571 — Re: using split for a string : error

From	Chris Angelico <rosuav@gmail.com>
Date	2013-01-24 22:35 +1100
Subject	Re: using split for a string : error
Message-ID	<mailman.969.1359027330.2939.python-list@python.org>

On Thu, Jan 24, 2013 at 10:16 PM, Tobias M. <tm@tobix.eu> wrote:
> Chris Angelico wrote:
>> The other thing you may want to consider, if the values are supposed
>> to be integers, is to convert them to Python integers before
>> comparing.
>
> I thought of this too and I wonder if there are any major differences
> regarding performance compared to using the strip() method when parsing
> large files.
>
> In addition I guess one should catch the ValueError that might be raised by
> the cast if there is something else than a number in the file.

I'd not consider the performance, but the correctness. If you're
expecting them to be integers, just cast them, and specifically
_don't_ catch ValueError. Any non-integer value will then noisily
abort the script. (It may be worth checking for blank first, though,
depending on the data origin.)

It's usually fine to have int() complain about any non-numerics in the
string, but I must confess, I do sometimes yearn for atoi() semantics:
atoi("123asd") == 123, and atoi("qqq") == 0. I've not seen a
convenient Python function for doing that. Usually it involves
manually getting the digits off the front. All I want is to suppress
the error on finding a non-digit. Oh well.

ChrisA

[toc] | [next] | [standalone]

#37642

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-01-25 11:20 +1100
Message-ID	<5101cfdb$0$29980$c3e8da3$5496439d@news.astraweb.com>
In reply to	#37571

Chris Angelico wrote:

> It's usually fine to have int() complain about any non-numerics in the
> string, but I must confess, I do sometimes yearn for atoi() semantics:
> atoi("123asd") == 123, and atoi("qqq") == 0. I've not seen a 
> convenient Python function for doing that. Usually it involves
> manually getting the digits off the front. All I want is to suppress
> the error on finding a non-digit. Oh well.

It's easy enough to write your own. All you need do is decide what you 
mean by "suppress the error on finding a non-digit".

Should atoi("123xyz456") return 123 or 123456?

Should atoi("xyz123") return 0 or 123?

And here's a good one:

Should atoi("1OOl") return 1, 100, or 1001?

That last is a serious suggestion by the way. There are still many people
who do not distinguish between 1 and l or 0 and O.

Actually I lied. It's not that easy. Consider:

py> s = '໑໒໙'
py> int(s)
129

Actually I lied again. It's not that hard:

def atoi(s):
    from unicodedata import digit
    i = 0
    for c in s:
        i *= 10
        i += digit(c, 0)
    return i

Variations that stop on the first non-digit, instead of treating them as
zero, are not much more difficult.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#37644

From	Chris Angelico <rosuav@gmail.com>
Date	2013-01-25 12:07 +1100
Message-ID	<mailman.1025.1359076064.2939.python-list@python.org>
In reply to	#37642

On Fri, Jan 25, 2013 at 11:20 AM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> Chris Angelico wrote:
>
>> It's usually fine to have int() complain about any non-numerics in the
>> string, but I must confess, I do sometimes yearn for atoi() semantics:
>> atoi("123asd") == 123, and atoi("qqq") == 0. I've not seen a
>> convenient Python function for doing that. Usually it involves
>> manually getting the digits off the front. All I want is to suppress
>> the error on finding a non-digit. Oh well.
>
> It's easy enough to write your own. All you need do is decide what you
> mean by "suppress the error on finding a non-digit".
>
> Should atoi("123xyz456") return 123 or 123456?
>
> Should atoi("xyz123") return 0 or 123?
>
> And here's a good one:
>
> Should atoi("1OOl") return 1, 100, or 1001?

123, 0, and 1. That's standard atoi semantics.

> That last is a serious suggestion by the way. There are still many people
> who do not distinguish between 1 and l or 0 and O.

Sure. But I'm not trying to cater to people who get it wrong; that's a
job for a DWIM.

> def atoi(s):
>     from unicodedata import digit
>     i = 0
>     for c in s:
>         i *= 10
>         i += digit(c, 0)
>     return i
>
> Variations that stop on the first non-digit, instead of treating them as
> zero, are not much more difficult.

And yes, I'm fully aware that I can roll my own. Here's a shorter
version (ASCII digits only, feel free to expand to Unicode), not
necessarily better:

def atoi(s):
    return int("0"+s[:-len(s.lstrip("0123456789"))])

It just seems silly that this should have to be done separately, when
it's really just a tweak to the usual string-to-int conversion: when
you come to a non-digit, take one of three options (throw error, skip,
or terminate).

Anyway, not a big deal.

ChrisA

[toc] | [prev] | [standalone]

csiph-web

Re: using split for a string : error

Contents

#37571 — Re: using split for a string : error

#37642

#37644