Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #63757 > unrolled thread

'Straße' ('Strasse') and Python 2

Started bywxjmfauth@gmail.com
First post2014-01-11 23:50 -0800
Last post2014-01-15 19:27 -0500
Articles 20 on this page of 37 — 16 participants

Back to article view | Back to comp.lang.python


Contents

  'Straße' ('Strasse') and Python 2 wxjmfauth@gmail.com - 2014-01-11 23:50 -0800
    Re: 'Straße' ('Strasse') and Python 2 Peter Otten <__peter__@web.de> - 2014-01-12 09:31 +0100
    Re: 'Straße' ('Strasse') and Python 2 Stefan Behnel <stefan_ml@behnel.de> - 2014-01-12 10:00 +0100
    Re: 'Straße' ('Strasse') and Python 2 Ned Batchelder <ned@nedbatchelder.com> - 2014-01-12 07:17 -0500
    Re: 'Straße' ('Strasse') and Python 2 Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-01-12 12:33 +0000
    Re: 'Straße' ('Strasse') and Python 2 MRAB <python@mrabarnett.plus.com> - 2014-01-12 18:33 +0000
    Re: 'Straße' ('Strasse') and Python 2 Thomas Rachel <nutznetz-0c1b6768-bfa9-48d5-a470-7603bd3aa915@spamschutz.glglgl.de> - 2014-01-13 09:27 +0100
      Re: 'Straße' ('Strasse') and Python 2 wxjmfauth@gmail.com - 2014-01-13 01:54 -0800
        Re: 'Straße' ('Strasse') and Python 2 Chris Angelico <rosuav@gmail.com> - 2014-01-13 21:26 +1100
        Re: 'Straße' ('Strasse') and Python 2 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-01-13 10:38 +0000
          Re: 'Straße' ('Strasse') and Python 2 Chris Angelico <rosuav@gmail.com> - 2014-01-13 21:57 +1100
            Re: 'Straße' ('Strasse') and Python 2 wxjmfauth@gmail.com - 2014-01-13 08:24 -0800
              Re: 'Straße' ('Strasse') and Python 2 Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-01-13 17:02 +0000
        Re: 'Straße' ('Strasse') and Python 2 Michael Torrie <torriem@gmail.com> - 2014-01-13 08:58 -0700
        Re: 'Straße' ('Strasse') and Python 2 Thomas Rachel <nutznetz-0c1b6768-bfa9-48d5-a470-7603bd3aa915@spamschutz.glglgl.de> - 2014-01-13 19:37 +0100
        Mistake or Troll (was Re: 'Straße' ('Strasse') and Python 2) Terry Reedy <tjreedy@udel.edu> - 2014-01-13 18:05 -0500
    Re: 'Straße' ('Strasse') and Python 2 Robin Becker <robin@reportlab.com> - 2014-01-15 12:00 +0000
      Re: 'Straße' ('Strasse') and Python 2 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-01-16 00:43 +0000
        Re: 'Straße' ('Strasse') and Python 2 Chris Angelico <rosuav@gmail.com> - 2014-01-16 12:26 +1100
    Re: 'Straße' ('Strasse') and Python 2 Ned Batchelder <ned@nedbatchelder.com> - 2014-01-15 07:13 -0500
      Re: 'Straße' ('Strasse') and Python 2 wxjmfauth@gmail.com - 2014-01-15 06:55 -0800
        Re: 'Straße' ('Strasse') and Python 2 Chris Angelico <rosuav@gmail.com> - 2014-01-16 02:14 +1100
          Re: 'Straße' ('Strasse') and Python 2 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-01-16 00:32 +0000
            Re: 'Straße' ('Strasse') and Python 2 Robin Becker <robin@reportlab.com> - 2014-01-16 10:51 +0000
              Re: 'Straße' ('Strasse') and Python 2 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-01-16 14:07 +0000
                Re: 'Straße' ('Strasse') and Python 2 Tim Chase <python.list@tim.thechases.com> - 2014-01-16 09:24 -0600
            Re: 'Straße' ('Strasse') and Python 2 Chris Angelico <rosuav@gmail.com> - 2014-01-16 21:58 +1100
            Re: 'StraÃYe' ('Strasse') and Python 2 "Frank Millman" <frank@chagford.com> - 2014-01-16 14:06 +0200
            Re: 'StraÃYe' ('Strasse') and Python 2 Robin Becker <robin@reportlab.com> - 2014-01-16 13:03 +0000
            Re: 'Straße' ('Strasse') and Python 2 Travis Griggs <travisgriggs@gmail.com> - 2014-01-16 13:30 -0800
    Re: 'Straße' ('Strasse') and Python 2 Robin Becker <robin@reportlab.com> - 2014-01-15 12:50 +0000
    Re: 'Straße' ('Strasse') and Python 2 Travis Griggs <travisgriggs@gmail.com> - 2014-01-15 08:28 -0800
    Re: 'Straße' ('Strasse') and Python 2 Robin Becker <robin@reportlab.com> - 2014-01-15 16:55 +0000
    Re: 'Straße' ('Strasse') and Python 2 Chris Angelico <rosuav@gmail.com> - 2014-01-16 04:14 +1100
    Re: 'Straße' ('Strasse') and Python 2 Robin Becker <robin@reportlab.com> - 2014-01-15 17:28 +0000
    Re: 'Straße' ('Strasse') and Python 2 Ian Kelly <ian.g.kelly@gmail.com> - 2014-01-15 11:32 -0700
    Re: 'Straße' ('Strasse') and Python 2 Terry Reedy <tjreedy@udel.edu> - 2014-01-15 19:27 -0500

Page 1 of 2  [1] 2  Next page →


#63757 — 'Straße' ('Strasse') and Python 2

Fromwxjmfauth@gmail.com
Date2014-01-11 23:50 -0800
Subject'Straße' ('Strasse') and Python 2
Message-ID<30dfa6f1-61b2-49b8-bc65-5fd18d498c38@googlegroups.com>
>>> sys.version
2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]
>>> s = 'Straße'
>>> assert len(s) == 6
>>> assert s[5] == 'e'
>>> 

jmf

[toc] | [next] | [standalone]


#63758

FromPeter Otten <__peter__@web.de>
Date2014-01-12 09:31 +0100
Message-ID<mailman.5360.1389515506.18130.python-list@python.org>
In reply to#63757
wxjmfauth@gmail.com wrote:

>>>> sys.version
> 2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]
>>>> s = 'Straße'
>>>> assert len(s) == 6
>>>> assert s[5] == 'e'
>>>> 
> 
> jmf

Signifying nothing. (Macbeth)

Python 2.7.2+ (default, Jul 20 2012, 22:15:08) 
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> s = "Straße"
>>> assert len(s) == 6
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AssertionError
>>> assert s[5] == "e"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AssertionError

[toc] | [prev] | [next] | [standalone]


#63759

FromStefan Behnel <stefan_ml@behnel.de>
Date2014-01-12 10:00 +0100
Message-ID<mailman.5361.1389517279.18130.python-list@python.org>
In reply to#63757
Peter Otten, 12.01.2014 09:31:
> wxjmfauth@gmail.com wrote:
> 
>> >>> sys.version
>> 2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]
>> >>> s = 'Straße'
>> >>> assert len(s) == 6
>> >>> assert s[5] == 'e'
>> >>>
>>
>> jmf
> 
> Signifying nothing. (Macbeth)
> 
> Python 2.7.2+ (default, Jul 20 2012, 22:15:08) 
> [GCC 4.6.1] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> s = "Straße"
> >>> assert len(s) == 6
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> AssertionError
> >>> assert s[5] == "e"
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
> AssertionError

The point I think he was trying to make is that Linux is better than
Windows, because the latter fails to fail on these assertions for some reason.

Stefan :o)

[toc] | [prev] | [next] | [standalone]


#63763

FromNed Batchelder <ned@nedbatchelder.com>
Date2014-01-12 07:17 -0500
Message-ID<mailman.5362.1389529053.18130.python-list@python.org>
In reply to#63757
On 1/12/14 2:50 AM, wxjmfauth@gmail.com wrote:
>>>> sys.version
> 2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]
>>>> s = 'Straße'
>>>> assert len(s) == 6
>>>> assert s[5] == 'e'
>>>>
>
> jmf
>

Dumping random snippets of Python sessions here is useless.  If you are 
trying to make a point, you have to put some English around it.  You 
know what is in your head, but we do not.

-- 
Ned Batchelder, http://nedbatchelder.com

[toc] | [prev] | [next] | [standalone]


#63764

FromMark Lawrence <breamoreboy@yahoo.co.uk>
Date2014-01-12 12:33 +0000
Message-ID<mailman.5363.1389529998.18130.python-list@python.org>
In reply to#63757
On 12/01/2014 09:00, Stefan Behnel wrote:
> Peter Otten, 12.01.2014 09:31:
>> wxjmfauth@gmail.com wrote:
>>
>>>>>> sys.version
>>> 2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]
>>>>>> s = 'Straße'
>>>>>> assert len(s) == 6
>>>>>> assert s[5] == 'e'
>>>>>>
>>>
>>> jmf
>>
>> Signifying nothing. (Macbeth)
>>
>> Python 2.7.2+ (default, Jul 20 2012, 22:15:08)
>> [GCC 4.6.1] on linux2
>> Type "help", "copyright", "credits" or "license" for more information.
>>>>> s = "Straße"
>>>>> assert len(s) == 6
>> Traceback (most recent call last):
>>    File "<stdin>", line 1, in <module>
>> AssertionError
>>>>> assert s[5] == "e"
>> Traceback (most recent call last):
>>    File "<stdin>", line 1, in <module>
>> AssertionError
>
> The point I think he was trying to make is that Linux is better than
> Windows, because the latter fails to fail on these assertions for some reason.
>
> Stefan :o)
>
>

The point he's trying to make is that he also reads the pythondev 
mailing list, where Steven D'Aprano posted this very example, stating it 
is "Python 2 nonsense".  Fixed in Python 3.  Don't mention... :)

-- 
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.

Mark Lawrence

[toc] | [prev] | [next] | [standalone]


#63793

FromMRAB <python@mrabarnett.plus.com>
Date2014-01-12 18:33 +0000
Message-ID<mailman.5380.1389551590.18130.python-list@python.org>
In reply to#63757
On 2014-01-12 08:31, Peter Otten wrote:
> wxjmfauth@gmail.com wrote:
>
>>>>> sys.version
>> 2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]
>>>>> s = 'Straße'
>>>>> assert len(s) == 6
>>>>> assert s[5] == 'e'
>>>>>
>>
>> jmf
>
> Signifying nothing. (Macbeth)
>
> Python 2.7.2+ (default, Jul 20 2012, 22:15:08)
> [GCC 4.6.1] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>>>> s = "Straße"
>>>> assert len(s) == 6
> Traceback (most recent call last):
>    File "<stdin>", line 1, in <module>
> AssertionError
>>>> assert s[5] == "e"
> Traceback (most recent call last):
>    File "<stdin>", line 1, in <module>
> AssertionError
>
>
The point is that in Python 2 'Straße' is a bytestring and its length
depends on the encoding of the source file. If the source file is UTF-8
then 'Straße' is a string literal with 7 bytes between the single
quotes.

[toc] | [prev] | [next] | [standalone]


#63815

FromThomas Rachel <nutznetz-0c1b6768-bfa9-48d5-a470-7603bd3aa915@spamschutz.glglgl.de>
Date2014-01-13 09:27 +0100
Message-ID<lb0826$k8h$1@r01.glglgl.de>
In reply to#63757
Am 12.01.2014 08:50 schrieb wxjmfauth@gmail.com:
>>>> sys.version
> 2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]
>>>> s = 'Straße'
>>>> assert len(s) == 6
>>>> assert s[5] == 'e'
>>>>

Wow. You just found one of the major differences between Python 2 and 3.

Your assertins are just wrong, as s = 'Straße' leads - provided you use 
UTF8 - to a representation of 'Stra\xc3\x9fe', obviously leading to a 
length of 7.


Thomas

[toc] | [prev] | [next] | [standalone]


#63819

Fromwxjmfauth@gmail.com
Date2014-01-13 01:54 -0800
Message-ID<d9170600-01e2-4417-af93-87120bffa940@googlegroups.com>
In reply to#63815
Le lundi 13 janvier 2014 09:27:46 UTC+1, Thomas Rachel a écrit :
> Am 12.01.2014 08:50 schrieb wxjmfauth@gmail.com:
> 
> >>>> sys.version
> 
> > 2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]
> 
> >>>> s = 'Stra�e'
> 
> >>>> assert len(s) == 6
> 
> >>>> assert s[5] == 'e'
> 
> >>>>
> 
> 
> 
> Wow. You just found one of the major differences between Python 2 and 3.
> 
> 
> 
> Your assertins are just wrong, as s = 'Stra�e' leads - provided you use 
> 
> UTF8 - to a representation of 'Stra\xc3\x9fe', obviously leading to a 
> 
> length of 7.
> 
> 


Not at all. I'm afraid I'm understanding Python (on this
aspect very well).

Do you belong to this group of people who are naively
writing wrong Python code (usually not properly working)
during more than a decade?

'ß' is the the fourth character in that text "Straße"
(base index 0).

This assertions are correct (byte string and unicode).

>>> sys.version
'2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]'
>>> assert 'Straße'[4] == 'ß'
>>> assert u'Straße'[4] == u'ß'
>>> 

jmf

PS Nothing to do with Py2/Py3.

[toc] | [prev] | [next] | [standalone]


#63822

FromChris Angelico <rosuav@gmail.com>
Date2014-01-13 21:26 +1100
Message-ID<mailman.5401.1389608764.18130.python-list@python.org>
In reply to#63819
On Mon, Jan 13, 2014 at 8:54 PM,  <wxjmfauth@gmail.com> wrote:
> This assertions are correct (byte string and unicode).
>
>>>> sys.version
> '2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]'
>>>> assert 'Straße'[4] == 'ß'
>>>> assert u'Straße'[4] == u'ß'
>>>>
>
> jmf
>
> PS Nothing to do with Py2/Py3.

This means that either your source encoding happens to include that
character, or you have assertions disabled. It does NOT mean that you
can rely on writing this string out to a file and having someone else
read it in and understand it the same way.

ChrisA

[toc] | [prev] | [next] | [standalone]


#63823

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2014-01-13 10:38 +0000
Message-ID<52d3c20b$0$29970$c3e8da3$5496439d@news.astraweb.com>
In reply to#63819
On Mon, 13 Jan 2014 01:54:21 -0800, wxjmfauth wrote:

>>>> sys.version
> '2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]'
>>>> assert 'Straße'[4] == 'ß'
>>>> assert u'Straße'[4] == u'ß'

I think you are using "from __future__ import unicode_literals". 
Otherwise, that cannot happen in Python 2.x. Using a narrow build:


# on my machine "ando"
py> sys.version
'2.7.2 (default, May 18 2012, 18:25:10) \n[GCC 4.1.2 20080704 (Red Hat 
4.1.2-52)]'
py> sys.maxunicode
65535
py> assert 'Straße'[4] == 'ß'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AssertionError
py> list('Straße')
['S', 't', 'r', 'a', '\xc3', '\x9f', 'e']


Using a wide build is the same:


# on my machine "orac"
>>> sys.maxunicode
1114111
>>> assert 'Straße'[4] == 'ß'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AssertionError


But once you run the "from __future__" line, the behaviour changes to 
what you show:

py> from __future__ import unicode_literals
py> list('Straße')
[u'S', u't', u'r', u'a', u'\xdf', u'e']
py> assert 'Straße'[4] == 'ß'
py>


But I still don't understand the point you are trying to make.



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#63824

FromChris Angelico <rosuav@gmail.com>
Date2014-01-13 21:57 +1100
Message-ID<mailman.5402.1389610651.18130.python-list@python.org>
In reply to#63823
On Mon, Jan 13, 2014 at 9:38 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> I think you are using "from __future__ import unicode_literals".
> Otherwise, that cannot happen in Python 2.x.
>

Alas, not true.

>>> sys.version
'2.7.4 (default, Apr  6 2013, 19:54:46) [MSC v.1500 32 bit (Intel)]'
>>> sys.maxunicode
65535
>>> assert 'Straße'[4] == 'ß'
>>> list('Straße')
['S', 't', 'r', 'a', '\xdf', 'e']

That's Windows XP. Presumably Latin-1 (or CP-1252, they both have that
char at 0xDF). He happens to be correct, *as long as the source code
encoding matches the output encoding and is one that uses 0xDF to mean
U+00DF*. Otherwise, he's not.

ChrisA

[toc] | [prev] | [next] | [standalone]


#63836

Fromwxjmfauth@gmail.com
Date2014-01-13 08:24 -0800
Message-ID<3cfbd99e-da03-4b49-bd44-83d098aefc2d@googlegroups.com>
In reply to#63824
Le lundi 13 janvier 2014 11:57:28 UTC+1, Chris Angelico a écrit :
> On Mon, Jan 13, 2014 at 9:38 PM, Steven D'Aprano
> 
> <steve+comp.lang.python@pearwood.info> wrote:
> 
> > I think you are using "from __future__ import unicode_literals".
> 
> > Otherwise, that cannot happen in Python 2.x.
> 
> >
> 
> 
> 
> Alas, not true.
> 
> 
> 
> >>> sys.version
> 
> '2.7.4 (default, Apr  6 2013, 19:54:46) [MSC v.1500 32 bit (Intel)]'
> 
> >>> sys.maxunicode
> 
> 65535
> 
> >>> assert 'Straße'[4] == 'ß'
> 
> >>> list('Straße')
> 
> ['S', 't', 'r', 'a', '\xdf', 'e']
> 
> 
> 
> That's Windows XP. Presumably Latin-1 (or CP-1252, they both have that
> 
> char at 0xDF). He happens to be correct, *as long as the source code
> 
> encoding matches the output encoding and is one that uses 0xDF to mean
> 
> U+00DF*. Otherwise, he's not.
> 
> 

You are right. It's on Windows. It is only showing how
Python can be a holy mess.

The funny aspect is when I'm reading " *YOUR* assertions
are false" when I'm presenting *PYTHON* assertions!

jmf

[toc] | [prev] | [next] | [standalone]


#63841

FromMark Lawrence <breamoreboy@yahoo.co.uk>
Date2014-01-13 17:02 +0000
Message-ID<mailman.5415.1389632562.18130.python-list@python.org>
In reply to#63836
On 13/01/2014 16:24, wxjmfauth@gmail.com wrote:
>
> You are right. It's on Windows. It is only showing how
> Python can be a holy mess.
>

Regarding unicode Python 2 was a holy mess, fixed in Python 3.

-- 
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.

Mark Lawrence

[toc] | [prev] | [next] | [standalone]


#63834

FromMichael Torrie <torriem@gmail.com>
Date2014-01-13 08:58 -0700
Message-ID<mailman.5410.1389628765.18130.python-list@python.org>
In reply to#63819
On 01/13/2014 02:54 AM, wxjmfauth@gmail.com wrote:
> Not at all. I'm afraid I'm understanding Python (on this
> aspect very well).

Are you sure about that?  Seems to me you're still confused as to the
difference between unicode and encodings.

> 
> Do you belong to this group of people who are naively
> writing wrong Python code (usually not properly working)
> during more than a decade?
> 
> 'ß' is the the fourth character in that text "Straße"
> (base index 0).
> 
> This assertions are correct (byte string and unicode).

How can they be?  They only are true for the default encoding and
character set you are using, which happens to have 'ß' as a single byte.
 Hence your little python 2.7 snippet is not using unicode at all, in
any form.  It's using a non-unicode character set.  There are methods
which can decode your character set to unicode and encode from unicode.
 But let's be clear.  Your byte streams are not unicode!

If the default byte encoding is UTF-8, which is a variable number of
bytes per character, your assertions are completely wrong.  Maybe it's
time you stopped programming in Windows and use OS X or Linux which
throw out the random single-byte character sets and instead provide a
UTF-8 terminal environment to support non-latin characters.

> 
>>>> sys.version
> '2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]'
>>>> assert 'Straße'[4] == 'ß'
>>>> assert u'Straße'[4] == u'ß'
>>>>
> 
> jmf
> 
> PS Nothing to do with Py2/Py3.

[toc] | [prev] | [next] | [standalone]


#63851

FromThomas Rachel <nutznetz-0c1b6768-bfa9-48d5-a470-7603bd3aa915@spamschutz.glglgl.de>
Date2014-01-13 19:37 +0100
Message-ID<lb1bp8$evr$1@r01.glglgl.de>
In reply to#63819
Am 13.01.2014 10:54 schrieb wxjmfauth@gmail.com:

> Not at all. I'm afraid I'm understanding Python (on this
> aspect very well).

IBTD.

> Do you belong to this group of people who are naively
> writing wrong Python code (usually not properly working)
> during more than a decade?

Why should I be?

> 'ß' is the the fourth character in that text "Straße"
> (base index 0).

Character-wise, yes. But not byte-string-wise. In a byte string, this 
depends on the character set used.

On CP 437, 850, 12xx (whatever Windows uses) or latin1, you are right, 
but not on the widely used UTF8.

>>>> sys.version
> '2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]'
>>>> assert 'Straße'[4] == 'ß'
>>>> assert u'Straße'[4] == u'ß'

Linux box at home:

Python 2.7.3 (default, Apr 14 2012, 08:58:41) [GCC] on linux2
Type "help", "copyright", "credits" or "license" for more information.
 >>> assert 'Straße'[4] == 'ß'
Traceback (most recent call last):
   File "<stdin>", line 1, in <module>
AssertionError
 >>> assert u'Straße'[4] == u'ß'

Python 3.3.0 (default, Oct 01 2012, 09:13:30) [GCC] on linux
Type "help", "copyright", "credits" or "license" for more information.
 >>> assert 'Straße'[4] == 'ß'
 >>> assert u'Straße'[4] == u'ß'

Windows box at work:

Python 2.7.5 (default, May 15 2013, 22:44:16) [MSC v.1500 64 bit 
(AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
 >>> assert 'Straße'[4] == 'ß'
 >>> assert u'Straße'[4] == u'ß'

> PS Nothing to do with Py2/Py3.

As bytes and unicode and str stuff is heavily changed between them, of 
course it has to do.

And I think you know that and try to confuse and FUD us all - with no avail.


Thomas

[toc] | [prev] | [next] | [standalone]


#63866 — Mistake or Troll (was Re: 'Straße' ('Strasse') and Python 2)

FromTerry Reedy <tjreedy@udel.edu>
Date2014-01-13 18:05 -0500
SubjectMistake or Troll (was Re: 'Straße' ('Strasse') and Python 2)
Message-ID<mailman.5432.1389654325.18130.python-list@python.org>
In reply to#63819
On 1/13/2014 4:54 AM, wxjmfauth@gmail.com wrote:

> I'm afraid I'm understanding Python (on this
> aspect very well).

Really?

> Do you belong to this group of people who are naively
> writing wrong Python code (usually not properly working)
> during more than a decade?

To me, the important question is whether this and previous similar posts 
are intentional trolls designed to stir up the flurry of responses they 
get or 'innocently' misleading or even erroneous. If your claim of 
understanding Python and Unicode is true, then this must be a troll 
post. Either way, please desist, or your access to python-list from 
google-groups may be removed.

> 'ß' is the the fourth character in that text "Straße"
> (base index 0).

As others have said, in the *unicode text "Straße", 'ß' is the fifth 
character, at character index 4, ...

> This assertions are correct (byte string and unicode).

whereas, when the text is encoded into bytes, the byte index depends on 
the encoding and the assertion that it is always 4 is incorrect. Did you 
know this or were you truly ignorant?

>>>> sys.version
> '2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]'
>>>> assert 'Straße'[4] == 'ß'

Sometimes true, sometimes not.

>>>> assert u'Straße'[4] == u'ß'

> PS Nothing to do with Py2/Py3.

This issue has everything to do with Py2, where 'Straße' is encoded 
bytes, versus Py3, where 'Straße' is unicode text where each character 
of that word takes one code unit, whether each is 2 bytes or 4 bytes.

If you replace 'ß' with any astral (non-BMP) character, this issue 
appears even for unicode text in 3.2-, where an astral character 
requires 2, not 1, code units on narrow builds, thereby screwing up 
indexing, just as can happen for encoded bytes. In 3.3+, all characters 
use 1 code unit and indexing (and slicing) always works properly. This 
is another unicode issue where you appear not to understand, but might 
just be trolling.

-- 
Terry Jan Reedy


[toc] | [prev] | [next] | [standalone]


#63971

FromRobin Becker <robin@reportlab.com>
Date2014-01-15 12:00 +0000
Message-ID<mailman.5500.1389787267.18130.python-list@python.org>
In reply to#63757
On 12/01/2014 07:50, wxjmfauth@gmail.com wrote:
>>>> sys.version
> 2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]
>>>> s = 'Straße'
>>>> assert len(s) == 6
>>>> assert s[5] == 'e'
>>>>
>
> jmf
>

On my utf8 based system


> robin@everest ~:
> $ cat ooo.py
> if __name__=='__main__':
>     import sys
>     s='A̅B'
>     print('version_info=%s\nlen(%s)=%d' % (sys.version_info,s,len(s)))
> robin@everest ~:
> $ python ooo.py
> version_info=sys.version_info(major=3, minor=3, micro=3, releaselevel='final', serial=0)
> len(A̅B)=3
> robin@everest ~:
> $


so two 'characters' are 3 (or 2 or more) codepoints. If I want to isolate so 
called graphemes I need an algorithm even for python's unicode ie when it really 
matters, python3 str is just another encoding.
-- 
Robin Becker

[toc] | [prev] | [next] | [standalone]


#64029

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2014-01-16 00:43 +0000
Message-ID<52d72b29$0$29970$c3e8da3$5496439d@news.astraweb.com>
In reply to#63971
On Wed, 15 Jan 2014 12:00:51 +0000, Robin Becker wrote:

> so two 'characters' are 3 (or 2 or more) codepoints.

Yes.


> If I want to isolate so called graphemes I need an algorithm even 
> for python's unicode

Correct. Graphemes are language dependent, e.g. in Dutch "ij" is usually 
a single grapheme, in English it would be counted as two. Likewise, in 
Czech, "ch" is a single grapheme. The Latin form of Serbo-Croation has 
two two-letter graphemes, Dž and Nj (it used to have three, but Dj is now 
written as Đ).

Worse, linguists sometimes disagree as to what counts as a grapheme. For 
instance, some authorities consider the English "sh" to be a separate 
grapheme. As a native English speaker, I'm not sure about that. Certainly 
it isn't a separate letter of the alphabet, but on the other hand I can't 
think of any words containing "sh" that should be considered as two 
graphemes "s" followed by "h". Wait, no, that's not true... compound 
words such as "glasshouse" or "disheartened" are counter examples.


> ie when it really matters, python3 str is just another encoding.

I'm not entirely sure how a programming language data type (str) can be 
considered a transformation.



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#64035

FromChris Angelico <rosuav@gmail.com>
Date2014-01-16 12:26 +1100
Message-ID<mailman.5557.1389835583.18130.python-list@python.org>
In reply to#64029
On Thu, Jan 16, 2014 at 11:43 AM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> Worse, linguists sometimes disagree as to what counts as a grapheme. For
> instance, some authorities consider the English "sh" to be a separate
> grapheme. As a native English speaker, I'm not sure about that. Certainly
> it isn't a separate letter of the alphabet, but on the other hand I can't
> think of any words containing "sh" that should be considered as two
> graphemes "s" followed by "h". Wait, no, that's not true... compound
> words such as "glasshouse" or "disheartened" are counter examples.

Digression: When I was taught basic English during my school days, my
mum used Spalding's book and the 70 phonograms. 25 of them are single
letters (Q is not a phonogram - QU is), and the others are mostly
pairs (there are a handful of 3- and 4-letter phonograms). Not every
instance of "s" followed by "h" is the phonogram "sh" - only the times
when it makes the single sound "sh" (which it doesn't in "glasshouse"
or "disheartened").

Thing is, you can't define spelling and pronunciation in terms of each
other, because you'll always be bitten by corner cases. Everyone knows
how "Thames" is pronounced... right? Well, no. There are (at least)
two rivers of that name, the famous one in London p1[ and another one
further north [2]. The obscure one is pronounced the way the word
looks, the famous one isn't. And don't even get started on English
family names... Majorinbanks, Meux and Cholmodeley, as lampshaded [3]
in this song [4]! Even without names, though, there are the tricky
cases and the ones where different localities pronounce the same word
very differently; Unicode shouldn't have to deal with that by changing
whether something's a single character or two. Considering that
phonograms aren't even ligatures (though there is overlap, eg "Th"),
it's much cleaner to leave them as multiple characters.

ChrisA

[1] https://en.wikipedia.org/wiki/River_Thames
[2] Though it's better known as the Isis. https://en.wikipedia.org/wiki/The_Isis
[3] http://tvtropes.org/pmwiki/pmwiki.php/Main/LampshadeHanging
[4] http://www.stagebeauty.net/plays/th-arca2.html - "Mosh-banks",
"Mow", and "Chumley" are the pronunciations used

[toc] | [prev] | [next] | [standalone]


#63974

FromNed Batchelder <ned@nedbatchelder.com>
Date2014-01-15 07:13 -0500
Message-ID<mailman.5503.1389788028.18130.python-list@python.org>
In reply to#63757
On 1/15/14 7:00 AM, Robin Becker wrote:
> On 12/01/2014 07:50, wxjmfauth@gmail.com wrote:
>>>>> sys.version
>> 2.7.6 (default, Nov 10 2013, 19:24:18) [MSC v.1500 32 bit (Intel)]
>>>>> s = 'Straße'
>>>>> assert len(s) == 6
>>>>> assert s[5] == 'e'
>>>>>
>>
>> jmf
>>
>
> On my utf8 based system
>
>
>> robin@everest ~:
>> $ cat ooo.py
>> if __name__=='__main__':
>>     import sys
>>     s='A̅B'
>>     print('version_info=%s\nlen(%s)=%d' % (sys.version_info,s,len(s)))
>> robin@everest ~:
>> $ python ooo.py
>> version_info=sys.version_info(major=3, minor=3, micro=3,
>> releaselevel='final', serial=0)
>> len(A̅B)=3
>> robin@everest ~:
>> $
>
>
> so two 'characters' are 3 (or 2 or more) codepoints. If I want to
> isolate so called graphemes I need an algorithm even for python's
> unicode ie when it really matters, python3 str is just another encoding.

You are right that more than one codepoint makes up a grapheme, and that 
you'll need code to deal with the correspondence between them. But let's 
not muddy these already confusing waters by referring to that mapping as 
an encoding.

In Unicode terms, an encoding is a mapping between codepoints and bytes. 
  Python 3's str is a sequence of codepoints.

-- 
Ned Batchelder, http://nedbatchelder.com

[toc] | [prev] | [next] | [standalone]


Page 1 of 2  [1] 2  Next page →

Back to top | Article view | comp.lang.python


csiph-web