Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #63757 > unrolled thread

'Straße' ('Strasse') and Python 2

Started bywxjmfauth@gmail.com
First post2014-01-11 23:50 -0800
Last post2014-01-15 19:27 -0500
Articles 17 on this page of 37 — 16 participants

Back to article view | Back to comp.lang.python


Contents

  'Straße' ('Strasse') and Python 2 wxjmfauth@gmail.com - 2014-01-11 23:50 -0800
    Re: 'Straße' ('Strasse') and Python 2 Peter Otten <__peter__@web.de> - 2014-01-12 09:31 +0100
    Re: 'Straße' ('Strasse') and Python 2 Stefan Behnel <stefan_ml@behnel.de> - 2014-01-12 10:00 +0100
    Re: 'Straße' ('Strasse') and Python 2 Ned Batchelder <ned@nedbatchelder.com> - 2014-01-12 07:17 -0500
    Re: 'Straße' ('Strasse') and Python 2 Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-01-12 12:33 +0000
    Re: 'Straße' ('Strasse') and Python 2 MRAB <python@mrabarnett.plus.com> - 2014-01-12 18:33 +0000
    Re: 'Straße' ('Strasse') and Python 2 Thomas Rachel <nutznetz-0c1b6768-bfa9-48d5-a470-7603bd3aa915@spamschutz.glglgl.de> - 2014-01-13 09:27 +0100
      Re: 'Straße' ('Strasse') and Python 2 wxjmfauth@gmail.com - 2014-01-13 01:54 -0800
        Re: 'Straße' ('Strasse') and Python 2 Chris Angelico <rosuav@gmail.com> - 2014-01-13 21:26 +1100
        Re: 'Straße' ('Strasse') and Python 2 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-01-13 10:38 +0000
          Re: 'Straße' ('Strasse') and Python 2 Chris Angelico <rosuav@gmail.com> - 2014-01-13 21:57 +1100
            Re: 'Straße' ('Strasse') and Python 2 wxjmfauth@gmail.com - 2014-01-13 08:24 -0800
              Re: 'Straße' ('Strasse') and Python 2 Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-01-13 17:02 +0000
        Re: 'Straße' ('Strasse') and Python 2 Michael Torrie <torriem@gmail.com> - 2014-01-13 08:58 -0700
        Re: 'Straße' ('Strasse') and Python 2 Thomas Rachel <nutznetz-0c1b6768-bfa9-48d5-a470-7603bd3aa915@spamschutz.glglgl.de> - 2014-01-13 19:37 +0100
        Mistake or Troll (was Re: 'Straße' ('Strasse') and Python 2) Terry Reedy <tjreedy@udel.edu> - 2014-01-13 18:05 -0500
    Re: 'Straße' ('Strasse') and Python 2 Robin Becker <robin@reportlab.com> - 2014-01-15 12:00 +0000
      Re: 'Straße' ('Strasse') and Python 2 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-01-16 00:43 +0000
        Re: 'Straße' ('Strasse') and Python 2 Chris Angelico <rosuav@gmail.com> - 2014-01-16 12:26 +1100
    Re: 'Straße' ('Strasse') and Python 2 Ned Batchelder <ned@nedbatchelder.com> - 2014-01-15 07:13 -0500
      Re: 'Straße' ('Strasse') and Python 2 wxjmfauth@gmail.com - 2014-01-15 06:55 -0800
        Re: 'Straße' ('Strasse') and Python 2 Chris Angelico <rosuav@gmail.com> - 2014-01-16 02:14 +1100
          Re: 'Straße' ('Strasse') and Python 2 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-01-16 00:32 +0000
            Re: 'Straße' ('Strasse') and Python 2 Robin Becker <robin@reportlab.com> - 2014-01-16 10:51 +0000
              Re: 'Straße' ('Strasse') and Python 2 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-01-16 14:07 +0000
                Re: 'Straße' ('Strasse') and Python 2 Tim Chase <python.list@tim.thechases.com> - 2014-01-16 09:24 -0600
            Re: 'Straße' ('Strasse') and Python 2 Chris Angelico <rosuav@gmail.com> - 2014-01-16 21:58 +1100
            Re: 'StraÃYe' ('Strasse') and Python 2 "Frank Millman" <frank@chagford.com> - 2014-01-16 14:06 +0200
            Re: 'StraÃYe' ('Strasse') and Python 2 Robin Becker <robin@reportlab.com> - 2014-01-16 13:03 +0000
            Re: 'Straße' ('Strasse') and Python 2 Travis Griggs <travisgriggs@gmail.com> - 2014-01-16 13:30 -0800
    Re: 'Straße' ('Strasse') and Python 2 Robin Becker <robin@reportlab.com> - 2014-01-15 12:50 +0000
    Re: 'Straße' ('Strasse') and Python 2 Travis Griggs <travisgriggs@gmail.com> - 2014-01-15 08:28 -0800
    Re: 'Straße' ('Strasse') and Python 2 Robin Becker <robin@reportlab.com> - 2014-01-15 16:55 +0000
    Re: 'Straße' ('Strasse') and Python 2 Chris Angelico <rosuav@gmail.com> - 2014-01-16 04:14 +1100
    Re: 'Straße' ('Strasse') and Python 2 Robin Becker <robin@reportlab.com> - 2014-01-15 17:28 +0000
    Re: 'Straße' ('Strasse') and Python 2 Ian Kelly <ian.g.kelly@gmail.com> - 2014-01-15 11:32 -0700
    Re: 'Straße' ('Strasse') and Python 2 Terry Reedy <tjreedy@udel.edu> - 2014-01-15 19:27 -0500

Page 2 of 2 — ← Prev page 1 [2]


#63988

Fromwxjmfauth@gmail.com
Date2014-01-15 06:55 -0800
Message-ID<5d820037-44ad-4d8d-bb1b-4fb812fc876b@googlegroups.com>
In reply to#63974
Le mercredi 15 janvier 2014 13:13:36 UTC+1, Ned Batchelder a écrit :

> 
> ... more than one codepoint makes up a grapheme ...

No

> In Unicode terms, an encoding is a mapping between codepoints and bytes. 

No

jmf

[toc] | [prev] | [next] | [standalone]


#63990

FromChris Angelico <rosuav@gmail.com>
Date2014-01-16 02:14 +1100
Message-ID<mailman.5516.1389798888.18130.python-list@python.org>
In reply to#63988
On Thu, Jan 16, 2014 at 1:55 AM,  <wxjmfauth@gmail.com> wrote:
> Le mercredi 15 janvier 2014 13:13:36 UTC+1, Ned Batchelder a écrit :
>
>>
>> ... more than one codepoint makes up a grapheme ...
>
> No

Yes.
http://www.unicode.org/faq/char_combmark.html

>> In Unicode terms, an encoding is a mapping between codepoints and bytes.
>
> No

Yes.
http://www.unicode.org/reports/tr17/
Specifically:
"Character Encoding Form: a mapping from a set of nonnegative integers
that are elements of a CCS to a set of sequences of particular code
units of some specified width, such as 32-bit integers"

Or are you saying that www.unicode.org is wrong about the definitions
of Unicode terms?

ChrisA

[toc] | [prev] | [next] | [standalone]


#64027

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2014-01-16 00:32 +0000
Message-ID<52d72898$0$29970$c3e8da3$5496439d@news.astraweb.com>
In reply to#63990
On Thu, 16 Jan 2014 02:14:38 +1100, Chris Angelico wrote:

> On Thu, Jan 16, 2014 at 1:55 AM,  <wxjmfauth@gmail.com> wrote:
>> Le mercredi 15 janvier 2014 13:13:36 UTC+1, Ned Batchelder a écrit :
>>
>>
>>> ... more than one codepoint makes up a grapheme ...
>>
>> No
> 
> Yes.
> http://www.unicode.org/faq/char_combmark.html
> 
>>> In Unicode terms, an encoding is a mapping between codepoints and
>>> bytes.
>>
>> No
> 
> Yes.
> http://www.unicode.org/reports/tr17/
> Specifically:
> "Character Encoding Form: a mapping from a set of nonnegative integers
> that are elements of a CCS to a set of sequences of particular code
> units of some specified width, such as 32-bit integers"

Technically Unicode talks about mapping code points and code *units*, but 
since code units are defined in terms of bytes, I think it is fair to cut 
out one layer of indirection and talk about mapping code points to bytes. 
For instance, UTF-32 uses 4-byte code units, and every code point U+0000 
through U+10FFFF is mapped to a single code unit, which is always a four-
byte quantity. UTF-8, on the other hand, uses single-byte code units, and 
maps code points to a variable number of code units, so UTF-8 maps code 
points to either 1, 2, 3 or 4 bytes.


> Or are you saying that www.unicode.org is wrong about the definitions of
> Unicode terms?

No, I think he is saying that he doesn't know Unicode anywhere near as 
well as he thinks he does. The question is, will he cherish his 
ignorance, or learn from this thread?




-- 
Steven

[toc] | [prev] | [next] | [standalone]


#64071

FromRobin Becker <robin@reportlab.com>
Date2014-01-16 10:51 +0000
Message-ID<mailman.5580.1389869514.18130.python-list@python.org>
In reply to#64027
On 16/01/2014 00:32, Steven D'Aprano wrote:
>> >Or are you saying thatwww.unicode.org  is wrong about the definitions of
>> >Unicode terms?
> No, I think he is saying that he doesn't know Unicode anywhere near as
> well as he thinks he does. The question is, will he cherish his
> ignorance, or learn from this thread?

I assure you that I fully understand my ignorance of unicode. Until recently I 
didn't even know that the unicode in python 2.x is considered broken and that 
str in python 3.x is considered 'better'.

I can say that having made a lot of reportlab work in both 2.7 & 3.3 I don't 
understand why the latter seems slower especially since we try to convert early 
to unicode/str as a desirable internal form. Probably I have some horrible error 
going on(eg one of the C extensions is working in 2.7 and not in 3.3).
-stupidly yrs-
Robin Becker

[toc] | [prev] | [next] | [standalone]


#64080

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2014-01-16 14:07 +0000
Message-ID<52d7e7be$0$29999$c3e8da3$5496439d@news.astraweb.com>
In reply to#64071
On Thu, 16 Jan 2014 10:51:42 +0000, Robin Becker wrote:

> On 16/01/2014 00:32, Steven D'Aprano wrote:
>>> >Or are you saying thatwww.unicode.org  is wrong about the definitions
>>> >of Unicode terms?
>> No, I think he is saying that he doesn't know Unicode anywhere near as
>> well as he thinks he does. The question is, will he cherish his
>> ignorance, or learn from this thread?
> 
> I assure you that I fully understand my ignorance of unicode.

Robin, while I'm very happy to see that you have a good grasp of what you 
don't know, I'm afraid that you're misrepresenting me. You deleted the 
part of my post that made it clear that I was referring to our resident 
Unicode crank, JMF <wxjmfauth@gmail.com>.


> Until
> recently I didn't even know that the unicode in python 2.x is considered
> broken and that str in python 3.x is considered 'better'.

No need for scare quotes.

The unicode type in Python 2.x is less-good because:

- it is not the default string type (you have to prefix the string 
  with a u to get Unicode);

- it is missing some functionality, e.g. casefold;

- there are two distinct implementations, narrow builds and wide builds;

- wide builds take up to four times more memory per string as needed;

- narrow builds take up to two times more memory per string as needed;

- worse, narrow builds have very naive (possibly even "broken") 
  handling of code points in the Supplementary Multilingual Planes.

The unicode string type in Python 3 is better because:

- it is the default string type;

- it includes more functionality;

- starting in Python 3.3, it gets rid of the distinction between 
  narrow and wide builds;

- which reduces the memory overhead of strings by up to a factor 
  of four in many cases;

- and fixes the issue of SMP code points.


> I can say that having made a lot of reportlab work in both 2.7 & 3.3 I
> don't understand why the latter seems slower especially since we try to
> convert early to unicode/str as a desirable internal form. 

*shrug*

Who knows? Is it slower or does it only *seem* slower? Is the performance 
regression platform specific? Have you traded correctness for speed, that 
is, does 2.7 version break when given astral characters on a narrow build?

Earlier in January, you commented in another thread that 

"I'm not sure if we have any non-bmp characters in the tests."

If you don't, you should have some.

There's all sorts of reasons why your code might be slower under 3.3, 
including the possibility of a non-trivial performance regression. If you 
can demonstrate a test case with a significant slowdown for real-world 
code, I'm sure that a bug report will be treated seriously.


> Probably I
> have some horrible error going on(eg one of the C extensions is working
> in 2.7 and not in 3.3).

Well that might explain a slowdown.

But really, one should expect that moving from single byte strings to up 
to four-byte strings will have *some* cost. It's exchanging functionality 
for time. The same thing happened years ago, people used to be extremely 
opposed to using floating point doubles instead of singles because of 
performance. And, I suppose it is true that back when 64K was considered 
a lot of memory, using eight whole bytes per floating point number (let 
alone ten like the IEEE Extended format) might have seemed the height of 
extravagance. But today we use doubles by default, and if singles would 
be a tiny bit faster, who wants to go back to the bad old days of single 
precision?

I believe the same applies to Unicode versus single-byte strings.



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#64085

FromTim Chase <python.list@tim.thechases.com>
Date2014-01-16 09:24 -0600
Message-ID<mailman.5589.1389885790.18130.python-list@python.org>
In reply to#64080
On 2014-01-16 14:07, Steven D'Aprano wrote:
> The unicode type in Python 2.x is less-good because:
> 
> - it is missing some functionality, e.g. casefold;

Just for the record, str.casefold() wasn't added until 3.3, so
earlier 3.x versions (such as the 3.2.3 that is the default python3
on Debian Stable) don't have it either.

-tkc


[toc] | [prev] | [next] | [standalone]


#64072

FromChris Angelico <rosuav@gmail.com>
Date2014-01-16 21:58 +1100
Message-ID<mailman.5581.1389869944.18130.python-list@python.org>
In reply to#64027
On Thu, Jan 16, 2014 at 9:51 PM, Robin Becker <robin@reportlab.com> wrote:
> On 16/01/2014 00:32, Steven D'Aprano wrote:
>>>
>>> >Or are you saying thatwww.unicode.org  is wrong about the definitions of
>>> >Unicode terms?
>>
>> No, I think he is saying that he doesn't know Unicode anywhere near as
>> well as he thinks he does. The question is, will he cherish his
>> ignorance, or learn from this thread?
>
>
> I assure you that I fully understand my ignorance of unicode. Until recently
> I didn't even know that the unicode in python 2.x is considered broken and
> that str in python 3.x is considered 'better'.

Your wisdom, if I may paraphrase Master Foo, is that you know you are a fool.

http://catb.org/esr/writings/unix-koans/zealot.html

ChrisA

[toc] | [prev] | [next] | [standalone]


#64076 — Re: 'StraÃYe' ('Strasse') and Python 2

From"Frank Millman" <frank@chagford.com>
Date2014-01-16 14:06 +0200
SubjectRe: 'StraÃYe' ('Strasse') and Python 2
Message-ID<mailman.5585.1389874001.18130.python-list@python.org>
In reply to#64027
"Robin Becker" <robin@reportlab.com> wrote in message 
news:52D7B9BE.9020001@chamonix.reportlab.co.uk...
> On 16/01/2014 00:32, Steven D'Aprano wrote:
>>> >Or are you saying thatwww.unicode.org  is wrong about the definitions 
>>> >of
>>> >Unicode terms?
>> No, I think he is saying that he doesn't know Unicode anywhere near as
>> well as he thinks he does. The question is, will he cherish his
>> ignorance, or learn from this thread?
>
> I assure you that I fully understand my ignorance of unicode. Until 
> recently I didn't even know that the unicode in python 2.x is considered 
> broken and that str in python 3.x is considered 'better'.
>

Hi Robin

I am pretty sure that Steven was referring to the original post from 
jmfauth, not to anything that you wrote.

May I say that I am delighted that you are putting in the effort to port 
ReportLab to python3, and I trust that you will get plenty of support from 
the gurus here in achieving this.

Frank Millman


[toc] | [prev] | [next] | [standalone]


#64079 — Re: 'StraÃYe' ('Strasse') and Python 2

FromRobin Becker <robin@reportlab.com>
Date2014-01-16 13:03 +0000
SubjectRe: 'StraÃYe' ('Strasse') and Python 2
Message-ID<mailman.5587.1389877414.18130.python-list@python.org>
In reply to#64027
On 16/01/2014 12:06, Frank Millman wrote:
..........
>> I assure you that I fully understand my ignorance of unicode. Until
>> recently I didn't even know that the unicode in python 2.x is considered
>> broken and that str in python 3.x is considered 'better'.
>>
>
> Hi Robin
>
> I am pretty sure that Steven was referring to the original post from
> jmfauth, not to anything that you wrote.
>

unfortunately my ignorance remains even in the absence of criticism

> May I say that I am delighted that you are putting in the effort to port
> ReportLab to python3, and I trust that you will get plenty of support from
> the gurus here in achieving this.
........
I have had a lot of support from the gurus thanks to all of them :)
-- 
Robin Becker

[toc] | [prev] | [next] | [standalone]


#64108

FromTravis Griggs <travisgriggs@gmail.com>
Date2014-01-16 13:30 -0800
Message-ID<mailman.5606.1389907814.18130.python-list@python.org>
In reply to#64027
On Jan 16, 2014, at 2:51 AM, Robin Becker <robin@reportlab.com> wrote:

> I assure you that I fully understand my ignorance of ...

Robin, don’t take this personally, I totally got what you meant.

At the same time, I got a real chuckle out of this line. That beats “army intelligence” any day.

[toc] | [prev] | [next] | [standalone]


#63977

FromRobin Becker <robin@reportlab.com>
Date2014-01-15 12:50 +0000
Message-ID<mailman.5506.1389790224.18130.python-list@python.org>
In reply to#63757
On 15/01/2014 12:13, Ned Batchelder wrote:
........
>> On my utf8 based system
>>
>>
>>> robin@everest ~:
>>> $ cat ooo.py
>>> if __name__=='__main__':
>>>     import sys
>>>     s='A̅B'
>>>     print('version_info=%s\nlen(%s)=%d' % (sys.version_info,s,len(s)))
>>> robin@everest ~:
>>> $ python ooo.py
>>> version_info=sys.version_info(major=3, minor=3, micro=3,
>>> releaselevel='final', serial=0)
>>> len(A̅B)=3
>>> robin@everest ~:
>>> $
>>
>>
........
> You are right that more than one codepoint makes up a grapheme, and that you'll
> need code to deal with the correspondence between them. But let's not muddy
> these already confusing waters by referring to that mapping as an encoding.
>
> In Unicode terms, an encoding is a mapping between codepoints and bytes.  Python
> 3's str is a sequence of codepoints.
>
Semantics is everything. For me graphemes are the endpoint (or should be); to 
get a proper rendering of a sequence of graphemes I can use either a sequence of 
bytes or a sequence of codepoints. They are both encodings of the graphemes; 
what unicode says is an encoding doesn't define what encodings are ie mappings 
from some source alphabet to a target alphabet.
-- 
Robin Becker

[toc] | [prev] | [next] | [standalone]


#63995

FromTravis Griggs <travisgriggs@gmail.com>
Date2014-01-15 08:28 -0800
Message-ID<mailman.5520.1389803336.18130.python-list@python.org>
In reply to#63757
On Jan 15, 2014, at 4:50 AM, Robin Becker <robin@reportlab.com> wrote:

> On 15/01/2014 12:13, Ned Batchelder wrote:
> ........
>>> On my utf8 based system
>>> 
>>> 
>>>> robin@everest ~:
>>>> $ cat ooo.py
>>>> if __name__=='__main__':
>>>>    import sys
>>>>    s='A̅B'
>>>>    print('version_info=%s\nlen(%s)=%d' % (sys.version_info,s,len(s)))
>>>> robin@everest ~:
>>>> $ python ooo.py
>>>> version_info=sys.version_info(major=3, minor=3, micro=3,
>>>> releaselevel='final', serial=0)
>>>> len(A̅B)=3
>>>> robin@everest ~:
>>>> $
>>> 
>>> 
> ........
>> You are right that more than one codepoint makes up a grapheme, and that you'll
>> need code to deal with the correspondence between them. But let's not muddy
>> these already confusing waters by referring to that mapping as an encoding.
>> 
>> In Unicode terms, an encoding is a mapping between codepoints and bytes.  Python
>> 3's str is a sequence of codepoints.
>> 
> Semantics is everything. For me graphemes are the endpoint (or should be); to get a proper rendering of a sequence of graphemes I can use either a sequence of bytes or a sequence of codepoints. They are both encodings of the graphemes; what unicode says is an encoding doesn't define what encodings are ie mappings from some source alphabet to a target alphabet.

But you’re talking about two levels of encoding. One runs on top of the other. So insisting that you be able to call them all encodings, makes the term pointless, because now it’s ambiguous as to what you’re referring to. Are you referring to encoding in the sense of representing code points with bytes? Or are you referring to what the unicode guys call “forms”?

For example, the NFC form of ‘ñ’ is ’\u00F1’. ‘nThe NFD form represents the exact same grapheme, but is ‘\u006e\u0303’. You can call them encodings if you want, but I echo Ned’s sentiment that you keep that to yourself. Conventionally, they’re different forms, not different encodings. You can encode either form with an encoding, e.g.

'\u00F1'.encode('utf8’)
'\u00F1'.encode('utf16’)

'\u006e\u0303'.encode('utf8’)
'\u006e\u0303'.encode('utf16')

[toc] | [prev] | [next] | [standalone]


#64001

FromRobin Becker <robin@reportlab.com>
Date2014-01-15 16:55 +0000
Message-ID<mailman.5525.1389804942.18130.python-list@python.org>
In reply to#63757
On 15/01/2014 16:28, Travis Griggs wrote:
........ of a sequence of graphemes I can use either a sequence of bytes or a 
sequence of codepoints. They are both encodings of the graphemes; what unicode 
says is an encoding doesn't define what encodings are ie mappings from some 
source alphabet to a target alphabet.
>
> But you’re talking about two levels of encoding. One runs on top of the other. So insisting that you be able to call them all encodings, makes the term pointless, because now it’s ambiguous as to what you’re referring to. Are you referring to encoding in the sense of representing code points with bytes? Or are you referring to what the unicode guys call “forms”?
>
> For example, the NFC form of ‘ñ’ is ’\u00F1’. ‘nThe NFD form represents the exact same grapheme, but is ‘\u006e\u0303’. You can call them encodings if you want, but I echo Ned’s sentiment that you keep that to yourself. Conventionally, they’re different forms, not different encodings. You can encode either form with an encoding, e.g.
>
> '\u00F1'.encode('utf8’)
> '\u00F1'.encode('utf16’)
>
> '\u006e\u0303'.encode('utf8’)
> '\u006e\u0303'.encode('utf16')
>

I think about these as encodings, because that's what they are mathematically, 
logically & practically. I can encode the target grapheme sequence as a sequence 
of bytes using a particular 'unicode encoding' eg utf8 or a sequence of code points.

The fact that unicoders want to take over the meaning of encoding is not relevant.

In my utf8 bash shell the python print() takes one encoding (python3 str) and 
translates that to the stdout encoding which happens to be utf8 and passes that 
to the shell which probably does a lot of work to render the result as graphical 
symbols (or graphemes).

I'm not anti unicode, that's just an assignment of identity to some symbols. 
Coding the values of the ids is a separate issue. It's my belief that we don't 
need more than the byte level encoding to represent unicode. One of the claims 
made for python3 unicode is that it somehow eliminates the problems associated 
with other encodings eg utf8, but in fact they will remain until we force 
printers/designers to stop using complicated multi-codepoint graphemes. I 
suspect that won't happen.
-- 
Robin Becker

[toc] | [prev] | [next] | [standalone]


#64006

FromChris Angelico <rosuav@gmail.com>
Date2014-01-16 04:14 +1100
Message-ID<mailman.5531.1389806101.18130.python-list@python.org>
In reply to#63757
On Thu, Jan 16, 2014 at 3:55 AM, Robin Becker <robin@reportlab.com> wrote:
> I think about these as encodings, because that's what they are
> mathematically, logically & practically. I can encode the target grapheme
> sequence as a sequence of bytes using a particular 'unicode encoding' eg
> utf8 or a sequence of code points.

By that definition, you can equally encode it as a bitmapped image, or
as a series of lines and arcs, and those are equally well "encodings"
of the character. This is not the normal use of that word.

http://en.wikipedia.org/wiki/Character_encoding

ChrisA

[toc] | [prev] | [next] | [standalone]


#64007

FromRobin Becker <robin@reportlab.com>
Date2014-01-15 17:28 +0000
Message-ID<mailman.5533.1389806945.18130.python-list@python.org>
In reply to#63757
On 15/01/2014 17:14, Chris Angelico wrote:
> On Thu, Jan 16, 2014 at 3:55 AM, Robin Becker <robin@reportlab.com> wrote:
>> I think about these as encodings, because that's what they are
>> mathematically, logically & practically. I can encode the target grapheme
>> sequence as a sequence of bytes using a particular 'unicode encoding' eg
>> utf8 or a sequence of code points.
>
> By that definition, you can equally encode it as a bitmapped image, or
> as a series of lines and arcs, and those are equally well "encodings"
> of the character. This is not the normal use of that word.
>
> http://en.wikipedia.org/wiki/Character_encoding
>
> ChrisA
>
Actually I didn't use the term 'character encoding', but that doesn't alter the 
argument. If I chose to embed the final graphemes as images encoded as bytes or 
lists of numbers that would still be still be an encoding; it just wouldn't be 
very easily usable (lots of typing).
-- 
Robin Becker

[toc] | [prev] | [next] | [standalone]


#64011

FromIan Kelly <ian.g.kelly@gmail.com>
Date2014-01-15 11:32 -0700
Message-ID<mailman.5539.1389810783.18130.python-list@python.org>
In reply to#63757
On Wed, Jan 15, 2014 at 9:55 AM, Robin Becker <robin@reportlab.com> wrote:
> The fact that unicoders want to take over the meaning of encoding is not
> relevant.

A virus is a small infectious agent that replicates only inside the
living cells of other organisms.  In the context of computing however,
that definition is completely false, and if you insist upon it when
trying to talk about computers, you're only going to confuse people as
to what you mean.  Somehow, I haven't seen any biologists complaining
that computer users want to take over the meaning of virus.

[toc] | [prev] | [next] | [standalone]


#64026

FromTerry Reedy <tjreedy@udel.edu>
Date2014-01-15 19:27 -0500
Message-ID<mailman.5551.1389832076.18130.python-list@python.org>
In reply to#63757
On 1/15/2014 11:55 AM, Robin Becker wrote:

> The fact that unicoders want to take over the meaning of encoding is not
> relevant.

I agree with you that 'encoding' should not be limited to 'byte encoding 
of a (subset of) unicode characters. For instance, .jpg and .png are 
byte encodings of images. In the other hand, it is common in human 
discourse to omit qualifiers in particular contexts. 'Computer virus' 
gets condensed to 'virus' in computer contexts.

The problem with graphemes is that there is no fixed set of unicode 
graphemes. Which is to say, the effective set of graphemes is 
context-specific. Just limiting ourselves to English, 'fi' is usually 2 
graphemes when printing to screen, but often just one when printing to 
paper. This is why the Unicode consortium punted 'graphemes' to 
'application' code.

> I'm not anti unicode, that's just an assignment of identity to some
> symbols. Coding the values of the ids is a separate issue. It's my
> belief that we don't need more than the byte level encoding to represent
> unicode. One of the claims made for python3 unicode is that it somehow
> eliminates the problems associated with other encodings eg utf8,

The claim is true for the following problems of the way-too-numerous 
unicode byte encodings.

Subseting: only a subset of characters can be encoded.

Shifting: the meaning of a byte depends on a preceding shift character, 
which might be back as the beginning of the sequence.

Varying size: the number of bytes to encode a character depends on the 
character.

Both of the last two problems can turn O(1) operations into O(n) 
operations. 3.3+ eliminates all these problems.

-- 
Terry Jan Reedy

[toc] | [prev] | [standalone]


Page 2 of 2 — ← Prev page 1 [2]

Back to top | Article view | comp.lang.python


csiph-web