Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #63757 > unrolled thread
| Started by | wxjmfauth@gmail.com |
|---|---|
| First post | 2014-01-11 23:50 -0800 |
| Last post | 2014-01-15 19:27 -0500 |
| Articles | 17 on this page of 37 — 16 participants |
Back to article view | Back to comp.lang.python
'Straße' ('Strasse') and Python 2 wxjmfauth@gmail.com - 2014-01-11 23:50 -0800
Re: 'Straße' ('Strasse') and Python 2 Peter Otten <__peter__@web.de> - 2014-01-12 09:31 +0100
Re: 'Straße' ('Strasse') and Python 2 Stefan Behnel <stefan_ml@behnel.de> - 2014-01-12 10:00 +0100
Re: 'Straße' ('Strasse') and Python 2 Ned Batchelder <ned@nedbatchelder.com> - 2014-01-12 07:17 -0500
Re: 'Straße' ('Strasse') and Python 2 Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-01-12 12:33 +0000
Re: 'Straße' ('Strasse') and Python 2 MRAB <python@mrabarnett.plus.com> - 2014-01-12 18:33 +0000
Re: 'Straße' ('Strasse') and Python 2 Thomas Rachel <nutznetz-0c1b6768-bfa9-48d5-a470-7603bd3aa915@spamschutz.glglgl.de> - 2014-01-13 09:27 +0100
Re: 'Straße' ('Strasse') and Python 2 wxjmfauth@gmail.com - 2014-01-13 01:54 -0800
Re: 'Straße' ('Strasse') and Python 2 Chris Angelico <rosuav@gmail.com> - 2014-01-13 21:26 +1100
Re: 'Straße' ('Strasse') and Python 2 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-01-13 10:38 +0000
Re: 'Straße' ('Strasse') and Python 2 Chris Angelico <rosuav@gmail.com> - 2014-01-13 21:57 +1100
Re: 'Straße' ('Strasse') and Python 2 wxjmfauth@gmail.com - 2014-01-13 08:24 -0800
Re: 'Straße' ('Strasse') and Python 2 Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-01-13 17:02 +0000
Re: 'Straße' ('Strasse') and Python 2 Michael Torrie <torriem@gmail.com> - 2014-01-13 08:58 -0700
Re: 'Straße' ('Strasse') and Python 2 Thomas Rachel <nutznetz-0c1b6768-bfa9-48d5-a470-7603bd3aa915@spamschutz.glglgl.de> - 2014-01-13 19:37 +0100
Mistake or Troll (was Re: 'Straße' ('Strasse') and Python 2) Terry Reedy <tjreedy@udel.edu> - 2014-01-13 18:05 -0500
Re: 'Straße' ('Strasse') and Python 2 Robin Becker <robin@reportlab.com> - 2014-01-15 12:00 +0000
Re: 'Straße' ('Strasse') and Python 2 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-01-16 00:43 +0000
Re: 'Straße' ('Strasse') and Python 2 Chris Angelico <rosuav@gmail.com> - 2014-01-16 12:26 +1100
Re: 'Straße' ('Strasse') and Python 2 Ned Batchelder <ned@nedbatchelder.com> - 2014-01-15 07:13 -0500
Re: 'Straße' ('Strasse') and Python 2 wxjmfauth@gmail.com - 2014-01-15 06:55 -0800
Re: 'Straße' ('Strasse') and Python 2 Chris Angelico <rosuav@gmail.com> - 2014-01-16 02:14 +1100
Re: 'Straße' ('Strasse') and Python 2 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-01-16 00:32 +0000
Re: 'Straße' ('Strasse') and Python 2 Robin Becker <robin@reportlab.com> - 2014-01-16 10:51 +0000
Re: 'Straße' ('Strasse') and Python 2 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-01-16 14:07 +0000
Re: 'Straße' ('Strasse') and Python 2 Tim Chase <python.list@tim.thechases.com> - 2014-01-16 09:24 -0600
Re: 'Straße' ('Strasse') and Python 2 Chris Angelico <rosuav@gmail.com> - 2014-01-16 21:58 +1100
Re: 'StraÃYe' ('Strasse') and Python 2 "Frank Millman" <frank@chagford.com> - 2014-01-16 14:06 +0200
Re: 'StraÃYe' ('Strasse') and Python 2 Robin Becker <robin@reportlab.com> - 2014-01-16 13:03 +0000
Re: 'Straße' ('Strasse') and Python 2 Travis Griggs <travisgriggs@gmail.com> - 2014-01-16 13:30 -0800
Re: 'Straße' ('Strasse') and Python 2 Robin Becker <robin@reportlab.com> - 2014-01-15 12:50 +0000
Re: 'Straße' ('Strasse') and Python 2 Travis Griggs <travisgriggs@gmail.com> - 2014-01-15 08:28 -0800
Re: 'Straße' ('Strasse') and Python 2 Robin Becker <robin@reportlab.com> - 2014-01-15 16:55 +0000
Re: 'Straße' ('Strasse') and Python 2 Chris Angelico <rosuav@gmail.com> - 2014-01-16 04:14 +1100
Re: 'Straße' ('Strasse') and Python 2 Robin Becker <robin@reportlab.com> - 2014-01-15 17:28 +0000
Re: 'Straße' ('Strasse') and Python 2 Ian Kelly <ian.g.kelly@gmail.com> - 2014-01-15 11:32 -0700
Re: 'Straße' ('Strasse') and Python 2 Terry Reedy <tjreedy@udel.edu> - 2014-01-15 19:27 -0500
Page 2 of 2 — ← Prev page 1 [2]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2014-01-15 06:55 -0800 |
| Message-ID | <5d820037-44ad-4d8d-bb1b-4fb812fc876b@googlegroups.com> |
| In reply to | #63974 |
Le mercredi 15 janvier 2014 13:13:36 UTC+1, Ned Batchelder a écrit : > > ... more than one codepoint makes up a grapheme ... No > In Unicode terms, an encoding is a mapping between codepoints and bytes. No jmf
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-01-16 02:14 +1100 |
| Message-ID | <mailman.5516.1389798888.18130.python-list@python.org> |
| In reply to | #63988 |
On Thu, Jan 16, 2014 at 1:55 AM, <wxjmfauth@gmail.com> wrote: > Le mercredi 15 janvier 2014 13:13:36 UTC+1, Ned Batchelder a écrit : > >> >> ... more than one codepoint makes up a grapheme ... > > No Yes. http://www.unicode.org/faq/char_combmark.html >> In Unicode terms, an encoding is a mapping between codepoints and bytes. > > No Yes. http://www.unicode.org/reports/tr17/ Specifically: "Character Encoding Form: a mapping from a set of nonnegative integers that are elements of a CCS to a set of sequences of particular code units of some specified width, such as 32-bit integers" Or are you saying that www.unicode.org is wrong about the definitions of Unicode terms? ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2014-01-16 00:32 +0000 |
| Message-ID | <52d72898$0$29970$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #63990 |
On Thu, 16 Jan 2014 02:14:38 +1100, Chris Angelico wrote: > On Thu, Jan 16, 2014 at 1:55 AM, <wxjmfauth@gmail.com> wrote: >> Le mercredi 15 janvier 2014 13:13:36 UTC+1, Ned Batchelder a écrit : >> >> >>> ... more than one codepoint makes up a grapheme ... >> >> No > > Yes. > http://www.unicode.org/faq/char_combmark.html > >>> In Unicode terms, an encoding is a mapping between codepoints and >>> bytes. >> >> No > > Yes. > http://www.unicode.org/reports/tr17/ > Specifically: > "Character Encoding Form: a mapping from a set of nonnegative integers > that are elements of a CCS to a set of sequences of particular code > units of some specified width, such as 32-bit integers" Technically Unicode talks about mapping code points and code *units*, but since code units are defined in terms of bytes, I think it is fair to cut out one layer of indirection and talk about mapping code points to bytes. For instance, UTF-32 uses 4-byte code units, and every code point U+0000 through U+10FFFF is mapped to a single code unit, which is always a four- byte quantity. UTF-8, on the other hand, uses single-byte code units, and maps code points to a variable number of code units, so UTF-8 maps code points to either 1, 2, 3 or 4 bytes. > Or are you saying that www.unicode.org is wrong about the definitions of > Unicode terms? No, I think he is saying that he doesn't know Unicode anywhere near as well as he thinks he does. The question is, will he cherish his ignorance, or learn from this thread? -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Robin Becker <robin@reportlab.com> |
|---|---|
| Date | 2014-01-16 10:51 +0000 |
| Message-ID | <mailman.5580.1389869514.18130.python-list@python.org> |
| In reply to | #64027 |
On 16/01/2014 00:32, Steven D'Aprano wrote: >> >Or are you saying thatwww.unicode.org is wrong about the definitions of >> >Unicode terms? > No, I think he is saying that he doesn't know Unicode anywhere near as > well as he thinks he does. The question is, will he cherish his > ignorance, or learn from this thread? I assure you that I fully understand my ignorance of unicode. Until recently I didn't even know that the unicode in python 2.x is considered broken and that str in python 3.x is considered 'better'. I can say that having made a lot of reportlab work in both 2.7 & 3.3 I don't understand why the latter seems slower especially since we try to convert early to unicode/str as a desirable internal form. Probably I have some horrible error going on(eg one of the C extensions is working in 2.7 and not in 3.3). -stupidly yrs- Robin Becker
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2014-01-16 14:07 +0000 |
| Message-ID | <52d7e7be$0$29999$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #64071 |
On Thu, 16 Jan 2014 10:51:42 +0000, Robin Becker wrote: > On 16/01/2014 00:32, Steven D'Aprano wrote: >>> >Or are you saying thatwww.unicode.org is wrong about the definitions >>> >of Unicode terms? >> No, I think he is saying that he doesn't know Unicode anywhere near as >> well as he thinks he does. The question is, will he cherish his >> ignorance, or learn from this thread? > > I assure you that I fully understand my ignorance of unicode. Robin, while I'm very happy to see that you have a good grasp of what you don't know, I'm afraid that you're misrepresenting me. You deleted the part of my post that made it clear that I was referring to our resident Unicode crank, JMF <wxjmfauth@gmail.com>. > Until > recently I didn't even know that the unicode in python 2.x is considered > broken and that str in python 3.x is considered 'better'. No need for scare quotes. The unicode type in Python 2.x is less-good because: - it is not the default string type (you have to prefix the string with a u to get Unicode); - it is missing some functionality, e.g. casefold; - there are two distinct implementations, narrow builds and wide builds; - wide builds take up to four times more memory per string as needed; - narrow builds take up to two times more memory per string as needed; - worse, narrow builds have very naive (possibly even "broken") handling of code points in the Supplementary Multilingual Planes. The unicode string type in Python 3 is better because: - it is the default string type; - it includes more functionality; - starting in Python 3.3, it gets rid of the distinction between narrow and wide builds; - which reduces the memory overhead of strings by up to a factor of four in many cases; - and fixes the issue of SMP code points. > I can say that having made a lot of reportlab work in both 2.7 & 3.3 I > don't understand why the latter seems slower especially since we try to > convert early to unicode/str as a desirable internal form. *shrug* Who knows? Is it slower or does it only *seem* slower? Is the performance regression platform specific? Have you traded correctness for speed, that is, does 2.7 version break when given astral characters on a narrow build? Earlier in January, you commented in another thread that "I'm not sure if we have any non-bmp characters in the tests." If you don't, you should have some. There's all sorts of reasons why your code might be slower under 3.3, including the possibility of a non-trivial performance regression. If you can demonstrate a test case with a significant slowdown for real-world code, I'm sure that a bug report will be treated seriously. > Probably I > have some horrible error going on(eg one of the C extensions is working > in 2.7 and not in 3.3). Well that might explain a slowdown. But really, one should expect that moving from single byte strings to up to four-byte strings will have *some* cost. It's exchanging functionality for time. The same thing happened years ago, people used to be extremely opposed to using floating point doubles instead of singles because of performance. And, I suppose it is true that back when 64K was considered a lot of memory, using eight whole bytes per floating point number (let alone ten like the IEEE Extended format) might have seemed the height of extravagance. But today we use doubles by default, and if singles would be a tiny bit faster, who wants to go back to the bad old days of single precision? I believe the same applies to Unicode versus single-byte strings. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Tim Chase <python.list@tim.thechases.com> |
|---|---|
| Date | 2014-01-16 09:24 -0600 |
| Message-ID | <mailman.5589.1389885790.18130.python-list@python.org> |
| In reply to | #64080 |
On 2014-01-16 14:07, Steven D'Aprano wrote: > The unicode type in Python 2.x is less-good because: > > - it is missing some functionality, e.g. casefold; Just for the record, str.casefold() wasn't added until 3.3, so earlier 3.x versions (such as the 3.2.3 that is the default python3 on Debian Stable) don't have it either. -tkc
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-01-16 21:58 +1100 |
| Message-ID | <mailman.5581.1389869944.18130.python-list@python.org> |
| In reply to | #64027 |
On Thu, Jan 16, 2014 at 9:51 PM, Robin Becker <robin@reportlab.com> wrote: > On 16/01/2014 00:32, Steven D'Aprano wrote: >>> >>> >Or are you saying thatwww.unicode.org is wrong about the definitions of >>> >Unicode terms? >> >> No, I think he is saying that he doesn't know Unicode anywhere near as >> well as he thinks he does. The question is, will he cherish his >> ignorance, or learn from this thread? > > > I assure you that I fully understand my ignorance of unicode. Until recently > I didn't even know that the unicode in python 2.x is considered broken and > that str in python 3.x is considered 'better'. Your wisdom, if I may paraphrase Master Foo, is that you know you are a fool. http://catb.org/esr/writings/unix-koans/zealot.html ChrisA
[toc] | [prev] | [next] | [standalone]
| From | "Frank Millman" <frank@chagford.com> |
|---|---|
| Date | 2014-01-16 14:06 +0200 |
| Subject | Re: 'StraÃYe' ('Strasse') and Python 2 |
| Message-ID | <mailman.5585.1389874001.18130.python-list@python.org> |
| In reply to | #64027 |
"Robin Becker" <robin@reportlab.com> wrote in message news:52D7B9BE.9020001@chamonix.reportlab.co.uk... > On 16/01/2014 00:32, Steven D'Aprano wrote: >>> >Or are you saying thatwww.unicode.org is wrong about the definitions >>> >of >>> >Unicode terms? >> No, I think he is saying that he doesn't know Unicode anywhere near as >> well as he thinks he does. The question is, will he cherish his >> ignorance, or learn from this thread? > > I assure you that I fully understand my ignorance of unicode. Until > recently I didn't even know that the unicode in python 2.x is considered > broken and that str in python 3.x is considered 'better'. > Hi Robin I am pretty sure that Steven was referring to the original post from jmfauth, not to anything that you wrote. May I say that I am delighted that you are putting in the effort to port ReportLab to python3, and I trust that you will get plenty of support from the gurus here in achieving this. Frank Millman
[toc] | [prev] | [next] | [standalone]
| From | Robin Becker <robin@reportlab.com> |
|---|---|
| Date | 2014-01-16 13:03 +0000 |
| Subject | Re: 'StraÃYe' ('Strasse') and Python 2 |
| Message-ID | <mailman.5587.1389877414.18130.python-list@python.org> |
| In reply to | #64027 |
On 16/01/2014 12:06, Frank Millman wrote: .......... >> I assure you that I fully understand my ignorance of unicode. Until >> recently I didn't even know that the unicode in python 2.x is considered >> broken and that str in python 3.x is considered 'better'. >> > > Hi Robin > > I am pretty sure that Steven was referring to the original post from > jmfauth, not to anything that you wrote. > unfortunately my ignorance remains even in the absence of criticism > May I say that I am delighted that you are putting in the effort to port > ReportLab to python3, and I trust that you will get plenty of support from > the gurus here in achieving this. ........ I have had a lot of support from the gurus thanks to all of them :) -- Robin Becker
[toc] | [prev] | [next] | [standalone]
| From | Travis Griggs <travisgriggs@gmail.com> |
|---|---|
| Date | 2014-01-16 13:30 -0800 |
| Message-ID | <mailman.5606.1389907814.18130.python-list@python.org> |
| In reply to | #64027 |
On Jan 16, 2014, at 2:51 AM, Robin Becker <robin@reportlab.com> wrote: > I assure you that I fully understand my ignorance of ... Robin, don’t take this personally, I totally got what you meant. At the same time, I got a real chuckle out of this line. That beats “army intelligence” any day.
[toc] | [prev] | [next] | [standalone]
| From | Robin Becker <robin@reportlab.com> |
|---|---|
| Date | 2014-01-15 12:50 +0000 |
| Message-ID | <mailman.5506.1389790224.18130.python-list@python.org> |
| In reply to | #63757 |
On 15/01/2014 12:13, Ned Batchelder wrote:
........
>> On my utf8 based system
>>
>>
>>> robin@everest ~:
>>> $ cat ooo.py
>>> if __name__=='__main__':
>>> import sys
>>> s='A̅B'
>>> print('version_info=%s\nlen(%s)=%d' % (sys.version_info,s,len(s)))
>>> robin@everest ~:
>>> $ python ooo.py
>>> version_info=sys.version_info(major=3, minor=3, micro=3,
>>> releaselevel='final', serial=0)
>>> len(A̅B)=3
>>> robin@everest ~:
>>> $
>>
>>
........
> You are right that more than one codepoint makes up a grapheme, and that you'll
> need code to deal with the correspondence between them. But let's not muddy
> these already confusing waters by referring to that mapping as an encoding.
>
> In Unicode terms, an encoding is a mapping between codepoints and bytes. Python
> 3's str is a sequence of codepoints.
>
Semantics is everything. For me graphemes are the endpoint (or should be); to
get a proper rendering of a sequence of graphemes I can use either a sequence of
bytes or a sequence of codepoints. They are both encodings of the graphemes;
what unicode says is an encoding doesn't define what encodings are ie mappings
from some source alphabet to a target alphabet.
--
Robin Becker
[toc] | [prev] | [next] | [standalone]
| From | Travis Griggs <travisgriggs@gmail.com> |
|---|---|
| Date | 2014-01-15 08:28 -0800 |
| Message-ID | <mailman.5520.1389803336.18130.python-list@python.org> |
| In reply to | #63757 |
On Jan 15, 2014, at 4:50 AM, Robin Becker <robin@reportlab.com> wrote:
> On 15/01/2014 12:13, Ned Batchelder wrote:
> ........
>>> On my utf8 based system
>>>
>>>
>>>> robin@everest ~:
>>>> $ cat ooo.py
>>>> if __name__=='__main__':
>>>> import sys
>>>> s='A̅B'
>>>> print('version_info=%s\nlen(%s)=%d' % (sys.version_info,s,len(s)))
>>>> robin@everest ~:
>>>> $ python ooo.py
>>>> version_info=sys.version_info(major=3, minor=3, micro=3,
>>>> releaselevel='final', serial=0)
>>>> len(A̅B)=3
>>>> robin@everest ~:
>>>> $
>>>
>>>
> ........
>> You are right that more than one codepoint makes up a grapheme, and that you'll
>> need code to deal with the correspondence between them. But let's not muddy
>> these already confusing waters by referring to that mapping as an encoding.
>>
>> In Unicode terms, an encoding is a mapping between codepoints and bytes. Python
>> 3's str is a sequence of codepoints.
>>
> Semantics is everything. For me graphemes are the endpoint (or should be); to get a proper rendering of a sequence of graphemes I can use either a sequence of bytes or a sequence of codepoints. They are both encodings of the graphemes; what unicode says is an encoding doesn't define what encodings are ie mappings from some source alphabet to a target alphabet.
But you’re talking about two levels of encoding. One runs on top of the other. So insisting that you be able to call them all encodings, makes the term pointless, because now it’s ambiguous as to what you’re referring to. Are you referring to encoding in the sense of representing code points with bytes? Or are you referring to what the unicode guys call “forms”?
For example, the NFC form of ‘ñ’ is ’\u00F1’. ‘nThe NFD form represents the exact same grapheme, but is ‘\u006e\u0303’. You can call them encodings if you want, but I echo Ned’s sentiment that you keep that to yourself. Conventionally, they’re different forms, not different encodings. You can encode either form with an encoding, e.g.
'\u00F1'.encode('utf8’)
'\u00F1'.encode('utf16’)
'\u006e\u0303'.encode('utf8’)
'\u006e\u0303'.encode('utf16')
[toc] | [prev] | [next] | [standalone]
| From | Robin Becker <robin@reportlab.com> |
|---|---|
| Date | 2014-01-15 16:55 +0000 |
| Message-ID | <mailman.5525.1389804942.18130.python-list@python.org> |
| In reply to | #63757 |
On 15/01/2014 16:28, Travis Griggs wrote:
........ of a sequence of graphemes I can use either a sequence of bytes or a
sequence of codepoints. They are both encodings of the graphemes; what unicode
says is an encoding doesn't define what encodings are ie mappings from some
source alphabet to a target alphabet.
>
> But you’re talking about two levels of encoding. One runs on top of the other. So insisting that you be able to call them all encodings, makes the term pointless, because now it’s ambiguous as to what you’re referring to. Are you referring to encoding in the sense of representing code points with bytes? Or are you referring to what the unicode guys call “forms”?
>
> For example, the NFC form of ‘ñ’ is ’\u00F1’. ‘nThe NFD form represents the exact same grapheme, but is ‘\u006e\u0303’. You can call them encodings if you want, but I echo Ned’s sentiment that you keep that to yourself. Conventionally, they’re different forms, not different encodings. You can encode either form with an encoding, e.g.
>
> '\u00F1'.encode('utf8’)
> '\u00F1'.encode('utf16’)
>
> '\u006e\u0303'.encode('utf8’)
> '\u006e\u0303'.encode('utf16')
>
I think about these as encodings, because that's what they are mathematically,
logically & practically. I can encode the target grapheme sequence as a sequence
of bytes using a particular 'unicode encoding' eg utf8 or a sequence of code points.
The fact that unicoders want to take over the meaning of encoding is not relevant.
In my utf8 bash shell the python print() takes one encoding (python3 str) and
translates that to the stdout encoding which happens to be utf8 and passes that
to the shell which probably does a lot of work to render the result as graphical
symbols (or graphemes).
I'm not anti unicode, that's just an assignment of identity to some symbols.
Coding the values of the ids is a separate issue. It's my belief that we don't
need more than the byte level encoding to represent unicode. One of the claims
made for python3 unicode is that it somehow eliminates the problems associated
with other encodings eg utf8, but in fact they will remain until we force
printers/designers to stop using complicated multi-codepoint graphemes. I
suspect that won't happen.
--
Robin Becker
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2014-01-16 04:14 +1100 |
| Message-ID | <mailman.5531.1389806101.18130.python-list@python.org> |
| In reply to | #63757 |
On Thu, Jan 16, 2014 at 3:55 AM, Robin Becker <robin@reportlab.com> wrote: > I think about these as encodings, because that's what they are > mathematically, logically & practically. I can encode the target grapheme > sequence as a sequence of bytes using a particular 'unicode encoding' eg > utf8 or a sequence of code points. By that definition, you can equally encode it as a bitmapped image, or as a series of lines and arcs, and those are equally well "encodings" of the character. This is not the normal use of that word. http://en.wikipedia.org/wiki/Character_encoding ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Robin Becker <robin@reportlab.com> |
|---|---|
| Date | 2014-01-15 17:28 +0000 |
| Message-ID | <mailman.5533.1389806945.18130.python-list@python.org> |
| In reply to | #63757 |
On 15/01/2014 17:14, Chris Angelico wrote: > On Thu, Jan 16, 2014 at 3:55 AM, Robin Becker <robin@reportlab.com> wrote: >> I think about these as encodings, because that's what they are >> mathematically, logically & practically. I can encode the target grapheme >> sequence as a sequence of bytes using a particular 'unicode encoding' eg >> utf8 or a sequence of code points. > > By that definition, you can equally encode it as a bitmapped image, or > as a series of lines and arcs, and those are equally well "encodings" > of the character. This is not the normal use of that word. > > http://en.wikipedia.org/wiki/Character_encoding > > ChrisA > Actually I didn't use the term 'character encoding', but that doesn't alter the argument. If I chose to embed the final graphemes as images encoded as bytes or lists of numbers that would still be still be an encoding; it just wouldn't be very easily usable (lots of typing). -- Robin Becker
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2014-01-15 11:32 -0700 |
| Message-ID | <mailman.5539.1389810783.18130.python-list@python.org> |
| In reply to | #63757 |
On Wed, Jan 15, 2014 at 9:55 AM, Robin Becker <robin@reportlab.com> wrote: > The fact that unicoders want to take over the meaning of encoding is not > relevant. A virus is a small infectious agent that replicates only inside the living cells of other organisms. In the context of computing however, that definition is completely false, and if you insist upon it when trying to talk about computers, you're only going to confuse people as to what you mean. Somehow, I haven't seen any biologists complaining that computer users want to take over the meaning of virus.
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2014-01-15 19:27 -0500 |
| Message-ID | <mailman.5551.1389832076.18130.python-list@python.org> |
| In reply to | #63757 |
On 1/15/2014 11:55 AM, Robin Becker wrote: > The fact that unicoders want to take over the meaning of encoding is not > relevant. I agree with you that 'encoding' should not be limited to 'byte encoding of a (subset of) unicode characters. For instance, .jpg and .png are byte encodings of images. In the other hand, it is common in human discourse to omit qualifiers in particular contexts. 'Computer virus' gets condensed to 'virus' in computer contexts. The problem with graphemes is that there is no fixed set of unicode graphemes. Which is to say, the effective set of graphemes is context-specific. Just limiting ourselves to English, 'fi' is usually 2 graphemes when printing to screen, but often just one when printing to paper. This is why the Unicode consortium punted 'graphemes' to 'application' code. > I'm not anti unicode, that's just an assignment of identity to some > symbols. Coding the values of the ids is a separate issue. It's my > belief that we don't need more than the byte level encoding to represent > unicode. One of the claims made for python3 unicode is that it somehow > eliminates the problems associated with other encodings eg utf8, The claim is true for the following problems of the way-too-numerous unicode byte encodings. Subseting: only a subset of characters can be encoded. Shifting: the meaning of a byte depends on a preceding shift character, which might be back as the beginning of the sequence. Varying size: the number of bytes to encode a character depends on the character. Both of the last two problems can turn O(1) operations into O(n) operations. 3.3+ eliminates all these problems. -- Terry Jan Reedy
[toc] | [prev] | [standalone]
Page 2 of 2 — ← Prev page 1 [2]
Back to top | Article view | comp.lang.python
csiph-web