Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #27730 > unrolled thread
| Started by | wxjmfauth@gmail.com |
|---|---|
| First post | 2012-08-23 05:47 -0700 |
| Last post | 2012-08-25 07:23 -0400 |
| Articles | 20 on this page of 95 — 21 participants |
Back to article view | Back to comp.lang.python
Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-23 05:47 -0700
Re: Flexible string representation, unicode, typography, ... Neil Hodgson <nhodgson@iinet.net.au> - 2012-08-23 23:57 +1000
Re: Flexible string representation, unicode, typography, ... MRAB <python@mrabarnett.plus.com> - 2012-08-23 16:11 +0100
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-23 09:19 -0600
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-23 11:33 -0700
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-23 13:22 -0600
Re: Flexible string representation, unicode, typography, ... rusi <rustompmody@gmail.com> - 2012-08-24 09:06 -0700
Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-24 17:47 +0100
Re: Flexible string representation, unicode, typography, ... Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-08-24 14:34 -0400
Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-23 20:34 +0100
Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-23 15:18 +0100
Re: Flexible string representation, unicode, typography, ... Ramchandra Apte <maniandram01@gmail.com> - 2012-08-24 07:38 -0700
Re: Flexible string representation, unicode, typography, ... Antoine Pitrou <solipsis@pitrou.net> - 2012-08-25 00:24 +0000
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 00:27 -0700
Re: Flexible string representation, unicode, typography, ... Ben Finney <ben+python@benfinney.id.au> - 2012-08-25 17:54 +1000
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 00:27 -0700
Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-25 09:58 +0100
Re: Flexible string representation, unicode, typography, ... Frank Millman <frank@chagford.com> - 2012-08-25 11:46 +0200
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 08:47 -0700
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 08:47 -0700
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-25 16:26 -0600
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 23:59 -0700
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-26 09:50 -0600
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 23:59 -0700
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-26 11:49 +0000
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-26 09:40 -0600
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-26 20:13 +0000
Re: Flexible string representation, unicode, typography, ... Dan Sommers <dan@tombstonezero.net> - 2012-08-26 13:45 -0700
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-27 12:16 -0700
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-27 14:14 -0600
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-27 13:37 -0700
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 04:38 -0700
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 04:38 -0700
Re: Flexible string representation, unicode, typography, ... Neil Hodgson <nhodgson@iinet.net.au> - 2012-08-28 09:54 +1000
Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-29 13:59 +1000
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-28 22:15 -0600
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-29 08:05 +0000
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 04:40 -0700
Re: Flexible string representation, unicode, typography, ... Dave Angel <d@davea.name> - 2012-08-29 08:01 -0400
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 08:43 -0700
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-30 06:55 +0000
Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-30 18:59 +1000
Re: Flexible string representation, unicode, typography, ... Roy Smith <roy@panix.com> - 2012-08-30 07:02 -0400
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-30 16:00 +0000
Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-08-30 16:44 -0400
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-31 12:32 +0000
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-31 09:13 -0600
Re: Flexible string representation, unicode, typography, ... Roy Smith <roy@panix.com> - 2012-08-31 08:43 -0400
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-31 14:54 +0000
Re: Flexible string representation, unicode, typography, ... Antoine Pitrou <solipsis@pitrou.net> - 2012-08-30 15:01 +0000
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 00:36 -0700
Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-02 09:58 +0100
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-09-02 03:06 -0600
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 11:58 -0700
Re: Flexible string representation, unicode, typography, ... Michael Torrie <torriem@gmail.com> - 2012-09-02 13:45 -0600
Re: Flexible string representation, unicode, typography, ... Dave Angel <d@davea.name> - 2012-09-02 16:07 -0400
Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-09-02 16:38 -0400
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-03 01:42 +0000
Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-03 18:26 +0300
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-04 00:53 +0000
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 11:58 -0700
Re: Flexible string representation, unicode, typography, ... Peter Otten <__peter__@web.de> - 2012-09-02 11:52 +0200
Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-02 11:36 +0100
Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-02 15:00 +0300
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 22:39 -0700
Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-03 07:11 +0100
Re: Flexible string representation, unicode, typography, ... Peter Otten <__peter__@web.de> - 2012-09-03 08:15 +0200
Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-09-03 04:38 -0400
Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-03 18:56 +0300
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 22:39 -0700
Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-02 13:23 +0100
Re: Flexible string representation, unicode, typography, ... Roy Smith <roy@panix.com> - 2012-09-02 08:35 -0400
Re: Flexible string representation, unicode, typography, ... Ramchandra Apte <maniandram01@gmail.com> - 2012-09-02 06:48 -0700
Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-02 15:46 +0100
Re: Flexible string representation, unicode, typography, ... Ramchandra Apte <maniandram01@gmail.com> - 2012-09-02 06:48 -0700
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-09-03 12:33 -0600
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 00:36 -0700
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-30 10:27 -0600
Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-02 23:38 +0300
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-03 01:54 +0000
Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-09-02 22:33 -0400
Re: Flexible string representation, unicode, typography, ... Roy Smith <roy@panix.com> - 2012-09-03 11:24 -0400
Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-03 18:41 +0300
Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-03 00:45 +0300
Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-30 01:54 +1000
Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-29 22:34 +1000
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 04:40 -0700
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-27 12:16 -0700
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-26 15:42 -0600
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-26 23:31 +0000
Re: Flexible string representation, unicode, typography, ... Paul Rubin <no.email@nospam.invalid> - 2012-08-26 17:47 -0700
Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-25 21:04 +1000
Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-25 12:05 +0100
Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-25 21:19 +1000
Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-08-25 07:23 -0400
Page 3 of 5 — ← Prev page 1 2 [3] 4 5 Next page →
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-08-30 06:55 +0000 |
| Message-ID | <503f0e45$0$9416$c3e8da3$76491128@news.astraweb.com> |
| In reply to | #28067 |
On Wed, 29 Aug 2012 08:43:05 -0700, wxjmfauth wrote: > I can hit the nail a little more. > I have even a better idea and I'm serious. > > If "Python" has found a new way to cover the set of the Unicode > characters, why not proposing it to the Unicode consortium? Because the implementation of the str datatype in a programming language has nothing to do with the Unicode consortium. You might as well propose it to the International Union of Railway Engineers. > Unicode has already three schemes covering practically all cases: memory > consumption, maximum flexibility and an intermediate solution. And Python's solution uses those: UCS-2, UCS-4, and UTF-8. The only thing which is innovative here is that instead of the Python compiler declaring that "all strings will be stored in UCS-2", the compiler chooses an implementation for each string as needed. So some strings will be stored internally as UCS-4, some as UCS-2, and some as ASCII (which is a standard, but not the Unicode consortium's standard). (And possibly some as UTF-8? I'm not entirely sure from reading the PEP.) There's nothing radical here, honest. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2012-08-30 18:59 +1000 |
| Message-ID | <mailman.3961.1346317170.4697.python-list@python.org> |
| In reply to | #28092 |
On Thu, Aug 30, 2012 at 6:51 PM, <wxjmfauth@gmail.com> wrote: > Pick up a random text and see the probability this > text match the most optimized case 1 char / 1 byte, > practically never. Only if you talk about a huge document. Try, instead, every string ever used in a Python script. Practically always. But I'm wasting my time saying this again. It's been said by multiple people multiple times. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2012-08-30 07:02 -0400 |
| Message-ID | <roy-947BF0.07022430082012@news.panix.com> |
| In reply to | #28092 |
In article <503f0e45$0$9416$c3e8da3$76491128@news.astraweb.com>, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > The only thing which is innovative here is that instead of the Python > compiler declaring that "all strings will be stored in UCS-2", the > compiler chooses an implementation for each string as needed. So some > strings will be stored internally as UCS-4, some as UCS-2, and some as > ASCII (which is a standard, but not the Unicode consortium's standard). Is the implementation smart enough to know that x == y is always False if x and y are using different internal representations?
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-08-30 16:00 +0000 |
| Message-ID | <503f8e33$0$30001$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #28100 |
On Thu, 30 Aug 2012 07:02:24 -0400, Roy Smith wrote: > In article <503f0e45$0$9416$c3e8da3$76491128@news.astraweb.com>, > Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > >> The only thing which is innovative here is that instead of the Python >> compiler declaring that "all strings will be stored in UCS-2", the >> compiler chooses an implementation for each string as needed. So some >> strings will be stored internally as UCS-4, some as UCS-2, and some as >> ASCII (which is a standard, but not the Unicode consortium's standard). > > Is the implementation smart enough to know that x == y is always False > if x and y are using different internal representations? But x and y are not necessarily always False just because they have different representations. There may be circumstances where two strings have different internal representations even though their content is the same, so it's an unsafe optimization to automatically treat them as unequal. The closest existing equivalent here is the relationship between ints and longs in Python 2. 42 == 42L even though they have different internal representations and take up a different amount of space. My expectation is that the initial implementation of PEP 393 will be relatively unoptimized, and over the next few releases it will get more efficient. That's usually the way these things go. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2012-08-30 16:44 -0400 |
| Message-ID | <mailman.3985.1346359524.4697.python-list@python.org> |
| In reply to | #28133 |
On 8/30/2012 12:00 PM, Steven D'Aprano wrote:
> On Thu, 30 Aug 2012 07:02:24 -0400, Roy Smith wrote:
>
>> In article <503f0e45$0$9416$c3e8da3$76491128@news.astraweb.com>,
>> Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote:
>>
>>> The only thing which is innovative here is that instead of the Python
>>> compiler declaring that "all strings will be stored in UCS-2", the
>>> compiler chooses an implementation for each string as needed. So some
>>> strings will be stored internally as UCS-4, some as UCS-2, and some as
>>> ASCII (which is a standard, but not the Unicode consortium's standard).
>>
>> Is the implementation smart enough to know that x == y is always False
>> if x and y are using different internal representations?
Yes, after checking lengths, and in same circumstances, x != y is True. From
http://hg.python.org/cpython/file/ab6ab44921b2/Objects/unicodeobject.c
PyObject *
PyUnicode_RichCompare(PyObject *left, PyObject *right, int op)
{
int result;
if (PyUnicode_Check(left) && PyUnicode_Check(right)) {
PyObject *v;
if (PyUnicode_READY(left) == -1 ||
PyUnicode_READY(right) == -1)
return NULL;
if (PyUnicode_GET_LENGTH(left) != PyUnicode_GET_LENGTH(right) ||
PyUnicode_KIND(left) != PyUnicode_KIND(right)) {
if (op == Py_EQ) {
Py_INCREF(Py_False);
return Py_False;
}
if (op == Py_NE) {
Py_INCREF(Py_True);
return Py_True;
}
}
...
KIND is 1,2,4 bytes/char
'a in s' is also False if a chars are wider than s chars.
If s is all ascii, s.encode('ascii') or s.encode('utf-8') is a fast,
constant time operation, as I showed earlier in this discussion. This is
one thing that is much faster in 3.3.
Such things can be tested by timing with different lengths of strings,
where the initial string creation is done in setup code rather than in
the repeated operation code.
> But x and y are not necessarily always False just because they have
> different representations. There may be circumstances where two strings
> have different internal representations even though their content is the
> same, so it's an unsafe optimization to automatically treat them as
> unequal.
I am sure that str objects are always in canonical form once visible to
Python code. Note that unready (non-canonical) objects are rejected by
the rich comparison function.
> My expectation is that the initial implementation of PEP 393 will be
> relatively unoptimized,
The initial implementation was a year ago. At least three people have
expended considerable effort improving it since, so that the slowdown
mentioned in the PEP has mostly disappeared. The things that are still
slower are somewhat balanced by things that are faster.
--
Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-08-31 12:32 +0000 |
| Message-ID | <5040aed8$0$29978$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #28140 |
On Thu, 30 Aug 2012 16:44:32 -0400, Terry Reedy wrote: > On 8/30/2012 12:00 PM, Steven D'Aprano wrote: >> On Thu, 30 Aug 2012 07:02:24 -0400, Roy Smith wrote: [...] >>> Is the implementation smart enough to know that x == y is always False >>> if x and y are using different internal representations? > > Yes, after checking lengths, and in same circumstances, x != y is True. [snip C code] Thanks Terry for looking that up. > 'a in s' is also False if a chars are wider than s chars. Now that's a nice optimization! [...] >> But x and y are not necessarily always False just because they have >> different representations. There may be circumstances where two strings >> have different internal representations even though their content is >> the same, so it's an unsafe optimization to automatically treat them as >> unequal. > > I am sure that str objects are always in canonical form once visible to > Python code. Note that unready (non-canonical) objects are rejected by > the rich comparison function. That's one thing that I'm unclear about -- under what circumstances will a string be in compact versus non-compact form? Reading between the lines, I guess that a lot of the complexity of the implementation only occurs while a string is being built. E.g. if you have Python code like this: ''.join(str(x) for x in something) # a generator expression Python can't tell how much space to allocate for the string -- it doesn't know either the overall length of the string or the width of the characters. So I presume that there is string builder code for dealing with that, and that it involves resizing blocks of memory. But if you do this: ''.join([str(x) for x in something]) # a list comprehension Python could scan the list first, find out the widest char, and allocate exactly the amount of space needed for the string. Even in Python 2, joining a list comp is much faster than joining a gen expression. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2012-08-31 09:13 -0600 |
| Message-ID | <mailman.3.1346426052.27098.python-list@python.org> |
| In reply to | #28172 |
On Fri, Aug 31, 2012 at 6:32 AM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> That's one thing that I'm unclear about -- under what circumstances will
> a string be in compact versus non-compact form?
I understand it to be entirely dependent on which API is used to
construct. The legacy API generates legacy strings, and the new API
generates compact strings. From the comments in unicodeobject.h:
/* ASCII-only strings created through PyUnicode_New use the PyASCIIObject
structure. state.ascii and state.compact are set, and the data
immediately follow the structure. utf8_length and wstr_length can be found
in the length field; the utf8 pointer is equal to the data pointer. */
...
Legacy strings are created by PyUnicode_FromUnicode() and
PyUnicode_FromStringAndSize(NULL, size) functions. They become ready
when PyUnicode_READY() is called.
...
/* Non-ASCII strings allocated through PyUnicode_New use the
PyCompactUnicodeObject structure. state.compact is set, and the data
immediately follow the structure. */
Since I'm not sure that this is clear, note that compact vs. legacy
does not describe which character width is used (except that
PyASCIIObject strings are always 1 byte wide). Legacy and compact
strings can each use the 1, 2, or 4 byte representations. "Compact"
merely denotes that the character data is stored inline with the
struct (as opposed to being stored somewhere else and pointed at by
the struct), not the relative size of the string data. Again from the
comments:
Compact strings use only one memory block (structure + characters),
whereas legacy strings use one block for the structure and one block
for characters.
Cheers,
Ian
[toc] | [prev] | [next] | [standalone]
| From | Roy Smith <roy@panix.com> |
|---|---|
| Date | 2012-08-31 08:43 -0400 |
| Message-ID | <roy-08D029.08435531082012@news.panix.com> |
| In reply to | #28133 |
In article <503f8e33$0$30001$c3e8da3$5496439d@news.astraweb.com>, Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > On Thu, 30 Aug 2012 07:02:24 -0400, Roy Smith wrote: > > Is the implementation smart enough to know that x == y is always False > > if x and y are using different internal representations? > > [...] There may be circumstances where two strings have different > internal representations even though their content is the same If there is a deterministic algorithm which maps string content to representation type, then I don't see how it's possible for two strings with different representation types to have the same content. Could you give me an example of when this might happen?
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-08-31 14:54 +0000 |
| Message-ID | <5040d032$0$29978$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #28173 |
On Fri, 31 Aug 2012 08:43:55 -0400, Roy Smith wrote: > In article <503f8e33$0$30001$c3e8da3$5496439d@news.astraweb.com>, > Steven D'Aprano <steve+comp.lang.python@pearwood.info> wrote: > >> On Thu, 30 Aug 2012 07:02:24 -0400, Roy Smith wrote: >> > Is the implementation smart enough to know that x == y is always >> > False if x and y are using different internal representations? >> >> [...] There may be circumstances where two strings have different >> internal representations even though their content is the same > > If there is a deterministic algorithm which maps string content to > representation type, then I don't see how it's possible for two strings > with different representation types to have the same content. Could you > give me an example of when this might happen? There are deterministic algorithms which can result in the same result with two different internal formats. Here's an example from Python 2: py> sum([1, 2**30, -2**30, 2**30, -2**30]) 1 py> sum([1, 2**30, 2**30, -2**30, -2**30]) 1L The internal representation (int versus long) differs even though the sum is the same. A second example: the order of keys in a dict is deterministic but unpredictable, as it depends on the history of insertions and deletions into the dict. So two dicts could be equal, and yet have radically different internal layout. One final example: list resizing. Here are two lists which are equal but have different sizes: py> a = [0] py> b = range(10000) py> del b[1:] py> a == b True py> sys.getsizeof(a) 36 py> sys.getsizeof(b) 48 Is PEP 393 another example of this? I have no idea. Somebody who is more familiar with the details of the implementation would be able to answer whether or not that is the case. I'm just suggesting that it is possible. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Antoine Pitrou <solipsis@pitrou.net> |
|---|---|
| Date | 2012-08-30 15:01 +0000 |
| Message-ID | <mailman.3974.1346338910.4697.python-list@python.org> |
| In reply to | #28092 |
<wxjmfauth <at> gmail.com> writes: > > Pick up a random text and see the probability this > text match the most optimized case 1 char / 1 byte, > practically never. Funny that you posted a text which does just that: http://mail.python.org/pipermail/python-list/2012-August/629554.html > In a funny way, this is what Python was doing and it > performs better! I honestly suggest you shut up until you have a clue. Regards Antoine.
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2012-09-02 00:36 -0700 |
| Message-ID | <2a12ba52-232a-41b7-a906-1ec379bbddd7@googlegroups.com> |
| In reply to | #28126 |
Le jeudi 30 août 2012 17:01:50 UTC+2, Antoine Pitrou a écrit : > > > I honestly suggest you shut up until you have a clue. > Désolé Antoine, I have not the knowledge to dive in the Python code, but I know what is a character. The coding of the characters is a domain per se, independent from the os, from the computer languages. Before spending time to implement a new algorithm, maybe it is better to ask, if there is something better than the actual schemes. I still remember my thoughts when I read the PEP 393 discussion: "this is not logical", "they do no understand typography", "atomic character ???", ... Real world exemples. >>> import libfrancais >>> li = ['noël', 'noir', 'nœud', 'noduleux', \ ... 'noétique', 'noèse', 'noirâtre'] >>> r = libfrancais.sortfr(li) >>> r ['noduleux', 'noël', 'noèse', 'noétique', 'nœud', 'noir', 'noirâtre'] (cf "Le Petit Robert") or The *letters* satisfying the requirements of the "Imprimerie nationale". jmf
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2012-09-02 09:58 +0100 |
| Message-ID | <mailman.66.1346576268.27098.python-list@python.org> |
| In reply to | #28245 |
On 02/09/2012 08:36, wxjmfauth@gmail.com wrote: > Le jeudi 30 août 2012 17:01:50 UTC+2, Antoine Pitrou a écrit : >> >> >> I honestly suggest you shut up until you have a clue. >> > Désolé Antoine, > > I have not the knowledge to dive in the Python code, > but I know what is a character. You're a character, and from my observations on this thread you're very humorous. YMMV. > > The coding of the characters is a domain per se, > independent from the os, from the computer languages. > > Before spending time to implement a new algorithm, > maybe it is better to ask, if there is something > better than the actual schemes. Please write a new PEP indicating how you would correct your perceived deficiencies with PEP 393 and its implementation. > > I still remember my thoughts when I read the PEP 393 > discussion: "this is not logical", "they do no understand > typography", "atomic character ???", ... When PEP 393 was first drafted how much input did you give during the acceptance process, if any? > > Real world exemples. > >>>> import libfrancais >>>> li = ['noël', 'noir', 'nœud', 'noduleux', \ Why the unneeded continuation character, fancy wasting storage space? > ... 'noétique', 'noèse', 'noirâtre'] >>>> r = libfrancais.sortfr(li) >>>> r > ['noduleux', 'noël', 'noèse', 'noétique', 'nœud', 'noir', > 'noirâtre'] What has sorting foreign words got to do with the internal representaion of the individual characters? > > (cf "Le Petit Robert") > > or > > The *letters* satisfying the requirements of the > "Imprimerie nationale". > > jmf > I've just rechecked my calendar and it's definitly not 1st April today. Poor old me I'm baffled as always. -- Cheers. Mark Lawrence.
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2012-09-02 03:06 -0600 |
| Message-ID | <mailman.68.1346576808.27098.python-list@python.org> |
| In reply to | #28245 |
On Sun, Sep 2, 2012 at 1:36 AM, <wxjmfauth@gmail.com> wrote:
> I still remember my thoughts when I read the PEP 393
> discussion: "this is not logical", "they do no understand
> typography", "atomic character ???", ...
That would indicate one of two possibilities. Either:
1) Everybody in the PEP 393 discussion except for you is clueless
about how to implement a Unicode type; or
2) You are clueless about how to implement a Unicode type.
Taking into account Occam's razor, and also that you seem to be unable
or unwilling to offer a solid rationale for those thoughts, I have to
say that I'm currently leaning toward the second possibility.
> Real world exemples.
>
>>>> import libfrancais
>>>> li = ['noël', 'noir', 'nœud', 'noduleux', \
> ... 'noétique', 'noèse', 'noirâtre']
>>>> r = libfrancais.sortfr(li)
>>>> r
> ['noduleux', 'noël', 'noèse', 'noétique', 'nœud', 'noir',
> 'noirâtre']
libfrancais does not appear to be publicly available. It's not listed
in PyPI, and googling for "python libfrancais" turns up nothing
relevant.
Rewriting the example to use locale.strcoll instead:
>>> li = ['noël', 'noir', 'nœud', 'noduleux', 'noétique', 'noèse', 'noirâtre']
>>> import locale
>>> locale.setlocale(locale.LC_ALL, 'French_France')
'French_France.1252'
>>> import functools
>>> sorted(li, key=functools.cmp_to_key(locale.strcoll))
['noduleux', 'noël', 'noèse', 'noétique', 'nœud', 'noir', 'noirâtre']
# Python 3.2
>>> import timeit
>>> timeit.repeat("sorted(li, key=functools.cmp_to_key(locale.strcoll))", "import functools; import locale; li = ['noël', 'noir', 'nœud', 'noduleux', 'noétique', 'noèse', 'noirâtre']", number=10000)
[0.5544277025009592, 0.5370117249557325, 0.5551836677925053]
# Python 3.3
>>> import timeit
>>> timeit.repeat("sorted(li, key=functools.cmp_to_key(locale.strcoll))", "import functools; import locale; li = ['noël', 'noir', 'nœud', 'noduleux', 'noétique', 'noèse', 'noirâtre']", number=10000)
[0.1421166788364303, 0.12389078130001963, 0.13184190553613462]
As you can see, Python 3.3 is about 77% faster than Python 3.2 on this
example. If this was intended to show that the Python 3.3 Unicode
representation is a regression over the Python 3.2 implementation,
then it's a complete failure as an example.
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2012-09-02 11:58 -0700 |
| Message-ID | <f8dfb1ca-e48d-4a2f-baed-3c28a2f89777@googlegroups.com> |
| In reply to | #28251 |
Le dimanche 2 septembre 2012 11:07:35 UTC+2, Ian a écrit :
> On Sun, Sep 2, 2012 at 1:36 AM, <wxjmfauth@gmail.com> wrote:
>
> > I still remember my thoughts when I read the PEP 393
>
> > discussion: "this is not logical", "they do no understand
>
> > typography", "atomic character ???", ...
>
>
>
> That would indicate one of two possibilities. Either:
>
>
>
> 1) Everybody in the PEP 393 discussion except for you is clueless
>
> about how to implement a Unicode type; or
>
>
>
> 2) You are clueless about how to implement a Unicode type.
>
>
>
> Taking into account Occam's razor, and also that you seem to be unable
>
> or unwilling to offer a solid rationale for those thoughts, I have to
>
> say that I'm currently leaning toward the second possibility.
>
>
>
>
>
> > Real world exemples.
>
> >
>
> >>>> import libfrancais
>
> >>>> li = ['noël', 'noir', 'nœud', 'noduleux', \
>
> > ... 'noétique', 'noèse', 'noirâtre']
>
> >>>> r = libfrancais.sortfr(li)
>
> >>>> r
>
> > ['noduleux', 'noël', 'noèse', 'noétique', 'nœud', 'noir',
>
> > 'noirâtre']
>
>
>
> libfrancais does not appear to be publicly available. It's not listed
>
> in PyPI, and googling for "python libfrancais" turns up nothing
>
> relevant.
>
>
>
> Rewriting the example to use locale.strcoll instead:
>
>
>
> >>> li = ['noël', 'noir', 'nœud', 'noduleux', 'noétique', 'noèse', 'noirâtre']
>
> >>> import locale
>
> >>> locale.setlocale(locale.LC_ALL, 'French_France')
>
> 'French_France.1252'
>
> >>> import functools
>
> >>> sorted(li, key=functools.cmp_to_key(locale.strcoll))
>
> ['noduleux', 'noël', 'noèse', 'noétique', 'nœud', 'noir', 'noirâtre']
>
>
>
> # Python 3.2
>
> >>> import timeit
>
> >>> timeit.repeat("sorted(li, key=functools.cmp_to_key(locale.strcoll))", "import functools; import locale; li = ['noël', 'noir', 'nœud', 'noduleux', 'noétique', 'noèse', 'noirâtre']", number=10000)
>
> [0.5544277025009592, 0.5370117249557325, 0.5551836677925053]
>
>
>
> # Python 3.3
>
> >>> import timeit
>
> >>> timeit.repeat("sorted(li, key=functools.cmp_to_key(locale.strcoll))", "import functools; import locale; li = ['noël', 'noir', 'nœud', 'noduleux', 'noétique', 'noèse', 'noirâtre']", number=10000)
>
> [0.1421166788364303, 0.12389078130001963, 0.13184190553613462]
>
>
> As you can see, Python 3.3 is about 77% faster than Python 3.2 on this
>
> example. If this was intended to show that the Python 3.3 Unicode
>
> representation is a regression over the Python 3.2 implementation,
>
> then it's a complete failure as an example.
- Unfortunately, I got opposite and even much worst results on my win box,
considering
- libfrancais is one of my module and it does a little bit more than
the std sorting tools.
My rationale: very simple.
1) I never heard about something better than sticking with one
of the Unicode coding scheme. (genreral theory)
2) I am not at all convinced by the "new" Py 3.3 algorithm. I'm not the
only one guy, who noticed problems. Arguing, "it is fast enough", is not
a correct answer.
jmf
[toc] | [prev] | [next] | [standalone]
| From | Michael Torrie <torriem@gmail.com> |
|---|---|
| Date | 2012-09-02 13:45 -0600 |
| Message-ID | <mailman.106.1346615114.27098.python-list@python.org> |
| In reply to | #28292 |
On 09/02/2012 12:58 PM, wxjmfauth@gmail.com wrote: > My rationale: very simple. > > 1) I never heard about something better than sticking with one > of the Unicode coding scheme. (genreral theory) > 2) I am not at all convinced by the "new" Py 3.3 algorithm. I'm not the > only one guy, who noticed problems. Arguing, "it is fast enough", is not > a correct answer. If this is true, why were you holding ho Google Go as an example of doing it right? Certainly Google Go doesn't line up with your rational. Go has both Strings and Runes. But strings are UTF-8-encoded bytes strings and Runes are 32-bit integers. They are not interchangeable without a costly encoding and decoding process. Even worse, indexing a Go string to get a "Rune" involves some very costly decoding that has to be done starting at the beginning of the string each time. In the worst case, Python's strings are as slow as Go because Python does the exact same thing as Go, but chooses between three encodings instead of just one. Best case scenario, Python's strings could be much faster than Go's because indexing through 2 of the 3 encodings is O(1) because they are constant-width encodings. If as you say, the latin-1 subset of UTF-8 is used, then UTF-8 indexing is O(1) too, otherwise it's probably O(n).
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <d@davea.name> |
|---|---|
| Date | 2012-09-02 16:07 -0400 |
| Message-ID | <mailman.108.1346616485.27098.python-list@python.org> |
| In reply to | #28292 |
On 09/02/2012 03:45 PM, Michael Torrie wrote: > <jmfauth snipped>: > In the worst case, Python's strings are as slow as Go because Python > does the exact same thing as Go, but chooses between three encodings > instead of just one. Best case scenario, Python's strings could be > much faster than Go's because indexing through 2 of the 3 encodings is > O(1) because they are constant-width encodings. If as you say, the > latin-1 subset of UTF-8 is used, then UTF-8 indexing is O(1) too, > otherwise it's probably O(n). I'm afraid you have it backwards. the Utf-8 version of the latin-1-compatible characters would be variable length. But my understanding of the pep is that the internal one-byte format is simply the lowest order byte of each code point, after assuring that all code points in the particular string are less than 256. That's going to coincidentally resemble latin-1's encoding, but since it's an internal form, the resemblance is irrelevant. Anyway, those one-byte values are going to be O(1), naturally. No encoding involved, and no searching nor expanding. -- DaveA
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2012-09-02 16:38 -0400 |
| Message-ID | <mailman.114.1346618335.27098.python-list@python.org> |
| In reply to | #28292 |
On 9/2/2012 3:45 PM, Michael Torrie wrote: > In the worst case, Python's strings are as slow as Go because Python > does the exact same thing as Go, but chooses between three encodings > instead of just one. Best case scenario, Python's strings could be much > faster than Go's because indexing through 2 of the 3 encodings is O(1) In CPython 3.3, indexing of str text string objects is always O(1) and it is always indexes and counts code points rather than code units. It was the latter for narrow builds in 3.2 and before. As a result, single character (code point) strings had a length of 2 rather than 1 for extended plane characters. 3.3 corrects this. -- Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-09-03 01:42 +0000 |
| Message-ID | <50440af0$0$29967$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #28292 |
On Sun, 02 Sep 2012 11:58:08 -0700, wxjmfauth wrote: > - Unfortunately, I got opposite and even much worst results on my win > box, considering > - libfrancais is one of my module and it does a little bit more than the > std sorting tools. How do we know that the problem isn't in your module? > My rationale: very simple. > > 1) I never heard about something better than sticking with one of the > Unicode coding scheme. (genreral theory) Your ignorance is not a good reason for abandoning a powerful software technique. 2) I am not at all convinced by > the "new" Py 3.3 algorithm. I'm not the only one guy, who noticed > problems. That's nice. Nobody has yet displayed genuine performance problems, only artificial and platform-dependent slowdowns that are insignificant in practice. If you can demonstrate genuine problems, people will be interested in fixing them. Let me be frank: nobody gives a damn if, for some rare circumstances, some_string.replace(another_string) takes 0.3μs instead of 0.1μs. Overall, considering multiple platforms and dozens of different string operations, PEP 393 is a big win: - many operations are faster - a few operations are a LOT faster - but a very few operations are sometimes slower - many strings will use less memory - sometimes a LOT less memory - no more distinction between wide and narrow builds - characters in the supplementary planes are now, for the first time in Python, treated correctly by default That's six wins versus one loss. > Arguing, "it is fast enough", is not a correct answer. It is *exactly* the correct answer. Nobody is going to revert this just because your script now runs in 5.7ms instead of 5.2ms. Who cares? If you are *seriously* interested in debugging why string code is slower for you, you can start by running the full suite of Python string benchmarks: see the stringbench benchmark in the Tools directory of source installations, or see here: http://hg.python.org/cpython/file/8ff2f4634ed8/Tools/stringbench -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Serhiy Storchaka <storchaka@gmail.com> |
|---|---|
| Date | 2012-09-03 18:26 +0300 |
| Message-ID | <mailman.147.1346686000.27098.python-list@python.org> |
| In reply to | #28332 |
On 03.09.12 04:42, Steven D'Aprano wrote: > If you are *seriously* interested in debugging why string code is slower > for you, you can start by running the full suite of Python string > benchmarks: see the stringbench benchmark in the Tools directory of > source installations, or see here: > > http://hg.python.org/cpython/file/8ff2f4634ed8/Tools/stringbench http://hg.python.org/cpython/file/default/Tools/stringbench However, stringbench is not good tool to measure the effectiveness of new string representation, because it focuses mainly on ASCII strings and comparing strings with bytes.
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-09-04 00:53 +0000 |
| Message-ID | <504550ff$0$29978$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #28359 |
On Mon, 03 Sep 2012 18:26:02 +0300, Serhiy Storchaka wrote: > On 03.09.12 04:42, Steven D'Aprano wrote: >> If you are *seriously* interested in debugging why string code is >> slower for you, you can start by running the full suite of Python >> string benchmarks: see the stringbench benchmark in the Tools directory >> of source installations, or see here: >> >> http://hg.python.org/cpython/file/8ff2f4634ed8/Tools/stringbench > > http://hg.python.org/cpython/file/default/Tools/stringbench > > However, stringbench is not good tool to measure the effectiveness of > new string representation, because it focuses mainly on ASCII strings > and comparing strings with bytes. But it is a good place to start, so you can develop unicode benchmarks. -- Steven
[toc] | [prev] | [next] | [standalone]
Page 3 of 5 — ← Prev page 1 2 [3] 4 5 Next page →
Back to top | Article view | comp.lang.python
csiph-web