Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #27730 > unrolled thread
| Started by | wxjmfauth@gmail.com |
|---|---|
| First post | 2012-08-23 05:47 -0700 |
| Last post | 2012-08-25 07:23 -0400 |
| Articles | 20 on this page of 95 — 21 participants |
Back to article view | Back to comp.lang.python
Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-23 05:47 -0700
Re: Flexible string representation, unicode, typography, ... Neil Hodgson <nhodgson@iinet.net.au> - 2012-08-23 23:57 +1000
Re: Flexible string representation, unicode, typography, ... MRAB <python@mrabarnett.plus.com> - 2012-08-23 16:11 +0100
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-23 09:19 -0600
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-23 11:33 -0700
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-23 13:22 -0600
Re: Flexible string representation, unicode, typography, ... rusi <rustompmody@gmail.com> - 2012-08-24 09:06 -0700
Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-24 17:47 +0100
Re: Flexible string representation, unicode, typography, ... Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2012-08-24 14:34 -0400
Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-23 20:34 +0100
Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-23 15:18 +0100
Re: Flexible string representation, unicode, typography, ... Ramchandra Apte <maniandram01@gmail.com> - 2012-08-24 07:38 -0700
Re: Flexible string representation, unicode, typography, ... Antoine Pitrou <solipsis@pitrou.net> - 2012-08-25 00:24 +0000
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 00:27 -0700
Re: Flexible string representation, unicode, typography, ... Ben Finney <ben+python@benfinney.id.au> - 2012-08-25 17:54 +1000
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 00:27 -0700
Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-25 09:58 +0100
Re: Flexible string representation, unicode, typography, ... Frank Millman <frank@chagford.com> - 2012-08-25 11:46 +0200
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 08:47 -0700
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 08:47 -0700
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-25 16:26 -0600
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 23:59 -0700
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-26 09:50 -0600
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-25 23:59 -0700
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-26 11:49 +0000
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-26 09:40 -0600
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-26 20:13 +0000
Re: Flexible string representation, unicode, typography, ... Dan Sommers <dan@tombstonezero.net> - 2012-08-26 13:45 -0700
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-27 12:16 -0700
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-27 14:14 -0600
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-27 13:37 -0700
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 04:38 -0700
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 04:38 -0700
Re: Flexible string representation, unicode, typography, ... Neil Hodgson <nhodgson@iinet.net.au> - 2012-08-28 09:54 +1000
Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-29 13:59 +1000
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-28 22:15 -0600
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-29 08:05 +0000
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 04:40 -0700
Re: Flexible string representation, unicode, typography, ... Dave Angel <d@davea.name> - 2012-08-29 08:01 -0400
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 08:43 -0700
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-30 06:55 +0000
Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-30 18:59 +1000
Re: Flexible string representation, unicode, typography, ... Roy Smith <roy@panix.com> - 2012-08-30 07:02 -0400
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-30 16:00 +0000
Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-08-30 16:44 -0400
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-31 12:32 +0000
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-31 09:13 -0600
Re: Flexible string representation, unicode, typography, ... Roy Smith <roy@panix.com> - 2012-08-31 08:43 -0400
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-31 14:54 +0000
Re: Flexible string representation, unicode, typography, ... Antoine Pitrou <solipsis@pitrou.net> - 2012-08-30 15:01 +0000
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 00:36 -0700
Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-02 09:58 +0100
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-09-02 03:06 -0600
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 11:58 -0700
Re: Flexible string representation, unicode, typography, ... Michael Torrie <torriem@gmail.com> - 2012-09-02 13:45 -0600
Re: Flexible string representation, unicode, typography, ... Dave Angel <d@davea.name> - 2012-09-02 16:07 -0400
Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-09-02 16:38 -0400
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-03 01:42 +0000
Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-03 18:26 +0300
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-04 00:53 +0000
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 11:58 -0700
Re: Flexible string representation, unicode, typography, ... Peter Otten <__peter__@web.de> - 2012-09-02 11:52 +0200
Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-02 11:36 +0100
Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-02 15:00 +0300
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 22:39 -0700
Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-03 07:11 +0100
Re: Flexible string representation, unicode, typography, ... Peter Otten <__peter__@web.de> - 2012-09-03 08:15 +0200
Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-09-03 04:38 -0400
Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-03 18:56 +0300
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 22:39 -0700
Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-02 13:23 +0100
Re: Flexible string representation, unicode, typography, ... Roy Smith <roy@panix.com> - 2012-09-02 08:35 -0400
Re: Flexible string representation, unicode, typography, ... Ramchandra Apte <maniandram01@gmail.com> - 2012-09-02 06:48 -0700
Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-09-02 15:46 +0100
Re: Flexible string representation, unicode, typography, ... Ramchandra Apte <maniandram01@gmail.com> - 2012-09-02 06:48 -0700
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-09-03 12:33 -0600
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-09-02 00:36 -0700
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-30 10:27 -0600
Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-02 23:38 +0300
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-09-03 01:54 +0000
Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-09-02 22:33 -0400
Re: Flexible string representation, unicode, typography, ... Roy Smith <roy@panix.com> - 2012-09-03 11:24 -0400
Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-03 18:41 +0300
Re: Flexible string representation, unicode, typography, ... Serhiy Storchaka <storchaka@gmail.com> - 2012-09-03 00:45 +0300
Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-30 01:54 +1000
Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-29 22:34 +1000
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-29 04:40 -0700
Re: Flexible string representation, unicode, typography, ... wxjmfauth@gmail.com - 2012-08-27 12:16 -0700
Re: Flexible string representation, unicode, typography, ... Ian Kelly <ian.g.kelly@gmail.com> - 2012-08-26 15:42 -0600
Re: Flexible string representation, unicode, typography, ... Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-08-26 23:31 +0000
Re: Flexible string representation, unicode, typography, ... Paul Rubin <no.email@nospam.invalid> - 2012-08-26 17:47 -0700
Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-25 21:04 +1000
Re: Flexible string representation, unicode, typography, ... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-25 12:05 +0100
Re: Flexible string representation, unicode, typography, ... Chris Angelico <rosuav@gmail.com> - 2012-08-25 21:19 +1000
Re: Flexible string representation, unicode, typography, ... Terry Reedy <tjreedy@udel.edu> - 2012-08-25 07:23 -0400
Page 1 of 5 [1] 2 3 4 5 Next page →
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2012-08-23 05:47 -0700 |
| Subject | Flexible string representation, unicode, typography, ... |
| Message-ID | <a81cd504-d889-4aa1-9daa-6df3448b4da8@googlegroups.com> |
This is neither a complaint nor a question, just a comment.
In the previous discussion related to the flexible
string representation, Roy Smith added this comment:
http://groups.google.com/group/comp.lang.python/browse_thread/thread/2645504f459bab50/eda342573381ff42
Not only I agree with his sentence:
"Clearly, the world has moved to a 32-bit character set."
he used in his comment a very intersting word: "punctuation".
There is a point which is, in my mind, not very well understood,
"digested", underestimated or neglected by many developers:
the relation between the coding of the characters and the typography.
Unicode (the consortium), does not only deal with the coding of
the characters, it also worked on the characters *classification*.
A deliberatly simplistic representation: "letters" in the bottom
of the table, lower code points/integers; "typographic characters"
like punctuation, common symbols, ... high in the table, high code
points/integers.
The conclusion is inescapable, if one wish to work in a "unicode
mode", one is forced to use the whole palette of the unicode
code points, this is the *nature* of Unicode.
Technically, believing that it possible to optimize only a subrange
of the unicode code points range is simply an illusion. A lot of
work, probably quite complicate, which finally solves nothing.
Python, in my mind, fell in this trap.
"Simple is better than complex."
-> hard to maintained
"Flat is better than nested."
-> code points range
"Special cases aren't special enough to break the rules."
-> special unicode code points?
"Although practicality beats purity."
-> or the opposite?
"In the face of ambiguity, refuse the temptation to guess."
-> guessing a user will only work with the "optimmized" char subrange.
...
Small illustration. Take an a4 page containing 50 lines of 80 ascii
characters, add a single 'EM DASH' or an 'BULLET' (code points > 0x2000),
and you will see all the optimization efforts destroyed.
>> sys.getsizeof('a' * 80 * 50)
4025
>>> sys.getsizeof('a' * 80 * 50 + '•')
8040
Just my 2 € (code point 0x20ac) cents.
jmf
[toc] | [next] | [standalone]
| From | Neil Hodgson <nhodgson@iinet.net.au> |
|---|---|
| Date | 2012-08-23 23:57 +1000 |
| Message-ID | <D7udnfbyKvHEqqvNnZ2dnUVZ_sidnZ2d@westnet.com.au> |
| In reply to | #27730 |
wxjmfauth@gmail.com:
> Small illustration. Take an a4 page containing 50 lines of 80 ascii
> characters, add a single 'EM DASH' or an 'BULLET' (code points> 0x2000),
> and you will see all the optimization efforts destroyed.
>
>>> sys.getsizeof('a' * 80 * 50)
> 4025
>>>> sys.getsizeof('a' * 80 * 50 + '•')
> 8040
This example is still benefiting from shrinking the number of bytes
in half over using 32 bits per character as was the case with Python 3.2:
>>> sys.getsizeof('a' * 80 * 50)
16032
>>> sys.getsizeof('a' * 80 * 50 + '•')
16036
>>>
Neil
[toc] | [prev] | [next] | [standalone]
| From | MRAB <python@mrabarnett.plus.com> |
|---|---|
| Date | 2012-08-23 16:11 +0100 |
| Message-ID | <mailman.3717.1345734660.4697.python-list@python.org> |
| In reply to | #27733 |
On 23/08/2012 14:57, Neil Hodgson wrote:
> wxjmfauth@gmail.com:
>
>> Small illustration. Take an a4 page containing 50 lines of 80 ascii
>> characters, add a single 'EM DASH' or an 'BULLET' (code points> 0x2000),
>> and you will see all the optimization efforts destroyed.
>>
>>>> sys.getsizeof('a' * 80 * 50)
>> 4025
>>>>> sys.getsizeof('a' * 80 * 50 + '•')
>> 8040
>
> This example is still benefiting from shrinking the number of bytes
> in half over using 32 bits per character as was the case with Python 3.2:
>
> >>> sys.getsizeof('a' * 80 * 50)
> 16032
> >>> sys.getsizeof('a' * 80 * 50 + '•')
> 16036
> >>>
>
Perhaps the solution should've been to just switch between 2/4 bytes
instead
of 1/2/4 bytes. :-)
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2012-08-23 09:19 -0600 |
| Message-ID | <mailman.3718.1345735195.4697.python-list@python.org> |
| In reply to | #27733 |
On Thu, Aug 23, 2012 at 9:11 AM, MRAB <python@mrabarnett.plus.com> wrote: > Perhaps the solution should've been to just switch between 2/4 bytes instead > of 1/2/4 bytes. :-) Why? You don't lose any complexity by doing that. I can see arguments for 1/2/4 or for just 4, but I can't see any advantage of 2/4 over either of those.
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2012-08-23 11:33 -0700 |
| Message-ID | <7eaafbcd-597d-4f8c-98a8-ecb537e6e065@googlegroups.com> |
| In reply to | #27733 |
Le jeudi 23 août 2012 15:57:50 UTC+2, Neil Hodgson a écrit :
> wxjmfauth@gmail.com:
>
>
>
> > Small illustration. Take an a4 page containing 50 lines of 80 ascii
>
> > characters, add a single 'EM DASH' or an 'BULLET' (code points> 0x2000),
>
> > and you will see all the optimization efforts destroyed.
>
> >
>
> >>> sys.getsizeof('a' * 80 * 50)
>
> > 4025
>
> >>>> sys.getsizeof('a' * 80 * 50 + '•')
>
> > 8040
>
>
>
> This example is still benefiting from shrinking the number of bytes
>
> in half over using 32 bits per character as was the case with Python 3.2:
>
>
>
> >>> sys.getsizeof('a' * 80 * 50)
>
> 16032
>
> >>> sys.getsizeof('a' * 80 * 50 + '•')
>
> 16036
>
Correct, but how many times does it happen?
Practically never.
In this unicode stuff, I'm fascinated by the obsession
to solve a problem which is, due to the nature of
Unicode, unsolvable.
For every optimization algorithm, for every code
point range you can optimize, it is always possible
to find a case breaking that optimization.
This follows quasi the mathematical logic. To proof a
law is valid, you have to proof all the cases
are valid. To proof a law is invalid, just find one
case showing it.
Sure, it is possible to optimize the unicode usage
by not using French characters, punctuation, mathematical
symbols, currency symbols, CJK characters...
(select undesired characters here: http://www.unicode.org/charts/).
In that case, why using unicode?
(A problematic not specific to Python)
jmf
[toc] | [prev] | [next] | [standalone]
| From | Ian Kelly <ian.g.kelly@gmail.com> |
|---|---|
| Date | 2012-08-23 13:22 -0600 |
| Message-ID | <mailman.3730.1345749768.4697.python-list@python.org> |
| In reply to | #27757 |
On Thu, Aug 23, 2012 at 12:33 PM, <wxjmfauth@gmail.com> wrote:
>> >>> sys.getsizeof('a' * 80 * 50)
>>
>> > 4025
>>
>> >>>> sys.getsizeof('a' * 80 * 50 + '•')
>>
>> > 8040
>>
>>
>>
>> This example is still benefiting from shrinking the number of bytes
>>
>> in half over using 32 bits per character as was the case with Python 3.2:
>>
>>
>>
>> >>> sys.getsizeof('a' * 80 * 50)
>>
>> 16032
>>
>> >>> sys.getsizeof('a' * 80 * 50 + '•')
>>
>> 16036
>>
> Correct, but how many times does it happen?
> Practically never.
What are you talking about? Surely it happens the same number of
times that your example happens, since it's the same example. By
dismissing this example as being too infrequent to be of any
importance, you dismiss the validity of your own example as well.
> In this unicode stuff, I'm fascinated by the obsession
> to solve a problem which is, due to the nature of
> Unicode, unsolvable.
>
> For every optimization algorithm, for every code
> point range you can optimize, it is always possible
> to find a case breaking that optimization.
So what? Similarly, for any generalized data compression algorithm,
it is possible to engineer inputs for which the "compressed" output is
as large as or larger than the original input (this is easy to prove).
Does this mean that compression algorithms are useless? I hardly
think so, as evidenced by the widespread popularity of tools like gzip
and WinZip.
You seem to be saying that because we cannot pack all Unicode strings
into 1-byte or 2-byte per character representations, we should just
give up and force everybody to use maximum-width representations for
all strings. That is absurd.
> Sure, it is possible to optimize the unicode usage
> by not using French characters, punctuation, mathematical
> symbols, currency symbols, CJK characters...
> (select undesired characters here: http://www.unicode.org/charts/).
>
> In that case, why using unicode?
> (A problematic not specific to Python)
Obviously, it is because I want to have the *ability* to represent all
those characters in my strings, even if I am not necessarily going to
take advantage of that ability in every single string that I produce.
Not all of the strings I use are going to fit into the 1-byte or
2-byte per character representation. Fine, whatever -- that's part of
the cost of internationalization. However, *most* of the strings that
I work with (this entire email message, for instance) -- and, I think,
most of the strings that any developer works with (identifiers in the
standard library, for instance) -- will fit into at least the 2-byte
per character representation. Why shackle every string everywhere to
4 bytes per character when for a majority of them we can do much
better than that?
[toc] | [prev] | [next] | [standalone]
| From | rusi <rustompmody@gmail.com> |
|---|---|
| Date | 2012-08-24 09:06 -0700 |
| Message-ID | <a657deea-b429-4662-898e-c500ef592556@f4g2000pbq.googlegroups.com> |
| In reply to | #27762 |
On Aug 24, 12:22 am, Ian Kelly <ian.g.ke...@gmail.com> wrote:
> On Thu, Aug 23, 2012 at 12:33 PM, <wxjmfa...@gmail.com> wrote:
> >> >>> sys.getsizeof('a' * 80 * 50)
>
> >> > 4025
>
> >> >>>> sys.getsizeof('a' * 80 * 50 + '•')
>
> >> > 8040
>
> >> This example is still benefiting from shrinking the number of bytes
>
> >> in half over using 32 bits per character as was the case with Python 3.2:
>
> >> >>> sys.getsizeof('a' * 80 * 50)
>
> >> 16032
>
> >> >>> sys.getsizeof('a' * 80 * 50 + '•')
>
> >> 16036
>
> > Correct, but how many times does it happen?
> > Practically never.
>
> What are you talking about? Surely it happens the same number of
> times that your example happens, since it's the same example. By
> dismissing this example as being too infrequent to be of any
> importance, you dismiss the validity of your own example as well.
>
> > In this unicode stuff, I'm fascinated by the obsession
> > to solve a problem which is, due to the nature of
> > Unicode, unsolvable.
>
> > For every optimization algorithm, for every code
> > point range you can optimize, it is always possible
> > to find a case breaking that optimization.
>
> So what? Similarly, for any generalized data compression algorithm,
> it is possible to engineer inputs for which the "compressed" output is
> as large as or larger than the original input (this is easy to prove).
> Does this mean that compression algorithms are useless? I hardly
> think so, as evidenced by the widespread popularity of tools like gzip
> and WinZip.
>
> You seem to be saying that because we cannot pack all Unicode strings
> into 1-byte or 2-byte per character representations, we should just
> give up and force everybody to use maximum-width representations for
> all strings. That is absurd.
>
> > Sure, it is possible to optimize the unicode usage
> > by not using French characters, punctuation, mathematical
> > symbols, currency symbols, CJK characters...
> > (select undesired characters here:http://www.unicode.org/charts/).
>
> > In that case, why using unicode?
> > (A problematic not specific to Python)
>
> Obviously, it is because I want to have the *ability* to represent all
> those characters in my strings, even if I am not necessarily going to
> take advantage of that ability in every single string that I produce.
> Not all of the strings I use are going to fit into the 1-byte or
> 2-byte per character representation. Fine, whatever -- that's part of
> the cost of internationalization. However, *most* of the strings that
> I work with (this entire email message, for instance) -- and, I think,
> most of the strings that any developer works with (identifiers in the
> standard library, for instance) -- will fit into at least the 2-byte
> per character representation. Why shackle every string everywhere to
> 4 bytes per character when for a majority of them we can do much
> better than that?
Actually what exactly are you (jmf) asking for?
Its not clear to anybody as best as we can see...
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2012-08-24 17:47 +0100 |
| Message-ID | <mailman.3761.1345826801.4697.python-list@python.org> |
| In reply to | #27809 |
On 24/08/2012 17:06, rusi wrote: > > Actually what exactly are you (jmf) asking for? > Its not clear to anybody as best as we can see... > A knee in the temple and a dagger up the <censored> ? :) From another Monty Python sketch for those who don't know. -- Cheers. Mark Lawrence.
[toc] | [prev] | [next] | [standalone]
| From | Dennis Lee Bieber <wlfraed@ix.netcom.com> |
|---|---|
| Date | 2012-08-24 14:34 -0400 |
| Message-ID | <mailman.3765.1345833280.4697.python-list@python.org> |
| In reply to | #27809 |
On Fri, 24 Aug 2012 17:47:42 +0100, Mark Lawrence
<breamoreboy@yahoo.co.uk> declaimed the following in
gmane.comp.python.general:
>
> A knee in the temple and a dagger up the <censored> ? :) From another
> Monty Python sketch for those who don't know.
A poignard in the codpiece...
--
Wulfraed Dennis Lee Bieber AF6VN
wlfraed@ix.netcom.com HTTP://wlfraed.home.netcom.com/
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2012-08-23 20:34 +0100 |
| Message-ID | <mailman.3731.1345750334.4697.python-list@python.org> |
| In reply to | #27757 |
On 23/08/2012 19:33, wxjmfauth@gmail.com wrote:
> Le jeudi 23 août 2012 15:57:50 UTC+2, Neil Hodgson a écrit :
>> wxjmfauth@gmail.com:
>>
>>
>>
>>> Small illustration. Take an a4 page containing 50 lines of 80 ascii
>>
>>> characters, add a single 'EM DASH' or an 'BULLET' (code points> 0x2000),
>>
>>> and you will see all the optimization efforts destroyed.
>>
>>>
>>
>>>>> sys.getsizeof('a' * 80 * 50)
>>
>>> 4025
>>
>>>>>> sys.getsizeof('a' * 80 * 50 + '•')
>>
>>> 8040
>>
>>
>>
>> This example is still benefiting from shrinking the number of bytes
>>
>> in half over using 32 bits per character as was the case with Python 3.2:
>>
>>
>>
>> >>> sys.getsizeof('a' * 80 * 50)
>>
>> 16032
>>
>> >>> sys.getsizeof('a' * 80 * 50 + '•')
>>
>> 16036
>>
> Correct, but how many times does it happen?
> Practically never.
>
> In this unicode stuff, I'm fascinated by the obsession
> to solve a problem which is, due to the nature of
> Unicode, unsolvable.
>
> For every optimization algorithm, for every code
> point range you can optimize, it is always possible
> to find a case breaking that optimization.
>
> This follows quasi the mathematical logic. To proof a
> law is valid, you have to proof all the cases
> are valid. To proof a law is invalid, just find one
> case showing it.
>
> Sure, it is possible to optimize the unicode usage
> by not using French characters, punctuation, mathematical
> symbols, currency symbols, CJK characters...
> (select undesired characters here: http://www.unicode.org/charts/).
>
> In that case, why using unicode?
> (A problematic not specific to Python)
>
> jmf
>
What do you propose should be used instead, as you appear to be the
resident expert in the field?
--
Cheers.
Mark Lawrence.
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2012-08-23 15:18 +0100 |
| Message-ID | <mailman.3715.1345731438.4697.python-list@python.org> |
| In reply to | #27730 |
On 23/08/2012 13:47, wxjmfauth@gmail.com wrote:
> This is neither a complaint nor a question, just a comment.
>
> In the previous discussion related to the flexible
> string representation, Roy Smith added this comment:
>
> http://groups.google.com/group/comp.lang.python/browse_thread/thread/2645504f459bab50/eda342573381ff42
>
> Not only I agree with his sentence:
> "Clearly, the world has moved to a 32-bit character set."
>
> he used in his comment a very intersting word: "punctuation".
>
> There is a point which is, in my mind, not very well understood,
> "digested", underestimated or neglected by many developers:
> the relation between the coding of the characters and the typography.
>
> Unicode (the consortium), does not only deal with the coding of
> the characters, it also worked on the characters *classification*.
>
> A deliberatly simplistic representation: "letters" in the bottom
> of the table, lower code points/integers; "typographic characters"
> like punctuation, common symbols, ... high in the table, high code
> points/integers.
>
> The conclusion is inescapable, if one wish to work in a "unicode
> mode", one is forced to use the whole palette of the unicode
> code points, this is the *nature* of Unicode.
>
> Technically, believing that it possible to optimize only a subrange
> of the unicode code points range is simply an illusion. A lot of
> work, probably quite complicate, which finally solves nothing.
>
> Python, in my mind, fell in this trap.
>
> "Simple is better than complex."
> -> hard to maintained
> "Flat is better than nested."
> -> code points range
> "Special cases aren't special enough to break the rules."
> -> special unicode code points?
> "Although practicality beats purity."
> -> or the opposite?
> "In the face of ambiguity, refuse the temptation to guess."
> -> guessing a user will only work with the "optimmized" char subrange.
> ...
>
> Small illustration. Take an a4 page containing 50 lines of 80 ascii
> characters, add a single 'EM DASH' or an 'BULLET' (code points > 0x2000),
> and you will see all the optimization efforts destroyed.
>
>>> sys.getsizeof('a' * 80 * 50)
> 4025
>>>> sys.getsizeof('a' * 80 * 50 + '•')
> 8040
>
> Just my 2 € (code point 0x20ac) cents.
>
> jmf
>
I'm looking forward to all the patches you are going to provide to
correct all these (presumably) cPython deficiencies. When do they start
arriving on the bug tracker?
--
Cheers.
Mark Lawrence.
[toc] | [prev] | [next] | [standalone]
| From | Ramchandra Apte <maniandram01@gmail.com> |
|---|---|
| Date | 2012-08-24 07:38 -0700 |
| Message-ID | <1874857c-68ef-4c1b-b15a-46ef47df9445@googlegroups.com> |
| In reply to | #27730 |
On Thursday, 23 August 2012 18:17:29 UTC+5:30, (unknown) wrote:
> This is neither a complaint nor a question, just a comment.
>
>
>
> In the previous discussion related to the flexible
>
> string representation, Roy Smith added this comment:
>
>
>
> http://groups.google.com/group/comp.lang.python/browse_thread/thread/2645504f459bab50/eda342573381ff42
>
>
>
> Not only I agree with his sentence:
>
> "Clearly, the world has moved to a 32-bit character set."
>
>
>
> he used in his comment a very intersting word: "punctuation".
>
>
>
> There is a point which is, in my mind, not very well understood,
>
> "digested", underestimated or neglected by many developers:
>
> the relation between the coding of the characters and the typography.
>
>
>
> Unicode (the consortium), does not only deal with the coding of
>
> the characters, it also worked on the characters *classification*.
>
>
>
> A deliberatly simplistic representation: "letters" in the bottom
>
> of the table, lower code points/integers; "typographic characters"
>
> like punctuation, common symbols, ... high in the table, high code
>
> points/integers.
>
>
>
> The conclusion is inescapable, if one wish to work in a "unicode
>
> mode", one is forced to use the whole palette of the unicode
>
> code points, this is the *nature* of Unicode.
>
>
>
> Technically, believing that it possible to optimize only a subrange
>
> of the unicode code points range is simply an illusion. A lot of
>
> work, probably quite complicate, which finally solves nothing.
>
>
>
> Python, in my mind, fell in this trap.
>
>
>
> "Simple is better than complex."
>
> -> hard to maintained
>
> "Flat is better than nested."
>
> -> code points range
>
> "Special cases aren't special enough to break the rules."
>
> -> special unicode code points?
>
> "Although practicality beats purity."
>
> -> or the opposite?
>
> "In the face of ambiguity, refuse the temptation to guess."
>
> -> guessing a user will only work with the "optimmized" char subrange.
>
> ...
>
>
>
> Small illustration. Take an a4 page containing 50 lines of 80 ascii
>
> characters, add a single 'EM DASH' or an 'BULLET' (code points > 0x2000),
>
> and you will see all the optimization efforts destroyed.
>
>
>
> >> sys.getsizeof('a' * 80 * 50)
>
> 4025
>
> >>> sys.getsizeof('a' * 80 * 50 + '•')
>
> 8040
>
>
>
> Just my 2 € (code point 0x20ac) cents.
>
>
>
> jmf
The zen of python is simply a guideline
[toc] | [prev] | [next] | [standalone]
| From | Antoine Pitrou <solipsis@pitrou.net> |
|---|---|
| Date | 2012-08-25 00:24 +0000 |
| Message-ID | <mailman.3784.1345854291.4697.python-list@python.org> |
| In reply to | #27802 |
Ramchandra Apte <maniandram01 <at> gmail.com> writes: > > The zen of python is simply a guideline What's more, the Zen guides the language's design, not its implementation. People who think CPython is a complicated implementation can take a look at PyPy :-) Regards Antoine. -- Software development and contracting: http://pro.pitrou.net
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2012-08-25 00:27 -0700 |
| Message-ID | <mailman.3788.1345879639.4697.python-list@python.org> |
| In reply to | #27843 |
Le samedi 25 août 2012 02:24:35 UTC+2, Antoine Pitrou a écrit : > Ramchandra Apte <maniandram01 <at> gmail.com> writes: > > > > > > The zen of python is simply a guideline > > > > What's more, the Zen guides the language's design, not its implementation. > > People who think CPython is a complicated implementation can take a look at PyPy > > :-) Unicode design: a flat table of code points, where all code points are "equals". As soon as one attempts to escape from this rule, one has to "pay" for it. The creator of this machinery (flexible string representation) can not even benefit from it in his native language (I think I'm correctly informed). Hint: Google -> "Das grosse Eszett" jmf
[toc] | [prev] | [next] | [standalone]
| From | Ben Finney <ben+python@benfinney.id.au> |
|---|---|
| Date | 2012-08-25 17:54 +1000 |
| Message-ID | <87sjbbe78w.fsf@benfinney.id.au> |
| In reply to | #27853 |
wxjmfauth@gmail.com writes: > Unicode design: a flat table of code points, where all code > points are "equals". Yes, Unicode's design entails a flat table of hundreds of thousands of code points, expansible in future. This is in direct conflict with the design of all significant computers we need to write software for: data stored and transported as 8-bit bytes, which can only ever hold 256 different values, no expansion. > As soon as one attempts to escape from this rule, one has to > "pay" for it. Yes, in either direction; the conflict means that trade-offs need to be made. See this presentation by Ned Batchelder, “Pragmatic Unicode” <URL:http://nedbatchelder.com/text/unipain.html>, which lays out the fundamental conflict of representing human text in computer data; and several practical approaches to deal with it. -- \ “I busted a mirror and got seven years bad luck, but my lawyer | `\ thinks he can get me five.” —Steven Wright | _o__) | Ben Finney
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2012-08-25 00:27 -0700 |
| Message-ID | <1cb3f062-eb45-4b0c-977b-76afb099923c@googlegroups.com> |
| In reply to | #27843 |
Le samedi 25 août 2012 02:24:35 UTC+2, Antoine Pitrou a écrit : > Ramchandra Apte <maniandram01 <at> gmail.com> writes: > > > > > > The zen of python is simply a guideline > > > > What's more, the Zen guides the language's design, not its implementation. > > People who think CPython is a complicated implementation can take a look at PyPy > > :-) Unicode design: a flat table of code points, where all code points are "equals". As soon as one attempts to escape from this rule, one has to "pay" for it. The creator of this machinery (flexible string representation) can not even benefit from it in his native language (I think I'm correctly informed). Hint: Google -> "Das grosse Eszett" jmf
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2012-08-25 09:58 +0100 |
| Message-ID | <mailman.3791.1345885204.4697.python-list@python.org> |
| In reply to | #27854 |
On 25/08/2012 08:27, wxjmfauth@gmail.com wrote: > Le samedi 25 août 2012 02:24:35 UTC+2, Antoine Pitrou a écrit : >> Ramchandra Apte <maniandram01 <at> gmail.com> writes: >> >>> >> >>> The zen of python is simply a guideline >> >> >> >> What's more, the Zen guides the language's design, not its implementation. >> >> People who think CPython is a complicated implementation can take a look at PyPy >> >> :-) > > Unicode design: a flat table of code points, where all code > points are "equals". > As soon as one attempts to escape from this rule, one has to > "pay" for it. > The creator of this machinery (flexible string representation) > can not even benefit from it in his native language (I think > I'm correctly informed). > > Hint: Google -> "Das grosse Eszett" > > jmf > It's Saturday morning, I'm stone cold sober, had a good sleep and I'm still baffled as to the point if any. Could someone please enlightem me? -- Cheers. Mark Lawrence.
[toc] | [prev] | [next] | [standalone]
| From | Frank Millman <frank@chagford.com> |
|---|---|
| Date | 2012-08-25 11:46 +0200 |
| Message-ID | <mailman.3793.1345888006.4697.python-list@python.org> |
| In reply to | #27854 |
On 25/08/2012 10:58, Mark Lawrence wrote: > On 25/08/2012 08:27, wxjmfauth@gmail.com wrote: >> >> Unicode design: a flat table of code points, where all code >> points are "equals". >> As soon as one attempts to escape from this rule, one has to >> "pay" for it. >> The creator of this machinery (flexible string representation) >> can not even benefit from it in his native language (I think >> I'm correctly informed). >> >> Hint: Google -> "Das grosse Eszett" >> >> jmf >> > > It's Saturday morning, I'm stone cold sober, had a good sleep and I'm > still baffled as to the point if any. Could someone please enlightem me? > Here's what I think he is saying. I am posting this to test the water. I am also confused, and if I have got it wrong hopefully someone will correct me. In python 3.3, unicode strings are now stored as follows - if all characters can be represented by 1 byte, the entire string is composed of 1-byte characters else if all characters can be represented by 1 or 2 bytea, the entire string is composed of 2-byte characters else the entire string is composed of 4-byte characters There is an overhead in making this choice, to detect the lowest number of bytes required. jmfauth believes that this only benefits 'english-speaking' users, as the rest of the world will tend to have strings where at least one character requires 2 or 4 bytes. So they incur the overhead, without getting any benefit. Therefore, I think he is saying that he would have preferred that python standardise on 4-byte characters, on the grounds that the saving in memory does not justify the performance overhead. Frank Millman
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2012-08-25 08:47 -0700 |
| Message-ID | <mailman.3805.1345909675.4697.python-list@python.org> |
| In reply to | #27860 |
Le samedi 25 août 2012 11:46:34 UTC+2, Frank Millman a écrit : > On 25/08/2012 10:58, Mark Lawrence wrote: > > > On 25/08/2012 08:27, wxjmfauth@gmail.com wrote: > > >> > > >> Unicode design: a flat table of code points, where all code > > >> points are "equals". > > >> As soon as one attempts to escape from this rule, one has to > > >> "pay" for it. > > >> The creator of this machinery (flexible string representation) > > >> can not even benefit from it in his native language (I think > > >> I'm correctly informed). > > >> > > >> Hint: Google -> "Das grosse Eszett" > > >> > > >> jmf > > >> > > > > > > It's Saturday morning, I'm stone cold sober, had a good sleep and I'm > > > still baffled as to the point if any. Could someone please enlightem me? > > > > > > > Here's what I think he is saying. I am posting this to test the water. I > > am also confused, and if I have got it wrong hopefully someone will > > correct me. > > > > In python 3.3, unicode strings are now stored as follows - > > if all characters can be represented by 1 byte, the entire string is > > composed of 1-byte characters > > else if all characters can be represented by 1 or 2 bytea, the entire > > string is composed of 2-byte characters > > else the entire string is composed of 4-byte characters > > > > There is an overhead in making this choice, to detect the lowest number > > of bytes required. > > > > jmfauth believes that this only benefits 'english-speaking' users, as > > the rest of the world will tend to have strings where at least one > > character requires 2 or 4 bytes. So they incur the overhead, without > > getting any benefit. > > > > Therefore, I think he is saying that he would have preferred that python > > standardise on 4-byte characters, on the grounds that the saving in > > memory does not justify the performance overhead. > > > > Frank Millman Very well explained. Thanks. More precisely, affected are not only the 'english-speaking' users, but all the users who are using not latin-1 characters. (See the title of this topic, ... typography). Being at the same time, latin-1 and unicode compliant is a plain absurdity in the mathematical sense. --- For those you do not know, the go language has introduced the rune type. As far as I know, nobody is complaining, I have not even seen a discussion related to this subject. 100% Unicode compliant from the day 0. Congratulations. jmf
[toc] | [prev] | [next] | [standalone]
| From | wxjmfauth@gmail.com |
|---|---|
| Date | 2012-08-25 08:47 -0700 |
| Message-ID | <f6266544-d67c-4589-a3ed-c14428ead237@googlegroups.com> |
| In reply to | #27860 |
Le samedi 25 août 2012 11:46:34 UTC+2, Frank Millman a écrit : > On 25/08/2012 10:58, Mark Lawrence wrote: > > > On 25/08/2012 08:27, wxjmfauth@gmail.com wrote: > > >> > > >> Unicode design: a flat table of code points, where all code > > >> points are "equals". > > >> As soon as one attempts to escape from this rule, one has to > > >> "pay" for it. > > >> The creator of this machinery (flexible string representation) > > >> can not even benefit from it in his native language (I think > > >> I'm correctly informed). > > >> > > >> Hint: Google -> "Das grosse Eszett" > > >> > > >> jmf > > >> > > > > > > It's Saturday morning, I'm stone cold sober, had a good sleep and I'm > > > still baffled as to the point if any. Could someone please enlightem me? > > > > > > > Here's what I think he is saying. I am posting this to test the water. I > > am also confused, and if I have got it wrong hopefully someone will > > correct me. > > > > In python 3.3, unicode strings are now stored as follows - > > if all characters can be represented by 1 byte, the entire string is > > composed of 1-byte characters > > else if all characters can be represented by 1 or 2 bytea, the entire > > string is composed of 2-byte characters > > else the entire string is composed of 4-byte characters > > > > There is an overhead in making this choice, to detect the lowest number > > of bytes required. > > > > jmfauth believes that this only benefits 'english-speaking' users, as > > the rest of the world will tend to have strings where at least one > > character requires 2 or 4 bytes. So they incur the overhead, without > > getting any benefit. > > > > Therefore, I think he is saying that he would have preferred that python > > standardise on 4-byte characters, on the grounds that the saving in > > memory does not justify the performance overhead. > > > > Frank Millman Very well explained. Thanks. More precisely, affected are not only the 'english-speaking' users, but all the users who are using not latin-1 characters. (See the title of this topic, ... typography). Being at the same time, latin-1 and unicode compliant is a plain absurdity in the mathematical sense. --- For those you do not know, the go language has introduced the rune type. As far as I know, nobody is complaining, I have not even seen a discussion related to this subject. 100% Unicode compliant from the day 0. Congratulations. jmf
[toc] | [prev] | [next] | [standalone]
Page 1 of 5 [1] 2 3 4 5 Next page →
Back to top | Article view | comp.lang.python
csiph-web