Groups > comp.lang.python > #53771 > unrolled thread

Chardet, file, ... and the Flexible String Representation

Started by	wxjmfauth@gmail.com
First post	2013-09-06 02:11 -0700
Last post	2013-09-12 00:11 +0300
Articles	18 — 12 participants

Back to article view | Back to comp.lang.python

  Chardet, file, ... and the Flexible String Representation wxjmfauth@gmail.com - 2013-09-06 02:11 -0700
    Re: Chardet, file, ... and the Flexible String Representation Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-09-06 10:57 +0000
    Re: Chardet, file, ... and the Flexible String Representation Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-09-06 13:10 +0200
    Re: Chardet, file, ... and the Flexible String Representation Ned Batchelder <ned@nedbatchelder.com> - 2013-09-06 07:02 -0400
    Re: Chardet, file, ... and the Flexible String Representation Piet van Oostrum <piet@vanoostrum.org> - 2013-09-06 11:46 -0400
      Re: Chardet, file, ... and the Flexible String Representation Chris Angelico <rosuav@gmail.com> - 2013-09-07 02:04 +1000
      Re: Chardet, file, ... and the Flexible String Representation random832@fastmail.us - 2013-09-06 12:59 -0400
      Re: Chardet, file, ... and the Flexible String Representation Chris Angelico <rosuav@gmail.com> - 2013-09-07 03:04 +1000
      Re: Chardet, file, ... and the Flexible String Representation wxjmfauth@gmail.com - 2013-09-09 07:28 -0700
        Re: Chardet, file, ... and the Flexible String Representation Ned Batchelder <ned@nedbatchelder.com> - 2013-09-09 12:38 -0400
        Re: Chardet, file, ... and the Flexible String Representation Michael Torrie <torriem@gmail.com> - 2013-09-09 11:05 -0600
          Re: Chardet, file, ... and the Flexible String Representation Steven D'Aprano <steve@pearwood.info> - 2013-09-10 04:58 +0000
        Re: Chardet, file, ... and the Flexible String Representation Terry Reedy <tjreedy@udel.edu> - 2013-09-09 16:47 -0400
        Re: Chardet, file, ... and the Flexible String Representation random832@fastmail.us - 2013-09-10 11:36 -0400
      Re: Chardet, file, ... and the Flexible String Representation random832@fastmail.us - 2013-09-09 14:34 -0400
      Re: Chardet, file, ... and the Flexible String Representation Ian Kelly <ian.g.kelly@gmail.com> - 2013-09-09 13:03 -0600
      Re: Chardet, file, ... and the Flexible String Representation random832@fastmail.us - 2013-09-09 15:27 -0400
      Re: Chardet, file, ... and the Flexible String Representation Serhiy Storchaka <storchaka@gmail.com> - 2013-09-12 00:11 +0300

#53771 — Chardet, file, ... and the Flexible String Representation

From	wxjmfauth@gmail.com
Date	2013-09-06 02:11 -0700
Subject	Chardet, file, ... and the Flexible String Representation
Message-ID	<4ce85ea8-4a4c-46cf-a546-ad999576a5f7@googlegroups.com>

Short comment about the "detection" tools from a previous
discussion.

The tools supposed to detect the coding scheme are all
working with a simple logical mathematical rule:

p  ==> q    <==>   non q  ==> non p .

Shortly  -- and consequence  --  they do not detect a
coding scheme they only detect "a" possible coding schme.


The Flexible String Representation has conceptually to
face the same problem. It splits "unicode" in chunks and
it has to solve two problems at the same time, the coding
and the handling of multiple "char sets". The problem?
It fails.
"This poor Flexible String Representation does not succeed
to solve the problem it create itsself."

Workaround: add more flags (see PEP 3xx.)

Still thinking "mathematics" (limit). For a given repertoire
of characters one can assume that every char has its own
flag (because of the usage of multiple coding schemes).
Conceptually, one will quickly realize, at the end, that they
will be an equal amount of flags and an amount of characters
and the only valid solution it to work with a unique set of
encoded code points, where every element of this set *is*
its own flag.
Curiously, that's what the utf-* (and btw other coding schemes
in the byte string world) are doing (with plenty of other
advantages).

Already said. An healthy coding scheme can only work with
a unique set of encoded code points. That's why we have to
live today with all these coding schemes.

jmf

[toc] | [next] | [standalone]

#53778

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2013-09-06 10:57 +0000
Message-ID	<5229b52c$0$29988$c3e8da3$5496439d@news.astraweb.com>
In reply to	#53771

On Fri, 06 Sep 2013 02:11:56 -0700, wxjmfauth wrote:

> Short comment about the "detection" tools from a previous discussion.
> 
> The tools supposed to detect the coding scheme are all working with a
> simple logical mathematical rule:
> 
> p  ==> q    <==>   non q  ==> non p .

Incorrect.

chardet does a statistical analysis of the bytes, and tries to guess what 
language they are likely to come from. The algorithm is described here:

https://github.com/erikrose/chardet/blob/master/docs/how-it-works.html

(although that's rather inconvenient to read), and here:

http://www-archive.mozilla.org/projects/intl/
UniversalCharsetDetection.html


chardet is a Python port of the Mozilla charset guesser, so they use the 
same algorithm.


> Shortly  -- and consequence  --  they do not detect a coding scheme they
> only detect "a" possible coding schme.

That at least is correct.


> The Flexible String Representation has conceptually to face the same
> problem. 

No it doesn't.


-- 
Steven

[toc] | [prev] | [next] | [standalone]

#53779

From	Antoon Pardon <antoon.pardon@rece.vub.ac.be>
Date	2013-09-06 13:10 +0200
Message-ID	<mailman.121.1378465845.5461.python-list@python.org>
In reply to	#53771

Op 06-09-13 11:11, wxjmfauth@gmail.com schreef:

> 
> The Flexible String Representation has conceptually to
> face the same problem. It splits "unicode" in chunks and
> it has to solve two problems at the same time, the coding
> and the handling of multiple "char sets". The problem?

Not true. The FSR always uses the same coding. An "A" is
always coded as 65.

-- 
Antoon Pardon

[toc] | [prev] | [next] | [standalone]

#53780

From	Ned Batchelder <ned@nedbatchelder.com>
Date	2013-09-06 07:02 -0400
Message-ID	<mailman.122.1378465881.5461.python-list@python.org>
In reply to	#53771

On 9/6/13 5:11 AM, wxjmfauth@gmail.com wrote:
> The Flexible String Representation has conceptually to
> face the same problem. It splits "unicode" in chunks and
> it has to solve two problems at the same time, the coding
> and the handling of multiple "char sets". The problem?
> It fails.

Just once, please say *how* it fails.  :(

--Ned.

[toc] | [prev] | [next] | [standalone]

#53791

From	Piet van Oostrum <piet@vanoostrum.org>
Date	2013-09-06 11:46 -0400
Message-ID	<m2a9jqq7g9.fsf@cochabamba.vanoostrum.org>
In reply to	#53771

wxjmfauth@gmail.com writes:

> The Flexible String Representation has conceptually to
> face the same problem. It splits "unicode" in chunks and
> it has to solve two problems at the same time, the coding
> and the handling of multiple "char sets". The problem?
> It fails.
> "This poor Flexible String Representation does not succeed
> to solve the problem it create itsself."

The FSR does not split unicode in chuncks. It does not create problems and therefore it doesn't have to solve this. 

The FSR simply stores a Unicode string as an array[*] of ints (the Unicode code points of the characters of the string. That's it. Then it uses a memory-efficient way to store this array of ints. But that has nothing to do with character sets. The same principle could be used for any array of ints.

So you are seeking problems where there are none. And you would have a lot more peace of mind if you stopped doing this.

[*] array in the C sense.
-- 
Piet van Oostrum <piet@vanoostrum.org>
WWW: http://pietvanoostrum.com/
PGP key: [8DAE142BE17999C4]

[toc] | [prev] | [next] | [standalone]

#53792

From	Chris Angelico <rosuav@gmail.com>
Date	2013-09-07 02:04 +1000
Message-ID	<mailman.127.1378483495.5461.python-list@python.org>
In reply to	#53791

On Sat, Sep 7, 2013 at 1:46 AM, Piet van Oostrum <piet@vanoostrum.org> wrote:
> The FSR simply stores a Unicode string as an array[*] of ints (the Unicode code points of the characters of the string. That's it. Then it uses a memory-efficient way to store this array of ints. But that has nothing to do with character sets. The same principle could be used for any array of ints.

Python does, in fact, store integers in different-sized blocks of
memory according to size - though not for anything smaller than
32-bit.

>>> sys.getsizeof(100)
14
>>> sys.getsizeof(1000000000000000000000000000000000)
28

So why this is suddenly a bad thing for characters is a mystery none
but he can comprehend.

ChrisA

[toc] | [prev] | [next] | [standalone]

#53796

From	random832@fastmail.us
Date	2013-09-06 12:59 -0400
Message-ID	<mailman.128.1378486750.5461.python-list@python.org>
In reply to	#53791

On Fri, Sep 6, 2013, at 11:46, Piet van Oostrum wrote:
> The FSR does not split unicode in chuncks. It does not create problems
> and therefore it doesn't have to solve this. 
> 
> The FSR simply stores a Unicode string as an array[*] of ints (the
> Unicode code points of the characters of the string. That's it. Then it
> uses a memory-efficient way to store this array of ints. But that has
> nothing to do with character sets. The same principle could be used for
> any array of ints.

I think the source of the confusion is that it is described in terms of
UCS-2 and Latin-1, which people often think of (especially latin-1) as
different encodings rather than merely storing code points in a narrower
type.

----

Incidentally, how does all this interact with ctypes unicode_buffers,
which slice as strings and must be UTF-16 on windows? This was fine
pre-FSR when unicode objects were UTF-16, but I'm not sure how it would
work now.

[toc] | [prev] | [next] | [standalone]

#53798

From	Chris Angelico <rosuav@gmail.com>
Date	2013-09-07 03:04 +1000
Message-ID	<mailman.129.1378487107.5461.python-list@python.org>
In reply to	#53791

On Sat, Sep 7, 2013 at 2:59 AM,  <random832@fastmail.us> wrote:
> Incidentally, how does all this interact with ctypes unicode_buffers,
> which slice as strings and must be UTF-16 on windows? This was fine
> pre-FSR when unicode objects were UTF-16, but I'm not sure how it would
> work now.

That would be pre-FSR *with a Narrow build*, which was the default on
Windows but not everywhere. But I don't know or use ctypes, so an
answer to your actual question will have to come from someone else.

ChrisA

[toc] | [prev] | [next] | [standalone]

#53873

From	wxjmfauth@gmail.com
Date	2013-09-09 07:28 -0700
Message-ID	<04abbe99-ca1e-40b5-86c7-64b0e5d9de9c@googlegroups.com>
In reply to	#53791

Le vendredi 6 septembre 2013 17:46:14 UTC+2, Piet van Oostrum a écrit :
> wxjmfauth@gmail.com writes:
> 
> 
> 
> > The Flexible String Representation has conceptually to
> 
> > face the same problem. It splits "unicode" in chunks and
> 
> > it has to solve two problems at the same time, the coding
> 
> > and the handling of multiple "char sets". The problem?
> 
> > It fails.
> 
> > "This poor Flexible String Representation does not succeed
> 
> > to solve the problem it create itsself."
> 
> 
> 
> The FSR does not split unicode in chuncks. It does not create problems and therefore it doesn't have to solve this. 
> 
> 
> 
> The FSR simply stores a Unicode string as an array[*] of ints (the Unicode code points of the characters of the string. That's it. Then it uses a memory-efficient way to store this array of ints. But that has nothing to do with character sets. The same principle could be used for any array of ints.
> 
> 
> 
> So you are seeking problems where there are none. And you would have a lot more peace of mind if you stopped doing this.
> 
> 
> 
> [*] array in the C sense.
> 
> -- 
> 
> Piet van Oostrum <piet@vanoostrum.org>
> 
> WWW: http://pietvanoostrum.com/
> 
> PGP key: [8DAE142BE17999C4]

----------


Due to its nature, a character cann't be handled in the
same way a one another type. That's the purpose of the UTF.

-----

Chunk latin-1, perfomance

ref:
>>> timeit.timeit("a = 'hundred'; 'x' in a")
0.13144639994075646

>>> timeit.timeit("a = 'hundrez'; 'x' in a")
0.13780295544393084

Chunk ucs2, perfomance

>>> timeit.timeit("a = 'hundre€'; 'x' in a")
0.23505392241617074

Chunk ucs4, perfomance

>>> timeit.timeit("a = 'hundre\U0001d11e'; 'x' in a")
0.26266673650735584

Comment: Such differences never happen with utf.

-----

Chunk latin-1, memory

>>> sys.getsizeof('a')
26

Chunk ucs2, memory

>>> sys.getsizeof('€')
40

Comment: 14 bytes more than latin-1

Chunk ucs4, memory

>>> sys.getsizeof('\U0001d11e')
44

Comment: 18 bytes more than latin-1

Comment: With utf, a char (in string or not) never exceed 4 

bytes.

-----

'a' + '€' in utf, conceptually

Concatenate the *unicode tranformation units*.
Some kind of a real direct 'a' + '€'.


'a' + '€' in FSR, conceptually

1) Check the "internal coding" of 'a'
2) Check the "internal coding" of '€'
3) Compare these codings

4a) If they match, concatenate the bytes

4b) If they do not match
	5) Reencode the string which has to
	6) Concatenate
	7) Set the "internal coding" status for
	further processing

-----

Complicate and full of side effects, eg :

>>> sys.getsizeof('a')
26
>>> sys.getsizeof('aé')
39

Is not a latin-1 "é" supposed to count as a latin-1 "a" ?

----

I picked up random methods, there may be variations, basically
this general behaviour is always expected.


jmf

[toc] | [prev] | [next] | [standalone]

#53877

From	Ned Batchelder <ned@nedbatchelder.com>
Date	2013-09-09 12:38 -0400
Message-ID	<mailman.184.1378744709.5461.python-list@python.org>
In reply to	#53873

On 9/9/13 10:28 AM, wxjmfauth@gmail.com wrote:
> Le vendredi 6 septembre 2013 17:46:14 UTC+2, Piet van Oostrum a écrit :
>> wxjmfauth@gmail.com writes:
>>
>>
>>
>>> The Flexible String Representation has conceptually to
>>> face the same problem. It splits "unicode" in chunks and
>>> it has to solve two problems at the same time, the coding
>>> and the handling of multiple "char sets". The problem?
>>> It fails.
>>> "This poor Flexible String Representation does not succeed
>>> to solve the problem it create itsself."
>>
>>
>> The FSR does not split unicode in chuncks. It does not create problems and therefore it doesn't have to solve this.
>>
>>
>>
>> The FSR simply stores a Unicode string as an array[*] of ints (the Unicode code points of the characters of the string. That's it. Then it uses a memory-efficient way to store this array of ints. But that has nothing to do with character sets. The same principle could be used for any array of ints.
>>
>>
>>
>> So you are seeking problems where there are none. And you would have a lot more peace of mind if you stopped doing this.
>>
>>
>>
>> [*] array in the C sense.
>>
>> -- 
>>
>> Piet van Oostrum <piet@vanoostrum.org>
>>
>> WWW: http://pietvanoostrum.com/
>>
>> PGP key: [8DAE142BE17999C4]
> ----------
>
>
> Due to its nature, a character cann't be handled in the
> same way a one another type. That's the purpose of the UTF.
>
> -----
>
> Chunk latin-1, perfomance
>
> ref:
>>>> timeit.timeit("a = 'hundred'; 'x' in a")
> 0.13144639994075646
>
>>>> timeit.timeit("a = 'hundrez'; 'x' in a")
> 0.13780295544393084
>
> Chunk ucs2, perfomance
>
>>>> timeit.timeit("a = 'hundre€'; 'x' in a")
> 0.23505392241617074
>
> Chunk ucs4, perfomance
>
>>>> timeit.timeit("a = 'hundre\U0001d11e'; 'x' in a")
> 0.26266673650735584
>
> Comment: Such differences never happen with utf.
>
> -----
>
> Chunk latin-1, memory
>
>>>> sys.getsizeof('a')
> 26
>
> Chunk ucs2, memory
>
>>>> sys.getsizeof('€')
> 40
>
> Comment: 14 bytes more than latin-1
>
> Chunk ucs4, memory
>
>>>> sys.getsizeof('\U0001d11e')
> 44
>
> Comment: 18 bytes more than latin-1
>
> Comment: With utf, a char (in string or not) never exceed 4
>
> bytes.
>
> -----
>
> 'a' + '€' in utf, conceptually
>
> Concatenate the *unicode tranformation units*.
> Some kind of a real direct 'a' + '€'.
>
>
> 'a' + '€' in FSR, conceptually
>
> 1) Check the "internal coding" of 'a'
> 2) Check the "internal coding" of '€'
> 3) Compare these codings
>
> 4a) If they match, concatenate the bytes
>
> 4b) If they do not match
> 	5) Reencode the string which has to
> 	6) Concatenate
> 	7) Set the "internal coding" status for
> 	further processing
>
> -----
>
> Complicate and full of side effects, eg :
>
>>>> sys.getsizeof('a')
> 26
>>>> sys.getsizeof('aé')
> 39
>
> Is not a latin-1 "é" supposed to count as a latin-1 "a" ?
>
> ----
>
> I picked up random methods, there may be variations, basically
> this general behaviour is always expected.
>
>
> jmf
>

jmf, thanks for your reply.  You've calmed my fears that there is 
something wrong with the Flexible String Representation.  None of the 
examples you show demonstrate any behavior contrary to the Unicode spec.

--Ned.

[toc] | [prev] | [next] | [standalone]

#53880

From	Michael Torrie <torriem@gmail.com>
Date	2013-09-09 11:05 -0600
Message-ID	<mailman.187.1378746353.5461.python-list@python.org>
In reply to	#53873

On 09/09/2013 08:28 AM, wxjmfauth@gmail.com wrote:
> Comment: Such differences never happen with utf.

But with utf, slicing strings is O(n) (well that's a simplification as
someone showed an algorithm that is log n), whereas a fixed-width
encoding (Latin-1, UCS-2, UCS-4) is O(1).  Do you understand what this
means?

> Complicate and full of side effects, eg :
> 
>>>> sys.getsizeof('a')
> 26
>>>> sys.getsizeof('aé')
> 39

Why on earth are you doing getsizeof?  What are you expecting to prove?
 Why are you even trying to concern yourself with implementation
details?  As a programmer you should deal with unicode.  Period.  All
you should care about is that you can properly index or slice a unicode
string and that unicode strings can be operated on at a reasonable speed.

IE string[4] should give you the character at position 4.  len(string)
should return the length of the string in *characters*.

The byte encoding used behind the scenes is of no consequence other than
speed (and you have not shown any problem with speed).

> 
> Is not a latin-1 "é" supposed to count as a latin-1 "a" ?

Of course it does.  'aé'[0] == 'a' and 'aé'[1] == 'é'.  len('aé') returns 2.

> I picked up random methods, there may be variations, basically
> this general behaviour is always expected.

Eh?  Can you point to something in the unicode spec that doesn't work?

I don't even know that much about unicode yet it's clear you're either
deliberately muddying the waters with your stupid and pointless
arguments against FCS or you don't really understand the difference
between unicode and byte encoding.  Which is it?

[toc] | [prev] | [next] | [standalone]

#53900

From	Steven D'Aprano <steve@pearwood.info>
Date	2013-09-10 04:58 +0000
Message-ID	<522ea700$0$29999$c3e8da3$5496439d@news.astraweb.com>
In reply to	#53880

On Mon, 09 Sep 2013 11:05:44 -0600, Michael Torrie wrote:

> On 09/09/2013 08:28 AM, wxjmfauth@gmail.com wrote:
>> Comment: Such differences never happen with utf.
> 
> But with utf, slicing strings is O(n) (well that's a simplification as
> someone showed an algorithm that is log n), whereas a fixed-width
> encoding (Latin-1, UCS-2, UCS-4) is O(1).  

UTF-32 is fixed-width. UTF-16 is not, but if you limit yourself to only 
characters in the Basic Multilingual Plane, it is functionally equivalent 
to UCS-2 and therefore fixed-width.

> Do you understand what this means?

Talking about "utf" in general as JMF does is a good sign that he 
doesn't. Which UTF? I know of at least eight:

UTF-1
UTF-7
UTF-8
UTF-9  # this one is a joke, but it does work
UTF-16  # in two varieties, big-endian and little-endian
UTF-18  # another joke
UTF-32  # likewise two varieties
UTF-EBCDIC

although only 3 (perhaps 4, if you include UTF-7) are in common use.

[...]
> I don't even know that much about unicode yet it's clear you're either
> deliberately muddying the waters with your stupid and pointless
> arguments against FCS or you don't really understand the difference
> between unicode and byte encoding.  Which is it?

I have been watching JMF get a mad-on about the flexible string 
representation since he first noticed it, and in my opinion, his 
complaints are based entirely on resentment that ASCII users save more 
memory than non-ASCII users. Even if it means everyone is worse off, he 
is utterly opposed to giving ASCII users any benefit.

Of course, he neglects to consider that *every single Python user* is an 
ASCII user, since most strings in Python are pure ASCII. Names of 
builtins, standard library modules, variables, attributes, most of them 
are ASCII.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#53891

From	Terry Reedy <tjreedy@udel.edu>
Date	2013-09-09 16:47 -0400
Message-ID	<mailman.195.1378759675.5461.python-list@python.org>
In reply to	#53873

On 9/9/2013 12:38 PM, Ned Batchelder wrote:

> jmf, thanks for your reply.  You've calmed my fears that there is
> something wrong with the Flexible String Representation.  None of the
> examples you show demonstrate any behavior contrary to the Unicode spec.

The goals of the new unicode implementation:
1. one implementation on all platforms, working the same on all platforms.
2. works correctly
3. O(1) indexing
4. save as much space as sensibly possible
5. not too much time penalty for the space saving.

The new implementation succeeded on all points. It exceeded the goal for 
5. With much optimization work, there essentially is no overall time 
penalty left.

Jmf's size examples show success with respect to goal 4. He apparently 
disagrees with that goal and would replace it with something else. At 
least some of his time examples show that saving space can save time, as 
was predicted when the FSR was being developed.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

#53921

From	random832@fastmail.us
Date	2013-09-10 11:36 -0400
Message-ID	<mailman.220.1378827397.5461.python-list@python.org>
In reply to	#53873

On Mon, Sep 9, 2013, at 10:28, wxjmfauth@gmail.com wrote:
*time performance differences*
> 
> Comment: Such differences never happen with utf.

Why is this bad? Keeping in mind that otherwise they would all be almost
as slow as the UCS-4 case.

> >>> sys.getsizeof('a')
> 26
> >>> sys.getsizeof('€')
> 40
> >>> sys.getsizeof('\U0001d11e')
> 44
> 
> Comment: 18 bytes more than latin-1
> 
> Comment: With utf, a char (in string or not) never exceed 4 

A string is an object and needs to store the length, along with any
overhead relating to object headers. I believe there is also an appended
null character. Also, ASCII strings are stored differently from Latin-1
strings.

>>> sys.getsizeof('a'*999)
1048 = 49 bytes overhead, 1 byte per character.
>>> sys.getsizeof('\xa4'*999)
1072 = 74 bytes overhead, 1 byte per character.
>>> sys.getsizeof('\u20ac'*999)
2072 = 76 bytes overhead, 2 bytes per character.
>>> sys.getsizeof('\U0001d11e'*999)
4072 = 80 bytes overhead, 4 bytes per character.

(I bet sys.getsizeof('\xa4') will return 38 on your system, so 44 is
only six bytes more, not 18)

If we did not have the FSR, everything would be 4 bytes per character.
We might have less overhead, but a string only has to be 25 characters
long before the savings from the shorter representation outweigh even
having _no_ overhead, and every four bytes of overhead reduces that
number by one. And you have a 32-bit python build, which has less
overhead than mine - in yours, strings only have to be seven characters
long for the FSR to be worth it. Assume the minimum possible overhead is
two words for the object header, a size, and a pointer - i.e. sixteen
bytes, compared to the 25 you've demonstrated for ASCII, and strings
only need to be _two_ characters long for the FSR to be a better deal
than always using UCS4 strings.

The need for four-byte-per-character strings would not go away by
eliminating the FSR, so you're basically saying that everything should
be constrained to the worst-case performance scenario.

[toc] | [prev] | [next] | [standalone]

#53883

From	random832@fastmail.us
Date	2013-09-09 14:34 -0400
Message-ID	<mailman.189.1378751678.5461.python-list@python.org>
In reply to	#53791

On Fri, Sep 6, 2013, at 13:04, Chris Angelico wrote:
> On Sat, Sep 7, 2013 at 2:59 AM,  <random832@fastmail.us> wrote:
> > Incidentally, how does all this interact with ctypes unicode_buffers,
> > which slice as strings and must be UTF-16 on windows? This was fine
> > pre-FSR when unicode objects were UTF-16, but I'm not sure how it would
> > work now.
> 
> That would be pre-FSR *with a Narrow build*, which was the default on
> Windows but not everywhere. But I don't know or use ctypes, so an
> answer to your actual question will have to come from someone else.

I did a couple tests - it works as well as can be expected for reading,
but completely breaks for writing (due to sequence size checks not
matching)

[toc] | [prev] | [next] | [standalone]

#53885

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2013-09-09 13:03 -0600
Message-ID	<mailman.190.1378753432.5461.python-list@python.org>
In reply to	#53791

[Multipart message — attachments visible in raw view] — view raw

On Sep 9, 2013 12:36 PM, <random832@fastmail.us> wrote:
>
> On Fri, Sep 6, 2013, at 13:04, Chris Angelico wrote:
> > On Sat, Sep 7, 2013 at 2:59 AM,  <random832@fastmail.us> wrote:
> > > Incidentally, how does all this interact with ctypes unicode_buffers,
> > > which slice as strings and must be UTF-16 on windows? This was fine
> > > pre-FSR when unicode objects were UTF-16, but I'm not sure how it
would
> > > work now.
> >
> > That would be pre-FSR *with a Narrow build*, which was the default on
> > Windows but not everywhere. But I don't know or use ctypes, so an
> > answer to your actual question will have to come from someone else.
>
> I did a couple tests - it works as well as can be expected for reading,
> but completely breaks for writing (due to sequence size checks not
> matching)

Do you mean that it breaks when overwriting Python string object buffers,
or when overwriting arbitrary C strings either received from C code or
created with create_unicode_buffer?

If the former, I think that is to be expected since ctypes ultimately can't
know what is the actual type of the pointer it was handed -- much as in C,
that's up to the programmer to get right. I also think it's very bad
practice to be overwriting those anyway, since Python strings are supposed
to be immutable.

If the latter, that sounds like a bug in ctypes to me.

[toc] | [prev] | [next] | [standalone]

#53886

From	random832@fastmail.us
Date	2013-09-09 15:27 -0400
Message-ID	<mailman.191.1378754893.5461.python-list@python.org>
In reply to	#53791

On Mon, Sep 9, 2013, at 15:03, Ian Kelly wrote:
> Do you mean that it breaks when overwriting Python string object buffers,
> or when overwriting arbitrary C strings either received from C code or
> created with create_unicode_buffer?
> 
> If the former, I think that is to be expected since ctypes ultimately
> can't
> know what is the actual type of the pointer it was handed -- much as in
> C,
> that's up to the programmer to get right. I also think it's very bad
> practice to be overwriting those anyway, since Python strings are
> supposed
> to be immutable.
> 
> If the latter, that sounds like a bug in ctypes to me.

I was talking about writing to the buffer object from python, i.e. with
slice assignment.
>>> s = 'Test \U00010000'
>>> len(s)
6
>>> buf = create_unicode_buffer(32)
>>> buf[:6] = s
TypeError: one character unicode string expected
>>> buf[:7] = s
ValueError: Can only assign sequence of same size
>>> buf[:7] = 'Test \ud800\udc00'
>>> buf[:7]
'Test \U00010000' # len = 6

Assigning with .value works, however, which may be a viable workaround
for most situations. The "one character unicode string expected" message
is a bit cryptic.

[toc] | [prev] | [next] | [standalone]

#53996

From	Serhiy Storchaka <storchaka@gmail.com>
Date	2013-09-12 00:11 +0300
Message-ID	<mailman.267.1378933905.5461.python-list@python.org>
In reply to	#53791

09.09.13 22:27, random832@fastmail.us написав(ла):
> On Mon, Sep 9, 2013, at 15:03, Ian Kelly wrote:
>> Do you mean that it breaks when overwriting Python string object buffers,
>> or when overwriting arbitrary C strings either received from C code or
>> created with create_unicode_buffer?
>>
>> If the former, I think that is to be expected since ctypes ultimately
>> can't
>> know what is the actual type of the pointer it was handed -- much as in
>> C,
>> that's up to the programmer to get right. I also think it's very bad
>> practice to be overwriting those anyway, since Python strings are
>> supposed
>> to be immutable.
>>
>> If the latter, that sounds like a bug in ctypes to me.
>
> I was talking about writing to the buffer object from python, i.e. with
> slice assignment.
>>>> s = 'Test \U00010000'
>>>> len(s)
> 6
>>>> buf = create_unicode_buffer(32)
>>>> buf[:6] = s
> TypeError: one character unicode string expected
>>>> buf[:7] = s
> ValueError: Can only assign sequence of same size
>>>> buf[:7] = 'Test \ud800\udc00'
>>>> buf[:7]
> 'Test \U00010000' # len = 6
>
> Assigning with .value works, however, which may be a viable workaround
> for most situations. The "one character unicode string expected" message
> is a bit cryptic.

Please report a bug on http://bugs.python.org/.

[toc] | [prev] | [standalone]

csiph-web

Chardet, file, ... and the Flexible String Representation

Contents

#53771 — Chardet, file, ... and the Flexible String Representation

#53778

#53779

#53780

#53791

#53792

#53796

#53798

#53873

#53877

#53880

#53900

#53891

#53921

#53883

#53885

#53886

#53996