Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #47866 > unrolled thread
| Started by | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| First post | 2013-06-13 00:13 +0000 |
| Last post | 2013-06-20 19:08 +0200 |
| Articles | 20 on this page of 90 — 31 participants |
Back to article view | Back to comp.lang.python
This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by
below is the oldest one visible, not the original post.
Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-13 00:13 +0000
Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-13 09:09 +0300
Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-13 07:11 +0000
Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-13 10:42 +0300
Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-13 17:58 +1000
Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-13 11:08 +0300
Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-13 18:20 +1000
Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-13 12:41 +0300
Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-13 11:49 +0000
Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-13 17:19 +0300
Re: A few questiosn about encoding Cameron Simpson <cs@zip.com.au> - 2013-06-14 11:00 +1000
Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 09:59 +0300
Re: A few questiosn about encoding Cameron Simpson <cs@zip.com.au> - 2013-06-14 20:14 +1000
Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 16:58 +0300
Re: A few questiosn about encoding Joel Goldstick <joel.goldstick@gmail.com> - 2013-06-14 11:21 -0400
Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 18:26 +0300
Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-15 03:03 +1000
Re: A few questiosn about encoding Walter Hurry <walterhurry@lavabit.com> - 2013-06-14 23:32 +0000
Re: A few questiosn about encoding Cameron Simpson <cs@zip.com.au> - 2013-06-15 10:26 +1000
Re: A few questiosn about encoding Denis McMahon <denismfmcmahon@gmail.com> - 2013-06-15 06:34 +0000
Re: A few questiosn about encoding Grant Edwards <invalid@invalid.invalid> - 2013-06-15 14:44 +0000
Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-15 17:49 +0300
Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-15 15:30 +0000
Re: A few questiosn about encoding Roy Smith <roy@panix.com> - 2013-06-15 10:59 -0400
Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-15 18:14 +0300
Re: A few questiosn about encoding Joel Goldstick <joel.goldstick@gmail.com> - 2013-06-15 11:35 -0400
Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-15 22:26 +0300
Re: A few questiosn about encoding Benjamin Schollnick <benjamin@schollnick.net> - 2013-06-15 16:35 -0400
Re: A few questiosn about encoding Chris “Kwpolska” Warrick <kwpolska@gmail.com> - 2013-06-16 15:45 +0200
Re: A few questiosn about encoding Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-06-14 09:36 +0200
Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 10:49 +0300
Re: A few questiosn about encoding Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-06-14 10:22 +0200
Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 11:37 +0300
Don't feed the troll... (was: Re: A few questiosn about encoding) Heiko Wundram <modelnine@modelnine.org> - 2013-06-14 11:06 +0200
Re: Don't feed the troll... Nick the Gr33k <support@superhost.gr> - 2013-06-14 12:32 +0300
Re: Don't feed the troll... Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-06-14 13:09 +0200
Re: Don't feed the troll... Nick the Gr33k <support@superhost.gr> - 2013-06-14 15:36 +0300
Re: Don't feed the troll... Joel Goldstick <joel.goldstick@gmail.com> - 2013-06-14 08:44 -0400
Re: Don't feed the troll... Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-06-14 15:25 +0200
Re: Don't feed the troll... Neil Cerutti <neilc@norwich.edu> - 2013-06-14 15:54 +0000
Re: Don't feed the troll... Heiko Wundram <modelnine@modelnine.org> - 2013-06-14 12:15 +0200
Re: Don't feed the troll... Guy Scree <nobody@nowhere.com> - 2013-06-14 18:50 -0400
Re: Don't feed the troll... Denis McMahon <denismfmcmahon@gmail.com> - 2013-06-15 06:31 +0000
Re: Don't feed the troll... Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-06-15 13:04 -0400
Re: Don't feed the troll... Guy Scree <nobody@nowhere.com> - 2013-06-17 16:15 -0400
Re: Don't feed the troll... Chris Angelico <rosuav@gmail.com> - 2013-06-18 07:46 +1000
Re: A few questiosn about encoding Cameron Simpson <cs@zip.com.au> - 2013-06-14 20:19 +1000
Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 15:41 +0300
Re: Don't feed the troll... (was: Re: A few questiosn about encoding) Fábio Santos <fabiosantosart@gmail.com> - 2013-06-14 11:20 +0100
Re: Don't feed the troll... (was: Re: A few questiosn about encoding) rusi <rustompmody@gmail.com> - 2013-06-14 04:51 -0700
Re: Don't feed the help-vampire rusi <rustompmody@gmail.com> - 2013-06-14 05:09 -0700
Re: Don't feed the help-vampire Heiko Wundram <modelnine@modelnine.org> - 2013-06-14 14:31 +0200
Re: Don't feed the help-vampire Ian Kelly <ian.g.kelly@gmail.com> - 2013-06-14 10:51 -0600
Re: Don't feed the troll... Nick the Gr33k <support@superhost.gr> - 2013-06-14 15:50 +0300
Re: Don't feed the troll... Zero Piraeus <schesis@gmail.com> - 2013-06-14 09:33 -0400
Re: Don't feed the troll... Nick the Gr33k <support@superhost.gr> - 2013-06-14 15:45 +0300
Re: Don't feed the troll... Heiko Wundram <modelnine@modelnine.org> - 2013-06-14 14:58 +0200
Re: Don't feed the troll... Fábio Santos <fabiosantosart@gmail.com> - 2013-06-14 14:25 +0100
Re: Don't feed the troll... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-06-14 17:12 +0100
Re: A few questiosn about encoding Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-06-14 12:50 +0200
Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 15:59 +0300
Re: A few questiosn about encoding Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-06-14 15:52 +0200
Re: A few questiosn about encoding Cameron Simpson <cs@zip.com.au> - 2013-06-15 10:28 +1000
Re: A few questiosn about encoding Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-06-17 08:49 +0200
Re: Don't feed the troll... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-06-14 12:57 +0100
Re: Don't feed the troll... (was: Re: A few questiosn about encoding) "D'Arcy J.M. Cain" <darcy@druid.net> - 2013-06-14 13:13 -0400
Re: Don't feed the troll... (was: Re: A few questiosn about encoding) Chris Angelico <rosuav@gmail.com> - 2013-06-15 03:31 +1000
Re: Don't feed the troll... (was: Re: A few questiosn about encoding) Grant Edwards <invalid@invalid.invalid> - 2013-06-14 19:40 +0000
Re: Don't feed the troll "D'Arcy J.M. Cain" <darcy@druid.net> - 2013-06-14 13:56 -0400
Re: Don't feed the troll Tim Chase <python.list@tim.thechases.com> - 2013-06-14 14:00 -0500
Re: Don't feed the troll "D'Arcy J.M. Cain" <darcy@druid.net> - 2013-06-14 15:17 -0400
Re: Don't feed the troll... Ben Finney <ben+python@benfinney.id.au> - 2013-06-15 10:42 +1000
Re: A few questiosn about encoding Rick Johnson <rantingrickjohnson@gmail.com> - 2013-06-19 18:46 -0700
Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-20 06:26 +0000
Re: A few questiosn about encoding MRAB <python@mrabarnett.plus.com> - 2013-06-20 12:43 +0100
Re: A few questiosn about encoding wxjmfauth@gmail.com - 2013-06-20 09:27 -0700
Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-21 02:37 +1000
Re: A few questiosn about encoding MRAB <python@mrabarnett.plus.com> - 2013-06-20 18:17 +0100
Re: A few questiosn about encoding wxjmfauth@gmail.com - 2013-06-23 08:51 -0700
Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-23 16:30 +0000
Re: A few questiosn about encoding wxjmfauth@gmail.com - 2013-06-25 13:16 -0700
Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-21 03:21 +1000
Re: A few questiosn about encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-06-20 20:43 +0100
Re: A few questiosn about encoding Rick Johnson <rantingrickjohnson@gmail.com> - 2013-06-20 06:40 -0700
Re: A few questiosn about encoding Andrew Berg <robotsondrugs@gmail.com> - 2013-06-20 09:04 -0500
Re: A few questiosn about encoding Rick Johnson <rantingrickjohnson@gmail.com> - 2013-06-20 08:12 -0700
Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-21 01:26 +1000
Re: A few questiosn about encoding Jussi Piitulainen <jpiitula@ling.helsinki.fi> - 2013-06-20 20:25 +0300
Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-21 01:28 +1000
Re: A few questiosn about encoding Andreas Perstinger <andipersti@gmail.com> - 2013-06-20 19:08 +0200
Page 1 of 5 [1] 2 3 4 5 Next page →
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-06-13 00:13 +0000 |
| Subject | Re: A few questiosn about encoding |
| Message-ID | <51b90ead$0$29997$c3e8da3$5496439d@news.astraweb.com> |
On Wed, 12 Jun 2013 14:23:49 +0300, Νικόλαος Κούρας wrote:
> So, how many bytes does UTF-8 stored for codepoints > 127 ?
Two, three or four, depending on the codepoint.
> example for codepoint 256, 1345, 16474 ?
You can do this yourself. I have already given you enough information in
previous emails to answer this question on your own, but here it is again:
Open an interactive Python session, and run this code:
c = ord(16474)
len(c.encode('utf-8'))
That will tell you how many bytes are used for that example.
--
Steven
[toc] | [next] | [standalone]
| From | Νικόλαος Κούρας <support@superhost.gr> |
|---|---|
| Date | 2013-06-13 09:09 +0300 |
| Message-ID | <kpbnmg$qvk$2@news.ntua.gr> |
| In reply to | #47866 |
On 13/6/2013 3:13 πμ, Steven D'Aprano wrote:
> On Wed, 12 Jun 2013 14:23:49 +0300, Νικόλαος Κούρας wrote:
>
>> So, how many bytes does UTF-8 stored for codepoints > 127 ?
>
> Two, three or four, depending on the codepoint.
The amount of bytes needed by UTF-8 to store a code-point(character),
depends on the ordinal value of the code-point in the Unicode charset,
correct?
If this is correct then the higher the ordinal value(which is an decimal
integer) in the Unicode charset the more bytes needed for storage.
Its like the bigger a decimal integer is the bigger binary number it
produces.
Is this correct?
>> example for codepoint 256, 1345, 16474 ?
>
> You can do this yourself. I have already given you enough information in
> previous emails to answer this question on your own, but here it is again:
>
> Open an interactive Python session, and run this code:
>
> c = ord(16474)
> len(c.encode('utf-8'))
>
>
> That will tell you how many bytes are used for that example.
This si actually wrong.
ord()'s arguments must be a character for which we expect its ordinal value.
>>> chr(16474)
'䁚'
Some Chinese symbol.
So code-point '䁚' has a Unicode ordinal value of 16474, correct?
where in after encoding this glyph's ordinal value to binary gives us
the following bytes:
>>> bin(16474).encode('utf-8')
b'0b100000001011010'
Now, we take tow symbols out:
'b' symbolism which is there to tell us that we are looking a bytes
object as well as the
'0b' symbolism which is there to tell us that we are looking a binary
representation of a bytes object
Thus, there we count 15 bits left.
So it says 15 bits, which is 1-bit less that 2 bytes.
Is the above statements correct please?
but thinking this through more and more:
>>> chr(16474).encode('utf-8')
b'\xe4\x81\x9a'
>>> len(b'\xe4\x81\x9a')
3
it seems that the bytestring the encode process produces is of length 3.
So i take it is 3 bytes?
but there is a mismatch of what >>> bin(16474).encode('utf-8') and >>>
chr(16474).encode('utf-8') is telling us here.
Care to explain that too please ?
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-06-13 07:11 +0000 |
| Message-ID | <51b9708b$0$29872$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #47902 |
On Thu, 13 Jun 2013 09:09:19 +0300, Νικόλαος Κούρας wrote:
> On 13/6/2013 3:13 πμ, Steven D'Aprano wrote:
>> Open an interactive Python session, and run this code:
>>
>> c = ord(16474)
>> len(c.encode('utf-8'))
>>
>>
>> That will tell you how many bytes are used for that example.
> This si actually wrong.
>
> ord()'s arguments must be a character for which we expect its ordinal
> value.
Gah!
That's twice I've screwed that up. Sorry about that!
> >>> chr(16474)
> '䁚'
>
> Some Chinese symbol.
> So code-point '䁚' has a Unicode ordinal value of 16474, correct?
Correct.
> where in after encoding this glyph's ordinal value to binary gives us
> the following bytes:
>
> >>> bin(16474).encode('utf-8')
> b'0b100000001011010'
No! That creates a string from 16474 in base two:
'0b100000001011010'
The leading 0b is just syntax to tell you "this is base 2, not base 8
(0o) or base 10 or base 16 (0x)". Also, leading zero bits are dropped.
Then you encode the string '0b100000001011010' into UTF-8. There are 17
characters in this string, and they are all ASCII characters to they take
up 1 byte each, giving you bytes b'0b100000001011010' (in ASCII form). In
hex form, they are:
b'\x30\x62\x31\x30\x30\x30\x30\x30\x30\x30\x31\x30\x31\x31\x30\x31\x30'
which takes up a lot more room, which is why Python prefers to show ASCII
characters as characters rather than as hex.
What you want is:
chr(16474).encode('utf-8')
[...]
> Thus, there we count 15 bits left.
> So it says 15 bits, which is 1-bit less that 2 bytes. Is the above
> statements correct please?
No. There are 17 BYTES there. The string "0" doesn't get turned into a
single bit. It still takes up a full byte, 0x30, which is 8 bits.
> but thinking this through more and more:
>
> >>> chr(16474).encode('utf-8')
> b'\xe4\x81\x9a'
> >>> len(b'\xe4\x81\x9a')
> 3
>
> it seems that the bytestring the encode process produces is of length 3.
Correct! Now you have got the right idea.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Νικόλαος Κούρας <support@superhost.gr> |
|---|---|
| Date | 2013-06-13 10:42 +0300 |
| Message-ID | <kpbt5i$7bj$1@news.ntua.gr> |
| In reply to | #47912 |
On 13/6/2013 10:11 πμ, Steven D'Aprano wrote:
>> >>> chr(16474)
>> '䁚'
>>
>> Some Chinese symbol.
>> So code-point '䁚' has a Unicode ordinal value of 16474, correct?
>
> Correct.
>
>
>> where in after encoding this glyph's ordinal value to binary gives us
>> the following bytes:
>>
>> >>> bin(16474).encode('utf-8')
>> b'0b100000001011010'
An observations here that you please confirm as valid.
1. A code-point and the code-point's ordinal value are associated into a
Unicode charset. They have the so called 1:1 mapping.
So, i was under the impression that by encoding the code-point into
utf-8 was the same as encoding the code-point's ordinal value into utf-8.
That is why i tried to:
bin(16474).encode('utf-8') instead of chr(16474).encode('utf-8')
So, now i believe they are two different things.
The code-point *is what actually* needs to be encoded and *not* its
ordinal value.
> The leading 0b is just syntax to tell you "this is base 2, not base 8
> (0o) or base 10 or base 16 (0x)". Also, leading zero bits are dropped.
But byte objects are represented as '\x' instead of the aforementioned
'0x'. Why is that?
> No! That creates a string from 16474 in base two:
> '0b100000001011010'
I disagree here.
16474 is a number in base 10. Doing bin(16474) we get the binary
representation of number 16474 and not a string.
Why you say we receive a string while python presents a binary number?
> Then you encode the string '0b100000001011010' into UTF-8. There are 17
> characters in this string, and they are all ASCII characters to they take
> up 1 byte each, giving you bytes b'0b100000001011010' (in ASCII form).
0b100000001011010 stands for a number in base 2 for me not as a string.
Have i understood something wrong?
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-06-13 17:58 +1000 |
| Message-ID | <mailman.3172.1371110288.3114.python-list@python.org> |
| In reply to | #47917 |
On Thu, Jun 13, 2013 at 5:42 PM, Νικόλαος Κούρας <support@superhost.gr> wrote: > On 13/6/2013 10:11 πμ, Steven D'Aprano wrote: >> No! That creates a string from 16474 in base two: >> '0b100000001011010' > > I disagree here. > 16474 is a number in base 10. Doing bin(16474) we get the binary > representation of number 16474 and not a string. > Why you say we receive a string while python presents a binary number? You can disagree all you like. Steven cited a simple point of fact, one which can be verified in any Python interpreter. Nikos, you are flat wrong here; bin(16474) creates a string. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Νικόλαος Κούρας <support@superhost.gr> |
|---|---|
| Date | 2013-06-13 11:08 +0300 |
| Message-ID | <kpbul5$7bj$2@news.ntua.gr> |
| In reply to | #47919 |
On 13/6/2013 10:58 πμ, Chris Angelico wrote: > On Thu, Jun 13, 2013 at 5:42 PM, �������� ������ <support@superhost.gr> wrote: >> On 13/6/2013 10:11 ��, Steven D'Aprano wrote: >>> No! That creates a string from 16474 in base two: >>> '0b100000001011010' >> >> I disagree here. >> 16474 is a number in base 10. Doing bin(16474) we get the binary >> representation of number 16474 and not a string. >> Why you say we receive a string while python presents a binary number? > > You can disagree all you like. Steven cited a simple point of fact, > one which can be verified in any Python interpreter. Nikos, you are > flat wrong here; bin(16474) creates a string. Indeed python embraced it in single quoting '0b100000001011010' and not as 0b100000001011010 which in fact makes it a string. But since bin(16474) seems to create a string rather than an expected number(at leat into my mind) then how do we get the binary representation of the number 16474 as a number?
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-06-13 18:20 +1000 |
| Message-ID | <mailman.3173.1371111646.3114.python-list@python.org> |
| In reply to | #47921 |
On Thu, Jun 13, 2013 at 6:08 PM, Νικόλαος Κούρας <support@superhost.gr> wrote: > On 13/6/2013 10:58 πμ, Chris Angelico wrote: >> >> On Thu, Jun 13, 2013 at 5:42 PM, �������� ������ <support@superhost.gr> >> wrote: >> >>> On 13/6/2013 10:11 ��, Steven D'Aprano wrote: >>>> >>>> No! That creates a string from 16474 in base two: >>>> '0b100000001011010' >>> >>> >>> I disagree here. >>> 16474 is a number in base 10. Doing bin(16474) we get the binary >>> representation of number 16474 and not a string. >>> Why you say we receive a string while python presents a binary number? >> >> >> You can disagree all you like. Steven cited a simple point of fact, >> one which can be verified in any Python interpreter. Nikos, you are >> flat wrong here; bin(16474) creates a string. > > > Indeed python embraced it in single quoting '0b100000001011010' and not as > 0b100000001011010 which in fact makes it a string. > > But since bin(16474) seems to create a string rather than an expected > number(at leat into my mind) then how do we get the binary representation of > the number 16474 as a number? In Python 2: >>> 16474 In Python 3, you have to fiddle around with ctypes, but broadly speaking, the same thing. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Νικόλαος Κούρας <support@superhost.gr> |
|---|---|
| Date | 2013-06-13 12:41 +0300 |
| Message-ID | <kpc44m$1mk8$1@news.ntua.gr> |
| In reply to | #47923 |
On 13/6/2013 11:20 πμ, Chris Angelico wrote: > On Thu, Jun 13, 2013 at 6:08 PM, Νικόλαος Κούρας <support@superhost.gr> wrote: >> On 13/6/2013 10:58 πμ, Chris Angelico wrote: >>> >>> On Thu, Jun 13, 2013 at 5:42 PM, �������� ������ <support@superhost.gr> >>> wrote: >>> >>>> On 13/6/2013 10:11 ��, Steven D'Aprano wrote: >>>>> >>>>> No! That creates a string from 16474 in base two: >>>>> '0b100000001011010' >>>> >>>> >>>> I disagree here. >>>> 16474 is a number in base 10. Doing bin(16474) we get the binary >>>> representation of number 16474 and not a string. >>>> Why you say we receive a string while python presents a binary number? >>> >>> >>> You can disagree all you like. Steven cited a simple point of fact, >>> one which can be verified in any Python interpreter. Nikos, you are >>> flat wrong here; bin(16474) creates a string. >> >> >> Indeed python embraced it in single quoting '0b100000001011010' and not as >> 0b100000001011010 which in fact makes it a string. >> >> But since bin(16474) seems to create a string rather than an expected >> number(at leat into my mind) then how do we get the binary representation of >> the number 16474 as a number? > > In Python 2: >>>> 16474 typing 16474 in interactive session both in python 2 and 3 gives back the number 16474 while we want the the binary representation of the number 16474
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2013-06-13 11:49 +0000 |
| Message-ID | <51b9b1d7$0$29997$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #47928 |
On Thu, 13 Jun 2013 12:41:41 +0300, Νικόλαος Κούρας wrote: >> In Python 2: >>>>> 16474 > typing 16474 in interactive session both in python 2 and 3 gives back > the number 16474 > > while we want the the binary representation of the number 16474 Python does not work that way. Ints *always* display in decimal. Regardless of whether you enter the decimal in binary: py> 0b100000001011010 16474 octal: py> 0o40132 16474 or hexadecimal: py> 0x405A 16474 ints always display in decimal. The only way to display in another base is to build a string showing what the int would look like in a different base: py> hex(16474) '0x405a' Notice that the return value of bin, oct and hex are all strings. If they were ints, then they would display in decimal, defeating the purpose! -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Νικόλαος Κούρας <support@superhost.gr> |
|---|---|
| Date | 2013-06-13 17:19 +0300 |
| Message-ID | <kpcke3$28k2$1@news.ntua.gr> |
| In reply to | #47939 |
On 13/6/2013 2:49 μμ, Steven D'Aprano wrote:
Please confirm these are true statement:
A code-point and the code-point's ordinal value are associated into a
Unicode charset. They have the so called 1:1 mapping.
So, i was under the impression that by encoding the code-point into
utf-8 was the same as encoding the code-point's ordinal value into utf-8.
So, now i believe they are two different things.
The code-point *is what actually* needs to be encoded and *not* its
ordinal value.
> The leading 0b is just syntax to tell you "this is base 2, not base 8
> (0o) or base 10 or base 16 (0x)". Also, leading zero bits are dropped.
But byte objects are represented as '\x' instead of the aforementioned
'0x'. Why is that?
> ints always display in decimal. The only way to display in another base
> is to build a string showing what the int would look like in a different
> base:
>
> py> hex(16474)
> '0x405a'
>
> Notice that the return value of bin, oct and hex are all strings. If they
> were ints, then they would display in decimal, defeating the purpose!
Thank you didn't knew that! indeed it working like this.
To encode a number we have to turn it into a string first.
"16474".encode('utf-8')
b'16474'
That 'b' stand for bytes.
How can i view this byte's object representation as hex() or as bin()?
============
Also:
>>> len('0b100000001011010')
17
You said this string consists of 17 chars.
Why the leading syntax of '0b' counts as bits as well? Shouldn't be 15
bits instead of 17?
[toc] | [prev] | [next] | [standalone]
| From | Cameron Simpson <cs@zip.com.au> |
|---|---|
| Date | 2013-06-14 11:00 +1000 |
| Message-ID | <mailman.3242.1371171652.3114.python-list@python.org> |
| In reply to | #47967 |
On 13Jun2013 17:19, Nikos as SuperHost Support <support@superhost.gr> wrote:
| A code-point and the code-point's ordinal value are associated into
| a Unicode charset. They have the so called 1:1 mapping.
|
| So, i was under the impression that by encoding the code-point into
| utf-8 was the same as encoding the code-point's ordinal value into
| utf-8.
|
| So, now i believe they are two different things.
| The code-point *is what actually* needs to be encoded and *not* its
| ordinal value.
Because there is a 1:1 mapping, these are the same thing: a code
point is directly _represented_ by the ordinal value, and the ordinal
value is encoded for storage as bytes.
| > The leading 0b is just syntax to tell you "this is base 2, not base 8
| > (0o) or base 10 or base 16 (0x)". Also, leading zero bits are dropped.
|
| But byte objects are represented as '\x' instead of the
| aforementioned '0x'. Why is that?
You're confusing a "string representation of a single number in
some base (eg 2 or 16)" with the "string-ish representation of a
bytes object".
The former is just notation for writing a number in different bases, eg:
27 base 10
1b base 16
33 base 8
11011 base 2
A common convention, and the one used by hex(), oct() and bin() in
Python, is to prefix the non-base-10 representations with "0x" for
base 16, "0o" for base 8 ("o"ctal) and "0b" for base 2 ("b"inary):
27
0x1b
0o33
0b11011
This allows the human reader or a machine lexer to decide what base
the number is written in, and therefore to figure out what the
underlying numeric value is.
Conversely, consider the bytes object consisting of the values [97,
98, 99, 27, 10]. In ASCII (and UTF-8 and the iso-8859-x encodings)
these may all represent the characters ['a', 'b', 'c', ESC, NL].
So when "printing" a bytes object, which is a sequence of small integers representing
values stored in bytes, it is compact to print:
b'abc\x1b\n'
which is ['a', 'b', 'c', chr(27), newline].
The slosh (\) is the common convention in C-like languages and many
others for representing special characters not directly represents
by themselves. So "\\" for a slosh, "\n" for a newline and "\x1b"
for character 27 (ESC).
The bytes object is still just a sequence on integers, but because
it is very common to have those integers represent text, and very
common to have some text one want represented as bytes in a direct
1:1 mapping, this compact text form is useful and readable. It is
also legal Python syntax for making a small bytes object.
To demonstrate that this is just a _representation_, run this:
>>> [ i for i in b'abc\x1b\n' ]
[97, 98, 99, 27, 10]
at an interactive Python 3 prompt. See? Just numbers.
| To encode a number we have to turn it into a string first.
|
| "16474".encode('utf-8')
| b'16474'
|
| That 'b' stand for bytes.
Syntactic details. Read this:
http://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals
| How can i view this byte's object representation as hex() or as bin()?
See above. A bytes is a _sequence_ of values. hex() and bin() print
individual values in hexadecimal or binary respectively. You could
do this:
for value in b'16474':
print(value, hex(value), bin(value))
Cheers,
--
Cameron Simpson <cs@zip.com.au>
Uhlmann's Razor: When stupidity is a sufficient explanation, there is no need
to have recourse to any other.
- Michael M. Uhlmann, assistant attorney general
for legislation in the Ford Administration
[toc] | [prev] | [next] | [standalone]
| From | Nick the Gr33k <support@superhost.gr> |
|---|---|
| Date | 2013-06-14 09:59 +0300 |
| Message-ID | <kpef1e$p37$3@news.ntua.gr> |
| In reply to | #48043 |
On 14/6/2013 4:00 πμ, Cameron Simpson wrote:
> On 13Jun2013 17:19, Nikos as SuperHost Support <support@superhost.gr> wrote:
> | A code-point and the code-point's ordinal value are associated into
> | a Unicode charset. They have the so called 1:1 mapping.
> |
> | So, i was under the impression that by encoding the code-point into
> | utf-8 was the same as encoding the code-point's ordinal value into
> | utf-8.
> |
> | So, now i believe they are two different things.
> | The code-point *is what actually* needs to be encoded and *not* its
> | ordinal value.
>
> Because there is a 1:1 mapping, these are the same thing: a code
> point is directly _represented_ by the ordinal value, and the ordinal
> value is encoded for storage as bytes.
So, you are saying that:
chr(16474).encode('utf-8') #being the code-point encoded
ord(chr(16474)).encode('utf-8') #being the code-point's ordinal
encoded which gives an error.
that shows us that a character is what is being be encoded to utf-8 but
the character's ordinal cannot.
So, whay you say "....and the ordinal value is encoded for storage as
bytes." ?
> | > The leading 0b is just syntax to tell you "this is base 2, not base 8
> | > (0o) or base 10 or base 16 (0x)". Also, leading zero bits are dropped.
> |
> | But byte objects are represented as '\x' instead of the
> | aforementioned '0x'. Why is that?
>
> You're confusing a "string representation of a single number in
> some base (eg 2 or 16)" with the "string-ish representation of a
> bytes object".
>>> bin(16474)
'0b100000001011010'
that is a binary format string representation of number 16474, yes?
>>> hex(16474)
'0x405a'
that is a hexadecimal format string representation of number 16474, yes?
WHILE:
b'abc\x1b\n' = a string representation of a byte, which in turn is a
series of integers, so that makes this a string representation of
integers, is this correct?
\x1b = ESC character
\ = for seperating bytes
x = to flag that the following bytes are going to be represented as hex
values? whats exactly 'x' means here? character perhaps?
Still its not clear into my head what the difference of '0x1b' and
'\x1b' is:
i think:
0x1b = an integer represented in hex format
\x1b = a character represented in hex format
id this true?
> | How can i view this byte's object representation as hex() or as bin()?
>
> See above. A bytes is a _sequence_ of values. hex() and bin() print
> individual values in hexadecimal or binary respectively.
>>> for value in b'\x97\x98\x99\x27\x10':
... print(value, hex(value), bin(value))
...
151 0x97 0b10010111
152 0x98 0b10011000
153 0x99 0b10011001
39 0x27 0b100111
16 0x10 0b10000
>>> for value in b'abc\x1b\n':
... print(value, hex(value), bin(value))
...
97 0x61 0b1100001
98 0x62 0b1100010
99 0x63 0b1100011
27 0x1b 0b11011
10 0xa 0b1010
Why these two give different values when printed?
--
What is now proved was at first only imagined!
[toc] | [prev] | [next] | [standalone]
| From | Cameron Simpson <cs@zip.com.au> |
|---|---|
| Date | 2013-06-14 20:14 +1000 |
| Message-ID | <mailman.3292.1371206432.3114.python-list@python.org> |
| In reply to | #48070 |
On 14Jun2013 09:59, Nikos as SuperHost Support <support@superhost.gr> wrote:
| On 14/6/2013 4:00 πμ, Cameron Simpson wrote:
| >On 13Jun2013 17:19, Nikos as SuperHost Support <support@superhost.gr> wrote:
| >| A code-point and the code-point's ordinal value are associated into
| >| a Unicode charset. They have the so called 1:1 mapping.
| >|
| >| So, i was under the impression that by encoding the code-point into
| >| utf-8 was the same as encoding the code-point's ordinal value into
| >| utf-8.
| >|
| >| So, now i believe they are two different things.
| >| The code-point *is what actually* needs to be encoded and *not* its
| >| ordinal value.
| >
| >Because there is a 1:1 mapping, these are the same thing: a code
| >point is directly _represented_ by the ordinal value, and the ordinal
| >value is encoded for storage as bytes.
|
| So, you are saying that:
|
| chr(16474).encode('utf-8') #being the code-point encoded
|
| ord(chr(16474)).encode('utf-8') #being the code-point's ordinal
| encoded which gives an error.
|
| that shows us that a character is what is being be encoded to utf-8
| but the character's ordinal cannot.
|
| So, whay you say "....and the ordinal value is encoded for storage
| as bytes." ?
No, I mean conceptually, there is no difference between a codepoint
and its ordinal value. They are the same thing.
Inside Python itself, a character (a string of length 1; there is
no separate character type) is a distinct type. Interally, the
characters in a string are stored numericly. As Unicode codepoints,
as their ordinal values.
It is a meaningful idea to store a Python string encoded into bytes
using some text encoding scheme (utf-8, iso-8859-7, what have you).
It is not a meaningful thing to store a number "encoded" without
some more context. The .encode() method that accepts an encoding
name like "utf-8" is specificly an encoding procedure FOR TEXT.
So strings have such a method, and integers do not.
When you write:
chr(16474)
you receive a _string_, containing the single character whose ordinal
is 16474. It is meaningful to transcribe this string to bytes using
a text encoding procedure like 'utf-8'.
When you write:
ord(chr(16474))
you get an integer. Because ord() is the reverse of chr(), you get
the integer 16474.
Integers do not have .encode() methods that accept a _text_ encoding
name like 'utf-8' because integers are not text.
| >| > The leading 0b is just syntax to tell you "this is base 2, not base 8
| >| > (0o) or base 10 or base 16 (0x)". Also, leading zero bits are dropped.
| >|
| >| But byte objects are represented as '\x' instead of the
| >| aforementioned '0x'. Why is that?
| >
| >You're confusing a "string representation of a single number in
| >some base (eg 2 or 16)" with the "string-ish representation of a
| >bytes object".
|
| >>> bin(16474)
| '0b100000001011010'
| that is a binary format string representation of number 16474, yes?
Yes.
| >>> hex(16474)
| '0x405a'
| that is a hexadecimal format string representation of number 16474, yes?
Yes.
| WHILE:
| b'abc\x1b\n' = a string representation of a byte, which in turn is a
| series of integers, so that makes this a string representation of
| integers, is this correct?
A "bytes" Python object. So not "a byte", 5 bytes.
It is a string representation of the series of byte values,
ON THE PREMISE that the bytes may well represent text.
On that basis, b'abc\x1b\n' is a reasonable way to display them.
In other contexts this might not be a sensible way to display these
bytes, and then another format would be chosen, possibly hand
constructed by the programmer, or equally reasonable, the hexlify()
function from the binascii module.
| \x1b = ESC character
Considering the bytes to be representing characters, then yes.
| \ = for seperating bytes
No, \ to introduce a sequence of characters with special meaning.
Normally a character in a b'...' item represents the byte value
matching the character's Unicode ordinal value. But several characters
are hard or confusing to place literally in a b'...' string. For
example a newline character or and escape character.
'a' means 65.
'\n' means 10 (newline, hence the 'n').
'\x1b' means 33 (escape, value 27, value 0x1b in hexadecimal).
And, of course, '\\' means a literal slosh, value 92.
| x = to flag that the following bytes are going to be represented as
| hex values? whats exactly 'x' means here? character perhaps?
A slosh followed by an 'x' means there will be 2 hexadecimal digits
to follow, and those two digits represent the byte value.
So, yes.
| Still its not clear into my head what the difference of '0x1b' and
| '\x1b' is:
They're the same thing in two similar but slightly different formats.
0x1b is a legitimate "bare" integer value in Python.
\x1b is a sequence you find inside strings (and "byte" strings, the
b'...' format).
| i think:
| 0x1b = an integer represented in hex format
Yes.
| \x1b = a character represented in hex format
Yes.
| >| How can i view this byte's object representation as hex() or as bin()?
| >
| >See above. A bytes is a _sequence_ of values. hex() and bin() print
| >individual values in hexadecimal or binary respectively.
|
| >>> for value in b'\x97\x98\x99\x27\x10':
| ... print(value, hex(value), bin(value))
| ...
| 151 0x97 0b10010111
| 152 0x98 0b10011000
| 153 0x99 0b10011001
| 39 0x27 0b100111
| 16 0x10 0b10000
|
|
| >>> for value in b'abc\x1b\n':
| ... print(value, hex(value), bin(value))
| ...
| 97 0x61 0b1100001
| 98 0x62 0b1100010
| 99 0x63 0b1100011
| 27 0x1b 0b11011
| 10 0xa 0b1010
|
|
| Why these two give different values when printed?
97 is in base 10 (9*10+7=97), but the notation '\x97' is base 16, so 9*16+7=151.
Cheers,
--
Cameron Simpson <cs@zip.com.au>
I'm Bubba of Borg. Y'all fixin' to be assimilated.
[toc] | [prev] | [next] | [standalone]
| From | Nick the Gr33k <support@superhost.gr> |
|---|---|
| Date | 2013-06-14 16:58 +0300 |
| Message-ID | <kpf7hr$spl$16@news.ntua.gr> |
| In reply to | #48113 |
On 14/6/2013 1:14 μμ, Cameron Simpson wrote:
> Normally a character in a b'...' item represents the byte value
> matching the character's Unicode ordinal value.
The only thing that i didn't understood is this line.
First please tell me what is a byte value
> \x1b is a sequence you find inside strings (and "byte" strings, the
> b'...' format).
\x1b is a character(ESC) represented in hex format
b'\x1b' is a byte object that represents what?
>>> chr(27).encode('utf-8')
b'\x1b'
>>> b'\x1b'.decode('utf-8')
'\x1b'
After decoding it gives the char ESC in hex format
Shouldn't it result in value 27 which is the ordinal of ESC ?
> No, I mean conceptually, there is no difference between a code-point
> and its ordinal value. They are the same thing.
Why Unicode charset doesn't just contain characters, but instead it
contains a mapping of (characters <--> ordinals) ?
I mean what we do is to encode a character like chr(65).encode('utf-8')
What's the reason of existence of its corresponding ordinal value since
it doesn't get involved into the encoding process?
Thank you very much for taking the time to explain.
--
What is now proved was at first only imagined!
[toc] | [prev] | [next] | [standalone]
| From | Joel Goldstick <joel.goldstick@gmail.com> |
|---|---|
| Date | 2013-06-14 11:21 -0400 |
| Message-ID | <mailman.3312.1371223308.3114.python-list@python.org> |
| In reply to | #48150 |
[Multipart message — attachments visible in raw view] — view raw
let's cut to the chase and start with telling us what you DO know Nick.
That would take less typing
On Fri, Jun 14, 2013 at 9:58 AM, Nick the Gr33k <support@superhost.gr>wrote:
> On 14/6/2013 1:14 μμ, Cameron Simpson wrote:
>
>> Normally a character in a b'...' item represents the byte value
>> matching the character's Unicode ordinal value.
>>
>
> The only thing that i didn't understood is this line.
> First please tell me what is a byte value
>
>
> \x1b is a sequence you find inside strings (and "byte" strings, the
>> b'...' format).
>>
>
> \x1b is a character(ESC) represented in hex format
>
> b'\x1b' is a byte object that represents what?
>
>
> >>> chr(27).encode('utf-8')
> b'\x1b'
>
> >>> b'\x1b'.decode('utf-8')
> '\x1b'
>
> After decoding it gives the char ESC in hex format
> Shouldn't it result in value 27 which is the ordinal of ESC ?
>
> > No, I mean conceptually, there is no difference between a code-point
>
> > and its ordinal value. They are the same thing.
>
> Why Unicode charset doesn't just contain characters, but instead it
> contains a mapping of (characters <--> ordinals) ?
>
> I mean what we do is to encode a character like chr(65).encode('utf-8')
>
> What's the reason of existence of its corresponding ordinal value since it
> doesn't get involved into the encoding process?
>
> Thank you very much for taking the time to explain.
>
> --
> What is now proved was at first only imagined!
> --
> http://mail.python.org/**mailman/listinfo/python-list<http://mail.python.org/mailman/listinfo/python-list>
>
--
Joel Goldstick
http://joelgoldstick.com
[toc] | [prev] | [next] | [standalone]
| From | Nick the Gr33k <support@superhost.gr> |
|---|---|
| Date | 2013-06-14 18:26 +0300 |
| Message-ID | <kpfcmf$1sl2$1@news.ntua.gr> |
| In reply to | #48163 |
On 14/6/2013 6:21 μμ, Joel Goldstick wrote: > let's cut to the chase and start with telling us what you DO know Nick. > That would take less typing Well, my biggest successes up until now where to build 3 websites utilizing database saves and retrievals in PHP in Perl and later in Python with absolute ignorance of Apache Configuration: CGI: Linux: with just basic knowledge of linux. I'am very proud of it. -- What is now proved was at first only imagined!
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-06-15 03:03 +1000 |
| Message-ID | <mailman.3319.1371229390.3114.python-list@python.org> |
| In reply to | #48164 |
On Sat, Jun 15, 2013 at 1:26 AM, Nick the Gr33k <support@superhost.gr> wrote: > Well, my biggest successes up until now where to build 3 websites utilizing > database saves and retrievals > > in PHP > in Perl > and later in Python > > with absolute ignorance of > > Apache Configuration: > CGI: > Linux: > > with just basic knowledge of linux. > I'am very proud of it. Translation: "I just built a car. I don't know anything about internal combustion engines or road rules or metalwork, and I'm very proud of the monstrosity that I'm now selling to my friends." Would you buy a car built by someone who proudly announces that he has no clue how to build one? Why do you sell web hosting services when you have no clue how to provide them? ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Walter Hurry <walterhurry@lavabit.com> |
|---|---|
| Date | 2013-06-14 23:32 +0000 |
| Message-ID | <kpg97a$mq7$1@news.albasani.net> |
| In reply to | #48185 |
On Sat, 15 Jun 2013 03:03:02 +1000, Chris Angelico wrote: > Why do you sell web hosting services when you > have no clue how to provide them? > And why do you continue responding to this timewaster? Please, please just killfile him and let's all move on.
[toc] | [prev] | [next] | [standalone]
| From | Cameron Simpson <cs@zip.com.au> |
|---|---|
| Date | 2013-06-15 10:26 +1000 |
| Message-ID | <mailman.3348.1371256019.3114.python-list@python.org> |
| In reply to | #48150 |
On 14Jun2013 16:58, Nikos as SuperHost Support <support@superhost.gr> wrote:
| On 14/6/2013 1:14 μμ, Cameron Simpson wrote:
| >Normally a character in a b'...' item represents the byte value
| >matching the character's Unicode ordinal value.
|
| The only thing that i didn't understood is this line.
| First please tell me what is a byte value
The numeric value stored in a byte. Bytes are just small integers
in the range 0..255; the values available with 8 bits of storage.
| >\x1b is a sequence you find inside strings (and "byte" strings, the
| >b'...' format).
|
| \x1b is a character(ESC) represented in hex format
Yes.
| b'\x1b' is a byte object that represents what?
An array of 1 byte, whose value is 0x1b or 27.
| >>> chr(27).encode('utf-8')
| b'\x1b'
Transcribing the ESC Unicode character to byte storage.
| >>> b'\x1b'.decode('utf-8')
| '\x1b'
Reading a single byte array containing a 27 and decoding it assuming 'utf-8'.
This obtains a single character string containing the ESC character.
| After decoding it gives the char ESC in hex format
| Shouldn't it result in value 27 which is the ordinal of ESC ?
When printing strings, the non-printable characters in the string are
_represented_ in hex format, so \x1b was printed.
| > No, I mean conceptually, there is no difference between a code-point
| > and its ordinal value. They are the same thing.
|
| Why Unicode charset doesn't just contain characters, but instead it
| contains a mapping of (characters <--> ordinals) ?
Look, as far as a computer is concerned a character and an ordinal
are the same thing because you just store character ordinals in
memory when you store a string.
When characters are _displayed_, your Terminal (or web browser or
whatever) takes character ordinals and looks them up in a _font_,
which is a mapping of character ordinals to glyphs (character
images), and renders the character image onto your screen.
| I mean what we do is to encode a character like chr(65).encode('utf-8')
| What's the reason of existence of its corresponding ordinal value
| since it doesn't get involved into the encoding process?
Stop thinking of Unicode code points and ordinal values as separate
things. They are effectively two terms for the same thing. So there
is no "corresponding ordinal value". 65 _is_ the ordinal value.
When you run:
chr(65).encode('utf-8')
you're going:
chr(65) ==> 'A'
Producing a string with just one character in it.
Internally, Python stores an array of character ordinals, thus: [65]
'A'.encode('utf-8')
Walk along all the ordinals in the string and transribe them as bytes.
For 65, the byte encoding in 'utf-8' is a single byte of value 65.
So you get an array of bytes (a "bytes object" in Python), thus: [65]
--
Cameron Simpson <cs@zip.com.au>
The double cam chain setup on the 1980's DOHC CB750 was another one of
Honda's pointless engineering breakthroughs. You know the cycle (if you'll
pardon the pun :-), Wonderful New Feature is introduced with much fanfare,
WNF is fawned over by the press, WNF is copied by the other three Japanese
makers (this step is sometimes optional), and finally, WNF is quietly dropped
by Honda.
- Blaine Gardner, <blgardne@sim.es.com>
[toc] | [prev] | [next] | [standalone]
| From | Denis McMahon <denismfmcmahon@gmail.com> |
|---|---|
| Date | 2013-06-15 06:34 +0000 |
| Message-ID | <kph1u6$su9$6@dont-email.me> |
| In reply to | #48150 |
On Fri, 14 Jun 2013 16:58:20 +0300, Nick the Gr33k wrote: > On 14/6/2013 1:14 μμ, Cameron Simpson wrote: >> Normally a character in a b'...' item represents the byte value >> matching the character's Unicode ordinal value. > The only thing that i didn't understood is this line. > First please tell me what is a byte value Seriously? You don't understand the term byte? And you're the support desk for a webhosting company? -- Denis McMahon, denismfmcmahon@gmail.com
[toc] | [prev] | [next] | [standalone]
Page 1 of 5 [1] 2 3 4 5 Next page →
Back to top | Article view | comp.lang.python
csiph-web