Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #47448 > unrolled thread
| Started by | Νικόλαος Κούρας <nikos.gr33k@gmail.com> |
|---|---|
| First post | 2013-06-09 03:44 -0700 |
| Last post | 2013-06-14 10:28 +0300 |
| Articles | 10 on this page of 110 — 36 participants |
Back to article view | Back to comp.lang.python
A few questiosn about encoding Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 03:44 -0700
Re: A few questiosn about encoding Fábio Santos <fabiosantosart@gmail.com> - 2013-06-09 13:18 +0100
Re: A few questiosn about encoding Nobody <nobody@nowhere.com> - 2013-06-09 18:01 +0100
Re: A few questiosn about encoding Chris “Kwpolska” Warrick <kwpolska@gmail.com> - 2013-06-09 19:12 +0200
Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-12 09:09 +0000
Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-12 09:24 +0000
Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-12 14:23 +0300
Re: A few questiosn about encoding Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> - 2013-06-12 14:52 +0200
Re: A few questiosn about encoding Nobody <nobody@nowhere.com> - 2013-06-12 21:30 +0100
Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-13 01:40 +0000
Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-13 12:01 +1000
Re: A few questiosn about encoding Nobody <nobody@nowhere.com> - 2013-06-13 11:02 +0100
Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-13 09:21 +0300
Re: A few questiosn about encoding jmfauth <wxjmfauth@gmail.com> - 2013-06-12 23:28 -0700
Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-13 16:48 +1000
Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-13 00:13 +0000
Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-13 09:09 +0300
Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-13 07:11 +0000
Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-13 10:42 +0300
Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-13 17:58 +1000
Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-13 11:08 +0300
Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-13 18:20 +1000
Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-13 12:41 +0300
Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-13 11:49 +0000
Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-13 17:19 +0300
Re: A few questiosn about encoding Cameron Simpson <cs@zip.com.au> - 2013-06-14 11:00 +1000
Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 09:59 +0300
Re: A few questiosn about encoding Cameron Simpson <cs@zip.com.au> - 2013-06-14 20:14 +1000
Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 16:58 +0300
Re: A few questiosn about encoding Joel Goldstick <joel.goldstick@gmail.com> - 2013-06-14 11:21 -0400
Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 18:26 +0300
Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-15 03:03 +1000
Re: A few questiosn about encoding Walter Hurry <walterhurry@lavabit.com> - 2013-06-14 23:32 +0000
Re: A few questiosn about encoding Cameron Simpson <cs@zip.com.au> - 2013-06-15 10:26 +1000
Re: A few questiosn about encoding Denis McMahon <denismfmcmahon@gmail.com> - 2013-06-15 06:34 +0000
Re: A few questiosn about encoding Grant Edwards <invalid@invalid.invalid> - 2013-06-15 14:44 +0000
Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-15 17:49 +0300
Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-15 15:30 +0000
Re: A few questiosn about encoding Roy Smith <roy@panix.com> - 2013-06-15 10:59 -0400
Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-15 18:14 +0300
Re: A few questiosn about encoding Joel Goldstick <joel.goldstick@gmail.com> - 2013-06-15 11:35 -0400
Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-15 22:26 +0300
Re: A few questiosn about encoding Benjamin Schollnick <benjamin@schollnick.net> - 2013-06-15 16:35 -0400
Re: A few questiosn about encoding Chris “Kwpolska” Warrick <kwpolska@gmail.com> - 2013-06-16 15:45 +0200
Re: A few questiosn about encoding Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-06-14 09:36 +0200
Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 10:49 +0300
Re: A few questiosn about encoding Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-06-14 10:22 +0200
Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 11:37 +0300
Don't feed the troll... (was: Re: A few questiosn about encoding) Heiko Wundram <modelnine@modelnine.org> - 2013-06-14 11:06 +0200
Re: Don't feed the troll... Nick the Gr33k <support@superhost.gr> - 2013-06-14 12:32 +0300
Re: Don't feed the troll... Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-06-14 13:09 +0200
Re: Don't feed the troll... Nick the Gr33k <support@superhost.gr> - 2013-06-14 15:36 +0300
Re: Don't feed the troll... Joel Goldstick <joel.goldstick@gmail.com> - 2013-06-14 08:44 -0400
Re: Don't feed the troll... Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-06-14 15:25 +0200
Re: Don't feed the troll... Neil Cerutti <neilc@norwich.edu> - 2013-06-14 15:54 +0000
Re: Don't feed the troll... Heiko Wundram <modelnine@modelnine.org> - 2013-06-14 12:15 +0200
Re: Don't feed the troll... Guy Scree <nobody@nowhere.com> - 2013-06-14 18:50 -0400
Re: Don't feed the troll... Denis McMahon <denismfmcmahon@gmail.com> - 2013-06-15 06:31 +0000
Re: Don't feed the troll... Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-06-15 13:04 -0400
Re: Don't feed the troll... Guy Scree <nobody@nowhere.com> - 2013-06-17 16:15 -0400
Re: Don't feed the troll... Chris Angelico <rosuav@gmail.com> - 2013-06-18 07:46 +1000
Re: A few questiosn about encoding Cameron Simpson <cs@zip.com.au> - 2013-06-14 20:19 +1000
Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 15:41 +0300
Re: Don't feed the troll... (was: Re: A few questiosn about encoding) Fábio Santos <fabiosantosart@gmail.com> - 2013-06-14 11:20 +0100
Re: Don't feed the troll... (was: Re: A few questiosn about encoding) rusi <rustompmody@gmail.com> - 2013-06-14 04:51 -0700
Re: Don't feed the help-vampire rusi <rustompmody@gmail.com> - 2013-06-14 05:09 -0700
Re: Don't feed the help-vampire Heiko Wundram <modelnine@modelnine.org> - 2013-06-14 14:31 +0200
Re: Don't feed the help-vampire Ian Kelly <ian.g.kelly@gmail.com> - 2013-06-14 10:51 -0600
Re: Don't feed the troll... Nick the Gr33k <support@superhost.gr> - 2013-06-14 15:50 +0300
Re: Don't feed the troll... Zero Piraeus <schesis@gmail.com> - 2013-06-14 09:33 -0400
Re: Don't feed the troll... Nick the Gr33k <support@superhost.gr> - 2013-06-14 15:45 +0300
Re: Don't feed the troll... Heiko Wundram <modelnine@modelnine.org> - 2013-06-14 14:58 +0200
Re: Don't feed the troll... Fábio Santos <fabiosantosart@gmail.com> - 2013-06-14 14:25 +0100
Re: Don't feed the troll... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-06-14 17:12 +0100
Re: A few questiosn about encoding Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-06-14 12:50 +0200
Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 15:59 +0300
Re: A few questiosn about encoding Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-06-14 15:52 +0200
Re: A few questiosn about encoding Cameron Simpson <cs@zip.com.au> - 2013-06-15 10:28 +1000
Re: A few questiosn about encoding Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-06-17 08:49 +0200
Re: Don't feed the troll... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-06-14 12:57 +0100
Re: Don't feed the troll... (was: Re: A few questiosn about encoding) "D'Arcy J.M. Cain" <darcy@druid.net> - 2013-06-14 13:13 -0400
Re: Don't feed the troll... (was: Re: A few questiosn about encoding) Chris Angelico <rosuav@gmail.com> - 2013-06-15 03:31 +1000
Re: Don't feed the troll... (was: Re: A few questiosn about encoding) Grant Edwards <invalid@invalid.invalid> - 2013-06-14 19:40 +0000
Re: Don't feed the troll "D'Arcy J.M. Cain" <darcy@druid.net> - 2013-06-14 13:56 -0400
Re: Don't feed the troll Tim Chase <python.list@tim.thechases.com> - 2013-06-14 14:00 -0500
Re: Don't feed the troll "D'Arcy J.M. Cain" <darcy@druid.net> - 2013-06-14 15:17 -0400
Re: Don't feed the troll... Ben Finney <ben+python@benfinney.id.au> - 2013-06-15 10:42 +1000
Re: A few questiosn about encoding Rick Johnson <rantingrickjohnson@gmail.com> - 2013-06-19 18:46 -0700
Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-20 06:26 +0000
Re: A few questiosn about encoding MRAB <python@mrabarnett.plus.com> - 2013-06-20 12:43 +0100
Re: A few questiosn about encoding wxjmfauth@gmail.com - 2013-06-20 09:27 -0700
Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-21 02:37 +1000
Re: A few questiosn about encoding MRAB <python@mrabarnett.plus.com> - 2013-06-20 18:17 +0100
Re: A few questiosn about encoding wxjmfauth@gmail.com - 2013-06-23 08:51 -0700
Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-23 16:30 +0000
Re: A few questiosn about encoding wxjmfauth@gmail.com - 2013-06-25 13:16 -0700
Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-21 03:21 +1000
Re: A few questiosn about encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-06-20 20:43 +0100
Re: A few questiosn about encoding Rick Johnson <rantingrickjohnson@gmail.com> - 2013-06-20 06:40 -0700
Re: A few questiosn about encoding Andrew Berg <robotsondrugs@gmail.com> - 2013-06-20 09:04 -0500
Re: A few questiosn about encoding Rick Johnson <rantingrickjohnson@gmail.com> - 2013-06-20 08:12 -0700
Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-21 01:26 +1000
Re: A few questiosn about encoding Jussi Piitulainen <jpiitula@ling.helsinki.fi> - 2013-06-20 20:25 +0300
Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-21 01:28 +1000
Re: A few questiosn about encoding Andreas Perstinger <andipersti@gmail.com> - 2013-06-20 19:08 +0200
Re: A few questiosn about encoding Dave Angel <davea@davea.name> - 2013-06-12 08:43 -0400
Re: A few questiosn about encoding Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-06-13 18:46 -0400
Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 08:34 +0300
Re: A few questiosn about encoding Zero Piraeus <schesis@gmail.com> - 2013-06-14 02:00 -0400
Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 10:28 +0300
Page 6 of 6 — ← Prev page 1 2 3 4 5 [6]
| From | Rick Johnson <rantingrickjohnson@gmail.com> |
|---|---|
| Date | 2013-06-20 08:12 -0700 |
| Message-ID | <0f045970-7c77-4d66-81cf-214f111232c3@googlegroups.com> |
| In reply to | #48794 |
On Thursday, June 20, 2013 9:04:50 AM UTC-5, Andrew Berg wrote: > On 2013.06.20 08:40, Rick Johnson wrote: > > then what is the purpose of a Unicode Braille character set? > Two dimensional characters can be made into 3 dimensional shapes. Yes in the real world. But what about on your computer screen? How do you plan on creating tactile representations of braille glyphs on my monitor? Hey, if you can already do this, please share, as it sure would make internet porn more interesting! > Building numbers are a good example of this. Either the matrix is reality or you must live inside your computer as a virtual being. Is your name Tron? Are you a pawn of Master Control? He's such a tyrant!
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-06-21 01:26 +1000 |
| Message-ID | <mailman.3625.1371741988.3114.python-list@python.org> |
| In reply to | #48799 |
On Fri, Jun 21, 2013 at 1:12 AM, Rick Johnson <rantingrickjohnson@gmail.com> wrote: > On Thursday, June 20, 2013 9:04:50 AM UTC-5, Andrew Berg wrote: >> On 2013.06.20 08:40, Rick Johnson wrote: > >> > then what is the purpose of a Unicode Braille character set? >> Two dimensional characters can be made into 3 dimensional shapes. > > Yes in the real world. But what about on your computer > screen? How do you plan on creating tactile representations of > braille glyphs on my monitor? Hey, if you can already do this, > please share, as it sure would make internet porn more > interesting! I had a device for creating embossed text. It predated Unicode by a couple of years at least (not sure how many, because I was fairly young at the time). It was made by a company called Epson, it plugged into the computer via a 25-pin plug, and when it was properly functioning, it had a ribbon of ink that it would bash through to darken the underside of the embossed text. But sometimes that ribbon slipped out of position, and we had beautifully-hammered ASCII text, unsullied by ink. And since the device did graphics too, it could be used for the entire Unicode character set if you wanted. Not sure that it would improve your porn any, but I've no doubt you could try if you wanted. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Jussi Piitulainen <jpiitula@ling.helsinki.fi> |
|---|---|
| Date | 2013-06-20 20:25 +0300 |
| Message-ID | <qota9mkwv80.fsf@ruuvi.it.helsinki.fi> |
| In reply to | #48799 |
Rick Johnson writes: > On Thursday, June 20, 2013 9:04:50 AM UTC-5, Andrew Berg wrote: > > On 2013.06.20 08:40, Rick Johnson wrote: > > > > then what is the purpose of a Unicode Braille character set? > > Two dimensional characters can be made into 3 dimensional shapes. > > Yes in the real world. But what about on your computer screen? How > do you plan on creating tactile representations of braille glyphs on > my monitor? Hey, if you can already do this, please share, as it > sure would make internet porn more interesting! Search for braille display on the web. A wikipedia article also led me to braille e-book. (Or search for braille porn, since you are so inclined - the concept turns out to be already out there on the web.)
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2013-06-21 01:28 +1000 |
| Message-ID | <mailman.3626.1371742118.3114.python-list@python.org> |
| In reply to | #48791 |
On Thu, Jun 20, 2013 at 11:40 PM, Rick Johnson <rantingrickjohnson@gmail.com> wrote: > Your generalization is analogous to explaining web browsers > as: "software that allows a user to view web pages in the > range www.*" Do you think someone could implement a web > browser from such limited specification? (if that was all > they knew?). Wow. That spec isn't limited, it's downright faulty. Or do you really think that (a) there is such a thing as the "range www.*", and that (b) that "range" has anything to do with web browsers? ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Andreas Perstinger <andipersti@gmail.com> |
|---|---|
| Date | 2013-06-20 19:08 +0200 |
| Message-ID | <mailman.3631.1371748097.3114.python-list@python.org> |
| In reply to | #48791 |
Rick Johnson <rantingrickjohnson@gmail.com> wrote: >============================================================ > Since we're on the subject of Unicode: >============================================================ >One the most humorous aspects of Unicode is that it has >encodings for Braille characters. Hmm, this presents a >conundrum of sorts. RIDDLE ME THIS?! > > Since Braille is a type of "reading" for the blind by > utilizing the sense of touch (therefore DEMANDING 3 > dimensions) and glyphs derived from Unicode are > restrictively two dimensional, because let's face it people, > Unicode exists in your computer, and computer screens are > two dimensional... but you already knew that -- i think?, > then what is the purpose of a Unicode Braille character set? > >That should haunt your nightmares for some time. From http://www.unicode.org/versions/Unicode6.2.0/ch15.pdf "The intent of encoding the 256 Braille patterns in the Unicode Standard is to allow input and output devices to be implemented that can interchange Braille data without having to go through a context-dependent conversion from semantic values to patterns, or vice versa. In this manner, final-form documents can be exchanged and faithfully rendered." http://files.pef-format.org/specifications/pef-2008-1/pef-specification.html#Unicode I wish you a pleasant sleep tonight. Bye, Andreas
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <davea@davea.name> |
|---|---|
| Date | 2013-06-12 08:43 -0400 |
| Message-ID | <mailman.3105.1371041006.3114.python-list@python.org> |
| In reply to | #47767 |
On 06/12/2013 05:24 AM, Steven D'Aprano wrote: > On Wed, 12 Jun 2013 09:09:05 +0000, Νικόλαος Κούρας wrote: > >> Isn't 14 bits way to many to store a character ? > > No. > > There are 1114111 possible characters in Unicode. (And in Japan, they > sometimes use TRON instead of Unicode, which has even more.) > > If you list out all the combinations of 14 bits: > > 0000 0000 0000 00 > 0000 0000 0000 01 > 0000 0000 0000 10 > 0000 0000 0000 11 > [...] > 1111 1111 1111 10 > 1111 1111 1111 11 > > you will see that there are only 32767 (2**15-1) such values. You can't > fit 1114111 characters with just 32767 values. > > Actually, it's worse. There are 16536 such values (2**14), assuming you include null, which you did in your list. -- DaveA
[toc] | [prev] | [next] | [standalone]
| From | Dennis Lee Bieber <wlfraed@ix.netcom.com> |
|---|---|
| Date | 2013-06-13 18:46 -0400 |
| Message-ID | <mailman.3238.1371163584.3114.python-list@python.org> |
| In reply to | #47762 |
On Wed, 12 Jun 2013 09:09:05 +0000 (UTC), ???????? ??????
<support@superhost.gr> declaimed the following:
>>> (*) infact UTF8 also indicates the end of each character
>
>> Up to a point. The initial byte encodes the length and the top few
>> bits, but the subsequent octets aren’t distinguishable as final in
>> isolation. 0x80-0xBF can all be either medial or final.
>
>
>So, the first high-bits are a directive that UTF-8 uses to know how many
>bytes each character is being represented as.
>
>0-127 codepoints(characters) use 1 bit to signify they need 1 bit for
>storage and the rest 7 bits to actually store the character ?
>
Not quite... The leading bit is a 0 -> which means 0..127 are sent
as-is, no manipulation.
>while
>
>128-256 codepoints(characters) use 2 bit to signify they need 2 bits for
>storage and the rest 14 bits to actually store the character ?
>
128..255 -- in what encoding? These all have the leading bit with a
value of 1. In 8-bit encodings (ISO-Latin-1) the meaning of those values is
inherent in the specified encoding and they are sent as-is.
BUT, in UTF-8, a byte with a leading 1-bit signals that the byte
identifies a multi-byte sequence. CF:
https://en.wikipedia.org/wiki/UTF-8#Description
So anything that starts with bits 110 is a two byte sequence (and the
second byte must start with bits 10 to be valid)
1110 starts a three byte sequence, 11110 starts a four byte sequence...
Basically, count the number of leading 1-bits before a 0 bit, and that
tells you how many bytes are in the multi-byte sequence -- and all bytes
that start with 10 are supposed to be the continuations of a multibyte set
(and not a signal that this is a 1-byte entry -- those only have a leading
0)
>Isn't 14 bits way to many to store a character ?
Original UTF-8 allowed for 31-bits to specify a character in the Unicode
set. It used 6 bytes -- 48 bits total, but 7 bits of the first byte were
the flag (6 leading 1 bits and a 0 bit), and two bits (leading 10) of each
continuation.
--
Wulfraed Dennis Lee Bieber AF6VN
wlfraed@ix.netcom.com HTTP://wlfraed.home.netcom.com/
[toc] | [prev] | [next] | [standalone]
| From | Nick the Gr33k <support@superhost.gr> |
|---|---|
| Date | 2013-06-14 08:34 +0300 |
| Message-ID | <kpea1c$p37$1@news.ntua.gr> |
| In reply to | #48038 |
On 14/6/2013 1:46 πμ, Dennis Lee Bieber wrote: > On Wed, 12 Jun 2013 09:09:05 +0000 (UTC), ???????? ?????? > <support@superhost.gr> declaimed the following: > >>>> (*) infact UTF8 also indicates the end of each character >> >>> Up to a point. The initial byte encodes the length and the top few >>> bits, but the subsequent octets aren’t distinguishable as final in >>> isolation. 0x80-0xBF can all be either medial or final. >> >> >> So, the first high-bits are a directive that UTF-8 uses to know how many >> bytes each character is being represented as. >> >> 0-127 codepoints(characters) use 1 bit to signify they need 1 bit for >> storage and the rest 7 bits to actually store the character ? >> > Not quite... The leading bit is a 0 -> which means 0..127 are sent > as-is, no manipulation. So, in utf-8, the leading bit which is a zero 0, its actually a flag to tell that the code-point needs 1 byte to be stored and the rest 7 bits is for the actual value of 0-127 code-points ? >> 128-256 codepoints(characters) use 2 bit to signify they need 2 bits for >> storage and the rest 14 bits to actually store the character ? >> > 128..255 -- in what encoding? These all have the leading bit with a > value of 1. In 8-bit encodings (ISO-Latin-1) the meaning of those values is > inherent in the specified encoding and they are sent as-is. So, latin-iso or greek-iso, the leading 0 is not a flag like it is in utf-8 encoding because latin-iso and greek-iso and all *-iso use all 8 bits for storage? But, in utf-8, the leading bit, which is 1, is to tell that the code-point needs 2 byte to be stored and the rest 7 bits is for the actual value of 128-255 code-points ? But why 2 bytes? leading 1 is a flag and the rest 7 bits can hold the encoded value. Bu that is not the case since we know that utf-8 needs 2 bytes to store code-points 127-255 > 1110 starts a three byte sequence, 11110 starts a four byte sequence... > Basically, count the number of leading 1-bits before a 0 bit, and that > tells you how many bytes are in the multi-byte sequence -- and all bytes > that start with 10 are supposed to be the continuations of a multibyte set > (and not a signal that this is a 1-byte entry -- those only have a leading > 0) Why doesn't it work like this? leading 0 = 1 byte flag leading 1 = 2 bytes flag leading 00 = 3 bytes flag leading 01 = 4 bytes flag leading 10 = 5 bytes flag leading 11 = 6 bytes flag Wouldn't it be more logical? > Original UTF-8 allowed for 31-bits to specify a character in the Unicode > set. It used 6 bytes -- 48 bits total, but 7 bits of the first byte were > the flag (6 leading 1 bits and a 0 bit), and two bits (leading 10) of each > continuation. utf8 6 byted = 48 bits - 7 bits(from first bytes) - 2 bits(for each continuation) * 5 = 48 - 7 - 10 = 31 bits indeed to store the actual code-point. But 2^31 is still a huge number to store any kind of character isnt it? -- What is now proved was at first only imagined!
[toc] | [prev] | [next] | [standalone]
| From | Zero Piraeus <schesis@gmail.com> |
|---|---|
| Date | 2013-06-14 02:00 -0400 |
| Message-ID | <mailman.3253.1371189687.3114.python-list@python.org> |
| In reply to | #48063 |
: On 14 June 2013 01:34, Nick the Gr33k <support@superhost.gr> wrote: > Why doesn't it work like this? > > leading 0 = 1 byte flag > leading 1 = 2 bytes flag > leading 00 = 3 bytes flag > leading 01 = 4 bytes flag > leading 10 = 5 bytes flag > leading 11 = 6 bytes flag > > Wouldn't it be more logical? Think about it. Let's say that, as per your scheme, a leading 0 indicates "1 byte" (as is indeed the case in UTF8). What things could follow that leading 0? How does that impact your choice of a leading 00 or 01 for other numbers of bytes? ... okay, you're obviously going to need to be spoon-fed a little more than that. Here's a byte: 01010101 Is that a single byte representing a code point in the 0-127 range, or the first of 4 bytes representing something else, in your proposed scheme? How can you tell? Now look at the way UTF8 does it: <http://en.wikipedia.org/wiki/Utf-8#Description> Really, follow the link and study the table carefully. Don't continue reading this until you believe you understand the choices that the designers of UTF8 made, and why they made them. Pay particular attention to the possible values for byte 1. Do you notice the difference between that scheme, and yours: 0xxxxxxx 1xxxxxxx 00xxxxxx 01xxxxxx 10xxxxxx 11xxxxxx If you don't see it, keep looking until you do ... this email gives you more than enough hints to work it out. Don't ask someone here to explain it to you. If you want to become competent, you must use your brain. -[]z.
[toc] | [prev] | [next] | [standalone]
| From | Nick the Gr33k <support@superhost.gr> |
|---|---|
| Date | 2013-06-14 10:28 +0300 |
| Message-ID | <kpegmv$p37$4@news.ntua.gr> |
| In reply to | #48066 |
On 14/6/2013 9:00 πμ, Zero Piraeus wrote: > : > > On 14 June 2013 01:34, Nick the Gr33k <support@superhost.gr> wrote: >> Why doesn't it work like this? >> >> leading 0 = 1 byte flag >> leading 1 = 2 bytes flag >> leading 00 = 3 bytes flag >> leading 01 = 4 bytes flag >> leading 10 = 5 bytes flag >> leading 11 = 6 bytes flag >> >> Wouldn't it be more logical? > > Think about it. Let's say that, as per your scheme, a leading 0 > indicates "1 byte" (as is indeed the case in UTF8). What things could > follow that leading 0? How does that impact your choice of a leading > 00 or 01 for other numbers of bytes? > > ... okay, you're obviously going to need to be spoon-fed a little more > than that. Here's a byte: > > 01010101 > > Is that a single byte representing a code point in the 0-127 range, or > the first of 4 bytes representing something else, in your proposed > scheme? How can you tell? Indeed. You cannot tell if it stands for 1 byte or a 4 byte sequence: 0 + 1010101 = leading 0 stands for 1byte representation of a code-point 01 + 010101 = leading 01 stands for 4byte representation of a code-point the problem here in my scheme of how utf8 encoding works is that you cannot tell whether the flag is '0' or '01' Same happen with leading '1' and '11'. You cannot tell what the flag is, so you cannot know if the Unicode code-point is being represented as 2-byte sequence or 6 bye sequence Understood > Now look at the way UTF8 does it: > <http://en.wikipedia.org/wiki/Utf-8#Description> > > Really, follow the link and study the table carefully. Don't continue > reading this until you believe you understand the choices that the > designers of UTF8 made, and why they made them. > > Pay particular attention to the possible values for byte 1. Do you > notice the difference between that scheme, and yours: > > 0xxxxxxx > 1xxxxxxx > 00xxxxxx > 01xxxxxx > 10xxxxxx > 11xxxxxx > > If you don't see it, keep looking until you do ... this email gives > you more than enough hints to work it out. Don't ask someone here to > explain it to you. If you want to become competent, you must use your > brain. 0xxxxxxx 110xxxxx 10xxxxxx 1110xxxx 10xxxxxx 10xxxxxx 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx I did read the link but i still cannot see why 1. '110' is the flag for 2-byte code-point 2. why the in the 2nd byte and every subsequent byte leading flag has to be '10' -- What is now proved was at first only imagined!
[toc] | [prev] | [standalone]
Page 6 of 6 — ← Prev page 1 2 3 4 5 [6]
Back to top | Article view | comp.lang.python
csiph-web