Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #47448 > unrolled thread

A few questiosn about encoding

Started byΝικόλαος Κούρας <nikos.gr33k@gmail.com>
First post2013-06-09 03:44 -0700
Last post2013-06-14 10:28 +0300
Articles 10 on this page of 110 — 36 participants

Back to article view | Back to comp.lang.python


Contents

  A few questiosn about encoding Νικόλαος Κούρας <nikos.gr33k@gmail.com> - 2013-06-09 03:44 -0700
    Re: A few questiosn about encoding Fábio Santos <fabiosantosart@gmail.com> - 2013-06-09 13:18 +0100
    Re: A few questiosn about encoding Nobody <nobody@nowhere.com> - 2013-06-09 18:01 +0100
    Re: A few questiosn about encoding Chris “Kwpolska” Warrick <kwpolska@gmail.com> - 2013-06-09 19:12 +0200
      Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-12 09:09 +0000
        Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-12 09:24 +0000
          Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-12 14:23 +0300
            Re: A few questiosn about encoding Ulrich Eckhardt <ulrich.eckhardt@dominolaser.com> - 2013-06-12 14:52 +0200
            Re: A few questiosn about encoding Nobody <nobody@nowhere.com> - 2013-06-12 21:30 +0100
              Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-13 01:40 +0000
                Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-13 12:01 +1000
                  Re: A few questiosn about encoding Nobody <nobody@nowhere.com> - 2013-06-13 11:02 +0100
              Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-13 09:21 +0300
                Re: A few questiosn about encoding jmfauth <wxjmfauth@gmail.com> - 2013-06-12 23:28 -0700
                Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-13 16:48 +1000
            Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-13 00:13 +0000
              Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-13 09:09 +0300
                Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-13 07:11 +0000
                  Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-13 10:42 +0300
                    Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-13 17:58 +1000
                      Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-13 11:08 +0300
                        Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-13 18:20 +1000
                          Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-13 12:41 +0300
                            Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-13 11:49 +0000
                              Re: A few questiosn about encoding Νικόλαος Κούρας <support@superhost.gr> - 2013-06-13 17:19 +0300
                                Re: A few questiosn about encoding Cameron Simpson <cs@zip.com.au> - 2013-06-14 11:00 +1000
                                  Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 09:59 +0300
                                    Re: A few questiosn about encoding Cameron Simpson <cs@zip.com.au> - 2013-06-14 20:14 +1000
                                      Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 16:58 +0300
                                        Re: A few questiosn about encoding Joel Goldstick <joel.goldstick@gmail.com> - 2013-06-14 11:21 -0400
                                          Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 18:26 +0300
                                            Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-15 03:03 +1000
                                              Re: A few questiosn about encoding Walter Hurry <walterhurry@lavabit.com> - 2013-06-14 23:32 +0000
                                        Re: A few questiosn about encoding Cameron Simpson <cs@zip.com.au> - 2013-06-15 10:26 +1000
                                        Re: A few questiosn about encoding Denis McMahon <denismfmcmahon@gmail.com> - 2013-06-15 06:34 +0000
                                          Re: A few questiosn about encoding Grant Edwards <invalid@invalid.invalid> - 2013-06-15 14:44 +0000
                                            Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-15 17:49 +0300
                                              Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-15 15:30 +0000
                                            Re: A few questiosn about encoding Roy Smith <roy@panix.com> - 2013-06-15 10:59 -0400
                                              Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-15 18:14 +0300
                                                Re: A few questiosn about encoding Joel Goldstick <joel.goldstick@gmail.com> - 2013-06-15 11:35 -0400
                                        Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-15 22:26 +0300
                                          Re: A few questiosn about encoding Benjamin Schollnick <benjamin@schollnick.net> - 2013-06-15 16:35 -0400
                                          Re: A few questiosn about encoding Chris “Kwpolska” Warrick <kwpolska@gmail.com> - 2013-06-16 15:45 +0200
                        Re: A few questiosn about encoding Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-06-14 09:36 +0200
                          Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 10:49 +0300
                            Re: A few questiosn about encoding Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-06-14 10:22 +0200
                              Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 11:37 +0300
                                Don't feed the troll... (was: Re: A few questiosn about encoding) Heiko Wundram <modelnine@modelnine.org> - 2013-06-14 11:06 +0200
                                  Re: Don't feed the troll... Nick the Gr33k <support@superhost.gr> - 2013-06-14 12:32 +0300
                                    Re: Don't feed the troll... Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-06-14 13:09 +0200
                                      Re: Don't feed the troll... Nick the Gr33k <support@superhost.gr> - 2013-06-14 15:36 +0300
                                        Re: Don't feed the troll... Joel Goldstick <joel.goldstick@gmail.com> - 2013-06-14 08:44 -0400
                                        Re: Don't feed the troll... Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-06-14 15:25 +0200
                                          Re: Don't feed the troll... Neil Cerutti <neilc@norwich.edu> - 2013-06-14 15:54 +0000
                                    Re: Don't feed the troll... Heiko Wundram <modelnine@modelnine.org> - 2013-06-14 12:15 +0200
                                    Re: Don't feed the troll... Guy Scree <nobody@nowhere.com> - 2013-06-14 18:50 -0400
                                    Re: Don't feed the troll... Denis McMahon <denismfmcmahon@gmail.com> - 2013-06-15 06:31 +0000
                                      Re: Don't feed the troll... Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-06-15 13:04 -0400
                                    Re: Don't feed the troll... Guy Scree <nobody@nowhere.com> - 2013-06-17 16:15 -0400
                                      Re: Don't feed the troll... Chris Angelico <rosuav@gmail.com> - 2013-06-18 07:46 +1000
                                Re: A few questiosn about encoding Cameron Simpson <cs@zip.com.au> - 2013-06-14 20:19 +1000
                                  Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 15:41 +0300
                                Re: Don't feed the troll... (was: Re: A few questiosn about encoding) Fábio Santos <fabiosantosart@gmail.com> - 2013-06-14 11:20 +0100
                                  Re: Don't feed the troll... (was: Re: A few questiosn about encoding) rusi <rustompmody@gmail.com> - 2013-06-14 04:51 -0700
                                    Re: Don't feed the help-vampire rusi <rustompmody@gmail.com> - 2013-06-14 05:09 -0700
                                      Re: Don't feed the help-vampire Heiko Wundram <modelnine@modelnine.org> - 2013-06-14 14:31 +0200
                                      Re: Don't feed the help-vampire Ian Kelly <ian.g.kelly@gmail.com> - 2013-06-14 10:51 -0600
                                    Re: Don't feed the troll... Nick the Gr33k <support@superhost.gr> - 2013-06-14 15:50 +0300
                                      Re: Don't feed the troll... Zero Piraeus <schesis@gmail.com> - 2013-06-14 09:33 -0400
                                  Re: Don't feed the troll... Nick the Gr33k <support@superhost.gr> - 2013-06-14 15:45 +0300
                                    Re: Don't feed the troll... Heiko Wundram <modelnine@modelnine.org> - 2013-06-14 14:58 +0200
                                    Re: Don't feed the troll... Fábio Santos <fabiosantosart@gmail.com> - 2013-06-14 14:25 +0100
                                    Re: Don't feed the troll... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-06-14 17:12 +0100
                                Re: A few questiosn about encoding Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-06-14 12:50 +0200
                                  Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 15:59 +0300
                                    Re: A few questiosn about encoding Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-06-14 15:52 +0200
                                    Re: A few questiosn about encoding Cameron Simpson <cs@zip.com.au> - 2013-06-15 10:28 +1000
                                    Re: A few questiosn about encoding Antoon Pardon <antoon.pardon@rece.vub.ac.be> - 2013-06-17 08:49 +0200
                                Re: Don't feed the troll... Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-06-14 12:57 +0100
                                Re: Don't feed the troll... (was: Re: A few questiosn about encoding) "D'Arcy J.M. Cain" <darcy@druid.net> - 2013-06-14 13:13 -0400
                                Re: Don't feed the troll... (was: Re: A few questiosn about encoding) Chris Angelico <rosuav@gmail.com> - 2013-06-15 03:31 +1000
                                  Re: Don't feed the troll... (was: Re: A few questiosn about encoding) Grant Edwards <invalid@invalid.invalid> - 2013-06-14 19:40 +0000
                                Re: Don't feed the troll "D'Arcy J.M. Cain" <darcy@druid.net> - 2013-06-14 13:56 -0400
                                Re: Don't feed the troll Tim Chase <python.list@tim.thechases.com> - 2013-06-14 14:00 -0500
                                Re: Don't feed the troll "D'Arcy J.M. Cain" <darcy@druid.net> - 2013-06-14 15:17 -0400
                                Re: Don't feed the troll... Ben Finney <ben+python@benfinney.id.au> - 2013-06-15 10:42 +1000
                  Re: A few questiosn about encoding Rick Johnson <rantingrickjohnson@gmail.com> - 2013-06-19 18:46 -0700
                    Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-20 06:26 +0000
                      Re: A few questiosn about encoding MRAB <python@mrabarnett.plus.com> - 2013-06-20 12:43 +0100
                        Re: A few questiosn about encoding wxjmfauth@gmail.com - 2013-06-20 09:27 -0700
                          Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-21 02:37 +1000
                          Re: A few questiosn about encoding MRAB <python@mrabarnett.plus.com> - 2013-06-20 18:17 +0100
                            Re: A few questiosn about encoding wxjmfauth@gmail.com - 2013-06-23 08:51 -0700
                              Re: A few questiosn about encoding Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-06-23 16:30 +0000
                                Re: A few questiosn about encoding wxjmfauth@gmail.com - 2013-06-25 13:16 -0700
                          Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-21 03:21 +1000
                          Re: A few questiosn about encoding Mark Lawrence <breamoreboy@yahoo.co.uk> - 2013-06-20 20:43 +0100
                      Re: A few questiosn about encoding Rick Johnson <rantingrickjohnson@gmail.com> - 2013-06-20 06:40 -0700
                        Re: A few questiosn about encoding Andrew Berg <robotsondrugs@gmail.com> - 2013-06-20 09:04 -0500
                          Re: A few questiosn about encoding Rick Johnson <rantingrickjohnson@gmail.com> - 2013-06-20 08:12 -0700
                            Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-21 01:26 +1000
                            Re: A few questiosn about encoding Jussi Piitulainen <jpiitula@ling.helsinki.fi> - 2013-06-20 20:25 +0300
                        Re: A few questiosn about encoding Chris Angelico <rosuav@gmail.com> - 2013-06-21 01:28 +1000
                        Re: A few questiosn about encoding Andreas Perstinger <andipersti@gmail.com> - 2013-06-20 19:08 +0200
          Re: A few questiosn about encoding Dave Angel <davea@davea.name> - 2013-06-12 08:43 -0400
        Re: A few questiosn about encoding Dennis Lee Bieber <wlfraed@ix.netcom.com> - 2013-06-13 18:46 -0400
          Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 08:34 +0300
            Re: A few questiosn about encoding Zero Piraeus <schesis@gmail.com> - 2013-06-14 02:00 -0400
              Re: A few questiosn about encoding Nick the Gr33k <support@superhost.gr> - 2013-06-14 10:28 +0300

Page 6 of 6 — ← Prev page 1 2 3 4 5 [6]


#48799

FromRick Johnson <rantingrickjohnson@gmail.com>
Date2013-06-20 08:12 -0700
Message-ID<0f045970-7c77-4d66-81cf-214f111232c3@googlegroups.com>
In reply to#48794
On Thursday, June 20, 2013 9:04:50 AM UTC-5, Andrew Berg wrote:
> On 2013.06.20 08:40, Rick Johnson wrote:

> >     then what is the purpose of a Unicode Braille character set?
> Two dimensional characters can be made into 3 dimensional shapes.

Yes in the real world. But what about on your computer
screen? How do you plan on creating tactile representations of
braille glyphs on my monitor? Hey, if you can already do this, 
please share, as it sure would make internet porn more 
interesting!

> Building numbers are a good example of this.

Either the matrix is reality or you must live inside your
computer as a virtual being. Is your name Tron? Are you a pawn
of Master Control? He's such a tyrant!

[toc] | [prev] | [next] | [standalone]


#48800

FromChris Angelico <rosuav@gmail.com>
Date2013-06-21 01:26 +1000
Message-ID<mailman.3625.1371741988.3114.python-list@python.org>
In reply to#48799
On Fri, Jun 21, 2013 at 1:12 AM, Rick Johnson
<rantingrickjohnson@gmail.com> wrote:
> On Thursday, June 20, 2013 9:04:50 AM UTC-5, Andrew Berg wrote:
>> On 2013.06.20 08:40, Rick Johnson wrote:
>
>> >     then what is the purpose of a Unicode Braille character set?
>> Two dimensional characters can be made into 3 dimensional shapes.
>
> Yes in the real world. But what about on your computer
> screen? How do you plan on creating tactile representations of
> braille glyphs on my monitor? Hey, if you can already do this,
> please share, as it sure would make internet porn more
> interesting!

I had a device for creating embossed text. It predated Unicode by a
couple of years at least (not sure how many, because I was fairly
young at the time). It was made by a company called Epson, it plugged
into the computer via a 25-pin plug, and when it was properly
functioning, it had a ribbon of ink that it would bash through to
darken the underside of the embossed text. But sometimes that ribbon
slipped out of position, and we had beautifully-hammered ASCII text,
unsullied by ink. And since the device did graphics too, it could be
used for the entire Unicode character set if you wanted.

Not sure that it would improve your porn any, but I've no doubt you
could try if you wanted.

ChrisA

[toc] | [prev] | [next] | [standalone]


#48814

FromJussi Piitulainen <jpiitula@ling.helsinki.fi>
Date2013-06-20 20:25 +0300
Message-ID<qota9mkwv80.fsf@ruuvi.it.helsinki.fi>
In reply to#48799
Rick Johnson writes:
> On Thursday, June 20, 2013 9:04:50 AM UTC-5, Andrew Berg wrote:
> > On 2013.06.20 08:40, Rick Johnson wrote:
> 
> > >   then what is the purpose of a Unicode Braille character set?
> > Two dimensional characters can be made into 3 dimensional shapes.
> 
> Yes in the real world. But what about on your computer screen? How
> do you plan on creating tactile representations of braille glyphs on
> my monitor? Hey, if you can already do this, please share, as it
> sure would make internet porn more interesting!

Search for braille display on the web. A wikipedia article also led me
to braille e-book. (Or search for braille porn, since you are so
inclined - the concept turns out to be already out there on the web.)

[toc] | [prev] | [next] | [standalone]


#48801

FromChris Angelico <rosuav@gmail.com>
Date2013-06-21 01:28 +1000
Message-ID<mailman.3626.1371742118.3114.python-list@python.org>
In reply to#48791
On Thu, Jun 20, 2013 at 11:40 PM, Rick Johnson
<rantingrickjohnson@gmail.com> wrote:
> Your generalization is analogous to explaining web browsers
> as: "software that allows a user to view web pages in the
> range www.*" Do you think someone could implement a web
> browser from such limited specification? (if that was all
> they knew?).

Wow. That spec isn't limited, it's downright faulty. Or do you really
think that (a) there is such a thing as the "range www.*", and that
(b) that "range" has anything to do with web browsers?

ChrisA

[toc] | [prev] | [next] | [standalone]


#48808

FromAndreas Perstinger <andipersti@gmail.com>
Date2013-06-20 19:08 +0200
Message-ID<mailman.3631.1371748097.3114.python-list@python.org>
In reply to#48791
Rick Johnson <rantingrickjohnson@gmail.com> wrote:
>============================================================
> Since we're on the subject of Unicode:
>============================================================
>One the most humorous aspects of Unicode is that it has
>encodings for Braille characters. Hmm, this presents a
>conundrum of sorts. RIDDLE ME THIS?!
>
>    Since Braille is a type of "reading" for the blind by
>    utilizing the sense of touch (therefore DEMANDING 3
>    dimensions) and glyphs derived from Unicode are
>    restrictively two dimensional, because let's face it people,
>    Unicode exists in your computer, and computer screens are
>    two dimensional... but you already knew that -- i think?,
>    then what is the purpose of a Unicode Braille character set?
>
>That should haunt your nightmares for some time.

From http://www.unicode.org/versions/Unicode6.2.0/ch15.pdf
"The intent of encoding the 256 Braille patterns in the Unicode
Standard is to allow input and output devices to be implemented that
can interchange Braille data without having to go through a
context-dependent conversion from semantic values to patterns, or vice
versa. In this manner, final-form documents can be exchanged and
faithfully rendered."

http://files.pef-format.org/specifications/pef-2008-1/pef-specification.html#Unicode

I wish you a pleasant sleep tonight.

Bye, Andreas

[toc] | [prev] | [next] | [standalone]


#47797

FromDave Angel <davea@davea.name>
Date2013-06-12 08:43 -0400
Message-ID<mailman.3105.1371041006.3114.python-list@python.org>
In reply to#47767
On 06/12/2013 05:24 AM, Steven D'Aprano wrote:
> On Wed, 12 Jun 2013 09:09:05 +0000, Νικόλαος Κούρας wrote:
>
>> Isn't 14 bits way to many to store a character ?
>
> No.
>
> There are 1114111 possible characters in Unicode. (And in Japan, they
> sometimes use TRON instead of Unicode, which has even more.)
>
> If you list out all the combinations of 14 bits:
>
> 0000 0000 0000 00
> 0000 0000 0000 01
> 0000 0000 0000 10
> 0000 0000 0000 11
> [...]
> 1111 1111 1111 10
> 1111 1111 1111 11
>
> you will see that there are only 32767 (2**15-1) such values. You can't
> fit 1114111 characters with just 32767 values.
>
>

Actually, it's worse.  There are 16536 such values (2**14), assuming you 
include null, which you did in your list.

-- 
DaveA

[toc] | [prev] | [next] | [standalone]


#48038

FromDennis Lee Bieber <wlfraed@ix.netcom.com>
Date2013-06-13 18:46 -0400
Message-ID<mailman.3238.1371163584.3114.python-list@python.org>
In reply to#47762
On Wed, 12 Jun 2013 09:09:05 +0000 (UTC), ???????? ??????
<support@superhost.gr> declaimed the following:

>>> (*) infact UTF8 also indicates the end of each character
>
>> Up to a point.  The initial byte encodes the length and the top few
>> bits, but the subsequent octets aren’t distinguishable as final in
>> isolation.  0x80-0xBF can all be either medial or final.
>
>
>So, the first high-bits are a directive that UTF-8 uses to know how many 
>bytes each character is being represented as.
>
>0-127 codepoints(characters) use 1 bit to signify they need 1 bit for 
>storage and the rest 7 bits to actually store the character ?
>
	Not quite... The leading bit is a 0 -> which means 0..127 are sent
as-is, no manipulation.

>while
>
>128-256 codepoints(characters) use 2 bit to signify they need 2 bits for 
>storage and the rest 14 bits to actually store the character ?
>
	128..255 -- in what encoding? These all have the leading bit with a
value of 1. In 8-bit encodings (ISO-Latin-1) the meaning of those values is
inherent in the specified encoding and they are sent as-is.

	BUT, in UTF-8, a byte with a leading 1-bit signals that the byte
identifies a multi-byte sequence. CF:
https://en.wikipedia.org/wiki/UTF-8#Description

	So anything that starts with bits 110 is a two byte sequence (and the
second byte must start with bits 10 to be valid)

	1110 starts a three byte sequence, 11110 starts a four byte sequence...
Basically, count the number of leading 1-bits before a 0 bit, and that
tells you how many bytes are in the multi-byte sequence -- and all bytes
that start with 10 are supposed to be the continuations of a multibyte set
(and not a signal that this is a 1-byte entry -- those only have a leading
0)

>Isn't 14 bits way to many to store a character ? 

Original UTF-8 allowed for 31-bits to specify a character in the Unicode
set. It used 6 bytes -- 48 bits total, but 7 bits of the first byte were
the flag (6 leading 1 bits and a 0 bit), and two bits (leading 10) of each
continuation.

	
-- 
	Wulfraed                 Dennis Lee Bieber         AF6VN
    wlfraed@ix.netcom.com    HTTP://wlfraed.home.netcom.com/

[toc] | [prev] | [next] | [standalone]


#48063

FromNick the Gr33k <support@superhost.gr>
Date2013-06-14 08:34 +0300
Message-ID<kpea1c$p37$1@news.ntua.gr>
In reply to#48038
On 14/6/2013 1:46 πμ, Dennis Lee Bieber wrote:
> On Wed, 12 Jun 2013 09:09:05 +0000 (UTC), ???????? ??????
> <support@superhost.gr> declaimed the following:
>
>>>> (*) infact UTF8 also indicates the end of each character
>>
>>> Up to a point.  The initial byte encodes the length and the top few
>>> bits, but the subsequent octets aren’t distinguishable as final in
>>> isolation.  0x80-0xBF can all be either medial or final.
>>
>>
>> So, the first high-bits are a directive that UTF-8 uses to know how many
>> bytes each character is being represented as.
>>
>> 0-127 codepoints(characters) use 1 bit to signify they need 1 bit for
>> storage and the rest 7 bits to actually store the character ?
>>
> 	Not quite... The leading bit is a 0 -> which means 0..127 are sent
> as-is, no manipulation.

So, in utf-8, the leading bit which is a zero 0, its actually a flag to 
tell that the code-point needs 1 byte to be stored and the rest 7 bits 
is for the actual value of 0-127 code-points ?

>> 128-256 codepoints(characters) use 2 bit to signify they need 2 bits for
>> storage and the rest 14 bits to actually store the character ?
>>
> 	128..255 -- in what encoding? These all have the leading bit with a
> value of 1. In 8-bit encodings (ISO-Latin-1) the meaning of those values is
> inherent in the specified encoding and they are sent as-is.

So, latin-iso or greek-iso, the leading 0 is not a flag like it is in 
utf-8 encoding because latin-iso and greek-iso and all *-iso use all 8 
bits for storage?

But, in utf-8, the leading bit, which is 1, is to tell that the 
code-point needs 2 byte to be stored and the rest 7 bits is for the 
actual value of 128-255 code-points ?

But why 2 bytes? leading 1 is a flag and the rest 7 bits can hold the 
encoded value.

Bu that is not the case since we know that utf-8 needs 2 bytes to store 
code-points 127-255


> 	1110 starts a three byte sequence, 11110 starts a four byte sequence...
> Basically, count the number of leading 1-bits before a 0 bit, and that
> tells you how many bytes are in the multi-byte sequence -- and all bytes
> that start with 10 are supposed to be the continuations of a multibyte set
> (and not a signal that this is a 1-byte entry -- those only have a leading
> 0)

Why doesn't it work like this?

leading 0 = 1 byte flag
leading 1 = 2 bytes flag
leading 00 = 3 bytes flag
leading 01 = 4 bytes flag
leading 10 = 5 bytes flag
leading 11 = 6 bytes flag

Wouldn't it be more logical?


> Original UTF-8 allowed for 31-bits to specify a character in the Unicode
> set. It used 6 bytes -- 48 bits total, but 7 bits of the first byte were
> the flag (6 leading 1 bits and a 0 bit), and two bits (leading 10) of each
> continuation.

utf8 6 byted = 48 bits - 7 bits(from first bytes) - 2 bits(for each 
continuation) * 5 = 48 - 7 - 10 = 31 bits indeed to store the actual 
code-point. But 2^31 is still a huge number to store any kind of 
character isnt it?





-- 
What is now proved was at first only imagined!

[toc] | [prev] | [next] | [standalone]


#48066

FromZero Piraeus <schesis@gmail.com>
Date2013-06-14 02:00 -0400
Message-ID<mailman.3253.1371189687.3114.python-list@python.org>
In reply to#48063
:

On 14 June 2013 01:34, Nick the Gr33k <support@superhost.gr> wrote:
> Why doesn't it work like this?
>
> leading 0 = 1 byte flag
> leading 1 = 2 bytes flag
> leading 00 = 3 bytes flag
> leading 01 = 4 bytes flag
> leading 10 = 5 bytes flag
> leading 11 = 6 bytes flag
>
> Wouldn't it be more logical?

Think about it. Let's say that, as per your scheme, a leading 0
indicates "1 byte" (as is indeed the case in UTF8). What things could
follow that leading 0? How does that impact your choice of a leading
00 or 01 for other numbers of bytes?

... okay, you're obviously going to need to be spoon-fed a little more
than that. Here's a byte:

  01010101

Is that a single byte representing a code point in the 0-127 range, or
the first of 4 bytes representing something else, in your proposed
scheme? How can you tell?

Now look at the way UTF8 does it:
<http://en.wikipedia.org/wiki/Utf-8#Description>

Really, follow the link and study the table carefully. Don't continue
reading this until you believe you understand the choices that the
designers of UTF8 made, and why they made them.

Pay particular attention to the possible values for byte 1. Do you
notice the difference between that scheme, and yours:

  0xxxxxxx
  1xxxxxxx
  00xxxxxx
  01xxxxxx
  10xxxxxx
  11xxxxxx

If you don't see it, keep looking until you do ... this email gives
you more than enough hints to work it out. Don't ask someone here to
explain it to you. If you want to become competent, you must use your
brain.

 -[]z.

[toc] | [prev] | [next] | [standalone]


#48073

FromNick the Gr33k <support@superhost.gr>
Date2013-06-14 10:28 +0300
Message-ID<kpegmv$p37$4@news.ntua.gr>
In reply to#48066
On 14/6/2013 9:00 πμ, Zero Piraeus wrote:
> :
>
> On 14 June 2013 01:34, Nick the Gr33k <support@superhost.gr> wrote:
>> Why doesn't it work like this?
>>
>> leading 0 = 1 byte flag
>> leading 1 = 2 bytes flag
>> leading 00 = 3 bytes flag
>> leading 01 = 4 bytes flag
>> leading 10 = 5 bytes flag
>> leading 11 = 6 bytes flag
>>
>> Wouldn't it be more logical?
>
> Think about it. Let's say that, as per your scheme, a leading 0
> indicates "1 byte" (as is indeed the case in UTF8). What things could
> follow that leading 0? How does that impact your choice of a leading
> 00 or 01 for other numbers of bytes?
>
> ... okay, you're obviously going to need to be spoon-fed a little more
> than that. Here's a byte:
>
>    01010101
>
> Is that a single byte representing a code point in the 0-127 range, or
> the first of 4 bytes representing something else, in your proposed
> scheme? How can you tell?

Indeed.

You cannot tell if it stands for 1 byte or a 4 byte sequence:

0 + 1010101 = leading 0 stands for 1byte representation of a code-point

01 + 010101 = leading 01 stands for 4byte representation of a code-point

the problem here in my scheme of how utf8 encoding works is that you 
cannot tell whether the flag is '0' or '01'

Same happen with leading '1' and '11'. You cannot tell what the flag is, 
so you cannot know if the Unicode code-point is being represented as 
2-byte sequence or 6 bye sequence

Understood


> Now look at the way UTF8 does it:
> <http://en.wikipedia.org/wiki/Utf-8#Description>
>
> Really, follow the link and study the table carefully. Don't continue
> reading this until you believe you understand the choices that the
> designers of UTF8 made, and why they made them.
>
> Pay particular attention to the possible values for byte 1. Do you
> notice the difference between that scheme, and yours:
>
>    0xxxxxxx
>    1xxxxxxx
>    00xxxxxx
>    01xxxxxx
>    10xxxxxx
>    11xxxxxx
>
> If you don't see it, keep looking until you do ... this email gives
> you more than enough hints to work it out. Don't ask someone here to
> explain it to you. If you want to become competent, you must use your
> brain.

0xxxxxxx
110xxxxx	10xxxxxx
1110xxxx	10xxxxxx	10xxxxxx
11110xxx	10xxxxxx	10xxxxxx	10xxxxxx

I did read the link but i still cannot see why

1. '110' is the flag for 2-byte code-point
2. why the in the 2nd byte and every subsequent byte leading flag has to 
be '10'

-- 
What is now proved was at first only imagined!

[toc] | [prev] | [standalone]


Page 6 of 6 — ← Prev page 1 2 3 4 5 [6]

Back to top | Article view | comp.lang.python


csiph-web