Path: csiph.com!usenet.pasdenom.info!gegeweb.org!de-l.enfer-du-nord.net!feeder2.enfer-du-nord.net!feeds.phibee-telecom.net!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'encoding': 0.05; 'encoded': 0.07; 'utf-8': 0.07; "'a'": 0.09; '(first': 0.09; '32-bit': 0.09; 'bits': 0.09; 'bold': 0.09; 'bytes,': 0.09; 'combines': 0.09; 'subject:few': 0.09; '\xe2\x80\x94': 0.09; 'translate': 0.10; 'cc:addr:python-list': 0.11; 'python': 0.11; 'stored': 0.12; 'mostly': 0.14; '0x10000': 0.16; '127': 0.16; '16-bit': 0.16; '8-bit': 0.16; 'begin.': 0.16; 'byte,': 0.16; 'decimal.': 0.16; 'encoding.': 0.16; 'exactly,': 0.16; 'gpg': 0.16; 'hex': 0.16; 'non-ascii': 0.16; 'ordinal': 0.16; 'pair.': 0.16; 'reversing': 0.16; 'shorten': 0.16; 'storing': 0.16; 'surrogate': 0.16; 'throw': 0.16; 'two,': 0.16; 'utf8': 0.16; 'worthless,': 0.16; 'zeroes': 0.16; 'wrote:': 0.18; 'split': 0.19; 'meant': 0.20; '(the': 0.22; '>>>': 0.22; 'example': 0.22; 'cc:addr:python.org': 0.22; 'byte': 0.24; 'bytes': 0.24; 'mathematical': 0.24; 'unicode': 0.24; 'question': 0.24; 'cc:2**0': 0.24; 'cc:no real name:2**0': 0.24; 'purposes': 0.26; 'values': 0.27; 'header:In-Reply-To:1': 0.27; 'character': 0.29; 'characters': 0.30; 'is?': 0.30; 'message-id:@mail.gmail.com': 0.30; "skip:' 10": 0.31; 'url:wiki': 0.31; '(since': 0.31; 'stands': 0.31; 'url:wikipedia': 0.31; 'values.': 0.31; 'another': 0.32; 'becomes': 0.33; 'skip:b 30': 0.33; 'table': 0.34; 'could': 0.34; 'but': 0.35; 'received:google.com': 0.35; 'there': 0.35; 'combination': 0.36; 'skip:> 10': 0.36; 'url:org': 0.36; 'half': 0.37; 'two': 0.37; 'represent': 0.38; 'sometimes': 0.38; 'pm,': 0.38; 'does': 0.39; 'how': 0.40; 'read': 0.60; 'consists': 0.60; 'number,': 0.60; 'tell': 0.60; 'reserved': 0.61; 'numbers': 0.61; 'range': 0.61; 'from:charset:utf-8': 0.61; 'further': 0.61; 'first': 0.61; 'here:': 0.62; 'name': 0.63; 'story': 0.63; 'different': 0.65; 'to:addr:gmail.com': 0.65; 'details': 0.65; '8bit%:100': 0.72; 'capital': 0.73; '1st': 0.74; 'chinese': 0.74; 'beside': 0.84; 'characters,': 0.84; 'url:tk': 0.95; '2013': 0.98 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type:content-transfer-encoding; bh=zzGsUwMDMiCaKjmirFXRdTFN9cTHYCSrafHmuHG0YB4=; b=DELuOpIKVnDG/eYQkHZ7ga4xGofx3iL2pDZ95813fjETUB0eUX6HwFUj3wlBzzJcOw eTMNdRJJcuYhp1l8zQCzVFmbyJa2eA1YIDTOQnSU+GfzUQLSSt1B6sc8C1Vhyi7K7Zis CeID1FGZSrShNXu05hbR49Fnggd4NqzDOJb1ICUeF/s91iCSqTqW9xXGcGFCkg7AB4q6 u3lnyXbX5Rji0JTRVJSufeQzMSct2PBheHdJjU8BXp/N1oj5KIzPtKgOjn0tuoQTHQl9 VnhmVBVZ4pF/mKD109rxE6aLx/GPeeuLeX4SZy7FAkRlSe42Wsbtu9dqhjuBzq5eVNk7 nJTA== X-Received: by 10.43.179.71 with SMTP id oz7mr2595119icc.43.1370797963702; Sun, 09 Jun 2013 10:12:43 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <6dfa3707-80f4-407a-a109-66dbb0130513@googlegroups.com> References: <6dfa3707-80f4-407a-a109-66dbb0130513@googlegroups.com> From: =?UTF-8?B?Q2hyaXMg4oCcS3dwb2xza2HigJ0gV2Fycmljaw==?= Date: Sun, 9 Jun 2013 19:12:22 +0200 Subject: Re: A few questiosn about encoding To: =?UTF-8?B?zp3Ouc66z4zOu86xzr/PgiDOms6/z43Pgc6xz4I=?= Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Cc: python-list@python.org X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 80 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1370797972 news.xs4all.nl 15871 [2001:888:2000:d::a6]:35868 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:47472 On Sun, Jun 9, 2013 at 12:44 PM, =CE=9D=CE=B9=CE=BA=CF=8C=CE=BB=CE=B1=CE=BF= =CF=82 =CE=9A=CE=BF=CF=8D=CF=81=CE=B1=CF=82 wrote: > A few questiosn about encoding please: > >>> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for >>> values up to 256? > >>Because then how do you tell when you need one byte, and when you need >>two? If you read two bytes, and see 0x4C 0xFA, does that mean two >>characters, with ordinal values 0x4C and 0xFA, or one character with >>ordinal value 0x4CFA? > > I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant= up to 256, not above 256. It is required so the computer can know where characters begin. 0x0080 (first non-ASCII character) becomes 0xC280 in UTF-8. Further details here: http://en.wikipedia.org/wiki/UTF-8#Description >>> UTF-8 and UTF-16 and UTF-32 >>> I though the number beside of UTF- was to declare how many bits the >>> character set was using to store a character into the hdd, no? > >>Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values. >>UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit >>values to make a surrogate pair. > > A surrogate pair is like itting for example Ctrl-A, which means is a comb= ination character that consists of 2 different characters? > Is this what a surrogate is? a pari of 2 chars? http://en.wikipedia.org/wiki/UTF-16#Code_points_U.2B10000_to_U.2B10FFFF Long story short: codepoint - 0x10000 (up to 20 bits) =E2=86=92 two 10-bit numbers =E2=86=92 0xD800 + first_half 0xDC00 + second_half. Rephrasing: We take MATHEMATICAL BOLD CAPITAL B (U+1D401). If you have UTF-8: =F0=9D= =90=81 It is over 0xFFFF, and we need to use surrogate pairs. We end up with 0xD401, or 0b1101010000000001. Both representations are worthless, as we have a 16-bit number, not a 20-bit one. We throw in some leading zeroes and end up with 0b00001101010000000001. Split it in half and we get 0b0000110101 and 0b0000000001, which we can now shorten to 0b110101 and 0b1, or translate to hex as 0x0035 and 0x0001. 0xD800 + 0x0035 and 0xDC00 + 0x0035 =E2=86=92 0xD835 0xDC00. Type it into python an= d: >>> b'\xD8\x35\xDC\x01'.decode('utf-16be') '=F0=9D=90=81' And before you ask: that =E2=80=9CBE=E2=80=9D stands for Big-Endian. Littl= e-Endian would mean reversing the bytes in a codepoint, which would make it '\x35\xD8\x01\xDC' (the name is based on the first 256 characters, which are 0x6500 for 'a' in a little-endian encoding. Another question you may ask: 0xD800=E2=80=A60xDFFF are reserved in Unicode for the purposes of UTF-16, so there is no conflicts. >>UTF-8 uses 8-bit values, but sometimes >>it combines two, three or four of them to represent a single code-point. > > 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal =3D 65) > '=CE=B1=CE=84' to be utf8 encoded needs 2 bytes to be stored ? (since ord= inal is > 127 ) yup. =CE=B1 is at 0x03B1, or 945 decimal. > 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (sin= ce ordinal > 65000 ) Not necessarily, as CJK characters start at U+2E80, which is in the 3-byte range (0x0800 through 0xFFFF) =E2=80=94 the table is here: http://en.wikipedia.org/wiki/UTF-8#Description -- Kwpolska | GPG KEY: 5EAAEA16 stop html mail | always bottom-post http://asciiribbon.org | http://caliburn.nl/topposting.html