Path: csiph.com!usenet.pasdenom.info!news.albasani.net!rt.uk.eu.org!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'represents': 0.05; 'encoded': 0.07; 'utf-8': 0.07; "'a'": 0.09; '32-bit': 0.09; 'bits': 0.09; 'bytes,': 0.09; 'combines': 0.09; 'encode': 0.09; 'cc:addr:python-list': 0.11; 'stored': 0.12; 'mostly': 0.14; '127': 0.16; '16-bit': 0.16; '8-bit': 0.16; 'byte,': 0.16; 'disc': 0.16; 'exactly,': 0.16; 'from:addr:cs': 0.16; 'from:addr:zip.com.au': 0.16; 'from:name:cameron simpson': 0.16; 'message-id:@cskk.homeip.net': 0.16; 'ordinal': 0.16; 'pair.': 0.16; 'received:211.29': 0.16; 'received:211.29.132': 0.16; 'received:optusnet.com.au': 0.16; 'received:syd.optusnet.com.au': 0.16; 'simpson': 0.16; 'storing': 0.16; 'surrogate': 0.16; 'two,': 0.16; 'utf8': 0.16; 'wrote:': 0.18; 'meant': 0.20; 'example': 0.22; 'cc:addr:python.org': 0.22; 'header:User-Agent:1': 0.23; 'error': 0.23; 'byte': 0.24; 'bytes': 0.24; 'unicode': 0.24; 'cheers,': 0.24; 'cc:2**0': 0.24; 'cc:no real name:2**0': 0.24; 'web.': 0.26; 'values': 0.27; 'header:In-Reply-To:1': 0.27; 'point': 0.28; 'character': 0.29; 'raise': 0.29; 'is?': 0.30; 'code': 0.31; '(since': 0.31; 'crash': 0.31; 'skip:= 40': 0.31; 'steven': 0.31; 'values.': 0.31; 'not.': 0.33; "i'd": 0.34; 'subject:from': 0.34; 'could': 0.34; 'subject: (': 0.35; 'point.': 0.35; 'but': 0.35; 'combination': 0.36; 'received:com.au': 0.36; 'two': 0.37; 'being': 0.38; 'represent': 0.38; 'sometimes': 0.38; 'depends': 0.38; 'received:211': 0.38; 'needed': 0.38; 'files': 0.38; 'does': 0.39; 'how': 0.40; 'even': 0.60; 'skip:u 10': 0.60; 'read': 0.60; 'consists': 0.60; 'most': 0.60; 'tell': 0.60; 'first': 0.61; 'back': 0.62; 'content-disposition:inline': 0.62; 'our': 0.64; 'different': 0.65; 'situation': 0.65; 'to:addr:gmail.com': 0.65; '1st': 0.74; 'chinese': 0.74; 'beside': 0.84; 'discovering': 0.91; 'widespread': 0.91 Date: Sun, 9 Jun 2013 19:12:36 +1000 From: Cameron Simpson To: =?utf-8?B?zp3Ouc66z4zOu86xzr/PgiDOms6/z43Pgc6xz4I=?= Subject: Re: Changing filenames from Greeklish => Greek (subprocess complain) MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <5b0d3d7c-e3a4-436d-a55f-26bd40064fd5@googlegroups.com> User-Agent: Mutt/1.5.21 (2010-09-15) References: <5b0d3d7c-e3a4-436d-a55f-26bd40064fd5@googlegroups.com> X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.0 cv=K8x6hFqI c=1 sm=1 a=wom5GMh1gUkA:10 a=AdgiQdVXbpoA:10 a=IkcTkHD0fZMA:10 a=vrnE16BAAAAA:8 a=ZtCCktOnAAAA:8 a=uw23S90zXSUA:10 a=pGLkceISAAAA:8 a=Lvs5wTLs7d9RthzLfVEA:9 a=QEXdDO2ut3YA:10 a=MSl-tDqOz04A:10 a=ChdAjXE5lkUvdteQbhpnkQ==:117 Cc: python-list@python.org X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 49 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1370769172 news.xs4all.nl 15955 [2001:888:2000:d::a6]:42333 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:47437 On 09Jun2013 02:00, =?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?= wrote: | Steven wrote: | >> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for | >> values up to 256? | | >Because then how do you tell when you need one byte, and when you need | >two? If you read two bytes, and see 0x4C 0xFA, does that mean two | >characters, with ordinal values 0x4C and 0xFA, or one character with | >ordinal value 0x4CFA? | | I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256. Then it would not be UTF-8. UTF-8 will encode an Unicode codepoint. Your suggestion will not. I'd point out that if you did this, you'd be back in the same situation you just encountered with ASCII: the first above-255 value would raise a UnicodeEncodeError (an error which does not even exist at present:-) | >> UTF-8 and UTF-16 and UTF-32 | >> I though the number beside of UTF- was to declare how many bits the | >> character set was using to store a character into the hdd, no? | | >Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values. | >UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit | >values to make a surrogate pair. | | A surrogate pair is like itting for example Ctrl-A, which means is a combination character that consists of 2 different characters? | Is this what a surrogate is? a pari of 2 chars? Essentially. The combination represents a code point. | >UTF-8 uses 8-bit values, but sometimes | >it combines two, three or four of them to represent a single code-point. | | 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65) | 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is > 127 ) | 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (since ordinal > 65000 ) | | The amount of bytes needed to store a character solely depends on the character's ordinal value in the Unicode table? Essentially. You can read up on the exact process in Wikipedia or the Unicode Standard. Cheers, -- Cameron Simpson The most annoying thing about being without my files after our disc crash was discovering once again how widespread BLINK was on the web.