Path: csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.001 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'essentially': 0.04; 'encoding': 0.05; 'represents': 0.05; 'encoded': 0.07; 'utf-8': 0.07; "'a'": 0.09; '32-bit': 0.09; 'bits': 0.09; 'bytes,': 0.09; 'bytes.': 0.09; 'combines': 0.09; 'encode': 0.09; 'stored': 0.12; 'mostly': 0.14; "wouldn't": 0.14; '127': 0.16; '16-bit': 0.16; '8-bit': 0.16; 'byte,': 0.16; 'exactly,': 0.16; 'exists,': 0.16; 'ordinal': 0.16; 'pair.': 0.16; 'simpson': 0.16; 'storing': 0.16; 'surrogate': 0.16; 'two,': 0.16; 'unicode.': 0.16; 'utf8': 0.16; 'wrote:': 0.18; 'bit': 0.19; 'dependent': 0.19; 'meant': 0.20; 'example': 0.22; 'to:name:python-list@python.org': 0.22; 'byte': 0.24; 'bytes': 0.24; 'unicode': 0.24; 'values': 0.27; 'gets': 0.27; 'header:In-Reply-To:1': 0.27; 'point': 0.28; 'am,': 0.29; 'character': 0.29; 'characters': 0.30; 'is?': 0.30; 'message- id:@mail.gmail.com': 0.30; 'code': 0.31; '(since': 0.31; 'skip:= 40': 0.31; 'steven': 0.31; 'values.': 0.31; 'yes.': 0.31; 'not.': 0.33; 'subject:from': 0.34; 'could': 0.34; 'subject: (': 0.35; "can't": 0.35; 'agree': 0.35; 'received:209.85': 0.35; 'point.': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'combination': 0.36; 'two': 0.37; 'received:209': 0.37; 'represent': 0.38; 'sometimes': 0.38; 'depends': 0.38; 'needed': 0.38; 'to:addr :python-list': 0.38; 'does': 0.39; 'to:addr:python.org': 0.39; 'called': 0.40; 'how': 0.40; 'read': 0.60; 'consists': 0.60; 'tell': 0.60; 'first': 0.61; 'different': 0.65; 'between': 0.67; 'dont': 0.67; '8bit%:92': 0.71; '8bit%:100': 0.72; '1st': 0.74; 'chinese': 0.74; 'million': 0.74; '257': 0.84; 'beside': 0.84; 'characters,': 0.84; 'greek': 0.84; '2013': 0.98 X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding:x-gm-message-state; bh=grryuc2b90QdeVyeJlqbzx1lm2v4VsvY9j5Q6v5c5C0=; b=bvsNZH1g+oC6Ovpdkv7NLxCOBnpEgquielnZAuX1uYrdXB6rwsYrELxiM4SKChwTdn 1AY36Ps4vdBvrgJoyEWTIlRMmG3uRp2P5+OpnfFeyVq+rCzE3mUcJSzcIVNPazUaxueB az5xImWcm2kcBwXtekMJUjpUuvKTMQeRNyOW84JGFLMjqE2DzahZUhcdNReI7/jghOAq /gqZ07yjWFdgKVuA7U5izR1L8pTnl0x/SjKFwyZee3MgcmPi7BcR/+rBtF5qJ6GH+uWf 70dN4J2231CONMJc/L6CBldCTeWWGF66bBupjf9EWHkakCvE6FiGktyq73FcYH9/m6pz T87w== X-Received: by 10.60.42.237 with SMTP id r13mr5834980oel.61.1370808076268; Sun, 09 Jun 2013 13:01:16 -0700 (PDT) MIME-Version: 1.0 X-Received: by 10.60.42.237 with SMTP id r13mr5834974oel.61.1370808076175; Sun, 09 Jun 2013 13:01:16 -0700 (PDT) In-Reply-To: <8471f19b-e21a-4859-9842-92a97d75a840@googlegroups.com> References: <5b0d3d7c-e3a4-436d-a55f-26bd40064fd5@googlegroups.com> <8471f19b-e21a-4859-9842-92a97d75a840@googlegroups.com> Date: Sun, 9 Jun 2013 13:01:15 -0700 Subject: Re: Changing filenames from Greeklish => Greek (subprocess complain) From: Benjamin Kaplan To: "python-list@python.org" Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-Gm-Message-State: ALoCoQmqo/5EC4SDtyZHe4frraigKu56/GL7w2tV9w9mvgSMvCndFnrzqDi5qZ7hTfQn159bf3Elk+AltefmEa5x9FSkubhYJPoHPRPqHK3EBp/Ll9NcbAhIWokgWjENLuJi+MD3v/PfKCEHlZG1MsrCIYEmAViP/w== X-Junkmail-Whitelist: YES (by domain whitelist at mpv2.tis.cwru.edu) X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 114 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1370808466 news.xs4all.nl 15975 [2001:888:2000:d::a6]:38995 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:47487 On Sun, Jun 9, 2013 at 2:20 AM, =CE=9D=CE=B9=CE=BA=CF=8C=CE=BB=CE=B1=CE=BF= =CF=82 =CE=9A=CE=BF=CF=8D=CF=81=CE=B1=CF=82 wrote: > =CE=A4=CE=B7 =CE=9A=CF=85=CF=81=CE=B9=CE=B1=CE=BA=CE=AE, 9 =CE=99=CE=BF= =CF=85=CE=BD=CE=AF=CE=BF=CF=85 2013 12:12:36 =CE=BC.=CE=BC. UTC+3, =CE=BF = =CF=87=CF=81=CE=AE=CF=83=CF=84=CE=B7=CF=82 Cameron Simpson =CE=AD=CE=B3=CF= =81=CE=B1=CF=88=CE=B5: >> On 09Jun2013 02:00, =3D?utf-8?B?zp3Or866zr/PgiDOk866z4EzM866?=3D wrote: >> >> | Steven wrote: >> >> | >> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for >> >> | >> values up to 256? >> >> | >> >> | >Because then how do you tell when you need one byte, and when you nee= d >> >> | >two? If you read two bytes, and see 0x4C 0xFA, does that mean two >> >> | >characters, with ordinal values 0x4C and 0xFA, or one character with >> >> | >ordinal value 0x4CFA? >> >> | >> >> | I mean utf-8 could use 1 byte for storing the 1st 256 characters. I me= ant up to 256, not above 256. >> >> >> >> Then it would not be UTF-8. UTF-8 will encode an Unicode codepoint. Your= >suggestion will not. > > I dont follow. > The point in the UTF formats is that they can encode any of the 1.1 million codepoints available in Unicode. Your suggestion can only encode 256 code points. We have that encoding already- it's called Latin-1 and it can't encode any of your Greek characters (hence why ISO-8859-7 exists, which can encode the Greek characters but not the Latin ones). If you were to use the whole byte to store the first 256 characters, you wouldn't be able to store character number 256 because the computer wouldn't be able to tell the difference between character 257 (0x01 0x01) and two chr(1)s. UTF-8 gets around this by reserving the top bit as a "am I part of a multibyte sequence" flag, >> | >> UTF-8 and UTF-16 and UTF-32 >> >> | >> I though the number beside of UTF- was to declare how many bits the >> >> | >> character set was using to store a character into the hdd, no? >> >> | >> >> | >Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values. >> >> | >UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bi= t >> >> | >values to make a surrogate pair. >> >> | >> >> | A surrogate pair is like itting for example Ctrl-A, which means is a c= ombination character that consists of 2 different characters? >> >> | Is this what a surrogate is? a pari of 2 chars? >> >> >> >> Essentially. The combination represents a code point. >> >> >> >> | >UTF-8 uses 8-bit values, but sometimes >> >> | >it combines two, three or four of them to represent a single code-poi= nt. >> >> | >> >> | 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal =3D = 65) >> >> | '=CE=B1=CE=84' to be utf8 encoded needs 2 bytes to be stored ? (since = ordinal is > 127 ) >> >> | 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (= since ordinal > 65000 ) >> >> | >> >> | The amount of bytes needed to store a character solely depends on the = character's ordinal value in the Unicode table? >> >> >> >> Essentially. You can read up on the exact process in Wikipedia or the Un= icode Standard. > > > > When you say essentially means you agree with my statements? > -- In UTF-8 or UTF-16, the number of bytes required for the character is dependent on its code point, yes. That isn't the case for UTF-32, where every character uses exactly four bytes.