Path: csiph.com!newsfeed.hal-mli.net!feeder3.hal-mli.net!newsfeed.hal-mli.net!feeder1.hal-mli.net!newsfeed.xs4all.nl!newsfeed2.news.xs4all.nl!xs4all!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'encoding': 0.05; 'interpreter': 0.05; 'binary': 0.07; 'utf-8': 0.07; "'a'": 0.09; 'ascii': 0.09; 'back.': 0.09; 'bits': 0.09; 'bytes.': 0.09; 'encode': 0.09; 'escape': 0.09; 'interpreted': 0.09; 'works.': 0.09; 'python': 0.11; 'question.': 0.14; "wouldn't": 0.14; "'b'": 0.16; '(starting': 0.16; 'andreas': 0.16; 'article.': 0.16; 'bye,': 0.16; 'charset': 0.16; 'discarded': 0.16; 'exactly?': 0.16; 'hex': 0.16; 'hexadecimal': 0.16; 'notation': 0.16; 'notation.': 0.16; 'ordinal': 0.16; 'sequence.': 0.16; 'unicode.': 0.16; 'zeros': 0.16; 'wrote:': 0.18; 'trying': 0.19; 'translated': 0.19; 'command': 0.22; '>>>': 0.22; 'rules': 0.22; 'header:User- Agent:1': 0.23; 'byte': 0.24; 'bytes': 0.24; 'char': 0.24; 'tells': 0.24; 'unicode': 0.24; "haven't": 0.24; '(see': 0.26; 'values': 0.27; 'header:In-Reply-To:1': 0.27; 'character': 0.29; 'thus': 0.29; 'characters': 0.30; 'usually': 0.31; 'you?': 0.31; '>>>>': 0.31; "d'aprano": 0.31; 'decimal': 0.31; 'object.': 0.31; 'steven': 0.31; 'way?': 0.31; 'url:python': 0.33; 'are:': 0.33; 'table': 0.34; 'subject:from': 0.34; 'subject: (': 0.35; 'but': 0.35; 'received:google.com': 0.35; 'there': 0.35; 'sequence': 0.36; 'method': 0.36; 'url:org': 0.36; 'should': 0.36; 'two': 0.37; 'list': 0.37; 'received:10': 0.37; 'being': 0.38; 'represent': 0.38; 'message-id:@gmail.com': 0.38; 'needed': 0.38; 'to:addr:python-list': 0.38; 'expect': 0.39; 'does': 0.39; 'to:addr:python.org': 0.39; 'how': 0.40; 'read': 0.60; 'tell': 0.60; 'length': 0.61; 'mentioned': 0.61; 'url:3': 0.61; 'first': 0.61; 'email addr:gmail.com': 0.63; 'show': 0.63; 'decided': 0.64; 'different': 0.65; 'here': 0.66; 'determine': 0.67; '8bit%:92': 0.71; '8bit%:100': 0.72; 'designers': 0.74; 'article': 0.77; 'url:reference': 0.84; '2013': 0.98 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=message-id:date:from:user-agent:mime-version:to:subject:references :in-reply-to:content-type:content-transfer-encoding; bh=ELaYWLxBTAchDcRE8Xg3n6M2vZkiUAgfdsQ13/o5dGU=; b=jQiGl1+p1hN4KfMoNCi9xoP7vrshqxG2KkbroUHzJP+P4ADqWbn4v/U48kwP1fxuaP 6Z4OgzgRSQlac3D9B90jxnCGLMx+6Y3HopDTwAqqGwZPHJeo2uC5uxJWXfoZfushxt2M UGbU3dz7WJ1DQlF8hBu+SChCQh3cyuFDfblxlaIo8Z3r01C46onB6rK9wWuWnvpEriAt VbzDgF2tOCyzDd7e3ZJFhVKFobIwm4LTyXuCZsqVG1r5Kz9u/IloG6Nsf/zsuB73hQPF dRMAJr1jTDPYVe+cQgEpKcYau/NcibV0YbNwkZwp7DdvWSuPXFAUHJBf5wD2BkZX7Hjk rQ2g== X-Received: by 10.204.26.8 with SMTP id b8mr1329638bkc.83.1370852140921; Mon, 10 Jun 2013 01:15:40 -0700 (PDT) Date: Mon, 10 Jun 2013 10:15:38 +0200 From: Andreas Perstinger User-Agent: Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/20130510 Thunderbird/17.0.6 MIME-Version: 1.0 To: python-list@python.org Subject: Re: Changing filenames from Greeklish => Greek (subprocess complain) References: <7d8da6c9-fb92-4329-b207-4280f29ba664@googlegroups.com> <20130608024931.GA77888@cskk.homeip.net> <51B37173.9060601@gmail.com> <3fbb5d0e-51fb-4aed-b829-8388304a9885@googlegroups.com> <51b4249d$0$30001$c3e8da3$5496439d@news.astraweb.com> <5b0d3d7c-e3a4-436d-a55f-26bd40064fd5@googlegroups.com> <51b475b0$0$30001$c3e8da3$5496439d@news.astraweb.com> In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 99 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1370852149 news.xs4all.nl 15960 [2001:888:2000:d::a6]:53938 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:47532 On 10.06.2013 09:10, nagia.retsina@gmail.com wrote: > Τη Κυριακή, 9 Ιουνίου 2013 3:31:44 μ.μ. UTC+3, ο χρήστης Steven D'Aprano έγραψε: > >> py> c = 'α' >> py> ord(c) >> 945 > > The number 945 is the characters 'α' ordinal value in the unicode charset correct? Yes, the unicode character set is just a big list of characters. The 946th character in that list (starting from 0) happens to be 'α'. > The command in the python interactive session to show me how many bytes > this character will take upon encoding to utf-8 is: > >>>> s = 'α' >>>> s.encode('utf-8') > b'\xce\xb1' > > I see that the encoding of this char takes 2 bytes. But why two exactly? That's how the encoding is designed. Haven't you read the wikipedia article which was already mentioned several times? > How do i calculate how many bits are needed to store this char into bytes? You need to understand how UTF-8 works. Read the wikipedia article. > Trying to to the same here but it gave me no bytes back. > >>>> s = 'a' >>>> s.encode('utf-8') > b'a' The encode method returns a byte object. It's length will tell you how many bytes there are: >>> len(b'a') 1 >>> len(b'\xce\xb1') 2 The python interpreter will represent all values below 256 as ASCII characters if they are printable: >>> ord(b'a') 97 >>> hex(97) '0x61' >>> b'\x61' == b'a' True The Python designers have decided to use b'a' instead of b'\x61'. >>py> c.encode('utf-8') >> b'\xce\xb1' > > 2 bytes here. why 2? Same as your first question. >> py> c.encode('utf-16be') >> b'\x03\xb1' > > 2 byets here also. but why 3 different bytes? the ordinal value of > char 'a' is the same in unicode. the encodign system just takes the > ordinal value end encode, but sinc eit uses 2 bytes should these 2 bytes > be the same? 'utf-16be' is a different encoding scheme, thus it uses other rules to determine how each character is translated into a byte sequence. >> py> c.encode('iso-8859-7') >> b'\xe1' > > And also does '\x' means that the value is being respresented in hex way? > and when i bin(6) i see '0b1000001' > > I should expect to see 8 bits of 1s and 0's. what the 'b' is tryign to say? > '\x' is an escape sequence and means that the following two characters should be interpreted as a number in hexadecimal notation (see also the table of allowed escape sequences: http://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals ). '0b' tells you that the number is printed in binary notation. Leading zeros are usually discarded when a number is printed: >>> bin(70) '0b1000110' >>> 0b100110 == 0b00100110 True >>> 0b100110 == 0b0000000000100110 True It's the same with decimal notation. You wouldn't say 00123 is different from 123, would you? Bye, Andreas