Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!rt.uk.eu.org!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Date: Mon, 10 Jun 2013 12:42:25 +0200
From: Andreas Perstinger <andipersti@gmail.com>
User-Agent: Mozilla/5.0 (X11; Linux i686; rv:17.0) Gecko/20130510 Thunderbird/17.0.6
MIME-Version: 1.0
To: python-list@python.org
Subject: Re: Changing filenames from Greeklish => Greek (subprocess complain)
References: <7d8da6c9-fb92-4329-b207-4280f29ba664@googlegroups.com> <20130608024931.GA77888@cskk.homeip.net> <51B37173.9060601@gmail.com> <mailman.2894.1370719010.3114.python-list@python.org> <3fbb5d0e-51fb-4aed-b829-8388304a9885@googlegroups.com> <51b4249d$0$30001$c3e8da3$5496439d@news.astraweb.com> <5b0d3d7c-e3a4-436d-a55f-26bd40064fd5@googlegroups.com> <51b475b0$0$30001$c3e8da3$5496439d@news.astraweb.com> <a64ba08f-2616-4715-818c-073f3d1e2ffb@googlegroups.com> <mailman.2959.1370852149.3114.python-list@python.org> <c6c9a67a-8eab-41e3-b8bb-d013fd7805b5@googlegroups.com> <349f7474-fce3-4891-8eb2-92fc53606fb2@googlegroups.com>
In-Reply-To: <349f7474-fce3-4891-8eb2-92fc53606fb2@googlegroups.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.2962.1370860953.3114.python-list@python.org>
Lines: 57
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:47545

On 10.06.2013 11:59, Νικόλαος Κούρας wrote:
>> >>>> s = 'α'
>> >>>> s.encode('utf-8')
>> > b'\xce\xb1'
>
> 'b' stands for binary right?

No, here it stands for bytes:
http://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals

>   b'\xce\xb1' = we are looking at a byte in a hexadecimal format?

No, b'\xce\xb1' represents a byte object containing 2 bytes.
Yes, each byte is represented in hexadecimal format.

> if yes how could we see it in binary and decimal represenation?

 >>> s = b'\xce\xb1'
 >>> s[0]
206
 >>> bin(s[0])
'0b11001110'
 >>> s[1]
177
 >>> bin(s[1])
'0b10110001'

A byte object is a sequence of bytes (= integer values) and support 
indexing.
http://docs.python.org/3/library/stdtypes.html#bytes

> Since 2^8 = 256, utf-8 should store the first 256 chars of unicode
> charset using 1 byte.
>
> Also Since 2^16 = 65535, utf-8 should store the first 65535 chars of
> unicode charset using 2 bytes and so on.
>
> But i know that this is not the case. But i dont understand why.

Because your method doesn't work.
If you use all possible 256 bit-combinations to represent a valid 
character, how do you decide where to stop in a sequence of bytes?

>> >>>> s = 'a'
>> >>>> s.encode('utf-8')
>> > b'a'
>> utf-8 takes ASCII as it is, as 1 byte. They are the same
>
> EBCDIC and ASCII and Unicode are charactet sets, correct?
>
> iso-8859-1, iso-8859-7, utf-8, utf-16, utf-32 and so on are encoding methods, right?
>

Look at http://www.unicode.org/glossary/ for an explanation of all the 
terms.

Bye, Andreas