Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!gegeweb.org!de-l.enfer-du-nord.net!feeder2.enfer-du-nord.net!feeder.news-service.com!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
In-Reply-To: <KDGyp.180$0t1.7@newsfe04.iad>
References: <OkDyp.2983$M61.450@newsfe07.iad> <mailman.1433.1305151801.9059.python-list@python.org> <vpEyp.981$dL5.736@newsfe08.iad> <mailman.1435.1305157329.9059.python-list@python.org> <KDGyp.180$0t1.7@newsfe04.iad>
Date: Thu, 12 May 2011 13:54:20 +1000
Subject: Re: unicode by default
From: "John Machin" <sjmachin@lexicon.net>
To: python-list@python.org
User-Agent: SquirrelMail/1.4.21
MIME-Version: 1.0
Content-Type: text/plain;charset=iso-8859-1
Content-Transfer-Encoding: 8bit
Importance: Normal
Precedence: list
Reply-To: sjmachin@lexicon.net
Newsgroups: comp.lang.python
Message-ID: <mailman.1441.1305172465.9059.python-list@python.org>
Lines: 73
NNTP-Posting-Host: 82.94.164.166
Xref: x330-a1.tempe.blueboxinc.net comp.lang.python:5191

On Thu, May 12, 2011 11:22 am, harrismh777 wrote:
> John Machin wrote:
>> (1) You cannot work without using bytes sequences. Files are byte
>> sequences. Web communication is in bytes. You need to (know / assume /
>> be
>> able to extract / guess) the input encoding. You need to encode your
>> output using an encoding that is expected by the consumer (or use an
>> output method that will do it for you).
>>
>> (2) You don't need to use bytes to specify a Unicode code point. Just
>> use
>> an escape sequence e.g. "\u0404" is a Cyrillic character.
>>
>
> Thanks John.  In reverse order, I understand point (2). I'm less clear
> on point (1).
>
> If I generate a string of characters that I presume to be ascii/utf-8
> (no \u0404 type characters)
> and write them to a file (stdout) how does
> default encoding affect that file.by default..?   I'm not seeing that
> there is anything unusual going on...

About """characters that I presume to be ascii/utf-8 (no \u0404 type
characters)""": All Unicode characters (including U+0404) are encodable in
bytes using UTF-8.

The result of sys.stdout.write(unicode_characters) to a TERMINAL depends
mostly on sys.stdout.encoding. This is likely to be UTF-8 on a
linux/OSX/platform. On a typical American / Western European /[former]
colonies Windows box, this is likely to be cp850 on a Command Prompt
window, and cp1252 in IDLE.

UTF-8: All Unicode characters are encodable in UTF-8. Only problem arises
if the terminal can't render the character -- you'll get spaces or blobs
or boxes with hex digits in them or nothing.

Windows (Command Prompt window): only a small subset of characters can be
encoded in e.g. cp850; anything else causes an exception.

Windows (IDLE): ignores sys.stdout.encoding and renders the characters
itself. Same outcome as *x/UTF-8 above.

If you write directly (or sys.stdout is redirected) to a FILE, the default
encoding is obtained by sys.getdefaultencoding() and is AFAIK ascii unless
the machine's site.py has been fiddled with to make it UTF-8 or something
else.

>   If I open the file with vi?  If
> I open the file with gedit?  emacs?

Any editor will have a default encoding; if that doesn't match the file
encoding, you have a (hopefully obvious) problem if the editor doesn't
detect the mismatch. Consult your editor's docs or HTFF1K.

> Another question... in mail I'm receiving many small blocks that look
> like sprites with four small hex codes, scattered about the mail...
> mostly punctuation, maybe?   ... guessing, are these unicode code
> points,

yes

> and if so what is the best way to 'guess' the encoding?

google("chardet") or rummage through the mail headers (but 4 hex digits in
a box are a symptom of inability to render, not necessarily caused by an
incorrect decoding)

 ... is
> it coded in the stream somewhere...protocol?

Should be.