Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #5182
| Path | csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!gegeweb.org!de-l.enfer-du-nord.net!feeder2.enfer-du-nord.net!feeder.news-service.com!newsfeed.xs4all.nl!newsfeed5.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail |
|---|---|
| Return-Path | <python@mrabarnett.plus.com> |
| X-Original-To | python-list@python.org |
| Delivered-To | python-list@mail.python.org |
| X-Spam-Status | OK 0.000 |
| X-Spam-Evidence | '*H*': 1.00; '*S*': 0.00; 'example:': 0.03; 'escape': 0.04; 'socket': 0.05; 'ascii': 0.07; 'bytes.': 0.07; 'specifying': 0.07; 'used.': 0.07; 'bytes,': 0.09; 'encoding.': 0.09; 'from:addr:python': 0.09; 'presume': 0.09; 'protocol.': 0.09; 'utf-8': 0.09; 'output': 0.12; 'wrote:': 0.14; '(1).': 0.16; 'coded': 0.16; 'codes,': 0.16; 'encodings,': 0.16; 'from:addr:mrabarnett.plus.com': 0.16; 'from:name:mrab': 0.16; 'hex': 0.16; 'message-id:@mrabarnett.plus.com': 0.16; 'received:84.92': 0.16; 'received:84.92.122': 0.16; 'received:84.92.122.60': 0.16; 'received:84.93': 0.16; 'received:84.93.230': 0.16; 'reply-to:addr:python-list': 0.16; 'subject:unicode': 0.16; 'input': 0.18; 'bytes': 0.19; 'specifies': 0.19; 'handles': 0.20; 'seeing': 0.21; '(or': 0.22; 'code': 0.22; 'header:In-Reply-To:1': 0.22; 'e.g.': 0.22; 'sequences.': 0.23; 'byte': 0.25; 'extract': 0.25; 'received:84': 0.25; 'specify': 0.25; 'assume': 0.25; "i'm": 0.26; 'thanks': 0.29; 'string': 0.29; 'unicode': 0.29; 'all.': 0.30; 'blocks': 0.31; 'character.': 0.31; 'characters,': 0.31; 'does': 0.31; "can't": 0.31; 'perhaps': 0.32; 'to:addr:python-list': 0.32; 'another': 0.32; '...': 0.32; 'character': 0.33; 'using': 0.34; 'difference': 0.35; 'there': 0.35; 'file': 0.35; 'characters': 0.35; 'open': 0.35; 'header:User-Agent:1': 0.35; 'point': 0.35; 'reply-to:addr:python.org': 0.35; 'represent': 0.37; 'should': 0.37; 'sequence': 0.38; 'less': 0.38; 'files': 0.38; 'anything': 0.38; 'unless': 0.38; 'affect': 0.39; 'to:addr:python.org': 0.39; 'how': 0.39; 'unusual': 0.60; 'best': 0.60; 'reply-to:no real name:2**0': 0.72; 'header:Reply-To:1': 0.72; 'order,': 0.73; 'stream': 0.73; 'consumer': 0.80; 'encoding,': 0.84; 'encoding?': 0.84; 'points,': 0.91; 'you).': 0.91; 'on...': 0.93 |
| X-IronPort-Anti-Spam-Filtered | true |
| X-IronPort-Anti-Spam-Result | AiEHAIhFy01UXebj/2dsb2JhbACXfY1zd8lDgxiCeQSUGoo+ |
| Date | Thu, 12 May 2011 03:31:18 +0100 |
| From | MRAB <python@mrabarnett.plus.com> |
| User-Agent | Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.2.17) Gecko/20110414 Thunderbird/3.1.10 |
| MIME-Version | 1.0 |
| To | python-list@python.org |
| Subject | Re: unicode by default |
| References | <OkDyp.2983$M61.450@newsfe07.iad> <mailman.1433.1305151801.9059.python-list@python.org> <vpEyp.981$dL5.736@newsfe08.iad> <mailman.1435.1305157329.9059.python-list@python.org> <KDGyp.180$0t1.7@newsfe04.iad> |
| In-Reply-To | <KDGyp.180$0t1.7@newsfe04.iad> |
| Content-Type | text/plain; charset=ISO-8859-1; format=flowed |
| Content-Transfer-Encoding | 8bit |
| X-BeenThere | python-list@python.org |
| X-Mailman-Version | 2.1.12 |
| Precedence | list |
| Reply-To | python-list@python.org |
| List-Id | General discussion list for the Python programming language <python-list.python.org> |
| List-Unsubscribe | <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe> |
| List-Archive | <http://mail.python.org/pipermail/python-list> |
| List-Post | <mailto:python-list@python.org> |
| List-Help | <mailto:python-list-request@python.org?subject=help> |
| List-Subscribe | <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe> |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.1439.1305167541.9059.python-list@python.org> (permalink) |
| Lines | 55 |
| NNTP-Posting-Host | 82.94.164.166 |
| X-Trace | 1305167541 news.xs4all.nl 81478 [::ffff:82.94.164.166]:53538 |
| X-Complaints-To | abuse@xs4all.nl |
| Xref | x330-a1.tempe.blueboxinc.net comp.lang.python:5182 |
Show key headers only | View raw
On 12/05/2011 02:22, harrismh777 wrote:
> John Machin wrote:
>> (1) You cannot work without using bytes sequences. Files are byte
>> sequences. Web communication is in bytes. You need to (know / assume / be
>> able to extract / guess) the input encoding. You need to encode your
>> output using an encoding that is expected by the consumer (or use an
>> output method that will do it for you).
>>
>> (2) You don't need to use bytes to specify a Unicode code point. Just use
>> an escape sequence e.g. "\u0404" is a Cyrillic character.
>>
>
> Thanks John. In reverse order, I understand point (2). I'm less clear on
> point (1).
>
> If I generate a string of characters that I presume to be ascii/utf-8
> (no \u0404 type characters) and write them to a file (stdout) how does
> default encoding affect that file.by default..? I'm not seeing that
> there is anything unusual going on... If I open the file with vi? If I
> open the file with gedit? emacs?
>
> ....
>
> Another question... in mail I'm receiving many small blocks that look
> like sprites with four small hex codes, scattered about the mail...
> mostly punctuation, maybe? ... guessing, are these unicode code points,
> and if so what is the best way to 'guess' the encoding? ... is it coded
> in the stream somewhere...protocol?
>
You need to understand the difference between characters and bytes.
A string contains characters, a file contains bytes.
The encoding specifies how a character is represented as bytes.
For example:
In the Latin-1 encoding, the character "£" is represented by the
byte 0xA3.
In the UTF-8 encoding, the character "£" is represented by the byte
sequence 0xC2 0xA3.
In the ASCII encoding, the character "£" can't be represented at all.
The advantage of UTF-8 is that it can represent _all_ Unicode
characters (codepoints, actually) as byte sequences, and all those in
the ASCII range are represented by the same single bytes which the
original ASCII system used. Use the UTF-8 encoding unless you have to
use a different one.
A file contains only bytes, a socket handles only bytes. Which encoding
you should use for characters is down to protocol. A system such as
email, which can handle different encodings, should have a way of
specifying the encoding, and perhaps also a default encoding.
Back to comp.lang.python | Previous | Next — Previous in thread | Next in thread | Find similar | Unroll thread
unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-11 16:37 -0500
Re: unicode by default Ian Kelly <ian.g.kelly@gmail.com> - 2011-05-11 16:09 -0600
Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-11 17:51 -0500
Re: unicode by default "John Machin" <sjmachin@lexicon.net> - 2011-05-12 09:32 +1000
Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-11 20:22 -0500
Re: unicode by default MRAB <python@mrabarnett.plus.com> - 2011-05-12 03:31 +0100
Re: unicode by default Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2011-05-12 03:16 +0000
Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-11 22:44 -0500
Re: unicode by default Terry Reedy <tjreedy@udel.edu> - 2011-05-12 00:12 -0400
Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-12 01:43 -0500
Re: unicode by default "John Machin" <sjmachin@lexicon.net> - 2011-05-12 14:14 +1000
Re: unicode by default Benjamin Kaplan <benjamin.kaplan@case.edu> - 2011-05-11 21:14 -0700
Re: unicode by default "John Machin" <sjmachin@lexicon.net> - 2011-05-12 14:41 +1000
Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-12 01:14 -0500
Re: unicode by default TheSaint <nobody@nowhere.net.no> - 2011-05-12 20:40 +0800
Re: unicode by default Ben Finney <ben+python@benfinney.id.au> - 2011-05-12 14:07 +1000
Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-12 01:31 -0500
Re: unicode by default "John Machin" <sjmachin@lexicon.net> - 2011-05-12 17:58 +1000
Re: unicode by default Ian Kelly <ian.g.kelly@gmail.com> - 2011-05-12 10:17 -0600
Re: unicode by default jmfauth <wxjmfauth@gmail.com> - 2011-05-12 23:28 -0700
Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-13 14:53 -0500
Re: unicode by default Robert Kern <robert.kern@gmail.com> - 2011-05-13 15:18 -0500
Re: unicode by default Terry Reedy <tjreedy@udel.edu> - 2011-05-13 21:41 -0400
Re: unicode by default harrismh777 <harrismh777@charter.net> - 2011-05-14 02:41 -0500
Re: unicode by default jmfauth <wxjmfauth@gmail.com> - 2011-05-14 03:26 -0700
Re: unicode by default Terry Reedy <tjreedy@udel.edu> - 2011-05-14 16:26 -0400
Re: unicode by default Ben Finney <ben+python@benfinney.id.au> - 2011-05-15 09:47 +1000
Re: unicode by default Nobody <nobody@nowhere.com> - 2011-05-14 09:34 +0100
Re: unicode by default Terry Reedy <tjreedy@udel.edu> - 2011-05-12 16:42 -0400
Re: unicode by default Ian Kelly <ian.g.kelly@gmail.com> - 2011-05-12 16:25 -0600
Re: unicode by default "John Machin" <sjmachin@lexicon.net> - 2011-05-12 13:54 +1000
Re: unicode by default Benjamin Kaplan <benjamin.kaplan@case.edu> - 2011-05-11 15:34 -0700
csiph-web