Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
In-Reply-To: <9th0u8Fuf2U1@mid.dfncis.de>
References: <9tg21lFmo3U1@mid.dfncis.de> <mailman.1065.1332925364.3037.python-list@python.org> <9tg4qoFbfpU1@mid.dfncis.de> <mailman.1069.1332931371.3037.python-list@python.org> <9th0u8Fuf2U1@mid.dfncis.de>
From: Ian Kelly <ian.g.kelly@gmail.com>
Date: Wed, 28 Mar 2012 12:20:30 -0600
Subject: Re: "convert" string to bytes without changing data (encoding)
To: Peter Daum <gator@cs.tu-berlin.de>
Content-Type: text/plain; charset=ISO-8859-1
Cc: python-list@python.org
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.1086.1332958864.3037.python-list@python.org>
Lines: 33
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:22295

On Wed, Mar 28, 2012 at 11:43 AM, Peter Daum <gator@cs.tu-berlin.de> wrote:
> ... I was under the illusion, that python (like e.g. perl) stored
> strings internally in utf-8. In this case the "conversion" would simple
> mean to re-label the data. Unfortunately, as I meanwhile found out, this
> is not the case (nor the "apple encoding" ;-), so it would indeed be
> pretty useless.

No, unicode strings can be stored internally as any of UCS-1, UCS-2,
UCS-4, C wchar strings, or even plain ASCII.  And those are all
implementation details that could easily change in future versions of
Python.

> The longer story of my question is: I am new to python (obviously), and
> since I am not familiar with either one, I thought it would be advisory
> to go for python 3.x. The biggest problem that I am facing is, that I
> am often dealing with data, that is basically text, but it can contain
> 8-bit bytes. In this case, I can not safely assume any given encoding,
> but I actually also don't need to know - for my purposes, it would be
> perfectly good enough to deal with the ascii portions and keep anything
> else unchanged.

You can't generally just "deal with the ascii portions" without
knowing something about the encoding.  Say you encounter a byte
greater than 127.  Is it a single non-ASCII character, or is it the
leading byte of a multi-byte character?  If the next character is less
than 127, is it an ASCII character, or a continuation of the previous
character?  For UTF-8 you could safely assume ASCII, but without
knowing the encoding, there is no way to be sure.  If you just assume
it's ASCII and manipulate it as such, you could be messing up
non-ASCII characters.

Cheers,
Ian