Path: csiph.com!usenet.pasdenom.info!aioe.org!news.stack.nl!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Date: Wed, 28 Mar 2012 11:17:56 -0700
From: Ethan Furman <ethan@stoneleaf.us>
User-Agent: Thunderbird 1.5.0.10 (Windows/20070221)
MIME-Version: 1.0
To: Peter Daum <gator@cs.tu-berlin.de>
Subject: Re: "convert" string to bytes without changing data (encoding)
References: <9tg21lFmo3U1@mid.dfncis.de>	<mailman.1065.1332925364.3037.python-list@python.org>	<9tg4qoFbfpU1@mid.dfncis.de>	<mailman.1069.1332931371.3037.python-list@python.org> <9th0u8Fuf2U1@mid.dfncis.de>
In-Reply-To: <9th0u8Fuf2U1@mid.dfncis.de>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: python-list@python.org
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.1090.1332959710.3037.python-list@python.org>
Lines: 34
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:22301

Peter Daum wrote:
> On 2012-03-28 12:42, Heiko Wundram wrote:
>> Am 28.03.2012 11:43, schrieb Peter Daum:
>>> ... in my example, the variable s points to a "string", i.e. a series of
>>> bytes, (0x61,0x62 ...) interpreted as ascii/unicode characters.
>> No; a string contains a series of codepoints from the unicode plane,
>> representing natural language characters (at least in the simplistic
>> view, I'm not talking about surrogates). These can be encoded to
>> different binary storage representations, of which ascii is (a common) one.
>>
>>> What I am looking for is a general way to just copy the raw data
>>> from a "string" object to a "byte" object without any attempt to
>>> "decode" or "encode" anything ...
>> There is "logically" no raw data in the string, just a series of
>> codepoints, as stated above. You'll have to specify the encoding to use
>> to get at "raw" data, and from what I gather you're interested in the
>> latin-1 (or iso-8859-15) encoding, as you're specifically referencing
>> chars >= 0x80 (which hints at your mindset being in LATIN-land, so to
>> speak).
> 
> The longer story of my question is: I am new to python (obviously), and
> since I am not familiar with either one, I thought it would be advisory
> to go for python 3.x. The biggest problem that I am facing is, that I
> am often dealing with data, that is basically text, but it can contain
> 8-bit bytes. In this case, I can not safely assume any given encoding,
> but I actually also don't need to know - for my purposes, it would be
> perfectly good enough to deal with the ascii portions and keep anything
> else unchanged.

Where is the data coming from?  Files?  In that case, it sounds like you 
will want to decode/encode using 'latin-1', as the bulk of your text is 
plain ascii and you don't really care about the upper-ascii chars.

~Ethan~