Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.003 X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'stored': 0.04; 'subject:data': 0.07; 'bytes.': 0.09; 'character,': 0.09; 'continuation': 0.09; 'messing': 0.09; 'portions': 0.09; 'received :mail-lpp01m010-f46.google.com': 0.09; 'subject:string': 0.09; 'utf-8': 0.09; 'python': 0.11; 'python.': 0.12; 'assume': 0.12; 'cc:addr:python-list': 0.15; 'ascii': 0.16; 'encoding.': 0.16; 'internally': 0.16; 'meanwhile': 0.16; 'subject:changing': 0.16; 'sure.': 0.16; 'useless.': 0.16; 'versions': 0.18; 'wrote:': 0.21; 'header:In-Reply-To:1': 0.22; '(like': 0.23; 'basically': 0.23; 'received:209.85.215.46': 0.23; 'wed,': 0.24; 'cc:no real name:2**0': 0.26; 'message-id:@mail.gmail.com': 0.27; 'cc:addr:python.org': 0.27; 'cheers,': 0.28; 'safely': 0.29; 'question': 0.30; 'character': 0.30; 'cc:2**0': 0.31; 'received:209.85': 0.32; 'received:google.com': 0.32; 'subject: (': 0.33; 'byte': 0.33; 'facing': 0.33; 'unicode': 0.33; 'could': 0.34; 'problem': 0.34; 'received:209.85.215': 0.34; 'received:209': 0.35; 'there': 0.35; 'actually': 0.35; 'subject:)': 0.36; 'but': 0.36; 'no,': 0.37; 'peter': 0.37; 'next': 0.38; 'something': 0.38; "can't": 0.39; 'either': 0.39; 'is:': 0.61; 'mar': 0.61; 'single': 0.61; 'biggest': 0.61; 'out,': 0.63; 'strings': 0.66; 'details': 0.69; '2012': 0.69; 'perfectly': 0.72; '11:43': 0.84; 'encoding,': 0.84; 'such,': 0.84; 'ascii.': 0.91; 'encounter': 0.91 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc:content-type; bh=WEmLJArwu5AAaAo0UgnijyHVEb3U3y2FGI3IlJ0hdeo=; b=TkerATIFcj4SkR9zvQxVijfI0Cp1yJ5EQx4Ol2pHJogvHreBMG4BF1PMoI/55+9hft TRySNi3K3gR6vP/nZ8VvSK6A1DQDCDsRQD8WPnHiftO7+MdBS65hWaOJWrG/60h4IGdk 0DrNxWo6Es2SaQC0k9kQwUahDu7EjruoegiQa4eSSQvuBiRiCbtFBAT/GUc6oH+BHhsg DxBEysrFZ4kh+IUO+JXlaq5cyzXaHlR9TTDYv8pQfPs9f3jSTsX2f/v9+D7K4z1yotUM h2AI3sIG6qG+E4O3SS4wYN3v8RFSA6LEnF6kU3mlJtHJ1Jv6ImI3h9X8wGuiQ+s+NVwr 8fpQ== MIME-Version: 1.0 In-Reply-To: <9th0u8Fuf2U1@mid.dfncis.de> References: <9tg21lFmo3U1@mid.dfncis.de> <9tg4qoFbfpU1@mid.dfncis.de> <9th0u8Fuf2U1@mid.dfncis.de> From: Ian Kelly Date: Wed, 28 Mar 2012 12:20:30 -0600 Subject: Re: "convert" string to bytes without changing data (encoding) To: Peter Daum Content-Type: text/plain; charset=ISO-8859-1 Cc: python-list@python.org X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.12 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 33 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1332958864 news.xs4all.nl 6965 [2001:888:2000:d::a6]:40191 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:22295 On Wed, Mar 28, 2012 at 11:43 AM, Peter Daum wrote: > ... I was under the illusion, that python (like e.g. perl) stored > strings internally in utf-8. In this case the "conversion" would simple > mean to re-label the data. Unfortunately, as I meanwhile found out, this > is not the case (nor the "apple encoding" ;-), so it would indeed be > pretty useless. No, unicode strings can be stored internally as any of UCS-1, UCS-2, UCS-4, C wchar strings, or even plain ASCII. And those are all implementation details that could easily change in future versions of Python. > The longer story of my question is: I am new to python (obviously), and > since I am not familiar with either one, I thought it would be advisory > to go for python 3.x. The biggest problem that I am facing is, that I > am often dealing with data, that is basically text, but it can contain > 8-bit bytes. In this case, I can not safely assume any given encoding, > but I actually also don't need to know - for my purposes, it would be > perfectly good enough to deal with the ascii portions and keep anything > else unchanged. You can't generally just "deal with the ascii portions" without knowing something about the encoding. Say you encounter a byte greater than 127. Is it a single non-ASCII character, or is it the leading byte of a multi-byte character? If the next character is less than 127, is it an ASCII character, or a continuation of the previous character? For UTF-8 you could safely assume ASCII, but without knowing the encoding, there is no way to be sure. If you just assume it's ASCII and manipulate it as such, you could be messing up non-ASCII characters. Cheers, Ian