Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #11464

Re: string to unicode

Path csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!aioe.org!feeder.news-service.com!newsfeed.xs4all.nl!newsfeed6.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Return-Path <rosuav@gmail.com>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.001
X-Spam-Evidence '*H*': 1.00; '*S*': 0.00; 'example:': 0.03; 'bytes.': 0.07; 'transform': 0.07; 'python': 0.08; '(those': 0.09; 'csv': 0.09; 'subject:string': 0.09; 'stored': 0.13; 'library': 0.15; '(eg': 0.16; 'encode': 0.16; 'failing': 0.16; 'file;': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'subject:unicode': 0.16; 'url:unicode': 0.16; 'mon,': 0.16; 'wrote:': 0.16; 'bytes': 0.18; 'file,': 0.21; 'header:In-Reply- To:1': 0.22; 'pm,': 0.24; 'aug': 0.24; 'string': 0.26; 'unicode': 0.29; 'message-id:@mail.gmail.com': 0.29; 'separately': 0.30; 'chris': 0.32; 'probably': 0.33; 'there': 0.33; 'to:addr:python- list': 0.33; 'that,': 0.33; 'normally': 0.34; 'rather': 0.35; 'file': 0.36; 'skip:" 10': 0.36; 'using': 0.37; 'but': 0.37; 'open': 0.37; 'too,': 0.38; 'received:google.com': 0.38; 'received:209.85': 0.38; 'subject:: ': 0.39; 'itself.': 0.39; 'data': 0.39; 'to:addr:python.org': 0.39; "it's": 0.40; 'your': 0.61; 'here:': 0.65; 'respectively': 0.84
DKIM-Signature v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=DXSV5x5tGgrsOdHHFokbWz9BmD+dNeBQzthWRSDmYzg=; b=mKjF8D/rK8JnJt6bMfKhAY5oOvOcozYHkBaH8uGGGSgpXkvwc56Za1AmJ+7qZwCR6F tfOwJVRZC7FCYWWaZOq9jgDaV+l8wLMbxJcf4W4KH3WgIDyleupGUqZahu4F91WsUfP5 8iQ++xu23atSUnJcCvZMsSVh8mSLZte598eK4=
MIME-Version 1.0
In-Reply-To <4E493936.5030807@gmail.com>
References <4E493936.5030807@gmail.com>
Date Mon, 15 Aug 2011 16:37:36 +0100
Subject Re: string to unicode
From Chris Angelico <rosuav@gmail.com>
To python-list@python.org
Content-Type text/plain; charset=ISO-8859-1
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.12
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.13.1313422660.27778.python-list@python.org> (permalink)
Lines 24
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1313422660 news.xs4all.nl 23940 [2001:888:2000:d::a6]:60854
X-Complaints-To abuse@xs4all.nl
Xref x330-a1.tempe.blueboxinc.net comp.lang.python:11464

Show key headers only | View raw


On Mon, Aug 15, 2011 at 4:20 PM, Artie Ziff <artie.ziff@gmail.com> wrote:
> if I am using the standard csv library to read contents of a csv file which
> contains Unicode strings (short example: '\xe8\x9f\x92\xe8\x9b\x87'), how do
> I use a python Unicode method such as decode or encode to transform this
> string type into a python unicode type? Must I know the encoding (byte
> groupings) of the Unicode? Can I get this from the file? Perhaps I need to
> open the file with particular attributes?
>

Start here:

http://www.joelonsoftware.com/articles/Unicode.html

The CSV file, being stored on disk, cannot contain Unicode strings; it
can only contain bytes. If you know the encoding (eg UTF-8, UCS-2,
etc), then you can decode it using that. If you don't, your best bet
is to ask the origin of the file; failing that, check the first few
bytes - if it's "\xFF\xFE" or "\xFE\xFF" or "\xEF\xBB\xBF", then it's
probably UTF-16LE, UTF-16BE, or UTF-8, respectively (those being the
encodings of the BOM). There may be other clues, too, but normally
it's best to get the encoding separately from the data rather than try
to decode it from the data itself.

Chris Angelico

Back to comp.lang.python | Previous | NextNext in thread | Find similar | Unroll thread


Thread

Re: string to unicode Chris Angelico <rosuav@gmail.com> - 2011-08-15 16:37 +0100
  Re: string to unicode Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2011-08-15 21:37 +0200

csiph-web