Groups > comp.lang.python > #11464 > unrolled thread

Re: string to unicode

Started by	Chris Angelico <rosuav@gmail.com>
First post	2011-08-15 16:37 +0100
Last post	2011-08-15 21:37 +0200
Articles	2 — 2 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Re: string to unicode Chris Angelico <rosuav@gmail.com> - 2011-08-15 16:37 +0100
    Re: string to unicode Thomas 'PointedEars' Lahn <PointedEars@web.de> - 2011-08-15 21:37 +0200

#11464 — Re: string to unicode

From	Chris Angelico <rosuav@gmail.com>
Date	2011-08-15 16:37 +0100
Subject	Re: string to unicode
Message-ID	<mailman.13.1313422660.27778.python-list@python.org>

On Mon, Aug 15, 2011 at 4:20 PM, Artie Ziff <artie.ziff@gmail.com> wrote:
> if I am using the standard csv library to read contents of a csv file which
> contains Unicode strings (short example: '\xe8\x9f\x92\xe8\x9b\x87'), how do
> I use a python Unicode method such as decode or encode to transform this
> string type into a python unicode type? Must I know the encoding (byte
> groupings) of the Unicode? Can I get this from the file? Perhaps I need to
> open the file with particular attributes?
>

Start here:

http://www.joelonsoftware.com/articles/Unicode.html

The CSV file, being stored on disk, cannot contain Unicode strings; it
can only contain bytes. If you know the encoding (eg UTF-8, UCS-2,
etc), then you can decode it using that. If you don't, your best bet
is to ask the origin of the file; failing that, check the first few
bytes - if it's "\xFF\xFE" or "\xFE\xFF" or "\xEF\xBB\xBF", then it's
probably UTF-16LE, UTF-16BE, or UTF-8, respectively (those being the
encodings of the BOM). There may be other clues, too, but normally
it's best to get the encoding separately from the data rather than try
to decode it from the data itself.

Chris Angelico

[toc] | [next] | [standalone]

#11473

From	Thomas 'PointedEars' Lahn <PointedEars@web.de>
Date	2011-08-15 21:37 +0200
Message-ID	<21539106.hDsTRIgEHo@PointedEars.de>
In reply to	#11464

Chris Angelico wrote:

> On Mon, Aug 15, 2011 at 4:20 PM, Artie Ziff <artie.ziff@gmail.com> wrote:
>> if I am using the standard csv library to read contents of a csv file
>> which contains Unicode strings (short example:
>> '\xe8\x9f\x92\xe8\x9b\x87'), how do I use a python Unicode method such as
>> decode or encode to transform this string type into a python unicode
>> type? Must I know the encoding (byte groupings) of the Unicode? Can I get
>> this from the file? Perhaps I need to open the file with particular
>> attributes?
> 
> Start here:
> 
> http://www.joelonsoftware.com/articles/Unicode.html
> 
> The CSV file, being stored on disk, cannot contain Unicode strings; it
> can only contain bytes. If you know the encoding (eg UTF-8, UCS-2,
> etc), then you can decode it using that. If you don't, your best bet
> is to ask the origin of the file; failing that, check the first few
> bytes - if it's "\xFF\xFE" or "\xFE\xFF" or "\xEF\xBB\xBF", then it's
> probably UTF-16LE, UTF-16BE, or UTF-8, respectively (those being the
> encodings of the BOM). There may be other clues, too, but normally
> it's best to get the encoding separately from the data rather than try
> to decode it from the data itself.

As this problem really is not a new one, there are several more – if I may 
say so – pythonic approaches:

<http://stackoverflow.com/questions/436220/python-is-there-a-way-to-
determine-the-encoding-of-text-file>

Improving Billy Mays' "matching brackets" checker, chardet worked for me 
(the test file was UTF-8-encoded).  Watch for word-wrap:

-----------------------------------------------------------------------
# encoding: utf-8
'''
Created on 2011-07-18

@author: Thomas 'PointedEars' Lahn <PointedEars@web.de>, based on an idea of
Billy Mays <81282ed9a88799d21e77957df2d84bd6514d9af6@myhashismyemail.com>
in <news:j01ph6$knt$1@speranza.aioe.org> 
'''
import sys, os, chardet

pairs = {u'}': u'{', u')': u'(', u']': u'[',
         u'”': u'“', u'›': u'‹', u'»': u'«',
         u'】': u'【', u'〉': u'〈', u'》': u'《',
         u'」': u'「', u'』': u'『'}
valid = set(v for pair in pairs.items() for v in pair)

if __name__ == '__main__':
    for dirpath, dirnames, filenames in os.walk(sys.argv[1]):
        for name in filenames:
            stack = [' ']

            file_path = os.path.join(dirpath, name)

            with open(file_path, 'rb') as f:
                reported = False
                lines = enumerate(f, 1)

                encoding = chardet.detect(''.join(map(lambda x: x[1], 
lines)))['encoding']

                chars = ((c, line_no, col) for line_no, line in lines for 
col, c in enumerate(line.decode(encoding), 1) if c in valid)
                for c, line_no, col in chars:
                    if c in pairs:
                        if stack[-1] == pairs[c]:
                            stack.pop()
                        else:
                            if not reported:
                                first_bad = (c, line_no, col)
                                reported = True
                    else:
                        stack.append(c)

            print '%s: %s' % (name, ("good" if len(stack) == 1 else "bad 
'%s' at %s:%s" % first_bad))
-----------------------------------------------------------------------

HTH

-- 
PointedEars

Bitte keine Kopien per E-Mail. / Please do not Cc: me.

[toc] | [prev] | [standalone]

csiph-web

Re: string to unicode

Contents

#11464 — Re: string to unicode

#11473