Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #53127

Re: split lines from stdin into a list of unicode strings

From Dave Angel <davea@davea.name>
Subject Re: split lines from stdin into a list of unicode strings
Date 2013-08-28 11:13 +0000
References <521DB58E.5000102@gmail.com>
Newsgroups comp.lang.python
Message-ID <mailman.302.1377688438.19984.python-list@python.org> (permalink)

Show all headers | View raw


On 28/8/2013 04:32, Kurt Mueller wrote:

> This is a follow up to the Subject
> "right adjusted strings containing umlauts"

You started a new thread, with a new subject line.  So presumably we're
starting over with a clean slate.

>
> For some text manipulation tasks I need a template to split lines
> from stdin into a list of strings the way shlex.split() does it.
> The encoding of the input can vary.

Does that mean it'll vary from one run of the program to the next, or
it'll vary from one line to the next?  Your code below assumes the
latter.  That can greatly increase the unreliability of the already
dubious chardet algorithm.

> For further processing in Python I need the list of strings to be in unicode.
>
> Here is template.py:
>
> ##############################################################################################################
> #!/usr/bin/env python
> # vim: set fileencoding=utf-8 :
> # split lines from stdin into a list of unicode strings
> # Muk 2013-08-23
> # Python 2.7.3
>
> from __future__ import print_function
> import sys
> import shlex
> import chardet

Is this the one ?
    https://pypi.python.org/pypi/chardet

>
> bool_cmnt = True  # shlex: skip comments
> bool_posx = True  # shlex: posix mode (strings in quotes)
>
> for inpt_line in sys.stdin:
>     print( 'inpt_line=' + repr( inpt_line ) )
>     enco_type = chardet.detect( inpt_line )[ 'encoding' ]           # {'encoding': 'EUC-JP', 'confidence': 0.99}
>     print( 'enco_type=' + repr( enco_type ) )
>     try:
>         strg_inpt = shlex.split( inpt_line, bool_cmnt, bool_posx, ) # shlex does not work on unicode

But shlex does, since you're using Python 2.7.3

>     except Exception, errr:                                         # usually 'No closing quotation'
>         print( "error='%s' on inpt_line='%s'" % ( errr, inpt_line.rstrip(), ), file=sys.stderr, )
>         continue
>     print( 'strg_inpt=' + repr( strg_inpt ) )                       # list of strings
>     strg_unic = [ strg.decode( enco_type ) for strg in strg_inpt ]  # decode the strings into unicode
>     print( 'strg_unic=' + repr( strg_unic ) )                       # list of unicode strings
> ##############################################################################################################
>
> $ cat <some-file> | template.py
>

Why not have a separate filter that converts from a (guessed) encoding
into utf-8, and have the later stage(s)  assume utf-8 ?  That way, the
filter could be fed clues by the user, or replaced entirely, without
affecting the main code you're working on.

Alternatively, just add a commandline argument with the encoding, and
parse it into enco_type.


-- 
DaveA

Back to comp.lang.python | Previous | NextNext in thread | Find similar | Unroll thread


Thread

Re: split lines from stdin into a list of unicode strings Dave Angel <davea@davea.name> - 2013-08-28 11:13 +0000
  Re: split lines from stdin into a list of unicode strings kurt.alfred.mueller@gmail.com - 2013-08-28 05:39 -0700
    Re: split lines from stdin into a list of unicode strings Peter Otten <__peter__@web.de> - 2013-08-29 11:12 +0200
    Re: split lines from stdin into a list of unicode strings Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2013-08-29 13:31 +0200
    Re: split lines from stdin into a list of unicode strings Peter Otten <__peter__@web.de> - 2013-08-29 15:15 +0200
    Re: split lines from stdin into a list of unicode strings Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2013-09-05 09:42 +0200
    Re: split lines from stdin into a list of unicode strings Peter Otten <__peter__@web.de> - 2013-09-05 10:33 +0200
    Re: split lines from stdin into a list of unicode strings Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2013-09-05 15:25 +0200

csiph-web