Re: split lines from stdin into a list of unicode strings

Path	csiph.com!usenet.pasdenom.info!aioe.org!news.stack.nl!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path	<python-python-list@m.gmane.org>
X-Original-To	python-list@python.org
Delivered-To	python-list@mail.python.org
X-Spam-Status	OK 0.000
X-Spam-Evidence	'H': 1.00; 'S': 0.00; 'url:pypi': 0.03; 'argument': 0.05; 'encoding': 0.05; 'sys': 0.07; 'utf-8': 0.07; "'no": 0.09; 'converts': 0.09; 'exception,': 0.09; 'next,': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'subject:into': 0.09; 'try:': 0.09; 'python': 0.11; 'kurt': 0.12; 'assume': 0.14; 'template': 0.14; '2.7.3': 0.16; '__future__': 0.16; 'adjusted': 0.16; 'algorithm.': 0.16; 'clues': 0.16; 'commandline': 0.16; 'does,': 0.16; 'next?': 0.16; 'quotes)': 0.16; 'received:80.91.229.3': 0.16; 'received:plane.gmane.org': 0.16; 'subject:unicode': 0.16; 'thread,': 0.16; 'unicode.': 0.16; 'vary.': 0.16; 'wrote:': 0.18; 'split': 0.19; 'later': 0.20; 'input': 0.22; 'import': 0.22; 'separate': 0.22; 'header:User- Agent:1': 0.23; 'parse': 0.24; 'skip': 0.24; 'unicode': 0.24; 'header:X-Complaints-To:1': 0.27; 'mode': 0.30; 'subject:list': 0.30; 'code': 0.31; 'comments': 0.31; 'lines': 0.31; "skip:' 10": 0.31; 'usually': 0.31; 'assumes': 0.31; "we're": 0.32; 'run': 0.32; 'text': 0.33; 'url:python': 0.33; 'skip:# 10': 0.33; 'subject:from': 0.34; 'could': 0.34; 'except': 0.35; 'but': 0.35; 'add': 0.35; 'charset:us-ascii': 0.36; 'url:org': 0.36; 'list': 0.37; 'starting': 0.37; 'filter': 0.38; 'tasks': 0.38; 'to:addr :python-list': 0.38; 'does': 0.39; 'to:addr:python.org': 0.39; 'received:org': 0.40; 'skip:u 10': 0.60; 'new': 0.61; "you're": 0.61; 'further': 0.61; 'here': 0.66; 'subject': 0.69; 'containing': 0.69; 'user,': 0.69; 'increase': 0.74; 'presumably': 0.84
X-Injected-Via-Gmane	http://gmane.org/
To	python-list@python.org
From	Dave Angel <davea@davea.name>
Subject	Re: split lines from stdin into a list of unicode strings
Date	Wed, 28 Aug 2013 11:13:36 +0000 (UTC)
References	<521DB58E.5000102@gmail.com>
Mime-Version	1.0
Content-Type	text/plain; charset=US-ASCII
Content-Transfer-Encoding	7bit
X-Gmane-NNTP-Posting-Host	174.32.174.35
User-Agent	XPN/1.2.6 (Street Spirit ; Linux)
X-BeenThere	python-list@python.org
X-Mailman-Version	2.1.15
Precedence	list
List-Id	General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe	<http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive	<http://mail.python.org/pipermail/python-list/>
List-Post	<mailto:python-list@python.org>
List-Help	<mailto:python-list-request@python.org?subject=help>
List-Subscribe	<http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups	comp.lang.python
Message-ID	<mailman.302.1377688438.19984.python-list@python.org> (permalink)
Lines	74
NNTP-Posting-Host	2001:888:2000:d::a6
X-Trace	1377688438 news.xs4all.nl 15992 [2001:888:2000:d::a6]:53261
X-Complaints-To	abuse@xs4all.nl
Xref	csiph.com comp.lang.python:53127

Show key headers only | View raw

On 28/8/2013 04:32, Kurt Mueller wrote:

> This is a follow up to the Subject
> "right adjusted strings containing umlauts"

You started a new thread, with a new subject line.  So presumably we're
starting over with a clean slate.

>
> For some text manipulation tasks I need a template to split lines
> from stdin into a list of strings the way shlex.split() does it.
> The encoding of the input can vary.

Does that mean it'll vary from one run of the program to the next, or
it'll vary from one line to the next?  Your code below assumes the
latter.  That can greatly increase the unreliability of the already
dubious chardet algorithm.

> For further processing in Python I need the list of strings to be in unicode.
>
> Here is template.py:
>
> ##############################################################################################################
> #!/usr/bin/env python
> # vim: set fileencoding=utf-8 :
> # split lines from stdin into a list of unicode strings
> # Muk 2013-08-23
> # Python 2.7.3
>
> from __future__ import print_function
> import sys
> import shlex
> import chardet

Is this the one ?
    https://pypi.python.org/pypi/chardet

>
> bool_cmnt = True  # shlex: skip comments
> bool_posx = True  # shlex: posix mode (strings in quotes)
>
> for inpt_line in sys.stdin:
>     print( 'inpt_line=' + repr( inpt_line ) )
>     enco_type = chardet.detect( inpt_line )[ 'encoding' ]           # {'encoding': 'EUC-JP', 'confidence': 0.99}
>     print( 'enco_type=' + repr( enco_type ) )
>     try:
>         strg_inpt = shlex.split( inpt_line, bool_cmnt, bool_posx, ) # shlex does not work on unicode

But shlex does, since you're using Python 2.7.3

>     except Exception, errr:                                         # usually 'No closing quotation'
>         print( "error='%s' on inpt_line='%s'" % ( errr, inpt_line.rstrip(), ), file=sys.stderr, )
>         continue
>     print( 'strg_inpt=' + repr( strg_inpt ) )                       # list of strings
>     strg_unic = [ strg.decode( enco_type ) for strg in strg_inpt ]  # decode the strings into unicode
>     print( 'strg_unic=' + repr( strg_unic ) )                       # list of unicode strings
> ##############################################################################################################
>
> $ cat <some-file> | template.py
>

Why not have a separate filter that converts from a (guessed) encoding
into utf-8, and have the later stage(s)  assume utf-8 ?  That way, the
filter could be fed clues by the user, or replaced entirely, without
affecting the main code you're working on.

Alternatively, just add a commandline argument with the encoding, and
parse it into enco_type.


-- 
DaveA

Back to comp.lang.python | Previous | Next — Next in thread | Find similar | Unroll thread

Thread

Re: split lines from stdin into a list of unicode strings Dave Angel <davea@davea.name> - 2013-08-28 11:13 +0000
  Re: split lines from stdin into a list of unicode strings kurt.alfred.mueller@gmail.com - 2013-08-28 05:39 -0700
    Re: split lines from stdin into a list of unicode strings Peter Otten <__peter__@web.de> - 2013-08-29 11:12 +0200
    Re: split lines from stdin into a list of unicode strings Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2013-08-29 13:31 +0200
    Re: split lines from stdin into a list of unicode strings Peter Otten <__peter__@web.de> - 2013-08-29 15:15 +0200
    Re: split lines from stdin into a list of unicode strings Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2013-09-05 09:42 +0200
    Re: split lines from stdin into a list of unicode strings Peter Otten <__peter__@web.de> - 2013-09-05 10:33 +0200
    Re: split lines from stdin into a list of unicode strings Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2013-09-05 15:25 +0200

csiph-web