Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #53127

Re: split lines from stdin into a list of unicode strings

Path csiph.com!usenet.pasdenom.info!aioe.org!news.stack.nl!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path <python-python-list@m.gmane.org>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.000
X-Spam-Evidence '*H*': 1.00; '*S*': 0.00; 'url:pypi': 0.03; 'argument': 0.05; 'encoding': 0.05; 'sys': 0.07; 'utf-8': 0.07; "'no": 0.09; 'converts': 0.09; 'exception,': 0.09; 'next,': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'subject:into': 0.09; 'try:': 0.09; 'python': 0.11; 'kurt': 0.12; 'assume': 0.14; 'template': 0.14; '2.7.3': 0.16; '__future__': 0.16; 'adjusted': 0.16; 'algorithm.': 0.16; 'clues': 0.16; 'commandline': 0.16; 'does,': 0.16; 'next?': 0.16; 'quotes)': 0.16; 'received:80.91.229.3': 0.16; 'received:plane.gmane.org': 0.16; 'subject:unicode': 0.16; 'thread,': 0.16; 'unicode.': 0.16; 'vary.': 0.16; 'wrote:': 0.18; 'split': 0.19; 'later': 0.20; 'input': 0.22; 'import': 0.22; 'separate': 0.22; 'header:User- Agent:1': 0.23; 'parse': 0.24; 'skip': 0.24; 'unicode': 0.24; 'header:X-Complaints-To:1': 0.27; 'mode': 0.30; 'subject:list': 0.30; 'code': 0.31; 'comments': 0.31; 'lines': 0.31; "skip:' 10": 0.31; 'usually': 0.31; 'assumes': 0.31; "we're": 0.32; 'run': 0.32; 'text': 0.33; 'url:python': 0.33; 'skip:# 10': 0.33; 'subject:from': 0.34; 'could': 0.34; 'except': 0.35; 'but': 0.35; 'add': 0.35; 'charset:us-ascii': 0.36; 'url:org': 0.36; 'list': 0.37; 'starting': 0.37; 'filter': 0.38; 'tasks': 0.38; 'to:addr :python-list': 0.38; 'does': 0.39; 'to:addr:python.org': 0.39; 'received:org': 0.40; 'skip:u 10': 0.60; 'new': 0.61; "you're": 0.61; 'further': 0.61; 'here': 0.66; 'subject': 0.69; 'containing': 0.69; 'user,': 0.69; 'increase': 0.74; 'presumably': 0.84
X-Injected-Via-Gmane http://gmane.org/
To python-list@python.org
From Dave Angel <davea@davea.name>
Subject Re: split lines from stdin into a list of unicode strings
Date Wed, 28 Aug 2013 11:13:36 +0000 (UTC)
References <521DB58E.5000102@gmail.com>
Mime-Version 1.0
Content-Type text/plain; charset=US-ASCII
Content-Transfer-Encoding 7bit
X-Gmane-NNTP-Posting-Host 174.32.174.35
User-Agent XPN/1.2.6 (Street Spirit ; Linux)
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.15
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.302.1377688438.19984.python-list@python.org> (permalink)
Lines 74
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1377688438 news.xs4all.nl 15992 [2001:888:2000:d::a6]:53261
X-Complaints-To abuse@xs4all.nl
Xref csiph.com comp.lang.python:53127

Show key headers only | View raw


On 28/8/2013 04:32, Kurt Mueller wrote:

> This is a follow up to the Subject
> "right adjusted strings containing umlauts"

You started a new thread, with a new subject line.  So presumably we're
starting over with a clean slate.

>
> For some text manipulation tasks I need a template to split lines
> from stdin into a list of strings the way shlex.split() does it.
> The encoding of the input can vary.

Does that mean it'll vary from one run of the program to the next, or
it'll vary from one line to the next?  Your code below assumes the
latter.  That can greatly increase the unreliability of the already
dubious chardet algorithm.

> For further processing in Python I need the list of strings to be in unicode.
>
> Here is template.py:
>
> ##############################################################################################################
> #!/usr/bin/env python
> # vim: set fileencoding=utf-8 :
> # split lines from stdin into a list of unicode strings
> # Muk 2013-08-23
> # Python 2.7.3
>
> from __future__ import print_function
> import sys
> import shlex
> import chardet

Is this the one ?
    https://pypi.python.org/pypi/chardet

>
> bool_cmnt = True  # shlex: skip comments
> bool_posx = True  # shlex: posix mode (strings in quotes)
>
> for inpt_line in sys.stdin:
>     print( 'inpt_line=' + repr( inpt_line ) )
>     enco_type = chardet.detect( inpt_line )[ 'encoding' ]           # {'encoding': 'EUC-JP', 'confidence': 0.99}
>     print( 'enco_type=' + repr( enco_type ) )
>     try:
>         strg_inpt = shlex.split( inpt_line, bool_cmnt, bool_posx, ) # shlex does not work on unicode

But shlex does, since you're using Python 2.7.3

>     except Exception, errr:                                         # usually 'No closing quotation'
>         print( "error='%s' on inpt_line='%s'" % ( errr, inpt_line.rstrip(), ), file=sys.stderr, )
>         continue
>     print( 'strg_inpt=' + repr( strg_inpt ) )                       # list of strings
>     strg_unic = [ strg.decode( enco_type ) for strg in strg_inpt ]  # decode the strings into unicode
>     print( 'strg_unic=' + repr( strg_unic ) )                       # list of unicode strings
> ##############################################################################################################
>
> $ cat <some-file> | template.py
>

Why not have a separate filter that converts from a (guessed) encoding
into utf-8, and have the later stage(s)  assume utf-8 ?  That way, the
filter could be fed clues by the user, or replaced entirely, without
affecting the main code you're working on.

Alternatively, just add a commandline argument with the encoding, and
parse it into enco_type.


-- 
DaveA

Back to comp.lang.python | Previous | NextNext in thread | Find similar | Unroll thread


Thread

Re: split lines from stdin into a list of unicode strings Dave Angel <davea@davea.name> - 2013-08-28 11:13 +0000
  Re: split lines from stdin into a list of unicode strings kurt.alfred.mueller@gmail.com - 2013-08-28 05:39 -0700
    Re: split lines from stdin into a list of unicode strings Peter Otten <__peter__@web.de> - 2013-08-29 11:12 +0200
    Re: split lines from stdin into a list of unicode strings Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2013-08-29 13:31 +0200
    Re: split lines from stdin into a list of unicode strings Peter Otten <__peter__@web.de> - 2013-08-29 15:15 +0200
    Re: split lines from stdin into a list of unicode strings Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2013-09-05 09:42 +0200
    Re: split lines from stdin into a list of unicode strings Peter Otten <__peter__@web.de> - 2013-09-05 10:33 +0200
    Re: split lines from stdin into a list of unicode strings Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2013-09-05 15:25 +0200

csiph-web