Path: csiph.com!usenet.pasdenom.info!aioe.org!news.stack.nl!newsfeed.xs4all.nl!newsfeed4.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'url:pypi': 0.03; 'argument': 0.05; 'encoding': 0.05; 'sys': 0.07; 'utf-8': 0.07; "'no": 0.09; 'converts': 0.09; 'exception,': 0.09; 'next,': 0.09; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'subject:into': 0.09; 'try:': 0.09; 'python': 0.11; 'kurt': 0.12; 'assume': 0.14; 'template': 0.14; '2.7.3': 0.16; '__future__': 0.16; 'adjusted': 0.16; 'algorithm.': 0.16; 'clues': 0.16; 'commandline': 0.16; 'does,': 0.16; 'next?': 0.16; 'quotes)': 0.16; 'received:80.91.229.3': 0.16; 'received:plane.gmane.org': 0.16; 'subject:unicode': 0.16; 'thread,': 0.16; 'unicode.': 0.16; 'vary.': 0.16; 'wrote:': 0.18; 'split': 0.19; 'later': 0.20; 'input': 0.22; 'import': 0.22; 'separate': 0.22; 'header:User- Agent:1': 0.23; 'parse': 0.24; 'skip': 0.24; 'unicode': 0.24; 'header:X-Complaints-To:1': 0.27; 'mode': 0.30; 'subject:list': 0.30; 'code': 0.31; 'comments': 0.31; 'lines': 0.31; "skip:' 10": 0.31; 'usually': 0.31; 'assumes': 0.31; "we're": 0.32; 'run': 0.32; 'text': 0.33; 'url:python': 0.33; 'skip:# 10': 0.33; 'subject:from': 0.34; 'could': 0.34; 'except': 0.35; 'but': 0.35; 'add': 0.35; 'charset:us-ascii': 0.36; 'url:org': 0.36; 'list': 0.37; 'starting': 0.37; 'filter': 0.38; 'tasks': 0.38; 'to:addr :python-list': 0.38; 'does': 0.39; 'to:addr:python.org': 0.39; 'received:org': 0.40; 'skip:u 10': 0.60; 'new': 0.61; "you're": 0.61; 'further': 0.61; 'here': 0.66; 'subject': 0.69; 'containing': 0.69; 'user,': 0.69; 'increase': 0.74; 'presumably': 0.84 X-Injected-Via-Gmane: http://gmane.org/ To: python-list@python.org From: Dave Angel Subject: Re: split lines from stdin into a list of unicode strings Date: Wed, 28 Aug 2013 11:13:36 +0000 (UTC) References: <521DB58E.5000102@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Gmane-NNTP-Posting-Host: 174.32.174.35 User-Agent: XPN/1.2.6 (Street Spirit ; Linux) X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 74 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1377688438 news.xs4all.nl 15992 [2001:888:2000:d::a6]:53261 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:53127 On 28/8/2013 04:32, Kurt Mueller wrote: > This is a follow up to the Subject > "right adjusted strings containing umlauts" You started a new thread, with a new subject line. So presumably we're starting over with a clean slate. > > For some text manipulation tasks I need a template to split lines > from stdin into a list of strings the way shlex.split() does it. > The encoding of the input can vary. Does that mean it'll vary from one run of the program to the next, or it'll vary from one line to the next? Your code below assumes the latter. That can greatly increase the unreliability of the already dubious chardet algorithm. > For further processing in Python I need the list of strings to be in unicode. > > Here is template.py: > > ############################################################################################################## > #!/usr/bin/env python > # vim: set fileencoding=utf-8 : > # split lines from stdin into a list of unicode strings > # Muk 2013-08-23 > # Python 2.7.3 > > from __future__ import print_function > import sys > import shlex > import chardet Is this the one ? https://pypi.python.org/pypi/chardet > > bool_cmnt = True # shlex: skip comments > bool_posx = True # shlex: posix mode (strings in quotes) > > for inpt_line in sys.stdin: > print( 'inpt_line=' + repr( inpt_line ) ) > enco_type = chardet.detect( inpt_line )[ 'encoding' ] # {'encoding': 'EUC-JP', 'confidence': 0.99} > print( 'enco_type=' + repr( enco_type ) ) > try: > strg_inpt = shlex.split( inpt_line, bool_cmnt, bool_posx, ) # shlex does not work on unicode But shlex does, since you're using Python 2.7.3 > except Exception, errr: # usually 'No closing quotation' > print( "error='%s' on inpt_line='%s'" % ( errr, inpt_line.rstrip(), ), file=sys.stderr, ) > continue > print( 'strg_inpt=' + repr( strg_inpt ) ) # list of strings > strg_unic = [ strg.decode( enco_type ) for strg in strg_inpt ] # decode the strings into unicode > print( 'strg_unic=' + repr( strg_unic ) ) # list of unicode strings > ############################################################################################################## > > $ cat | template.py > Why not have a separate filter that converts from a (guessed) encoding into utf-8, and have the later stage(s) assume utf-8 ? That way, the filter could be fed clues by the user, or replaced entirely, without affecting the main code you're working on. Alternatively, just add a commandline argument with the encoding, and parse it into enco_type. -- DaveA