Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #53239

Re: split lines from stdin into a list of unicode strings

Path csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!eu.feeder.erje.net!xlned.com!feeder5.xlned.com!newsfeed.xs4all.nl!newsfeed1.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path <kurt.alfred.mueller@gmail.com>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.020
X-Spam-Evidence '*H*': 0.96; '*S*': 0.00; 'encoding': 0.05; 'wednesday,': 0.07; 'lines.': 0.09; 'subject:into': 0.09; 'python': 0.11; 'cheers': 0.12; 'kurt': 0.12; 'template': 0.14; '__future__': 0.16; 'detected': 0.16; 'expandable': 0.16; 'magic': 0.16; 'subject:unicode': 0.16; 'vary.': 0.16; 'wrote:': 0.18; 'library': 0.18; 'bit': 0.19; 'split': 0.19; 'seems': 0.21; '(the': 0.22; '>>>': 0.22; 'input': 0.22; 'import': 0.22; 'header :User-Agent:1': 0.23; 'script': 0.25; 'header:In-Reply-To:1': 0.27; 'character': 0.29; 'subject:list': 0.30; 'lines': 0.31; '>>>>': 0.31; 'file': 0.32; 'run': 0.32; 'text': 0.33; 'skip:# 10': 0.33; 'subject:from': 0.34; 'library.': 0.36; 'thanks': 0.36; 'similar': 0.36; 'should': 0.36; 'list': 0.37; 'performance': 0.37; 'message-id:@gmail.com': 0.38; 'ahead': 0.38; 'tasks': 0.38; 'to:addr:python-list': 0.38; 'does': 0.39; 'to:addr:python.org': 0.39; 'skip:- 60': 0.39; 'read': 0.60; 'easy': 0.60; 'dave': 0.60; 'august': 0.61; 'name:': 0.61; 'email addr:gmail.com': 0.63; 'more': 0.64; 'header:Reply-To:1': 0.67; 'reply-to:no real name:2**0': 0.71; 'reply-to:addr:gmail.com': 0.80; 'distracted': 0.84; 'angel': 0.91; 'whereas': 0.91; '2013': 0.98
X-Virus-Scanned amavisd-new at aerodynamics.ch
Date Thu, 29 Aug 2013 13:31:45 +0200
From Kurt Mueller <kurt.alfred.mueller@gmail.com>
Organization Rothenburg
User-Agent Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130801 Thunderbird/17.0.8
MIME-Version 1.0
To python-list@python.org
Subject Re: split lines from stdin into a list of unicode strings
References <521DB58E.5000102@gmail.com> <mailman.302.1377688438.19984.python-list@python.org> <b67f4179-53c6-4bd4-b9ea-8852c9048be0@googlegroups.com> <kvn399$a3r$1@ger.gmane.org>
In-Reply-To <kvn399$a3r$1@ger.gmane.org>
Content-Type text/plain; charset=ISO-8859-1
Content-Transfer-Encoding quoted-printable
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.15
Precedence list
Reply-To kurt.alfred.mueller@gmail.com
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.361.1377775933.19984.python-list@python.org> (permalink)
Lines 61
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1377775933 news.xs4all.nl 15924 [2001:888:2000:d::a6]:54632
X-Complaints-To abuse@xs4all.nl
Xref csiph.com comp.lang.python:53239

Show key headers only | View raw


Am 29.08.2013 11:12, schrieb Peter Otten:
> kurt.alfred.mueller@gmail.com wrote:
>> On Wednesday, August 28, 2013 1:13:36 PM UTC+2, Dave Angel wrote:
>>> On 28/8/2013 04:32, Kurt Mueller wrote:
>>>> For some text manipulation tasks I need a template to split lines
>>>> from stdin into a list of strings the way shlex.split() does it.
>>>> The encoding of the input can vary.

> You can compromise and read ahead a limited number of lines. Here's my demo 
> script (The interesting part is detect_encoding(), I got a bit distracted by 
> unrelated stuff...). The script does one extra decode/encode cycle -- it 
> should be easy to avoid that if you run into performance issues.

Thanks Peter!

I see the idea. It limits the buffersize/memory usage for the detection.


I have to say that I am a bit disapointed by the chardet library.
The encoding for the single character 'ü'
is detected as {'confidence': 0.99, 'encoding': 'EUC-JP'},
whereas "file" says:
$ echo "ü" | file -i -
/dev/stdin: text/plain; charset=utf-8
$

"ü" is a character I use very often, as it is in my name: "Müller":-)


I try to use the "python-magic" library which has a similar functionality
as chardet and is used by the "file" unix-command and it is expandable
with a magicfile, see "man file".


My magic_test script:
-------------------------------------------------------------------
#!/usr/bin/env python
# vim: set fileencoding=utf-8 :
from __future__ import print_function
import magic
strg_chck = 'ü'
magc_enco = magic.open( magic.MAGIC_MIME_ENCODING )
magc_enco.load()
print( strg_chck + ' encoding=' + magc_enco.buffer( strg_chck ) )
magc_enco.close()
-------------------------------------------------------------------
$ magic_test
ü encoding=utf-8

python-magic seems to me a bit more reliable.


Cheers
-- 
Kurt Mueller

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Re: split lines from stdin into a list of unicode strings Dave Angel <davea@davea.name> - 2013-08-28 11:13 +0000
  Re: split lines from stdin into a list of unicode strings kurt.alfred.mueller@gmail.com - 2013-08-28 05:39 -0700
    Re: split lines from stdin into a list of unicode strings Peter Otten <__peter__@web.de> - 2013-08-29 11:12 +0200
    Re: split lines from stdin into a list of unicode strings Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2013-08-29 13:31 +0200
    Re: split lines from stdin into a list of unicode strings Peter Otten <__peter__@web.de> - 2013-08-29 15:15 +0200
    Re: split lines from stdin into a list of unicode strings Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2013-09-05 09:42 +0200
    Re: split lines from stdin into a list of unicode strings Peter Otten <__peter__@web.de> - 2013-09-05 10:33 +0200
    Re: split lines from stdin into a list of unicode strings Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2013-09-05 15:25 +0200

csiph-web