Groups > comp.lang.python > #53127 > unrolled thread

Re: split lines from stdin into a list of unicode strings

Started by	Dave Angel <davea@davea.name>
First post	2013-08-28 11:13 +0000
Last post	2013-09-05 15:25 +0200
Articles	8 — 4 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Re: split lines from stdin into a list of unicode strings Dave Angel <davea@davea.name> - 2013-08-28 11:13 +0000
    Re: split lines from stdin into a list of unicode strings kurt.alfred.mueller@gmail.com - 2013-08-28 05:39 -0700
      Re: split lines from stdin into a list of unicode strings Peter Otten <__peter__@web.de> - 2013-08-29 11:12 +0200
      Re: split lines from stdin into a list of unicode strings Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2013-08-29 13:31 +0200
      Re: split lines from stdin into a list of unicode strings Peter Otten <__peter__@web.de> - 2013-08-29 15:15 +0200
      Re: split lines from stdin into a list of unicode strings Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2013-09-05 09:42 +0200
      Re: split lines from stdin into a list of unicode strings Peter Otten <__peter__@web.de> - 2013-09-05 10:33 +0200
      Re: split lines from stdin into a list of unicode strings Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2013-09-05 15:25 +0200

#53127 — Re: split lines from stdin into a list of unicode strings

From	Dave Angel <davea@davea.name>
Date	2013-08-28 11:13 +0000
Subject	Re: split lines from stdin into a list of unicode strings
Message-ID	<mailman.302.1377688438.19984.python-list@python.org>

On 28/8/2013 04:32, Kurt Mueller wrote:

> This is a follow up to the Subject
> "right adjusted strings containing umlauts"

You started a new thread, with a new subject line.  So presumably we're
starting over with a clean slate.

>
> For some text manipulation tasks I need a template to split lines
> from stdin into a list of strings the way shlex.split() does it.
> The encoding of the input can vary.

Does that mean it'll vary from one run of the program to the next, or
it'll vary from one line to the next?  Your code below assumes the
latter.  That can greatly increase the unreliability of the already
dubious chardet algorithm.

> For further processing in Python I need the list of strings to be in unicode.
>
> Here is template.py:
>
> ##############################################################################################################
> #!/usr/bin/env python
> # vim: set fileencoding=utf-8 :
> # split lines from stdin into a list of unicode strings
> # Muk 2013-08-23
> # Python 2.7.3
>
> from __future__ import print_function
> import sys
> import shlex
> import chardet

Is this the one ?
    https://pypi.python.org/pypi/chardet

>
> bool_cmnt = True  # shlex: skip comments
> bool_posx = True  # shlex: posix mode (strings in quotes)
>
> for inpt_line in sys.stdin:
>     print( 'inpt_line=' + repr( inpt_line ) )
>     enco_type = chardet.detect( inpt_line )[ 'encoding' ]           # {'encoding': 'EUC-JP', 'confidence': 0.99}
>     print( 'enco_type=' + repr( enco_type ) )
>     try:
>         strg_inpt = shlex.split( inpt_line, bool_cmnt, bool_posx, ) # shlex does not work on unicode

But shlex does, since you're using Python 2.7.3

>     except Exception, errr:                                         # usually 'No closing quotation'
>         print( "error='%s' on inpt_line='%s'" % ( errr, inpt_line.rstrip(), ), file=sys.stderr, )
>         continue
>     print( 'strg_inpt=' + repr( strg_inpt ) )                       # list of strings
>     strg_unic = [ strg.decode( enco_type ) for strg in strg_inpt ]  # decode the strings into unicode
>     print( 'strg_unic=' + repr( strg_unic ) )                       # list of unicode strings
> ##############################################################################################################
>
> $ cat <some-file> | template.py
>

Why not have a separate filter that converts from a (guessed) encoding
into utf-8, and have the later stage(s)  assume utf-8 ?  That way, the
filter could be fed clues by the user, or replaced entirely, without
affecting the main code you're working on.

Alternatively, just add a commandline argument with the encoding, and
parse it into enco_type.


-- 
DaveA

[toc] | [next] | [standalone]

#53147

From	kurt.alfred.mueller@gmail.com
Date	2013-08-28 05:39 -0700
Message-ID	<b67f4179-53c6-4bd4-b9ea-8852c9048be0@googlegroups.com>
In reply to	#53127

On Wednesday, August 28, 2013 1:13:36 PM UTC+2, Dave Angel wrote:
> On 28/8/2013 04:32, Kurt Mueller wrote:
> > For some text manipulation tasks I need a template to split lines
> > from stdin into a list of strings the way shlex.split() does it.
> > The encoding of the input can vary.

> Does that mean it'll vary from one run of the program to the next, or
> it'll vary from one line to the next?  Your code below assumes the
> latter.  That can greatly increase the unreliability of the already
> dubious chardet algorithm.

The encoding only varies from one launch to the other.
The reason I process each line is memory usage.

Option to have a better reliability of chardet:
I could read all of the input, save the input lines for further
processing in a list, feed the lines into
chardet.universaldetector.UniversalDetector.feed()/close()/result()
and then decode and split/shlex the lines in the list.
That way the chardet oracle would be more reliable, but 
roughly twice as much memory will be used.

> > import chardet
> Is this the one ?
>     https://pypi.python.org/pypi/chardet

Yes.

> > $ cat <some-file> | template.py

> Why not have a separate filter that converts from a (guessed) encoding
> into utf-8, and have the later stage(s)  assume utf-8 ?  That way, the
> filter could be fed clues by the user, or replaced entirely, without
> affecting the main code you're working on.

Working on UNIX-like systems (I am happy to work in a MSFZ)
the processing pipe would be then:

cat <some-file> | recode2utf8 | splitlines.py
memory usage 2 * <some-file> ( plus chardet memory usage )

> Alternatively, just add a commandline argument with the encoding, and
> parse it into enco_type.

cat <some-file> | splitlines.py -e latin9
memory usage 1 * <some-file>

or

cat <some-file> | splitlines.py -e $( codingdetect <some-file> )
memory usage 1 * <some-file>

So, because memory usage is not primary,
I think I will go with the option described above.

-- 
Kurt Müller

[toc] | [prev] | [next] | [standalone]

#53229

From	Peter Otten <__peter__@web.de>
Date	2013-08-29 11:12 +0200
Message-ID	<mailman.352.1377767550.19984.python-list@python.org>
In reply to	#53147

kurt.alfred.mueller@gmail.com wrote:

> On Wednesday, August 28, 2013 1:13:36 PM UTC+2, Dave Angel wrote:
>> On 28/8/2013 04:32, Kurt Mueller wrote:
>> > For some text manipulation tasks I need a template to split lines
>> > from stdin into a list of strings the way shlex.split() does it.
>> > The encoding of the input can vary.
> 
>> Does that mean it'll vary from one run of the program to the next, or
>> it'll vary from one line to the next?  Your code below assumes the
>> latter.  That can greatly increase the unreliability of the already
>> dubious chardet algorithm.
> 
> The encoding only varies from one launch to the other.
> The reason I process each line is memory usage.
> 
> Option to have a better reliability of chardet:
> I could read all of the input, save the input lines for further
> processing in a list, feed the lines into
> chardet.universaldetector.UniversalDetector.feed()/close()/result()
> and then decode and split/shlex the lines in the list.
> That way the chardet oracle would be more reliable, but
> roughly twice as much memory will be used.

You can compromise and read ahead a limited number of lines. Here's my demo 
script (The interesting part is detect_encoding(), I got a bit distracted by 
unrelated stuff...). The script does one extra decode/encode cycle -- it 
should be easy to avoid that if you run into performance issues.

#!/usr/bin/env python

import sys
import shlex
import chardet
from itertools import islice, chain

def detect_encoding(instream, encoding, detect_lines):
    if encoding is None:
        encoding = instream.encoding
        if encoding is None:
            head = list(islice(instream, detect_lines))
            encoding =  chardet.detect("".join(head))["encoding"]
            instream = chain(head, instream)
    return encoding, instream

def split_line(line, comments=True, posix=True):
    parts = shlex.split(line.encode("utf-8"),
                        comments=comments, posix=posix)
    return [part.decode("utf-8") for part in parts]

def to_int(s):
    """
    >>> to_int(" 42")
    42
    >>> to_int("-1") is None
    True
    >>> to_int(" NONE ") is None
    True
    >>> to_int("none") is None
    True
    >>> to_int(" 0x400  ")
    1024
    """
    s = s.lower().strip()
    if s in {"none", "-1"}: return None
    return int(s, 16 if s.startswith("0x") else 10)

def main():
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument("-e", "--encoding")
    parser.add_argument(
        "-d", "--detect-lines", type=to_int, default=100,
        help=("number of lines used to determine encoding; "
              "'none' or -1 for whole file. (default: 100)"))
    args = parser.parse_args()

    encoding, instream = detect_encoding(
        sys.stdin,
        encoding=args.encoding, detect_lines=args.detect_lines)
    lines = (line.decode(encoding) for line in instream)

    for line in lines:
        try:
            parts = split_line(line)
        except ValueError as exc:
            print >> sys.stderr, exc
        else:
            print parts

if __name__ == "__main__":
    main()

[toc] | [prev] | [next] | [standalone]

#53239

From	Kurt Mueller <kurt.alfred.mueller@gmail.com>
Date	2013-08-29 13:31 +0200
Message-ID	<mailman.361.1377775933.19984.python-list@python.org>
In reply to	#53147

Am 29.08.2013 11:12, schrieb Peter Otten:
> kurt.alfred.mueller@gmail.com wrote:
>> On Wednesday, August 28, 2013 1:13:36 PM UTC+2, Dave Angel wrote:
>>> On 28/8/2013 04:32, Kurt Mueller wrote:
>>>> For some text manipulation tasks I need a template to split lines
>>>> from stdin into a list of strings the way shlex.split() does it.
>>>> The encoding of the input can vary.

> You can compromise and read ahead a limited number of lines. Here's my demo 
> script (The interesting part is detect_encoding(), I got a bit distracted by 
> unrelated stuff...). The script does one extra decode/encode cycle -- it 
> should be easy to avoid that if you run into performance issues.

Thanks Peter!

I see the idea. It limits the buffersize/memory usage for the detection.

I have to say that I am a bit disapointed by the chardet library.
The encoding for the single character 'ü'
is detected as {'confidence': 0.99, 'encoding': 'EUC-JP'},
whereas "file" says:
$ echo "ü" | file -i -
/dev/stdin: text/plain; charset=utf-8
$

"ü" is a character I use very often, as it is in my name: "Müller":-)

I try to use the "python-magic" library which has a similar functionality
as chardet and is used by the "file" unix-command and it is expandable
with a magicfile, see "man file".

My magic_test script:
-------------------------------------------------------------------
#!/usr/bin/env python
# vim: set fileencoding=utf-8 :
from __future__ import print_function
import magic
strg_chck = 'ü'
magc_enco = magic.open( magic.MAGIC_MIME_ENCODING )
magc_enco.load()
print( strg_chck + ' encoding=' + magc_enco.buffer( strg_chck ) )
magc_enco.close()
-------------------------------------------------------------------
$ magic_test
ü encoding=utf-8

python-magic seems to me a bit more reliable.

Cheers
-- 
Kurt Mueller

[toc] | [prev] | [next] | [standalone]

#53246

From	Peter Otten <__peter__@web.de>
Date	2013-08-29 15:15 +0200
Message-ID	<mailman.365.1377782100.19984.python-list@python.org>
In reply to	#53147

Kurt Mueller wrote:

> I have to say that I am a bit disapointed by the chardet library.
> The encoding for the single character 'ü'
> is detected as {'confidence': 0.99, 'encoding': 'EUC-JP'},
> whereas "file" says:
> $ echo "ü" | file -i -
> /dev/stdin: text/plain; charset=utf-8
> $
> 
> "ü" is a character I use very often, as it is in my name: "Müller":-)

You cannot determine an encoding by a single letter. 

Why should "ü" be more likely than "端"? The only thing you can blame chardet 
for is that its confidence rating is a flat out lie...

For "Müller" on the other side you could probably come up with a (simple) 
heuristic that "ü" is more likely to be surrounded by ascii-letters than 
"端".

[toc] | [prev] | [next] | [standalone]

#53679

From	Kurt Mueller <kurt.alfred.mueller@gmail.com>
Date	2013-09-05 09:42 +0200
Message-ID	<mailman.80.1378366986.5461.python-list@python.org>
In reply to	#53147

Am 29.08.2013 11:12, schrieb Peter Otten:
> kurt.alfred.mueller@gmail.com wrote:
>> On Wednesday, August 28, 2013 1:13:36 PM UTC+2, Dave Angel wrote:
>>> On 28/8/2013 04:32, Kurt Mueller wrote:
>>>> For some text manipulation tasks I need a template to split lines
>>>> from stdin into a list of strings the way shlex.split() does it.
>>>> The encoding of the input can vary.

> You can compromise and read ahead a limited number of lines. Here's my demo 
> script (The interesting part is detect_encoding(), I got a bit distracted by 
> unrelated stuff...). The script does one extra decode/encode cycle -- it 
> should be easy to avoid that if you run into performance issues.

I took your script as a template.
But I used the libmagic library (pyhton-magic) instead of chardet.
See http://linux.die.net/man/3/libmagic
and https://github.com/ahupp/python-magic
( I made tests with files of different size, up to 1.2 [GB] )

I had following issues:

- I a real file, the encoding was detected as 'ascii' for detect_lines=1000.
  In line 1002 there was an umlaut character. So then the line.decode(encoding) failed.
  I think to add the errors parameter, line.decode(encoding, errors='replace')

- If the buffer was bigger than about some Megabytes, the returned encoding
  from libmagic was always None. The big files had very long lines ( more than 4k per line ).
  So with detect_lines=1000 this limit was exceeded.

- The magic.buffer() ( the equivalent of chardet.detect() ) takes about 2 seconds
  per megabyte buffer.



-- 
Kurt Mueller

[toc] | [prev] | [next] | [standalone]

#53681

From	Peter Otten <__peter__@web.de>
Date	2013-09-05 10:33 +0200
Message-ID	<mailman.82.1378370024.5461.python-list@python.org>
In reply to	#53147

Kurt Mueller wrote:

> Am 29.08.2013 11:12, schrieb Peter Otten:
>> kurt.alfred.mueller@gmail.com wrote:
>>> On Wednesday, August 28, 2013 1:13:36 PM UTC+2, Dave Angel wrote:
>>>> On 28/8/2013 04:32, Kurt Mueller wrote:
>>>>> For some text manipulation tasks I need a template to split lines
>>>>> from stdin into a list of strings the way shlex.split() does it.
>>>>> The encoding of the input can vary.
> 
>> You can compromise and read ahead a limited number of lines. Here's my
>> demo script (The interesting part is detect_encoding(), I got a bit
>> distracted by unrelated stuff...). The script does one extra
>> decode/encode cycle -- it should be easy to avoid that if you run into
>> performance issues.
> 
> I took your script as a template.
> But I used the libmagic library (pyhton-magic) instead of chardet.
> See http://linux.die.net/man/3/libmagic
> and https://github.com/ahupp/python-magic
> ( I made tests with files of different size, up to 1.2 [GB] )
> 
> I had following issues:
> 
> - I a real file, the encoding was detected as 'ascii' for
> detect_lines=1000.
>   In line 1002 there was an umlaut character. So then the
>   line.decode(encoding) failed. I think to add the errors parameter,
>   line.decode(encoding, errors='replace')

Tough luck ;) You could try and tackle the problem by skipping leading 
ascii-only lines. Untested:

def detect_encoding(instream, encoding, detect_lines, skip_ascii=True):
    if encoding is None:
        encoding = instream.encoding
        if encoding is None:
            if skip_ascii:
                try:
                    for line in instream:
                        yield line.decode("ascii")
                except UnicodeDecodeError:
                    pass
                else:
                    return
            head = [line]
            head.extend(islice(instream, detect_lines-1))
            encoding =  chardet.detect("".join(head))["encoding"]
            instream = chain(head, instream)
    for line in instream:
        yield line.decode(encoding)

Or keep two lists, one with all, and one with only non-ascii lines, and read 
lines until there are enough lines in the list of non-ascii strings to make 
a good guess. Then take that list to determine the encoding.

You can even combine both approaches...

> - If the buffer was bigger than about some Megabytes, the returned
> encoding
>   from libmagic was always None. The big files had very long lines ( more
>   than 4k per line ). So with detect_lines=1000 this limit was exceeded.
> 
> - The magic.buffer() ( the equivalent of chardet.detect() ) takes about 2
> seconds
>   per megabyte buffer.

[toc] | [prev] | [next] | [standalone]

#53707

From	Kurt Mueller <kurt.alfred.mueller@gmail.com>
Date	2013-09-05 15:25 +0200
Message-ID	<mailman.95.1378387574.5461.python-list@python.org>
In reply to	#53147

Am 05.09.2013 10:33, schrieb Peter Otten:
> Kurt Mueller wrote:
>> Am 29.08.2013 11:12, schrieb Peter Otten:
>>> kurt.alfred.mueller@gmail.com wrote:
>>>> On Wednesday, August 28, 2013 1:13:36 PM UTC+2, Dave Angel wrote:
>>>>> On 28/8/2013 04:32, Kurt Mueller wrote:
>>>>>> For some text manipulation tasks I need a template to split lines
>>>>>> from stdin into a list of strings the way shlex.split() does it.
>>>>>> The encoding of the input can vary.
>> I took your script as a template.
>> But I used the libmagic library (pyhton-magic) instead of chardet.
>> See http://linux.die.net/man/3/libmagic
>> and https://github.com/ahupp/python-magic
>> ( I made tests with files of different size, up to 1.2 [GB] )
>> I had following issues:
>> - I a real file, the encoding was detected as 'ascii' for
>> detect_lines=1000.
>>   In line 1002 there was an umlaut character. So then the
>>   line.decode(encoding) failed. I think to add the errors parameter,
>>   line.decode(encoding, errors='replace')
> 
> Tough luck ;) You could try and tackle the problem by skipping leading 
> ascii-only lines. Untested:
> 
> def detect_encoding(instream, encoding, detect_lines, skip_ascii=True):
>     if encoding is None:
>         encoding = instream.encoding
>         if encoding is None:
>             if skip_ascii:
>                 try:
>                     for line in instream:
>                         yield line.decode("ascii")
>                 except UnicodeDecodeError:
>                     pass
>                 else:
>                     return
>             head = [line]
>             head.extend(islice(instream, detect_lines-1))
>             encoding =  chardet.detect("".join(head))["encoding"]
>             instream = chain(head, instream)
>     for line in instream:
>         yield line.decode(encoding)

I find this solution as a generator very nice.
With just some small modifications it runs fine for now.
( line is undefined if skip_ascii is False. )

For ascii only files chardet or libmagic will not be bothered.
And the detect_lines comes not in charge, until there are
some non ascii characters.

------------------------------------------------------------------------------
def decode_stream_lines( inpt_strm, enco_type, numb_inpt, skip_asci=True, ):
    if enco_type is None:
        enco_type = inpt_strm.encoding
        if enco_type is None:
            line_head = []
            if skip_asci:
                try:
                    for line in inpt_strm:
                        yield line.decode( 'ascii' )
                except UnicodeDecodeError:
                    line_head = [ line ] # last line was not ascii
                else:
                    return # all lines were ascii
            line_head.extend( islice( inpt_strm, numb_inpt - 1 ) )
            magc_enco = magic.open( magic.MAGIC_MIME_ENCODING )
            magc_enco.load()
            enco_type = magc_enco.buffer( "".join( line_head ) )
            magc_enco.close()
            print( I_AM + '-ERROR: enco_type=' + repr( enco_type ), file=sys.stderr, )
            if  enco_type.rfind( 'binary' ) >= 0: # binary, application/mswordbinary, application/vnd.ms-excelbinary and the like
                return
            inpt_strm = chain( line_head, inpt_strm )
    for line in inpt_strm:
        yield line.decode( enco_type, errors='replace' )
------------------------------------------------------------------------------


Thank you very much!
-- 
Kurt Mueller

[toc] | [prev] | [standalone]

csiph-web

Re: split lines from stdin into a list of unicode strings

Contents

#53127 — Re: split lines from stdin into a list of unicode strings

#53147

#53229

#53239

#53246

#53679

#53681

#53707