Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #53127 > unrolled thread
| Started by | Dave Angel <davea@davea.name> |
|---|---|
| First post | 2013-08-28 11:13 +0000 |
| Last post | 2013-09-05 15:25 +0200 |
| Articles | 8 — 4 participants |
Back to article view | Back to comp.lang.python
This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by
below is the oldest one visible, not the original post.
Re: split lines from stdin into a list of unicode strings Dave Angel <davea@davea.name> - 2013-08-28 11:13 +0000
Re: split lines from stdin into a list of unicode strings kurt.alfred.mueller@gmail.com - 2013-08-28 05:39 -0700
Re: split lines from stdin into a list of unicode strings Peter Otten <__peter__@web.de> - 2013-08-29 11:12 +0200
Re: split lines from stdin into a list of unicode strings Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2013-08-29 13:31 +0200
Re: split lines from stdin into a list of unicode strings Peter Otten <__peter__@web.de> - 2013-08-29 15:15 +0200
Re: split lines from stdin into a list of unicode strings Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2013-09-05 09:42 +0200
Re: split lines from stdin into a list of unicode strings Peter Otten <__peter__@web.de> - 2013-09-05 10:33 +0200
Re: split lines from stdin into a list of unicode strings Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2013-09-05 15:25 +0200
| From | Dave Angel <davea@davea.name> |
|---|---|
| Date | 2013-08-28 11:13 +0000 |
| Subject | Re: split lines from stdin into a list of unicode strings |
| Message-ID | <mailman.302.1377688438.19984.python-list@python.org> |
On 28/8/2013 04:32, Kurt Mueller wrote:
> This is a follow up to the Subject
> "right adjusted strings containing umlauts"
You started a new thread, with a new subject line. So presumably we're
starting over with a clean slate.
>
> For some text manipulation tasks I need a template to split lines
> from stdin into a list of strings the way shlex.split() does it.
> The encoding of the input can vary.
Does that mean it'll vary from one run of the program to the next, or
it'll vary from one line to the next? Your code below assumes the
latter. That can greatly increase the unreliability of the already
dubious chardet algorithm.
> For further processing in Python I need the list of strings to be in unicode.
>
> Here is template.py:
>
> ##############################################################################################################
> #!/usr/bin/env python
> # vim: set fileencoding=utf-8 :
> # split lines from stdin into a list of unicode strings
> # Muk 2013-08-23
> # Python 2.7.3
>
> from __future__ import print_function
> import sys
> import shlex
> import chardet
Is this the one ?
https://pypi.python.org/pypi/chardet
>
> bool_cmnt = True # shlex: skip comments
> bool_posx = True # shlex: posix mode (strings in quotes)
>
> for inpt_line in sys.stdin:
> print( 'inpt_line=' + repr( inpt_line ) )
> enco_type = chardet.detect( inpt_line )[ 'encoding' ] # {'encoding': 'EUC-JP', 'confidence': 0.99}
> print( 'enco_type=' + repr( enco_type ) )
> try:
> strg_inpt = shlex.split( inpt_line, bool_cmnt, bool_posx, ) # shlex does not work on unicode
But shlex does, since you're using Python 2.7.3
> except Exception, errr: # usually 'No closing quotation'
> print( "error='%s' on inpt_line='%s'" % ( errr, inpt_line.rstrip(), ), file=sys.stderr, )
> continue
> print( 'strg_inpt=' + repr( strg_inpt ) ) # list of strings
> strg_unic = [ strg.decode( enco_type ) for strg in strg_inpt ] # decode the strings into unicode
> print( 'strg_unic=' + repr( strg_unic ) ) # list of unicode strings
> ##############################################################################################################
>
> $ cat <some-file> | template.py
>
Why not have a separate filter that converts from a (guessed) encoding
into utf-8, and have the later stage(s) assume utf-8 ? That way, the
filter could be fed clues by the user, or replaced entirely, without
affecting the main code you're working on.
Alternatively, just add a commandline argument with the encoding, and
parse it into enco_type.
--
DaveA
[toc] | [next] | [standalone]
| From | kurt.alfred.mueller@gmail.com |
|---|---|
| Date | 2013-08-28 05:39 -0700 |
| Message-ID | <b67f4179-53c6-4bd4-b9ea-8852c9048be0@googlegroups.com> |
| In reply to | #53127 |
On Wednesday, August 28, 2013 1:13:36 PM UTC+2, Dave Angel wrote: > On 28/8/2013 04:32, Kurt Mueller wrote: > > For some text manipulation tasks I need a template to split lines > > from stdin into a list of strings the way shlex.split() does it. > > The encoding of the input can vary. > Does that mean it'll vary from one run of the program to the next, or > it'll vary from one line to the next? Your code below assumes the > latter. That can greatly increase the unreliability of the already > dubious chardet algorithm. The encoding only varies from one launch to the other. The reason I process each line is memory usage. Option to have a better reliability of chardet: I could read all of the input, save the input lines for further processing in a list, feed the lines into chardet.universaldetector.UniversalDetector.feed()/close()/result() and then decode and split/shlex the lines in the list. That way the chardet oracle would be more reliable, but roughly twice as much memory will be used. > > import chardet > Is this the one ? > https://pypi.python.org/pypi/chardet Yes. > > $ cat <some-file> | template.py > Why not have a separate filter that converts from a (guessed) encoding > into utf-8, and have the later stage(s) assume utf-8 ? That way, the > filter could be fed clues by the user, or replaced entirely, without > affecting the main code you're working on. Working on UNIX-like systems (I am happy to work in a MSFZ) the processing pipe would be then: cat <some-file> | recode2utf8 | splitlines.py memory usage 2 * <some-file> ( plus chardet memory usage ) > Alternatively, just add a commandline argument with the encoding, and > parse it into enco_type. cat <some-file> | splitlines.py -e latin9 memory usage 1 * <some-file> or cat <some-file> | splitlines.py -e $( codingdetect <some-file> ) memory usage 1 * <some-file> So, because memory usage is not primary, I think I will go with the option described above. -- Kurt Müller
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2013-08-29 11:12 +0200 |
| Message-ID | <mailman.352.1377767550.19984.python-list@python.org> |
| In reply to | #53147 |
kurt.alfred.mueller@gmail.com wrote:
> On Wednesday, August 28, 2013 1:13:36 PM UTC+2, Dave Angel wrote:
>> On 28/8/2013 04:32, Kurt Mueller wrote:
>> > For some text manipulation tasks I need a template to split lines
>> > from stdin into a list of strings the way shlex.split() does it.
>> > The encoding of the input can vary.
>
>> Does that mean it'll vary from one run of the program to the next, or
>> it'll vary from one line to the next? Your code below assumes the
>> latter. That can greatly increase the unreliability of the already
>> dubious chardet algorithm.
>
> The encoding only varies from one launch to the other.
> The reason I process each line is memory usage.
>
> Option to have a better reliability of chardet:
> I could read all of the input, save the input lines for further
> processing in a list, feed the lines into
> chardet.universaldetector.UniversalDetector.feed()/close()/result()
> and then decode and split/shlex the lines in the list.
> That way the chardet oracle would be more reliable, but
> roughly twice as much memory will be used.
You can compromise and read ahead a limited number of lines. Here's my demo
script (The interesting part is detect_encoding(), I got a bit distracted by
unrelated stuff...). The script does one extra decode/encode cycle -- it
should be easy to avoid that if you run into performance issues.
#!/usr/bin/env python
import sys
import shlex
import chardet
from itertools import islice, chain
def detect_encoding(instream, encoding, detect_lines):
if encoding is None:
encoding = instream.encoding
if encoding is None:
head = list(islice(instream, detect_lines))
encoding = chardet.detect("".join(head))["encoding"]
instream = chain(head, instream)
return encoding, instream
def split_line(line, comments=True, posix=True):
parts = shlex.split(line.encode("utf-8"),
comments=comments, posix=posix)
return [part.decode("utf-8") for part in parts]
def to_int(s):
"""
>>> to_int(" 42")
42
>>> to_int("-1") is None
True
>>> to_int(" NONE ") is None
True
>>> to_int("none") is None
True
>>> to_int(" 0x400 ")
1024
"""
s = s.lower().strip()
if s in {"none", "-1"}: return None
return int(s, 16 if s.startswith("0x") else 10)
def main():
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("-e", "--encoding")
parser.add_argument(
"-d", "--detect-lines", type=to_int, default=100,
help=("number of lines used to determine encoding; "
"'none' or -1 for whole file. (default: 100)"))
args = parser.parse_args()
encoding, instream = detect_encoding(
sys.stdin,
encoding=args.encoding, detect_lines=args.detect_lines)
lines = (line.decode(encoding) for line in instream)
for line in lines:
try:
parts = split_line(line)
except ValueError as exc:
print >> sys.stderr, exc
else:
print parts
if __name__ == "__main__":
main()
[toc] | [prev] | [next] | [standalone]
| From | Kurt Mueller <kurt.alfred.mueller@gmail.com> |
|---|---|
| Date | 2013-08-29 13:31 +0200 |
| Message-ID | <mailman.361.1377775933.19984.python-list@python.org> |
| In reply to | #53147 |
Am 29.08.2013 11:12, schrieb Peter Otten:
> kurt.alfred.mueller@gmail.com wrote:
>> On Wednesday, August 28, 2013 1:13:36 PM UTC+2, Dave Angel wrote:
>>> On 28/8/2013 04:32, Kurt Mueller wrote:
>>>> For some text manipulation tasks I need a template to split lines
>>>> from stdin into a list of strings the way shlex.split() does it.
>>>> The encoding of the input can vary.
> You can compromise and read ahead a limited number of lines. Here's my demo
> script (The interesting part is detect_encoding(), I got a bit distracted by
> unrelated stuff...). The script does one extra decode/encode cycle -- it
> should be easy to avoid that if you run into performance issues.
Thanks Peter!
I see the idea. It limits the buffersize/memory usage for the detection.
I have to say that I am a bit disapointed by the chardet library.
The encoding for the single character 'ü'
is detected as {'confidence': 0.99, 'encoding': 'EUC-JP'},
whereas "file" says:
$ echo "ü" | file -i -
/dev/stdin: text/plain; charset=utf-8
$
"ü" is a character I use very often, as it is in my name: "Müller":-)
I try to use the "python-magic" library which has a similar functionality
as chardet and is used by the "file" unix-command and it is expandable
with a magicfile, see "man file".
My magic_test script:
-------------------------------------------------------------------
#!/usr/bin/env python
# vim: set fileencoding=utf-8 :
from __future__ import print_function
import magic
strg_chck = 'ü'
magc_enco = magic.open( magic.MAGIC_MIME_ENCODING )
magc_enco.load()
print( strg_chck + ' encoding=' + magc_enco.buffer( strg_chck ) )
magc_enco.close()
-------------------------------------------------------------------
$ magic_test
ü encoding=utf-8
python-magic seems to me a bit more reliable.
Cheers
--
Kurt Mueller
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2013-08-29 15:15 +0200 |
| Message-ID | <mailman.365.1377782100.19984.python-list@python.org> |
| In reply to | #53147 |
Kurt Mueller wrote:
> I have to say that I am a bit disapointed by the chardet library.
> The encoding for the single character 'ü'
> is detected as {'confidence': 0.99, 'encoding': 'EUC-JP'},
> whereas "file" says:
> $ echo "ü" | file -i -
> /dev/stdin: text/plain; charset=utf-8
> $
>
> "ü" is a character I use very often, as it is in my name: "Müller":-)
You cannot determine an encoding by a single letter.
Why should "ü" be more likely than "端"? The only thing you can blame chardet
for is that its confidence rating is a flat out lie...
For "Müller" on the other side you could probably come up with a (simple)
heuristic that "ü" is more likely to be surrounded by ascii-letters than
"端".
[toc] | [prev] | [next] | [standalone]
| From | Kurt Mueller <kurt.alfred.mueller@gmail.com> |
|---|---|
| Date | 2013-09-05 09:42 +0200 |
| Message-ID | <mailman.80.1378366986.5461.python-list@python.org> |
| In reply to | #53147 |
Am 29.08.2013 11:12, schrieb Peter Otten: > kurt.alfred.mueller@gmail.com wrote: >> On Wednesday, August 28, 2013 1:13:36 PM UTC+2, Dave Angel wrote: >>> On 28/8/2013 04:32, Kurt Mueller wrote: >>>> For some text manipulation tasks I need a template to split lines >>>> from stdin into a list of strings the way shlex.split() does it. >>>> The encoding of the input can vary. > You can compromise and read ahead a limited number of lines. Here's my demo > script (The interesting part is detect_encoding(), I got a bit distracted by > unrelated stuff...). The script does one extra decode/encode cycle -- it > should be easy to avoid that if you run into performance issues. I took your script as a template. But I used the libmagic library (pyhton-magic) instead of chardet. See http://linux.die.net/man/3/libmagic and https://github.com/ahupp/python-magic ( I made tests with files of different size, up to 1.2 [GB] ) I had following issues: - I a real file, the encoding was detected as 'ascii' for detect_lines=1000. In line 1002 there was an umlaut character. So then the line.decode(encoding) failed. I think to add the errors parameter, line.decode(encoding, errors='replace') - If the buffer was bigger than about some Megabytes, the returned encoding from libmagic was always None. The big files had very long lines ( more than 4k per line ). So with detect_lines=1000 this limit was exceeded. - The magic.buffer() ( the equivalent of chardet.detect() ) takes about 2 seconds per megabyte buffer. -- Kurt Mueller
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2013-09-05 10:33 +0200 |
| Message-ID | <mailman.82.1378370024.5461.python-list@python.org> |
| In reply to | #53147 |
Kurt Mueller wrote:
> Am 29.08.2013 11:12, schrieb Peter Otten:
>> kurt.alfred.mueller@gmail.com wrote:
>>> On Wednesday, August 28, 2013 1:13:36 PM UTC+2, Dave Angel wrote:
>>>> On 28/8/2013 04:32, Kurt Mueller wrote:
>>>>> For some text manipulation tasks I need a template to split lines
>>>>> from stdin into a list of strings the way shlex.split() does it.
>>>>> The encoding of the input can vary.
>
>> You can compromise and read ahead a limited number of lines. Here's my
>> demo script (The interesting part is detect_encoding(), I got a bit
>> distracted by unrelated stuff...). The script does one extra
>> decode/encode cycle -- it should be easy to avoid that if you run into
>> performance issues.
>
> I took your script as a template.
> But I used the libmagic library (pyhton-magic) instead of chardet.
> See http://linux.die.net/man/3/libmagic
> and https://github.com/ahupp/python-magic
> ( I made tests with files of different size, up to 1.2 [GB] )
>
> I had following issues:
>
> - I a real file, the encoding was detected as 'ascii' for
> detect_lines=1000.
> In line 1002 there was an umlaut character. So then the
> line.decode(encoding) failed. I think to add the errors parameter,
> line.decode(encoding, errors='replace')
Tough luck ;) You could try and tackle the problem by skipping leading
ascii-only lines. Untested:
def detect_encoding(instream, encoding, detect_lines, skip_ascii=True):
if encoding is None:
encoding = instream.encoding
if encoding is None:
if skip_ascii:
try:
for line in instream:
yield line.decode("ascii")
except UnicodeDecodeError:
pass
else:
return
head = [line]
head.extend(islice(instream, detect_lines-1))
encoding = chardet.detect("".join(head))["encoding"]
instream = chain(head, instream)
for line in instream:
yield line.decode(encoding)
Or keep two lists, one with all, and one with only non-ascii lines, and read
lines until there are enough lines in the list of non-ascii strings to make
a good guess. Then take that list to determine the encoding.
You can even combine both approaches...
> - If the buffer was bigger than about some Megabytes, the returned
> encoding
> from libmagic was always None. The big files had very long lines ( more
> than 4k per line ). So with detect_lines=1000 this limit was exceeded.
>
> - The magic.buffer() ( the equivalent of chardet.detect() ) takes about 2
> seconds
> per megabyte buffer.
[toc] | [prev] | [next] | [standalone]
| From | Kurt Mueller <kurt.alfred.mueller@gmail.com> |
|---|---|
| Date | 2013-09-05 15:25 +0200 |
| Message-ID | <mailman.95.1378387574.5461.python-list@python.org> |
| In reply to | #53147 |
Am 05.09.2013 10:33, schrieb Peter Otten:
> Kurt Mueller wrote:
>> Am 29.08.2013 11:12, schrieb Peter Otten:
>>> kurt.alfred.mueller@gmail.com wrote:
>>>> On Wednesday, August 28, 2013 1:13:36 PM UTC+2, Dave Angel wrote:
>>>>> On 28/8/2013 04:32, Kurt Mueller wrote:
>>>>>> For some text manipulation tasks I need a template to split lines
>>>>>> from stdin into a list of strings the way shlex.split() does it.
>>>>>> The encoding of the input can vary.
>> I took your script as a template.
>> But I used the libmagic library (pyhton-magic) instead of chardet.
>> See http://linux.die.net/man/3/libmagic
>> and https://github.com/ahupp/python-magic
>> ( I made tests with files of different size, up to 1.2 [GB] )
>> I had following issues:
>> - I a real file, the encoding was detected as 'ascii' for
>> detect_lines=1000.
>> In line 1002 there was an umlaut character. So then the
>> line.decode(encoding) failed. I think to add the errors parameter,
>> line.decode(encoding, errors='replace')
>
> Tough luck ;) You could try and tackle the problem by skipping leading
> ascii-only lines. Untested:
>
> def detect_encoding(instream, encoding, detect_lines, skip_ascii=True):
> if encoding is None:
> encoding = instream.encoding
> if encoding is None:
> if skip_ascii:
> try:
> for line in instream:
> yield line.decode("ascii")
> except UnicodeDecodeError:
> pass
> else:
> return
> head = [line]
> head.extend(islice(instream, detect_lines-1))
> encoding = chardet.detect("".join(head))["encoding"]
> instream = chain(head, instream)
> for line in instream:
> yield line.decode(encoding)
I find this solution as a generator very nice.
With just some small modifications it runs fine for now.
( line is undefined if skip_ascii is False. )
For ascii only files chardet or libmagic will not be bothered.
And the detect_lines comes not in charge, until there are
some non ascii characters.
------------------------------------------------------------------------------
def decode_stream_lines( inpt_strm, enco_type, numb_inpt, skip_asci=True, ):
if enco_type is None:
enco_type = inpt_strm.encoding
if enco_type is None:
line_head = []
if skip_asci:
try:
for line in inpt_strm:
yield line.decode( 'ascii' )
except UnicodeDecodeError:
line_head = [ line ] # last line was not ascii
else:
return # all lines were ascii
line_head.extend( islice( inpt_strm, numb_inpt - 1 ) )
magc_enco = magic.open( magic.MAGIC_MIME_ENCODING )
magc_enco.load()
enco_type = magc_enco.buffer( "".join( line_head ) )
magc_enco.close()
print( I_AM + '-ERROR: enco_type=' + repr( enco_type ), file=sys.stderr, )
if enco_type.rfind( 'binary' ) >= 0: # binary, application/mswordbinary, application/vnd.ms-excelbinary and the like
return
inpt_strm = chain( line_head, inpt_strm )
for line in inpt_strm:
yield line.decode( enco_type, errors='replace' )
------------------------------------------------------------------------------
Thank you very much!
--
Kurt Mueller
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web