Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #52198 > unrolled thread

right adjusted strings containing umlauts

Started byKurt Mueller <kurt.alfred.mueller@gmail.com>
First post2013-08-08 16:23 +0200
Last post2013-08-28 04:17 -0700
Articles 19 — 11 participants

Back to article view | Back to comp.lang.python


Contents

  right adjusted strings containing umlauts Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2013-08-08 16:23 +0200
    Re: right adjusted strings containing umlauts Neil Cerutti <neilc@norwich.edu> - 2013-08-08 14:40 +0000
      Re: right adjusted strings containing umlauts MRAB <python@mrabarnett.plus.com> - 2013-08-08 16:19 +0100
    Re: right adjusted strings containing umlauts jfharden@gmail.com - 2013-08-08 07:43 -0700
      Re: right adjusted strings containing umlauts Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2013-08-08 17:24 +0200
        Re: right adjusted strings containing umlauts Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-08-10 01:29 +0000
      Re: right adjusted strings containing umlauts Peter Otten <__peter__@web.de> - 2013-08-08 17:44 +0200
      Re: right adjusted strings containing umlauts Dave Angel <davea@davea.name> - 2013-08-08 15:50 +0000
      Re: right adjusted strings containing umlauts Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2013-08-08 18:16 +0200
      Re: right adjusted strings containing umlauts Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2013-08-08 18:27 +0200
        Re: right adjusted strings containing umlauts wxjmfauth@gmail.com - 2013-08-09 01:30 -0700
      Re: right adjusted strings containing umlauts Peter Otten <__peter__@web.de> - 2013-08-08 18:34 +0200
      Re: right adjusted strings containing umlauts Chris Angelico <rosuav@gmail.com> - 2013-08-08 17:37 +0100
      Re: right adjusted strings containing umlauts Dave Angel <davea@davea.name> - 2013-08-08 17:47 +0000
      Re: right adjusted strings containing umlauts Terry Reedy <tjreedy@udel.edu> - 2013-08-08 16:51 -0400
      Re: right adjusted strings containing umlauts Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2013-08-23 17:47 +0200
      Re: right adjusted strings containing umlauts Kurt Mueller <kurt.alfred.mueller@gmail.com> - 2013-08-28 10:01 +0200
      Re: right adjusted strings containing umlauts Dave Angel <davea@davea.name> - 2013-08-28 10:23 +0000
        Re: right adjusted strings containing umlauts kurt.alfred.mueller@gmail.com - 2013-08-28 04:17 -0700

#52198 — right adjusted strings containing umlauts

FromKurt Mueller <kurt.alfred.mueller@gmail.com>
Date2013-08-08 16:23 +0200
Subjectright adjusted strings containing umlauts
Message-ID<mailman.352.1375972418.1251.python-list@python.org>
I'd like to print strings right adjusted.
( Python 2.7.3, Linux 3.4.47-2.38-desktop )

from __future__ import print_function
print( '>{0:>3}<'.format( 'a' ) )
>  a<

But if the string contains an Umlaut:
print( '>{0:>3}<'.format( 'ä' ) )
> ä<

Same with % notation:
print( '>%3s<' % ( 'a' ) )
>  a<
print( '>%3s<' % ( 'ä' ) )
> ä<

For a string with no Umlaut it uses 3 characters, but for an Umlaut
it uses only 2 characters.

I guess it has to to with unicode.
How do I get it right?


TIA
-- 
Kurt Mueller

[toc] | [next] | [standalone]


#52199

FromNeil Cerutti <neilc@norwich.edu>
Date2013-08-08 14:40 +0000
Message-ID<b6hourF5uu0U1@mid.individual.net>
In reply to#52198
On 2013-08-08, Kurt Mueller <kurt.alfred.mueller@gmail.com> wrote:
> I'd like to print strings right adjusted.
> ( Python 2.7.3, Linux 3.4.47-2.38-desktop )
>
> from __future__ import print_function
> print( '>{0:>3}<'.format( 'a' ) )
>>  a<
>
> But if the string contains an Umlaut:
> print( '>{0:>3}<'.format( '??' ) )
>> ??<
>
> Same with % notation:
> print( '>%3s<' % ( 'a' ) )
>>  a<
> print( '>%3s<' % ( '??' ) )
>> ??<
>
> For a string with no Umlaut it uses 3 characters, but for an
> Umlaut it uses only 2 characters.
>
> I guess it has to to with unicode.
> How do I get it right?

You guessed it!

Use unicode strings instead of byte strings, e.g., u"...".

-- 
Neil Cerutti

[toc] | [prev] | [next] | [standalone]


#52202

FromMRAB <python@mrabarnett.plus.com>
Date2013-08-08 16:19 +0100
Message-ID<mailman.354.1375975172.1251.python-list@python.org>
In reply to#52199
On 08/08/2013 15:40, Neil Cerutti wrote:
> On 2013-08-08, Kurt Mueller <kurt.alfred.mueller@gmail.com> wrote:
>> I'd like to print strings right adjusted.
>> ( Python 2.7.3, Linux 3.4.47-2.38-desktop )
>>
>> from __future__ import print_function
>> print( '>{0:>3}<'.format( 'a' ) )
>>>  a<
>>
>> But if the string contains an Umlaut:
>> print( '>{0:>3}<'.format( '??' ) )
>>> ??<
>>
>> Same with % notation:
>> print( '>%3s<' % ( 'a' ) )
>>>  a<
>> print( '>%3s<' % ( '??' ) )
>>> ??<
>>
>> For a string with no Umlaut it uses 3 characters, but for an
>> Umlaut it uses only 2 characters.
>>
>> I guess it has to to with unicode.
>> How do I get it right?
>
> You guessed it!
>
> Use unicode strings instead of byte strings, e.g., u"...".
>
It also matters which actual codepoints you're using in the Unicode
string.

You could have u'ä', which is one codepoint (u'\xE4' or u'\N{LATIN
SMALL LETTER A WITH DIAERESIS}'), or u'ä', which two codepoints
(u'a\u0308' or u'\N{LATIN SMALL LETTER A}\N{COMBINING DIAERESIS}').

[toc] | [prev] | [next] | [standalone]


#52200

Fromjfharden@gmail.com
Date2013-08-08 07:43 -0700
Message-ID<9781df99-f9c8-4217-aa67-7a714b7f2ebe@googlegroups.com>
In reply to#52198
On Thursday, 8 August 2013 15:23:46 UTC+1, Kurt Mueller  wrote:
> I'd like to print strings right adjusted.
> 
> print( '>{0:>3}<'.format( 'ä' ) )
> 

Make both strings unicode

print( u'>{0:>3}<'.format( u'ä' ) )

Why not use rjust for it though?

u'ä'.rjust(3)

-- 
Jonathan

[toc] | [prev] | [next] | [standalone]


#52203

FromKurt Mueller <kurt.alfred.mueller@gmail.com>
Date2013-08-08 17:24 +0200
Message-ID<mailman.355.1375975522.1251.python-list@python.org>
In reply to#52200
Am 08.08.2013 16:43, schrieb jfharden@gmail.com:
> On Thursday, 8 August 2013 15:23:46 UTC+1, Kurt Mueller  wrote:
>> I'd like to print strings right adjusted.
>> print( '>{0:>3}<'.format( 'ä' ) )
> 
> Make both strings unicode
> print( u'>{0:>3}<'.format( u'ä' ) )
> Why not use rjust for it though?
> u'ä'.rjust(3)

In real life there is a list of strings in output_list from a command like:
output_list = shlex.split( input_string, bool_cmnt, bool_posi, )
input_string is from a file, bool_* are either True or False
repr( output_list )
['\xc3\xb6', '\xc3\xbc', 'i', 's', 'f']
which should be printed right aligned.
using:
print( u'{0:>3} {1:>3} {2:>3} {3:>3} {4:>3}'.format( *output_list ) )
( In real life, the alignement and the width is variable )

How do I prepare output_list the pythonic way to be unicode strings?
What do I do, when input_strings/output_list has other codings like iso-8859-1?

TIA
-- 
Kurt Mueller

[toc] | [prev] | [next] | [standalone]


#52290

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2013-08-10 01:29 +0000
Message-ID<52059790$0$30000$c3e8da3$5496439d@news.astraweb.com>
In reply to#52203
On Thu, 08 Aug 2013 17:24:49 +0200, Kurt Mueller wrote:

> What do I do, when input_strings/output_list has other codings like
> iso-8859-1?

When reading from a text file, honour some sort of encoding cookie at the 
top (or bottom) of the file, like Emacs and Vim use, or a BOM. If there 
is no encoding cookie, assume UTF-8.

When reading from stdin, assume UTF-8.

Otherwise, make it the caller's responsibility to specify the encoding if 
they wish to use something else.

Pseudo-code:

encoding = None

if command line arguments include '--encoding':
    encoding = --encoding argument

if encoding is None:
    if input file is stdin:
        encoding = 'utf-8'
    else:
        open file as binary
        if first 2-4 bytes look like a BOM:
            encoding = one of UTF-8 or UTF-16 or UTF-32
        else:
            read first two lines 
            if either looks like an encoding cookie:
                encoding = cookie
            # optionally check the end of the file as well
        close file

if encoding is None:
    encoding = 'utf-8'

read from file using encoding




-- 
Steven

[toc] | [prev] | [next] | [standalone]


#52204

FromPeter Otten <__peter__@web.de>
Date2013-08-08 17:44 +0200
Message-ID<mailman.356.1375976674.1251.python-list@python.org>
In reply to#52200
Kurt Mueller wrote:

> Am 08.08.2013 16:43, schrieb jfharden@gmail.com:
>> On Thursday, 8 August 2013 15:23:46 UTC+1, Kurt Mueller  wrote:
>>> I'd like to print strings right adjusted.
>>> print( '>{0:>3}<'.format( 'ä' ) )
>> 
>> Make both strings unicode
>> print( u'>{0:>3}<'.format( u'ä' ) )
>> Why not use rjust for it though?
>> u'ä'.rjust(3)
> 
> In real life there is a list of strings in output_list from a command
> like: output_list = shlex.split( input_string, bool_cmnt, bool_posi, )
> input_string is from a file, bool_* are either True or False
> repr( output_list )
> ['\xc3\xb6', '\xc3\xbc', 'i', 's', 'f']
> which should be printed right aligned.
> using:
> print( u'{0:>3} {1:>3} {2:>3} {3:>3} {4:>3}'.format( *output_list ) )
> ( In real life, the alignement and the width is variable )
> 
> How do I prepare output_list the pythonic way to be unicode strings?
> What do I do, when input_strings/output_list has other codings like
> iso-8859-1?

You have to know the actual encoding. With that information it's easy:

>>> output_list
['\xc3\xb6', '\xc3\xbc', 'i', 's', 'f']
>>> encoding = "utf-8"
>>> output_list = [s.decode(encoding) for s in output_list]
>>> print output_list
[u'\xf6', u'\xfc', u'i', u's', u'f']

Don't worry that there are still escape codes -- when you print the 
individual list items the caracters will show up as expected:

>>> print ", ".join(output_list)
ö, ü, i, s, f

[toc] | [prev] | [next] | [standalone]


#52205

FromDave Angel <davea@davea.name>
Date2013-08-08 15:50 +0000
Message-ID<mailman.357.1375977051.1251.python-list@python.org>
In reply to#52200
Kurt Mueller wrote:

> Am 08.08.2013 16:43, schrieb jfharden@gmail.com:
>> On Thursday, 8 August 2013 15:23:46 UTC+1, Kurt Mueller  wrote:
>>> I'd like to print strings right adjusted.
>>> print( '>{0:>3}<'.format( 'ä' ) )
>> 
>> Make both strings unicode
>> print( u'>{0:>3}<'.format( u'ä' ) )
>> Why not use rjust for it though?
>> u'ä'.rjust(3)
>
> In real life there is a list of strings in output_list from a command like:
> output_list = shlex.split( input_string, bool_cmnt, bool_posi, )
> input_string is from a file, bool_* are either True or False
> repr( output_list )
> ['\xc3\xb6', '\xc3\xbc', 'i', 's', 'f']
> which should be printed right aligned.
> using:
> print( u'{0:>3} {1:>3} {2:>3} {3:>3} {4:>3}'.format( *output_list ) )
> ( In real life, the alignement and the width is variable )
>
> How do I prepare output_list the pythonic way to be unicode strings?
> What do I do, when input_strings/output_list has other codings like iso-8859-1?
>

In general, when reading from an outside device like a file, convert to
unicode immediately, while you still know the encoding used in that
particular file.  Then after all processing, worry about alignment only
when you're about to output the string.  And at that point, you're
subject to the quirks of the font as well as the quirks of the
encoding of the terminal.

As MRAB has pointed out, sometimes two code points are used to represent
 a single character which will end up taking a single column.  Likewise
sometimes a single code point will take more than one "column" to
display.  Ideograms are one example, but a font which is not fixed pitch
 is another.

If you're going to a standard terminal, all you can do is get close. 
This is why there are special functions for gui's to help with
alignment.



-- 
DaveA

[toc] | [prev] | [next] | [standalone]


#52207

FromKurt Mueller <kurt.alfred.mueller@gmail.com>
Date2013-08-08 18:16 +0200
Message-ID<mailman.358.1375978647.1251.python-list@python.org>
In reply to#52200
Am 08.08.2013 17:44, schrieb Peter Otten:
> Kurt Mueller wrote:
>> What do I do, when input_strings/output_list has other codings like
>> iso-8859-1?
> 
> You have to know the actual encoding. With that information it's easy:
>>>> output_list
> ['\xc3\xb6', '\xc3\xbc', 'i', 's', 'f']
>>>> encoding = "utf-8"
>>>> output_list = [s.decode(encoding) for s in output_list]
>>>> print output_list
> [u'\xf6', u'\xfc', u'i', u's', u'f']

How do I get to know the actual encoding?
I read from stdin. There can be different encondings.
Usually utf8 but also iso-8859-1/latin9 are to be expected.
But sys.stdin.encoding sais always 'None'.


TIA
-- 
Kurt Mueller

[toc] | [prev] | [next] | [standalone]


#52210

FromKurt Mueller <kurt.alfred.mueller@gmail.com>
Date2013-08-08 18:27 +0200
Message-ID<mailman.359.1375979258.1251.python-list@python.org>
In reply to#52200
Now I have this small example:
----------------------------------------------------------
#!/usr/bin/env python
# vim: set fileencoding=utf-8 :

from __future__ import print_function
import sys, shlex

print( repr( sys.stdin.encoding ) )

strg_form = u'{0:>3} {1:>3} {2:>3} {3:>3} {4:>3}'
for inpt_line in sys.stdin:
    proc_line = shlex.split( inpt_line, False, True, )
    encoding = "utf-8"
    proc_line = [ strg.decode( encoding ) for strg in proc_line ]
    print( strg_form.format( *proc_line ) )
----------------------------------------------------------

$ echo -e "a b c d e\na ö u 1 2" | file -
/dev/stdin: UTF-8 Unicode text
$ echo -e "a b c d e\na ö u 1 2" | ./align_compact.py
None
  a   b   c   d   e
  a   ö   u   1   2
$ echo -e "a b c d e\na ö u 1 2" | recode utf8..latin9 | file -
/dev/stdin: ISO-8859 text
$ echo -e "a b c d e\na ö u 1 2" | recode utf8..latin9 | ./align_compact.py
None
  a   b   c   d   e
Traceback (most recent call last):
  File "./align_compact.py", line 13, in <module>
    proc_line = [ strg.decode( encoding ) for strg in proc_line ]
  File "/usr/lib64/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf6 in position 0: invalid start byte
muk@mcp20:/sw/prog/scripts/text_manip>

How do I handle this two inputs?


TIA
-- 
Kurt Mueller

[toc] | [prev] | [next] | [standalone]


#52254

Fromwxjmfauth@gmail.com
Date2013-08-09 01:30 -0700
Message-ID<9018bc25-e25e-47fb-b7ca-05c33a28b76c@googlegroups.com>
In reply to#52210
Le jeudi 8 août 2013 18:27:06 UTC+2, Kurt Mueller a écrit :
> Now I have this small example:
> 
> ----------------------------------------------------------
> 
> #!/usr/bin/env python
> 
> # vim: set fileencoding=utf-8 :
> 
> 
> 
> from __future__ import print_function
> 
> import sys, shlex
> 
> 
> 
> print( repr( sys.stdin.encoding ) )
> 
> 
> 
> strg_form = u'{0:>3} {1:>3} {2:>3} {3:>3} {4:>3}'
> 
> for inpt_line in sys.stdin:
> 
>     proc_line = shlex.split( inpt_line, False, True, )
> 
>     encoding = "utf-8"
> 
>     proc_line = [ strg.decode( encoding ) for strg in proc_line ]
> 
>     print( strg_form.format( *proc_line ) )
> 
> ----------------------------------------------------------
> 
> 
> 
> $ echo -e "a b c d e\na ö u 1 2" | file -
> 
> /dev/stdin: UTF-8 Unicode text
> 
> $ echo -e "a b c d e\na ö u 1 2" | ./align_compact.py
> 
> None
> 
>   a   b   c   d   e
> 
>   a   ö   u   1   2
> 
> $ echo -e "a b c d e\na ö u 1 2" | recode utf8..latin9 | file -
> 
> /dev/stdin: ISO-8859 text
> 
> $ echo -e "a b c d e\na ö u 1 2" | recode utf8..latin9 | ./align_compact.py
> 
> None
> 
>   a   b   c   d   e
> 
> Traceback (most recent call last):
> 
>   File "./align_compact.py", line 13, in <module>
> 
>     proc_line = [ strg.decode( encoding ) for strg in proc_line ]
> 
>   File "/usr/lib64/python2.7/encodings/utf_8.py", line 16, in decode
> 
>     return codecs.utf_8_decode(input, errors, True)
> 
> UnicodeDecodeError: 'utf8' codec can't decode byte 0xf6 in position 0: invalid start byte
> 
> muk@mcp20:/sw/prog/scripts/text_manip>
> 
> 
> 
> How do I handle this two inputs?
> 
> 
> 
> 
> 
> TIA
> 
> -- 
> 
> Kurt Mueller

--------

It's very easy.

The error msg indicates, you cann't decode your series of bytes
with the utf-8 codec, simply because your string is encoded
in iso-8859-* (you did it explicitly!).


Your problem is not Python, your problem is the coding
of the characters.

You should be aware about the coding of the strings you are
manipulating (creating) and if necessary decode and/or encode
correctly accordingly to what you wish, eg. a suitable coding 
for the display. That's on this level that Python (or any
language) matters.

The sys.std*.encoding is a different problem.

iso-8859-* ?

iso-8859-1  == latin-1  and  latin9 == iso-8859-15.

If one excepts "das grosse Eszett", both codings are
able to handle German (it seems to be your case) and
there are no problems when working directly with these
codings.


jmf


[toc] | [prev] | [next] | [standalone]


#52211

FromPeter Otten <__peter__@web.de>
Date2013-08-08 18:34 +0200
Message-ID<mailman.360.1375979702.1251.python-list@python.org>
In reply to#52200
Kurt Mueller wrote:

> Am 08.08.2013 17:44, schrieb Peter Otten:
>> Kurt Mueller wrote:
>>> What do I do, when input_strings/output_list has other codings like
>>> iso-8859-1?
>> 
>> You have to know the actual encoding. With that information it's easy:
>>>>> output_list
>> ['\xc3\xb6', '\xc3\xbc', 'i', 's', 'f']
>>>>> encoding = "utf-8"
>>>>> output_list = [s.decode(encoding) for s in output_list]
>>>>> print output_list
>> [u'\xf6', u'\xfc', u'i', u's', u'f']
> 
> How do I get to know the actual encoding?
> I read from stdin. There can be different encondings.
> Usually utf8 but also iso-8859-1/latin9 are to be expected.
> But sys.stdin.encoding sais always 'None'.

Even with

$ cat funny_pic.jpg | ./mypythonscript.py

you could "successfully" (i. e. no errors) decode stdin using iso-8859-1.
So unfortunately you have to guess. 

A simple strategy is to try utf-8 and fall back to iso-8859-1 if that fails 
with a UnicodeDecodeError. There's also

https://pypi.python.org/pypi/chardet

[toc] | [prev] | [next] | [standalone]


#52212

FromChris Angelico <rosuav@gmail.com>
Date2013-08-08 17:37 +0100
Message-ID<mailman.361.1375980242.1251.python-list@python.org>
In reply to#52200
On Thu, Aug 8, 2013 at 5:16 PM, Kurt Mueller
<kurt.alfred.mueller@gmail.com> wrote:
> Am 08.08.2013 17:44, schrieb Peter Otten:
>> Kurt Mueller wrote:
>>> What do I do, when input_strings/output_list has other codings like
>>> iso-8859-1?
>>
>> You have to know the actual encoding. With that information it's easy:
>>>>> output_list
>> ['\xc3\xb6', '\xc3\xbc', 'i', 's', 'f']
>>>>> encoding = "utf-8"
>>>>> output_list = [s.decode(encoding) for s in output_list]
>>>>> print output_list
>> [u'\xf6', u'\xfc', u'i', u's', u'f']
>
> How do I get to know the actual encoding?
> I read from stdin. There can be different encondings.
> Usually utf8 but also iso-8859-1/latin9 are to be expected.
> But sys.stdin.encoding sais always 'None'.

If you can switch to Python 3, life becomes a LOT easier. The Python 3
input() function (which does the same job as raw_input() from Python
2) returns a Unicode string, meaning that it takes care of encodings
for you.

ChrisA

[toc] | [prev] | [next] | [standalone]


#52215

FromDave Angel <davea@davea.name>
Date2013-08-08 17:47 +0000
Message-ID<mailman.364.1375984053.1251.python-list@python.org>
In reply to#52200
Kurt Mueller wrote:

> Now I have this small example:
> ----------------------------------------------------------
> #!/usr/bin/env python
> # vim: set fileencoding=utf-8 :
>
> from __future__ import print_function
> import sys, shlex
>
> print( repr( sys.stdin.encoding ) )
>
> strg_form = u'{0:>3} {1:>3} {2:>3} {3:>3} {4:>3}'
> for inpt_line in sys.stdin:
>     proc_line = shlex.split( inpt_line, False, True, )
>     encoding = "utf-8"
>     proc_line = [ strg.decode( encoding ) for strg in proc_line ]
>     print( strg_form.format( *proc_line ) )
> ----------------------------------------------------------
>
> $ echo -e "a b c d e\na ö u 1 2" | file -
> /dev/stdin: UTF-8 Unicode text
> $ echo -e "a b c d e\na ö u 1 2" | ./align_compact.py
> None
>   a   b   c   d   e
>   a   ö   u   1   2
> $ echo -e "a b c d e\na ö u 1 2" | recode utf8..latin9 | file -
> /dev/stdin: ISO-8859 text
> $ echo -e "a b c d e\na ö u 1 2" | recode utf8..latin9 | ./align_compact.py
> None
>   a   b   c   d   e
> Traceback (most recent call last):
>   File "./align_compact.py", line 13, in <module>
>     proc_line = [ strg.decode( encoding ) for strg in proc_line ]
>   File "/usr/lib64/python2.7/encodings/utf_8.py", line 16, in decode
>     return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0xf6 in position 0: invalid start byte
> muk@mcp20:/sw/prog/scripts/text_manip>
>
> How do I handle this two inputs?
>

Once you're using pipes, you've given up any hope that the terminal will
report a useful encoding, so I'm not surprised you're getting None for
sys.stdin.encoding()

So you can either do as others have suggested, and guess, or you can get
the information explicitly, say from argv.  In any case you'll need a
different way to assign   encoding = 


-- 
DaveA

[toc] | [prev] | [next] | [standalone]


#52228

FromTerry Reedy <tjreedy@udel.edu>
Date2013-08-08 16:51 -0400
Message-ID<mailman.373.1375995084.1251.python-list@python.org>
In reply to#52200
On 8/8/2013 11:24 AM, Kurt Mueller wrote:

> print( u'{0:>3} {1:>3} {2:>3} {3:>3} {4:>3}'.format( *output_list ) )

Using autonumbering feature, same as

print( u'{:>3} {:>3} {:>3} {:>3} {:>3}'.format( *output_list ) )
print( (u' '.join([u'{:>3}']*5)).format(*output_list) )
print( (u' '.join([u'{:>3}']*len(output_list))).format(*output_list) )
-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]


#52898

FromKurt Mueller <kurt.alfred.mueller@gmail.com>
Date2013-08-23 17:47 +0200
Message-ID<mailman.168.1377273311.19984.python-list@python.org>
In reply to#52200
Am 08.08.2013 18:37, schrieb Chris Angelico:
> On Thu, Aug 8, 2013 at 5:16 PM, Kurt Mueller
> <kurt.alfred.mueller@gmail.com> wrote:
>> Am 08.08.2013 17:44, schrieb Peter Otten:
>>> Kurt Mueller wrote:
>>>> What do I do, when input_strings/output_list has other codings like
>>>> iso-8859-1?
>>> You have to know the actual encoding. With that information it's easy:
>>>>>> output_list
>>> ['\xc3\xb6', '\xc3\xbc', 'i', 's', 'f']
>>>>>> encoding = "utf-8"
>>>>>> output_list = [s.decode(encoding) for s in output_list]
>>>>>> print output_list
>>> [u'\xf6', u'\xfc', u'i', u's', u'f']
>> How do I get to know the actual encoding?
>> I read from stdin. There can be different encondings.
>> Usually utf8 but also iso-8859-1/latin9 are to be expected.
>> But sys.stdin.encoding sais always 'None'.
> 
> If you can switch to Python 3, life becomes a LOT easier. The Python 3
> input() function (which does the same job as raw_input() from Python
> 2) returns a Unicode string, meaning that it takes care of encodings
> for you.

Because I cannot switch to Python 3 for now my life is not so easy:-)

For some text manipulation tasks I need a template to split lines
from stdin into a list of strings the way shlex.split() does it.
The encoding of the input can vary.
For further processing in Python I need the list of strings to be in unicode.

Here is template.py:

##############################################################################################################
#!/usr/bin/env python
# vim: set fileencoding=utf-8 :
# split lines from stdin into a list of unicode strings
# Muk 2013-08-23
# Python 2.7.3

from __future__ import print_function
import sys
import shlex
import chardet

bool_cmnt = True  # shlex: skip comments
bool_posx = True  # shlex: posix mode (strings in quotes)

for inpt_line in sys.stdin:
    print( 'inpt_line=' + repr( inpt_line ) )
    enco_type = chardet.detect( inpt_line )[ 'encoding' ]           # {'encoding': 'EUC-JP', 'confidence': 0.99}
    print( 'enco_type=' + repr( enco_type ) )
    try:
        strg_inpt = shlex.split( inpt_line, bool_cmnt, bool_posx, ) # shlex does not work on unicode
    except Exception, errr:                                         # usually 'No closing quotation'
        print( "error='%s' on inpt_line='%s'" % ( errr, inpt_line.rstrip(), ), file=sys.stderr, )
        continue
    print( 'strg_inpt=' + repr( strg_inpt ) )                       # list of strings
    strg_unic = [ strg.decode( enco_type ) for strg in strg_inpt ]  # decode the strings into unicode
    print( 'strg_unic=' + repr( strg_unic ) )                       # list of unicode strings
##############################################################################################################

$ cat <some-file> | template.py


Comments are welcome.


TIA
-- 
Kurt Mueller

[toc] | [prev] | [next] | [standalone]


#53114

FromKurt Mueller <kurt.alfred.mueller@gmail.com>
Date2013-08-28 10:01 +0200
Message-ID<mailman.291.1377676940.19984.python-list@python.org>
In reply to#52200
Am 08.08.2013 18:37, schrieb Chris Angelico:
> On Thu, Aug 8, 2013 at 5:16 PM, Kurt Mueller
> <kurt.alfred.mueller@gmail.com> wrote:
>> Am 08.08.2013 17:44, schrieb Peter Otten:
>>> Kurt Mueller wrote:
>>>> What do I do, when input_strings/output_list has other codings like
>>>> iso-8859-1?
>>> You have to know the actual encoding. With that information it's easy:
>>>>>> output_list
>>> ['\xc3\xb6', '\xc3\xbc', 'i', 's', 'f']
>>>>>> encoding = "utf-8"
>>>>>> output_list = [s.decode(encoding) for s in output_list]
>>>>>> print output_list
>>> [u'\xf6', u'\xfc', u'i', u's', u'f']
>> How do I get to know the actual encoding?
>> I read from stdin. There can be different encondings.
>> Usually utf8 but also iso-8859-1/latin9 are to be expected.
>> But sys.stdin.encoding sais always 'None'.
> 
> If you can switch to Python 3, life becomes a LOT easier. The Python 3
> input() function (which does the same job as raw_input() from Python
> 2) returns a Unicode string, meaning that it takes care of encodings
> for you.

Because I cannot switch to Python 3 for now my life is not so easy:-)

For some text manipulation tasks I need a template to split lines
from stdin into a list of strings the way shlex.split() does it.
The encoding of the input can vary.
For further processing in Python I need the list of strings to be in unicode.

Here is template.py:

##############################################################################################################
#!/usr/bin/env python
# vim: set fileencoding=utf-8 :
# split lines from stdin into a list of unicode strings
# Muk 2013-08-23
# Python 2.7.3

from __future__ import print_function
import sys
import shlex
import chardet

bool_cmnt = True  # shlex: skip comments
bool_posx = True  # shlex: posix mode (strings in quotes)

for inpt_line in sys.stdin:
    print( 'inpt_line=' + repr( inpt_line ) )
    enco_type = chardet.detect( inpt_line )[ 'encoding' ]           # {'encoding': 'EUC-JP', 'confidence': 0.99}
    print( 'enco_type=' + repr( enco_type ) )
    try:
        strg_inpt = shlex.split( inpt_line, bool_cmnt, bool_posx, ) # shlex does not work on unicode
    except Exception, errr:                                         # usually 'No closing quotation'
        print( "error='%s' on inpt_line='%s'" % ( errr, inpt_line.rstrip(), ), file=sys.stderr, )
        continue
    print( 'strg_inpt=' + repr( strg_inpt ) )                       # list of strings
    strg_unic = [ strg.decode( enco_type ) for strg in strg_inpt ]  # decode the strings into unicode
    print( 'strg_unic=' + repr( strg_unic ) )                       # list of unicode strings
##############################################################################################################

$ cat <some-file> | template.py


Comments are welcome.


TIA
-- 
Kurt Mueller

-- 
Kurt Mueller

[toc] | [prev] | [next] | [standalone]


#53125

FromDave Angel <davea@davea.name>
Date2013-08-28 10:23 +0000
Message-ID<mailman.301.1377686682.19984.python-list@python.org>
In reply to#52200
On 28/8/2013 04:01, Kurt Mueller wrote:


> Because I cannot switch to Python 3 for now my life is not so easy:-)
>
> For some text manipulation tasks I need a template to split lines
> from stdin into a list of strings the way shlex.split() does it.
> The encoding of the input can vary.
> For further processing in Python I need the list of strings to be in unicode.
>

According to:
   http://docs.python.org/2/library/shlex.html

"""Prior to Python 2.7.3, this module did not support Unicode
input"""

I take that to mean that if you upgrade to Python 2.7.3, 2.7.4, or
2.7.5, you'll have Unicode support.

Presumably that would mean you could decode the string before calling
shlex.split().

-- 
DaveA

[toc] | [prev] | [next] | [standalone]


#53129

Fromkurt.alfred.mueller@gmail.com
Date2013-08-28 04:17 -0700
Message-ID<473e3274-9581-4699-a836-798ab2f54758@googlegroups.com>
In reply to#53125
On Wednesday, August 28, 2013 12:23:12 PM UTC+2, Dave Angel wrote:
> On 28/8/2013 04:01, Kurt Mueller wrote:
> > Because I cannot switch to Python 3 for now my life is not so easy:-)
> > For some text manipulation tasks I need a template to split lines
> > from stdin into a list of strings the way shlex.split() does it.
> > The encoding of the input can vary.
> > For further processing in Python I need the list of strings to be in unicode.
> According to:
>    http://docs.python.org/2/library/shlex.html
> """Prior to Python 2.7.3, this module did not support Unicode
> input"""
> I take that to mean that if you upgrade to Python 2.7.3, 2.7.4, or
> 2.7.5, you'll have Unicode support.

I have Python 2.7.3

> Presumably that would mean you could decode the string before calling
> shlex.split().

Yes, see new template.py:
###############################################################
#!/usr/bin/env python
# vim: set fileencoding=utf-8 :
# split lines from stdin into a list of unicode strings
# decode before shlex
# Muk 2013-08-28
# Python 2.7.3

from __future__ import print_function
import sys
import shlex
import chardet

bool_cmnt = True  # shlex: skip comments
bool_posx = True  # shlex: posix mode (strings in quotes)

for inpt_line in sys.stdin:
    print( 'inpt_line=' + repr( inpt_line ) )
    enco_type = chardet.detect( inpt_line )[ 'encoding' ]           # {'encoding': 'EUC-JP', 'confidence': 0.99}
    print( 'enco_type=' + repr( enco_type ) )
    strg_unic = inpt_line.decode( enco_type )                       # decode the input line into unicode
    print( 'strg_unic=' + repr( strg_unic ) )                       # unicode input line
    try:
        strg_inpt = shlex.split( strg_unic, bool_cmnt, bool_posx, ) # check if shlex works on unicode
    except Exception, errr:                                         # usually 'No closing quotation'
        print( "error='%s' on inpt_line='%s'" % ( errr, inpt_line.rstrip(), ), file=sys.stderr, )
        continue
    print( 'strg_inpt=' + repr( strg_inpt ) )                       # list of strings

###############################################################

$ python -V
Python 2.7.3
$ echo -e "a b c d e\na Ö u 1 2" | template.py
inpt_line='a b c d e\n'
enco_type='ascii'
strg_unic=u'a b c d e\n'
strg_inpt=['a', 'b', 'c', 'd', 'e']
inpt_line='a \xc3\x96 u 1 2\n'
enco_type='utf-8'
strg_unic=u'a \xd6 u 1 2\n'
error=''ascii' codec can't encode character u'\xd6' in position 2: ordinal not in range(128)' on inpt_line='a Ö u 1 2'
$ echo -e "a b c d e\na Ö u 1 2" | recode utf8..latin9 | ./split_shlex_unicode.py 
inpt_line='a b c d e\n'
enco_type='ascii'
strg_unic=u'a b c d e\n'
strg_inpt=['a', 'b', 'c', 'd', 'e']
inpt_line='a \xd6 u 1 2\n'
enco_type='windows-1252'
strg_unic=u'a \xd6 u 1 2\n'
error=''ascii' codec can't encode character u'\xd6' in position 2: ordinal not in range(128)' on inpt_line='a � u 1 2'
$

As can be seen, shlex does work only with unicode strings decoded from 'ascii' strings. (Python 2.7.3)

-- 
Kurt Müller

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web