Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #76419

Re: Unicode in cgi-script with apache2

Newsgroups comp.lang.python
Date 2014-08-17 03:05 -0700
References <mailman.13038.1408130249.18130.python-list@python.org> <satHv.195207$ze2.61877@fx28.am4> <mailman.13054.1408229123.18130.python-list@python.org> <lsp5ab$sjv$1@dont-email.me> <53f05ed9$0$30003$c3e8da3$5496439d@news.astraweb.com>
Message-ID <406363a3-5616-477c-86c0-71e101bca5bb@googlegroups.com> (permalink)
Subject Re: Unicode in cgi-script with apache2
From wxjmfauth@gmail.com

Show all headers | View raw


Le dimanche 17 août 2014 09:50:48 UTC+2, Steven D'Aprano a écrit :
> 
> 
> 
> 
> py> b = "Hello ë ü world".encode('utf-8')
> 
> py> print(b.decode('ascii', errors='replace'))
> 
> Hello �� �� world
> 
> 
> 

=========

No. Your are taking the problem in the wrong way. This is
a typical situation, where the produced code will work
correctly, but it will be a "just for me working code".

The mistake is that, in that way you are producing code,
that is not suitable for the "system" that will host your
string.

In the present case, you are already assuming prior
any string manipulation, the output should be utf-8.

D:\>c:\python32\python
Python 3.2.5 (default, May 15 2013, 23:06:03) [MSC v.1500 32 bit (Intel)] on win
32
Type "help", "copyright", "credits" or "license" for more information.
>>> b = "Hello ë ü world".encode('utf-8')
>>> b
b'Hello \xc3\xab \xc3\xbc world'
>>> b.decode('ascii', 'replace')
'Hello \ufffd\ufffd \ufffd\ufffd world'
>>> print(b.decode('ascii', 'replace'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "c:\python32\lib\encodings\cp850.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 6-7: cha
racter maps to <undefined>
>>>


The proper way is to "prepare" your string prior any
further manipulation (see my previous comment with
processes).

I'm using explicitely the code page cp850 and the
euro sign.

>>> u = "Hello ë ü world \u20ac\u20ac\u20ac"
>>> newu = u.encode('cp850', 'replace').decode('cp850')
>>> print(newu)
Hello ë ü world ???
>>> type(newu)
<class 'str'>
>>>

The replacement character now belongs to the set of the
characters, which are display-able.
It will never fail.


You can mimic the same behaviour with a web navigator.

Create an html file in utf-8 containing characters
not belonging to iso-8859-1.
Display that file and change the coding of the nagivator
to iso-8859-1.
You will see, the navigator "reencode* the source with
a replacement char and only later re-display it. Same
process I gave above.

The key point is the detection, if doable, of the coding scheme
that should be used.

>>> import sys
>>> sys.stdout.encoding
'cp850'
>>>

My example is not Windows specific. On a gb**** Chinese
BSD or a kio-8 Russion linux: identical problematic.

jmf

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Unicode in cgi-script with apache2 Dominique Ramaekers <dominique@ramaekers-stassart.be> - 2014-08-15 20:10 +0200
  Re: Unicode in cgi-script with apache2 alister <alister.nospam.ware@ntlworld.com> - 2014-08-15 19:27 +0000
    Re: Unicode in cgi-script with apache2 Dominique Ramaekers <dominique@ramaekers-stassart.be> - 2014-08-17 00:36 +0200
      Re: Unicode in cgi-script with apache2 Denis McMahon <denismfmcmahon@gmail.com> - 2014-08-17 02:50 +0000
        Re: Unicode in cgi-script with apache2 Dominique Ramaekers <dominique@ramaekers-stassart.be> - 2014-08-17 07:32 +0200
        Re: Unicode in cgi-script with apache2 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-08-17 17:50 +1000
          Re: Unicode in cgi-script with apache2 Dominique Ramaekers <dominique@ramaekers-stassart.be> - 2014-08-17 11:40 +0200
          Re: Unicode in cgi-script with apache2 wxjmfauth@gmail.com - 2014-08-17 03:05 -0700
          Re: Unicode in cgi-script with apache2 Peter Otten <__peter__@web.de> - 2014-08-17 13:04 +0200
          Re: Unicode in cgi-script with apache2 Dominique Ramaekers <dominique@ramaekers-stassart.be> - 2014-08-17 13:34 +0200
          Re: Unicode in cgi-script with apache2 Dominique Ramaekers <dominique@ramaekers-stassart.be> - 2014-08-17 14:02 +0200
            Re: Unicode in cgi-script with apache2 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-08-17 23:00 +1000
              Re: Unicode in cgi-script with apache2 wxjmfauth@gmail.com - 2014-08-17 08:56 -0700
          Re: Unicode in cgi-script with apache2 Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-08-17 13:35 +0100
            Re: Unicode in cgi-script with apache2 Tony the Tiger <tony@tiger.invalid> - 2014-08-18 04:39 +0000
          Re: Unicode in cgi-script with apache2 Peter Otten <__peter__@web.de> - 2014-08-17 15:12 +0200
          Re: Unicode in cgi-script with apache2 Peter Otten <__peter__@web.de> - 2014-08-17 16:06 +0200
      Re: Unicode in cgi-script with apache2 Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-08-17 15:54 +1000
  Re: Unicode in cgi-script with apache2 John Gordon <gordon@panix.com> - 2014-08-15 19:32 +0000
    Re: Unicode in cgi-script with apache2 Dominique Ramaekers <dominique@ramaekers-stassart.be> - 2014-08-17 00:39 +0200
  Re: Unicode in cgi-script with apache2 Denis McMahon <denismfmcmahon@gmail.com> - 2014-08-16 16:40 +0000
    Re: Unicode in cgi-script with apache2 Dominique Ramaekers <dominique@ramaekers-stassart.be> - 2014-08-17 00:57 +0200
  Re: Unicode in cgi-script with apache2 wxjmfauth@gmail.com - 2014-08-17 01:08 -0700

csiph-web