Groups > comp.lang.python > #34949 > unrolled thread

Unicode

Started by	Anatoli Hristov <tolidtm@gmail.com>
First post	2012-12-16 22:10 +0100
Last post	2012-12-17 23:31 +0100
Articles	20 on this page of 21 — 7 participants

Back to article view | Back to comp.lang.python

  Unicode Anatoli Hristov <tolidtm@gmail.com> - 2012-12-16 22:10 +0100
    Re: Unicode Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-12-17 06:06 +0000
      Re: Unicode Anatoli Hristov <tolidtm@gmail.com> - 2012-12-17 09:59 +0100
      Re: Unicode Benjamin Kaplan <benjamin.kaplan@case.edu> - 2012-12-17 01:28 -0800
      Re: Unicode Anatoli Hristov <tolidtm@gmail.com> - 2012-12-17 10:45 +0100
      Re: Unicode Vlastimil Brom <vlastimil.brom@gmail.com> - 2012-12-17 11:02 +0100
      Re: Unicode Anatoli Hristov <tolidtm@gmail.com> - 2012-12-17 11:17 +0100
      Re: Unicode Vlastimil Brom <vlastimil.brom@gmail.com> - 2012-12-17 11:55 +0100
      Re: Unicode Anatoli Hristov <tolidtm@gmail.com> - 2012-12-17 12:14 +0100
      Re: Unicode Vlastimil Brom <vlastimil.brom@gmail.com> - 2012-12-17 12:56 +0100
      Re: Unicode Anatoli Hristov <tolidtm@gmail.com> - 2012-12-17 18:43 +0100
      Re: Unicode Dave Angel <d@davea.name> - 2012-12-17 13:07 -0500
      Re: Unicode Anatoli Hristov <tolidtm@gmail.com> - 2012-12-17 19:36 +0100
        Re: Unicode Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-12-18 00:07 +0000
      Re: Unicode Vlastimil Brom <vlastimil.brom@gmail.com> - 2012-12-17 20:55 +0100
      Re: Unicode Anatoli Hristov <tolidtm@gmail.com> - 2012-12-17 21:00 +0100
      Re: Unicode Dave Angel <d@davea.name> - 2012-12-17 16:09 -0500
        Re: Unicode Hans Mulder <hansmu@xs4all.nl> - 2012-12-17 23:02 +0100
          Re: Unicode Anatoli Hristov <tolidtm@gmail.com> - 2012-12-17 23:33 +0100
      Re: Unicode Terry Reedy <tjreedy@udel.edu> - 2012-12-17 17:03 -0500
      Re: Unicode Anatoli Hristov <tolidtm@gmail.com> - 2012-12-17 23:31 +0100

Page 1 of 2 [1] 2 Next page →

#34949 — Unicode

From	Anatoli Hristov <tolidtm@gmail.com>
Date	2012-12-16 22:10 +0100
Subject	Unicode
Message-ID	<mailman.941.1355692240.29569.python-list@python.org>

Hello guys,

I'm using Linux CentOS and Python 2.4 with MySQL 5.xx, I get error
with Unicode I tried many things that I found on the net but none of
them working.

If I dont use UTF-8 it inserts the data into the DB  but some French
char. are not correctly decoded. Could you please help me ?

Thanks

def PrepareSpecs(product_id, icecat_prod_id, icecat_image_url, name):
"""Gets the specifications of a product from Icecat.biz and insert
them into the DB
"""
    specs = {3:GetSpecsNL(icecat_prod_id),2:GetSpecsFR(icecat_prod_id).decode('utf-8'),1:GetSpecsEN(icecat_prod_id)}
    SpecsToSQL(product_id,specs,name)
    CategorySQL(product_id)
    StoreSQL(product_id)
    GetIMG(icecat_image_url,icecat_prod_id)
    return

def GetSpecsFR(icecat_prod_id):
    opener = urllib.FancyURLopener({})
    ffr = opener.open("http://prf.icecat.biz/index.cgi?product_id=%s;mi=start;smi=product;shopname=openICEcat-url;lang=fr"
% icecat_prod_id)
    specsfr = ffr.read()
    #specsfr = specsfr.decode('utf-8')
    specsfr = RemoveHTML(specsfr)
    ##specsfr = "%r" % specsfr
##    if specsfr:
##        try:
##            specsfr = str(specsfr)
##        except UnicodeEncodeError:
##            specsfr = str(specsfr.encode('utf-16'))
    return specsfr

def RemoveHTML(specs):
    specs = specs.replace("<html>","")
    specs = specs.replace("<HTML>","")
    specs = specs.replace("</html>","")
    specs = specs.replace("</HTML>","")
    specs = specs.replace("<head>","")
    specs = specs.replace("<HEAD>","")
    specs = specs.replace("</head>","")
    specs = specs.replace("</HEAD>","")
    specs = specs.replace("<body>","")
    specs = specs.replace("</body>","")
    specs = specs.replace("<BODY>","")
    specs = specs.replace("</body>","")
    specs = specs.replace("<TITLE>","")
    specs = specs.replace("</TITLE>","")
    specs = specs.replace("<title>","")
    specs = specs.replace("</title>","")
    specs = specs.replace("<p>","")
    specs = specs.replace("</p>","")
    return specs

def SpecsToSQL(product_id, specs, name):
    for lang, spec in specs.iteritems():
        InsertSpecsDB(product_id, spec, lang, name)
    return

def InsertSpecsDB(product_id, spec, name, lang):
    db = MySQLdb.connect("localhost","getit","opencart")
    cursor = db.cursor()
    sql = "INSERT INTO product_description (product_id, language_id,
name, description) VALUES (%s,%s,%s,%s)"
    params = (product_id, lang, name, spec)
    cursor.execute(sql, params)
    id = cursor.lastrowid
    print"Updated ID %s description %s" %(int(id), lang)
    return

[toc] | [next] | [standalone]

#34955

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-12-17 06:06 +0000
Message-ID	<50ceb674$0$29868$c3e8da3$5496439d@news.astraweb.com>
In reply to	#34949

On Sun, 16 Dec 2012 22:10:37 +0100, Anatoli Hristov wrote:

> If I dont use UTF-8 it inserts the data into the DB  but some French
> char. are not correctly decoded. Could you please help me ?

What happens when you do use UTF-8?

What do you mean, "use UTF-8"?

To learn about Unicode, start here:

http://www.joelonsoftware.com/articles/Unicode.html

If that helps you solve the problem, excellent. If not, please come back 
with your questions, but first read this:

http://www.sscce.org/

As given, we cannot answer your question easily, or at all, because we 
cannot run your code. It gives indentation errors, you don't tell us what 
modules you're using, and you haven't reduced the example down to the 
critical parts that demonstrate the failure.

-- 
Steven

[toc] | [prev] | [next] | [standalone]

#34963

From	Anatoli Hristov <tolidtm@gmail.com>
Date	2012-12-17 09:59 +0100
Message-ID	<mailman.949.1355734762.29569.python-list@python.org>
In reply to	#34955

> What happens when you do use UTF-8?
This is the result when I encode the string:
" Ã©troits, en utilisant un portable extrÃªmement puissantâ€”le plus
petit et le plus lÃ©ger des HP EliteBook pleine puissanceâ€”avec un
Ã©cran de diagonale 31,75 cm (12,5 pouces), idÃ©al pour le
professionnel ultra-mobile.
"
No accents
>
> What do you mean, "use UTF-8"?

Trying to encode the string
>
>
> To learn about Unicode, start here:
>
> http://www.joelonsoftware.com/articles/Unicode.html
>
> If that helps you solve the problem, excellent. If not, please come back
> with your questions, but first read this:
I will try to understand the logic :)

>
> http://www.sscce.org/
>
> As given, we cannot answer your question easily, or at all, because we
> cannot run your code. It gives indentation errors, you don't tell us what
> modules you're using, and you haven't reduced the example down to the
> critical parts that demonstrate the failure.
I didn't wanted to include all my code as it is 15K. and also I know
my code is crappy and you will start blaming and saying that my code
is crap.- and I know it !

Thanks

[toc] | [prev] | [next] | [standalone]

#34966

From	Benjamin Kaplan <benjamin.kaplan@case.edu>
Date	2012-12-17 01:28 -0800
Message-ID	<mailman.951.1355736494.29569.python-list@python.org>
In reply to	#34955

On Mon, Dec 17, 2012 at 12:59 AM, Anatoli Hristov <tolidtm@gmail.com> wrote:
>> What happens when you do use UTF-8?
> This is the result when I encode the string:
> " Ã©troits, en utilisant un portable extrÃªmement puissantâ€”le plus
> petit et le plus lÃ©ger des HP EliteBook pleine puissanceâ€”avec un
> Ã©cran de diagonale 31,75 cm (12,5 pouces), idÃ©al pour le
> professionnel ultra-mobile.
> "
> No accents
>>
>> What do you mean, "use UTF-8"?
>
> Trying to encode the string
>>

What's your terminal's encoding? That looks like you have a CP-1252
terminal trying to output UTF-8 text.

[toc] | [prev] | [next] | [standalone]

#34967

From	Anatoli Hristov <tolidtm@gmail.com>
Date	2012-12-17 10:45 +0100
Message-ID	<mailman.952.1355737520.29569.python-list@python.org>
In reply to	#34955

> What's your terminal's encoding? That looks like you have a CP-1252
> terminal trying to output UTF-8 text.

Thanks for your answer, I tried <locale> in my terminal and it gives
this as an output:
LANG=en_US
LC_CTYPE="en_US"
LC_NUMERIC="en_US"
LC_TIME="en_US"
LC_COLLATE="en_US"
LC_MONETARY="en_US"
LC_MESSAGES="en_US"
LC_PAPER="en_US"
LC_NAME="en_US"
LC_ADDRESS="en_US"
LC_TELEPHONE="en_US"
LC_MEASUREMENT="en_US"
LC_IDENTIFICATION="en_US"
LC_ALL=

[toc] | [prev] | [next] | [standalone]

#34968

From	Vlastimil Brom <vlastimil.brom@gmail.com>
Date	2012-12-17 11:02 +0100
Message-ID	<mailman.953.1355738534.29569.python-list@python.org>
In reply to	#34955

2012/12/17 Anatoli Hristov <tolidtm@gmail.com>:
> What happens when you do use UTF-8?
This is the result when I encode the string:
 " Ã©troits, en utilisant un portable extrÃªmement puissantâ€”le plus
 petit et le plus lÃ©ger des HP EliteBook pleine puissanceâ€”avec un
 Ã©cran de diagonale 31,75 cm (12,5 pouces), idÃ©al pour le
 professionnel ultra-mobile.
 "
 No accents
>

Hi,
if you only see encoding problems on printing results to your
terminal, its settings or unicode capability might be the cause,
however, if you also get badly encoding items in the database, you are
likely using an inappropriate encoding in some step.

you seem to be doing something like the following (explicitly or
partly implicitly, based on your system defaults):

>>> print u"étroits, en utilisant un portable extrêmement puissant".encode("utf-8").decode("windows-1252")
Ã©troits, en utilisant un portable extrÃªmement puissant
>>>

i.e. encode a text using utf-8 and handling it like windows-1252
afterwards (or take an already encoded text and decode it with the
inappropriate ANSI encoding.

hth,
   vbr

[toc] | [prev] | [next] | [standalone]

#34969

From	Anatoli Hristov <tolidtm@gmail.com>
Date	2012-12-17 11:17 +0100
Message-ID	<mailman.954.1355739439.29569.python-list@python.org>
In reply to	#34955

> if you only see encoding problems on printing results to your
> terminal, its settings or unicode capability might be the cause,
> however, if you also get badly encoding items in the database, you are
> likely using an inappropriate encoding in some step.

I get badly encoding into my DB

> you seem to be doing something like the following (explicitly or
> partly implicitly, based on your system defaults):
>
>>>> print u"étroits, en utilisant un portable extrêmement puissant".encode("utf-8").decode("windows-1252")
> Ã©troits, en utilisant un portable extrÃªmement puissant
>>>>
>
> i.e. encode a text using utf-8 and handling it like windows-1252
> afterwards (or take an already encoded text and decode it with the
> inappropriate ANSI encoding.

Thank you Vlastimil,

I tried to print it as you sholed mr, but I receive an erro:
>>> print u"étroits, en utilisant un portable extrêmement puissant".encode("utf-8").decode("windows-1252")
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u0192'
in position 1: ordinal not in range(256)
>>>

[toc] | [prev] | [next] | [standalone]

#34970

From	Vlastimil Brom <vlastimil.brom@gmail.com>
Date	2012-12-17 11:55 +0100
Message-ID	<mailman.955.1355741714.29569.python-list@python.org>
In reply to	#34955

2012/12/17 Anatoli Hristov <tolidtm@gmail.com>:
>> if you only see encoding problems on printing results to your
>> terminal, its settings or unicode capability might be the cause,
>> however, if you also get badly encoding items in the database, you are
>> likely using an inappropriate encoding in some step.
>
> I get badly encoding into my DB
>
>> you seem to be doing something like the following (explicitly or
>> partly implicitly, based on your system defaults):
>>
>>>>> print u"étroits, en utilisant un portable extrêmement puissant".encode("utf-8").decode("windows-1252")
>> Ã©troits, en utilisant un portable extrÃªmement puissant
>>>>>
>>
>> i.e. encode a text using utf-8 and handling it like windows-1252
>> afterwards (or take an already encoded text and decode it with the
>> inappropriate ANSI encoding.
>
> Thank you Vlastimil,
>
> I tried to print it as you sholed mr, but I receive an erro:
>>>> print u"étroits, en utilisant un portable extrêmement puissant".encode("utf-8").decode("windows-1252")
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> UnicodeEncodeError: 'latin-1' codec can't encode character u'\u0192'
> in position 1: ordinal not in range(256)
>>>>

Hi,
this seems to be an encoding error of your terminal on printing.
You may need to describe (or better post the respective parts of the
source) where the text is coming from (external text file, database
entry, harcoded in the python source ...), how it is stored, retrieved
and possibly manipulated before you insert it to the database.

You may try to print a repr(...) of the string to be inserted to the
database to see, whether it isn't already mangled in some previous
part of the processing.

hth,

    vbr

[toc] | [prev] | [next] | [standalone]

#34971

From	Anatoli Hristov <tolidtm@gmail.com>
Date	2012-12-17 12:14 +0100
Message-ID	<mailman.956.1355742852.29569.python-list@python.org>
In reply to	#34955

> this seems to be an encoding error of your terminal on printing.
> You may need to describe (or better post the respective parts of the
> source) where the text is coming from (external text file, database
> entry, harcoded in the python source ...), how it is stored, retrieved
> and possibly manipulated before you insert it to the database.
>
Here is how I get the data using the urllib opener:

def GetSpecsFR(icecat_prod_id):
    opener = urllib.FancyURLopener({})
    ffr = opener.open("http://prf.icecat.biz/index.cgi?product_id=%s;mi=start;smi=product;shopname=openICEcat-url;lang=fr"
% icecat_prod_id)
    specsfr = ffr.read()
    #specsfr = specsfr.decode('utf-8')
    specsfr = RemoveHTML(specsfr)
    ##specsfr = "%r" % specsfr
##    if specsfr:
##        try:
##            specsfr = str(specsfr)
##        except UnicodeEncodeError:
##            specsfr = str(specsfr.encode('utf-16'))
    return specsfr

[toc] | [prev] | [next] | [standalone]

#34972

From	Vlastimil Brom <vlastimil.brom@gmail.com>
Date	2012-12-17 12:56 +0100
Message-ID	<mailman.957.1355745371.29569.python-list@python.org>
In reply to	#34955

2012/12/17 Anatoli Hristov <tolidtm@gmail.com>:
>> this seems to be an encoding error of your terminal on printing.
>> You may need to describe (or better post the respective parts of the
>> source) where the text is coming from (external text file, database
>> entry, harcoded in the python source ...), how it is stored, retrieved
>> and possibly manipulated before you insert it to the database.
>>
> Here is how I get the data using the urllib opener:
>
> def GetSpecsFR(icecat_prod_id):
>     opener = urllib.FancyURLopener({})
>     ffr = opener.open("http://prf.icecat.biz/index.cgi?product_id=%s;mi=start;smi=product;shopname=openICEcat-url;lang=fr"
> % icecat_prod_id)
>     specsfr = ffr.read()
>     #specsfr = specsfr.decode('utf-8')
>     specsfr = RemoveHTML(specsfr)
>     ##specsfr = "%r" % specsfr
> ##    if specsfr:
> ##        try:
> ##            specsfr = str(specsfr)
> ##        except UnicodeEncodeError:
> ##            specsfr = str(specsfr.encode('utf-16'))
>     return specsfr

Hi,
I don't know, what the product ID would look like, for this page, but
assuming, the catalog pages are also utf-8 encoded as well as the
error page I get, it should work ok; cf.:

>>> import urllib
>>> opener = urllib.FancyURLopener({})
>>> ffr = opener.open("http://prf.icecat.biz/index.cgi?product_id=%s;mi=start;smi=product;shopname=openICEcat-url;lang=fr" % (1234,))
>>> src = ffr.read()
>>> print src.decode("utf-8")


<!-- This Icecat template is used as head of all pages in Product finder -->


<HTML>
<HEAD>

[... - shortened]

<div align="center">"Désolé, pour ce produit, nous n'avons pas trouvé
d'autres informations produit.<br>Si vous n'êtes pas redirigés
automatiquement, veuillez cliquer" <a href="#" style="font-size:80%"
onclick="history.back()">ici</a>
</div>
<!--
            <td bgcolor="" width="230" align="center"><img
src="/imgs/logo.gif" width="180" height="58"></td>
-->



>>>

Printing on an unicode-capable shell works ok (wx PyShell in my case),
inserting to the database should be straightforward too (although I
don't have experiences with the specific db you are using.

Are you getting another unicode errors in other parts of the process,
or do the above steps work differently on your computer?

hth,
  vbr

[toc] | [prev] | [next] | [standalone]

#34997

From	Anatoli Hristov <tolidtm@gmail.com>
Date	2012-12-17 18:43 +0100
Message-ID	<mailman.974.1355766240.29569.python-list@python.org>
In reply to	#34955

> Hi,
> I don't know, what the product ID would look like, for this page, but
> assuming, the catalog pages are also utf-8 encoded as well as the
> error page I get, it should work ok; cf.:
You are right, I get it work on Windows too, but not in Linux. I
changed the codec of linux, but still I don't get it

Here is what I get from Linux:

>>> import urllib
>>> opener = urllib.FancyURLopener({})
>>> ffr = opener.open("http://prf.icecat.biz/index.cgi?product_id=%s;mi=start;smi=product;shopname=openICEcat-url;lang=fr" % (14688538))
>>> src = ffr.read()
>>> print src.decode("utf-8")
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2122'
in position 17167: ordinal not in range(256)

[toc] | [prev] | [next] | [standalone]

#35001

From	Dave Angel <d@davea.name>
Date	2012-12-17 13:07 -0500
Message-ID	<mailman.979.1355767688.29569.python-list@python.org>
In reply to	#34955

On 12/17/2012 12:43 PM, Anatoli Hristov wrote:
>> Hi,
>> I don't know, what the product ID would look like, for this page, but
>> assuming, the catalog pages are also utf-8 encoded as well as the
>> error page I get, it should work ok; cf.:
> You are right, I get it work on Windows too, but not in Linux. I
> changed the codec of linux, but still I don't get it
>
> Here is what I get from Linux:
>
>>>> import urllib
>>>> opener = urllib.FancyURLopener({})
>>>> ffr = opener.open("http://prf.icecat.biz/index.cgi?product_id=%s;mi=start;smi=product;shopname=openICEcat-url;lang=fr" % (14688538))
>>>> src = ffr.read()
>>>> print src.decode("utf-8")
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2122'
> in position 17167: ordinal not in range(256)

I can tell you what's happening, but maybe not how to fix it.

src.decode() is creating a unicode string.  The error is not happening
there.  But when print is used with a unicode string, it has to encode
the data.  And for whatever reason, yours is using latin-1, and you have
a character in there which is not in the latin-1 encoding.

My python 2.7 uses utf-8 everywhere (on Linux Ubuntu 11.04).

-- 

DaveA

[toc] | [prev] | [next] | [standalone]

#35004

From	Anatoli Hristov <tolidtm@gmail.com>
Date	2012-12-17 19:36 +0100
Message-ID	<mailman.982.1355769364.29569.python-list@python.org>
In reply to	#34955

> src.decode() is creating a unicode string.  The error is not happening
> there.  But when print is used with a unicode string, it has to encode
> the data.  And for whatever reason, yours is using latin-1, and you have
> a character in there which is not in the latin-1 encoding.
I fixed the print, I changed the setting of the terminal and also on
the sshconfig, so now when I print I'm able to print out without
problems, but when I tried to run the script I've made it gives me
again the same error :
""Unexpected error: exceptions.UnicodeEncodeError
"""
Maybe I will try to update to 2.7

[toc] | [prev] | [next] | [standalone]

#35028

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2012-12-18 00:07 +0000
Message-ID	<50cfb3d9$0$29991$c3e8da3$5496439d@news.astraweb.com>
In reply to	#35004

On Mon, 17 Dec 2012 19:36:01 +0100, Anatoli Hristov wrote:

>> src.decode() is creating a unicode string.  The error is not happening
>> there.  But when print is used with a unicode string, it has to encode
>> the data.  And for whatever reason, yours is using latin-1, and you
>> have a character in there which is not in the latin-1 encoding.
> I fixed the print, I changed the setting of the terminal and also on the
> sshconfig, so now when I print I'm able to print out without problems,
> but when I tried to run the script I've made it gives me again the same
> error :
> ""Unexpected error: exceptions.UnicodeEncodeError """

That is not a full Python traceback. Python gives you lots of debugging 
information, in the form of a complete traceback. Use those tracebacks, 
don't ignore them.

Trying to debug code without the full traceback is like trying to read a 
book by reading only every third page.


-- 
Steven

[toc] | [prev] | [next] | [standalone]

#35009

From	Vlastimil Brom <vlastimil.brom@gmail.com>
Date	2012-12-17 20:55 +0100
Message-ID	<mailman.987.1355774107.29569.python-list@python.org>
In reply to	#34955

2012/12/17 Anatoli Hristov <tolidtm@gmail.com>:
>> src.decode() is creating a unicode string.  The error is not happening
>> there.  But when print is used with a unicode string, it has to encode
>> the data.  And for whatever reason, yours is using latin-1, and you have
>> a character in there which is not in the latin-1 encoding.
> I fixed the print, I changed the setting of the terminal and also on
> the sshconfig, so now when I print I'm able to print out without
> problems, but when I tried to run the script I've made it gives me
> again the same error :
> ""Unexpected error: exceptions.UnicodeEncodeError
> """
> Maybe I will try to update to 2.7
> --
> http://mail.python.org/mailman/listinfo/python-list

Well, we don't see the context or traceback of that error, but it
looks like a mysql error on inserting data.
Could it be, that your database is not unicode enabled, e.g. utf-8,
but, say, latin-1?
I don't have experiences with this database this, but I guess, there
must be some configure options for this.
Would maybe setting the encoding in db.connect(...) work?
cf.:
http://stackoverflow.com/questions/8365660/python-mysql-unicode-and-encoding

Hopefully, others might give more reliable suggestions..

hth,
  vbr

[toc] | [prev] | [next] | [standalone]

#35010

From	Anatoli Hristov <tolidtm@gmail.com>
Date	2012-12-17 21:00 +0100
Message-ID	<mailman.988.1355774436.29569.python-list@python.org>
In reply to	#34955

> I fixed the print, I changed the setting of the terminal and also on
> the sshconfig, so now when I print I'm able to print out without
> problems, but when I tried to run the script I've made it gives me
> again the same error :
> ""Unexpected error: exceptions.UnicodeEncodeError
> """
> Maybe I will try to update to 2.7

Upgraded to python 27 and still it gives Unexpected error:
exceptions.UnicodeEncodeError. Damn encoders I don'y know what to
do...

[toc] | [prev] | [next] | [standalone]

#35014

From	Dave Angel <d@davea.name>
Date	2012-12-17 16:09 -0500
Message-ID	<mailman.991.1355778571.29569.python-list@python.org>
In reply to	#34955

On 12/17/2012 03:00 PM, Anatoli Hristov wrote:
>> I fixed the print, I changed the setting of the terminal and also on
>> the sshconfig, so now when I print I'm able to print out without
>> problems, but when I tried to run the script I've made it gives me
>> again the same error :
>> ""Unexpected error: exceptions.UnicodeEncodeError
>> """
That's not the whole error message. What encoding does it report in the
error?

Maybe I will try to update to 2.7

> Upgraded to python 27 and still it gives Unexpected error:
> exceptions.UnicodeEncodeError. Damn encoders I don'y know what to
> do...

I doubted that 2.7 would make any difference.

1. What does your "terminal' expect. (For all I know you're using
TeraTermPro as a terminal, which doesn't support utf-8.)
Have you looked at the terminal encoding to see what your copy of
Terminal is expecting? On my Ubuntu Linux, I open the terminal with
Ctrl-Alt-t, then in the menu bar, I select
Terminal->SetCharacterEncoding->utf-8

2. What does your environment tell Linux to support? At a bash prompt, try
echo $LANG (there are two other environment variables I've seen
reference to, so this aspect is nuts)

Mine says
en_US.UTF-8

3. What does Python think it was told?
import sys
print sys.stdout.encoding

Mine says
UTF-8

I can force a similar error as follows:

import urllib
opener = urllib.FancyURLopener({})
ffr =
opener.open("http://prf.icecat.biz/index.cgi?product_id=%s;mi=start;smi=product;shopname=openICEcat-url;lang=fr"
% (14688538))
src = ffr.read()

out = src.decode("utf-8").encode("latin-1")

Traceback (most recent call last):
File "anatoli3.py", line 9, in <module>
src.decode("utf-8").encode("latin-1")
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2122' in
position 17167: ordinal not in range(256)

And from that it's quite clear that for that particular data, I cannot
use a latin-1 encoder.

So I did a bit of hunting, and I find the offending character is the one
after the word 'Core" in the following quote:

processeurs Intel® Core™ de 3ème génération

The symbol is a trademark symbol and is not part of latin-1. If you're
really stuck with a latin-1 terminal, then you could do something like:

print src.decode("utf-8").encode("latin-1", "ignore")

That says to decode it using utf-8 (because the html declared a utf-8
encoding), and encode it back to latin-1 (because your terminal is stuck
there), then print.

Just realize that once you start using 'ignore' you're going to also
ignore discrepancies that are real. For example, maybe your terminal is
actual something other than either latin-1 or utf-8.

For others that just want to play with a minimal subset:

test = u'processeurs Intel\xae Core\u2122 de 3\xe8me g\xe9n\xe9ration av'
print test
print test.encode("latin-1", "ignore")
print test.encode("latin-1")

produces :

processeurs Intel® Core™ de 3ème génération av
processeurs Intel� Core de 3�me g�n�ration av
Traceback (most recent call last):
File "anatoli3.py", line 22, in <module>
print test.encode("latin-1")
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2122' in
position 23: ordinal not in range(256)

-- 

DaveA

[toc] | [prev] | [next] | [standalone]

#35020

From	Hans Mulder <hansmu@xs4all.nl>
Date	2012-12-17 23:02 +0100
Message-ID	<50cf966b$0$6969$e4fe514c@news2.news.xs4all.nl>
In reply to	#35014

On 17/12/12 22:09:04, Dave Angel wrote:
> print src.decode("utf-8").encode("latin-1", "ignore")
> 
> That says to decode it using utf-8 (because the html declared a utf-8
> encoding), and encode it back to latin-1 (because your terminal is stuck
> there), then print.
> 
> 
> Just realize that once you start using 'ignore' you're going to also
> ignore discrepancies that are real. For example, maybe your terminal is
> actual something other than either latin-1 or utf-8.

If you need to see such discrepancies, you can do

print src.decode("utf-8").encode("latin-1", ""xmlcharrefreplace")


That would produce something like:

processeurs Intel® Core&#8482; de 3ème génération av

that is, the problem characters are displayed in &#...; notation.
That is ugly, but sometimes it's the only way to see what character
you really have.

Notice that the number you get is in decimal, where the \u....
notation uses hex:

>>> ord(u"\u2122")
8482
>>>


Hope this helps,

-- HansM

[toc] | [prev] | [next] | [standalone]

#35023

From	Anatoli Hristov <tolidtm@gmail.com>
Date	2012-12-17 23:33 +0100
Message-ID	<mailman.999.1355783638.29569.python-list@python.org>
In reply to	#35020

>> Just realize that once you start using 'ignore' you're going to also
>> ignore discrepancies that are real. For example, maybe your terminal is
>> actual something other than either latin-1 or utf-8.
>
> If you need to see such discrepancies, you can do
>
> print src.decode("utf-8").encode("latin-1", ""xmlcharrefreplace")
>
>
> That would produce something like:
>
> processeurs Intel® Core&#8482; de 3ème génération av
>
> that is, the problem characters are displayed in &#...; notation.
> That is ugly, but sometimes it's the only way to see what character
> you really have.
>
> Notice that the number you get is in decimal, where the \u....
> notation uses hex:

Thanks guys my issue is now solved - the problem came from my Putty
client, it was on latin1 by default and changing it to utf-8, now
works...

[toc] | [prev] | [next] | [standalone]

#35021

From	Terry Reedy <tjreedy@udel.edu>
Date	2012-12-17 17:03 -0500
Message-ID	<mailman.997.1355781824.29569.python-list@python.org>
In reply to	#34955

On 12/17/2012 3:00 PM, Anatoli Hristov wrote:
>> I fixed the print, I changed the setting of the terminal and also on
>> the sshconfig, so now when I print I'm able to print out without
>> problems, but when I tried to run the script I've made it gives me
>> again the same error :
>> ""Unexpected error: exceptions.UnicodeEncodeError
>> """
>> Maybe I will try to update to 2.7
>
> Upgraded to python 27 and still it gives Unexpected error:
> exceptions.UnicodeEncodeError. Damn encoders I don'y know what to
> do...

If you are working with unicode, and you can upgrade to 3.3, you will 
probably we happier if you do. This does not solve all problems, but the 
python side is definitely better. (IE, there are unicode bugs in 2.7 
whose fix *is* to upgrade to 3.3.)

That said, retrieving

http://prf.icecat.biz/index.cgi?product_id=14688538;mi=start;smi=product;shopname=openICEcat-url;lang=fr

with Firefox on Win 7 returns a page containing

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

so I presume the http encoding is also utf-8

Also: printing to the screen in IDLE may work better than with the 
standard interactive console (especially the awful Windows version). I 
have the font set to Lucida Sans Unicode (this may be windows specific) 
which seems to works for all BMP (Basic Multilingual Plane) chars.

-- 
Terry Jan Reedy

[toc] | [prev] | [next] | [standalone]

Page 1 of 2 [1] 2 Next page →

csiph-web

Unicode

Contents

#34949 — Unicode

#34955

#34963

#34966

#34967

#34968

#34969

#34970

#34971

#34972

#34997

#35001

#35004

#35028

#35009

#35010

#35014

#35020

#35023

#35021