Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #34949 > unrolled thread
| Started by | Anatoli Hristov <tolidtm@gmail.com> |
|---|---|
| First post | 2012-12-16 22:10 +0100 |
| Last post | 2012-12-17 23:31 +0100 |
| Articles | 20 on this page of 21 — 7 participants |
Back to article view | Back to comp.lang.python
Unicode Anatoli Hristov <tolidtm@gmail.com> - 2012-12-16 22:10 +0100
Re: Unicode Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-12-17 06:06 +0000
Re: Unicode Anatoli Hristov <tolidtm@gmail.com> - 2012-12-17 09:59 +0100
Re: Unicode Benjamin Kaplan <benjamin.kaplan@case.edu> - 2012-12-17 01:28 -0800
Re: Unicode Anatoli Hristov <tolidtm@gmail.com> - 2012-12-17 10:45 +0100
Re: Unicode Vlastimil Brom <vlastimil.brom@gmail.com> - 2012-12-17 11:02 +0100
Re: Unicode Anatoli Hristov <tolidtm@gmail.com> - 2012-12-17 11:17 +0100
Re: Unicode Vlastimil Brom <vlastimil.brom@gmail.com> - 2012-12-17 11:55 +0100
Re: Unicode Anatoli Hristov <tolidtm@gmail.com> - 2012-12-17 12:14 +0100
Re: Unicode Vlastimil Brom <vlastimil.brom@gmail.com> - 2012-12-17 12:56 +0100
Re: Unicode Anatoli Hristov <tolidtm@gmail.com> - 2012-12-17 18:43 +0100
Re: Unicode Dave Angel <d@davea.name> - 2012-12-17 13:07 -0500
Re: Unicode Anatoli Hristov <tolidtm@gmail.com> - 2012-12-17 19:36 +0100
Re: Unicode Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2012-12-18 00:07 +0000
Re: Unicode Vlastimil Brom <vlastimil.brom@gmail.com> - 2012-12-17 20:55 +0100
Re: Unicode Anatoli Hristov <tolidtm@gmail.com> - 2012-12-17 21:00 +0100
Re: Unicode Dave Angel <d@davea.name> - 2012-12-17 16:09 -0500
Re: Unicode Hans Mulder <hansmu@xs4all.nl> - 2012-12-17 23:02 +0100
Re: Unicode Anatoli Hristov <tolidtm@gmail.com> - 2012-12-17 23:33 +0100
Re: Unicode Terry Reedy <tjreedy@udel.edu> - 2012-12-17 17:03 -0500
Re: Unicode Anatoli Hristov <tolidtm@gmail.com> - 2012-12-17 23:31 +0100
Page 1 of 2 [1] 2 Next page →
| From | Anatoli Hristov <tolidtm@gmail.com> |
|---|---|
| Date | 2012-12-16 22:10 +0100 |
| Subject | Unicode |
| Message-ID | <mailman.941.1355692240.29569.python-list@python.org> |
Hello guys,
I'm using Linux CentOS and Python 2.4 with MySQL 5.xx, I get error
with Unicode I tried many things that I found on the net but none of
them working.
If I dont use UTF-8 it inserts the data into the DB but some French
char. are not correctly decoded. Could you please help me ?
Thanks
def PrepareSpecs(product_id, icecat_prod_id, icecat_image_url, name):
"""Gets the specifications of a product from Icecat.biz and insert
them into the DB
"""
specs = {3:GetSpecsNL(icecat_prod_id),2:GetSpecsFR(icecat_prod_id).decode('utf-8'),1:GetSpecsEN(icecat_prod_id)}
SpecsToSQL(product_id,specs,name)
CategorySQL(product_id)
StoreSQL(product_id)
GetIMG(icecat_image_url,icecat_prod_id)
return
def GetSpecsFR(icecat_prod_id):
opener = urllib.FancyURLopener({})
ffr = opener.open("http://prf.icecat.biz/index.cgi?product_id=%s;mi=start;smi=product;shopname=openICEcat-url;lang=fr"
% icecat_prod_id)
specsfr = ffr.read()
#specsfr = specsfr.decode('utf-8')
specsfr = RemoveHTML(specsfr)
##specsfr = "%r" % specsfr
## if specsfr:
## try:
## specsfr = str(specsfr)
## except UnicodeEncodeError:
## specsfr = str(specsfr.encode('utf-16'))
return specsfr
def RemoveHTML(specs):
specs = specs.replace("<html>","")
specs = specs.replace("<HTML>","")
specs = specs.replace("</html>","")
specs = specs.replace("</HTML>","")
specs = specs.replace("<head>","")
specs = specs.replace("<HEAD>","")
specs = specs.replace("</head>","")
specs = specs.replace("</HEAD>","")
specs = specs.replace("<body>","")
specs = specs.replace("</body>","")
specs = specs.replace("<BODY>","")
specs = specs.replace("</body>","")
specs = specs.replace("<TITLE>","")
specs = specs.replace("</TITLE>","")
specs = specs.replace("<title>","")
specs = specs.replace("</title>","")
specs = specs.replace("<p>","")
specs = specs.replace("</p>","")
return specs
def SpecsToSQL(product_id, specs, name):
for lang, spec in specs.iteritems():
InsertSpecsDB(product_id, spec, lang, name)
return
def InsertSpecsDB(product_id, spec, name, lang):
db = MySQLdb.connect("localhost","getit","opencart")
cursor = db.cursor()
sql = "INSERT INTO product_description (product_id, language_id,
name, description) VALUES (%s,%s,%s,%s)"
params = (product_id, lang, name, spec)
cursor.execute(sql, params)
id = cursor.lastrowid
print"Updated ID %s description %s" %(int(id), lang)
return
[toc] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-12-17 06:06 +0000 |
| Message-ID | <50ceb674$0$29868$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #34949 |
On Sun, 16 Dec 2012 22:10:37 +0100, Anatoli Hristov wrote: > If I dont use UTF-8 it inserts the data into the DB but some French > char. are not correctly decoded. Could you please help me ? What happens when you do use UTF-8? What do you mean, "use UTF-8"? To learn about Unicode, start here: http://www.joelonsoftware.com/articles/Unicode.html If that helps you solve the problem, excellent. If not, please come back with your questions, but first read this: http://www.sscce.org/ As given, we cannot answer your question easily, or at all, because we cannot run your code. It gives indentation errors, you don't tell us what modules you're using, and you haven't reduced the example down to the critical parts that demonstrate the failure. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Anatoli Hristov <tolidtm@gmail.com> |
|---|---|
| Date | 2012-12-17 09:59 +0100 |
| Message-ID | <mailman.949.1355734762.29569.python-list@python.org> |
| In reply to | #34955 |
> What happens when you do use UTF-8? This is the result when I encode the string: " étroits, en utilisant un portable extrêmement puissant—le plus petit et le plus léger des HP EliteBook pleine puissance—avec un écran de diagonale 31,75 cm (12,5 pouces), idéal pour le professionnel ultra-mobile. " No accents > > What do you mean, "use UTF-8"? Trying to encode the string > > > To learn about Unicode, start here: > > http://www.joelonsoftware.com/articles/Unicode.html > > If that helps you solve the problem, excellent. If not, please come back > with your questions, but first read this: I will try to understand the logic :) > > http://www.sscce.org/ > > As given, we cannot answer your question easily, or at all, because we > cannot run your code. It gives indentation errors, you don't tell us what > modules you're using, and you haven't reduced the example down to the > critical parts that demonstrate the failure. I didn't wanted to include all my code as it is 15K. and also I know my code is crappy and you will start blaming and saying that my code is crap.- and I know it ! Thanks
[toc] | [prev] | [next] | [standalone]
| From | Benjamin Kaplan <benjamin.kaplan@case.edu> |
|---|---|
| Date | 2012-12-17 01:28 -0800 |
| Message-ID | <mailman.951.1355736494.29569.python-list@python.org> |
| In reply to | #34955 |
On Mon, Dec 17, 2012 at 12:59 AM, Anatoli Hristov <tolidtm@gmail.com> wrote: >> What happens when you do use UTF-8? > This is the result when I encode the string: > " étroits, en utilisant un portable extrêmement puissant—le plus > petit et le plus léger des HP EliteBook pleine puissance—avec un > écran de diagonale 31,75 cm (12,5 pouces), idéal pour le > professionnel ultra-mobile. > " > No accents >> >> What do you mean, "use UTF-8"? > > Trying to encode the string >> What's your terminal's encoding? That looks like you have a CP-1252 terminal trying to output UTF-8 text.
[toc] | [prev] | [next] | [standalone]
| From | Anatoli Hristov <tolidtm@gmail.com> |
|---|---|
| Date | 2012-12-17 10:45 +0100 |
| Message-ID | <mailman.952.1355737520.29569.python-list@python.org> |
| In reply to | #34955 |
> What's your terminal's encoding? That looks like you have a CP-1252 > terminal trying to output UTF-8 text. Thanks for your answer, I tried <locale> in my terminal and it gives this as an output: LANG=en_US LC_CTYPE="en_US" LC_NUMERIC="en_US" LC_TIME="en_US" LC_COLLATE="en_US" LC_MONETARY="en_US" LC_MESSAGES="en_US" LC_PAPER="en_US" LC_NAME="en_US" LC_ADDRESS="en_US" LC_TELEPHONE="en_US" LC_MEASUREMENT="en_US" LC_IDENTIFICATION="en_US" LC_ALL=
[toc] | [prev] | [next] | [standalone]
| From | Vlastimil Brom <vlastimil.brom@gmail.com> |
|---|---|
| Date | 2012-12-17 11:02 +0100 |
| Message-ID | <mailman.953.1355738534.29569.python-list@python.org> |
| In reply to | #34955 |
2012/12/17 Anatoli Hristov <tolidtm@gmail.com>:
> What happens when you do use UTF-8?
This is the result when I encode the string:
" étroits, en utilisant un portable extrêmement puissant—le plus
petit et le plus léger des HP EliteBook pleine puissance—avec un
écran de diagonale 31,75 cm (12,5 pouces), idéal pour le
professionnel ultra-mobile.
"
No accents
>
Hi,
if you only see encoding problems on printing results to your
terminal, its settings or unicode capability might be the cause,
however, if you also get badly encoding items in the database, you are
likely using an inappropriate encoding in some step.
you seem to be doing something like the following (explicitly or
partly implicitly, based on your system defaults):
>>> print u"étroits, en utilisant un portable extrêmement puissant".encode("utf-8").decode("windows-1252")
étroits, en utilisant un portable extrêmement puissant
>>>
i.e. encode a text using utf-8 and handling it like windows-1252
afterwards (or take an already encoded text and decode it with the
inappropriate ANSI encoding.
hth,
vbr
[toc] | [prev] | [next] | [standalone]
| From | Anatoli Hristov <tolidtm@gmail.com> |
|---|---|
| Date | 2012-12-17 11:17 +0100 |
| Message-ID | <mailman.954.1355739439.29569.python-list@python.org> |
| In reply to | #34955 |
> if you only see encoding problems on printing results to your
> terminal, its settings or unicode capability might be the cause,
> however, if you also get badly encoding items in the database, you are
> likely using an inappropriate encoding in some step.
I get badly encoding into my DB
> you seem to be doing something like the following (explicitly or
> partly implicitly, based on your system defaults):
>
>>>> print u"étroits, en utilisant un portable extrêmement puissant".encode("utf-8").decode("windows-1252")
> étroits, en utilisant un portable extrêmement puissant
>>>>
>
> i.e. encode a text using utf-8 and handling it like windows-1252
> afterwards (or take an already encoded text and decode it with the
> inappropriate ANSI encoding.
Thank you Vlastimil,
I tried to print it as you sholed mr, but I receive an erro:
>>> print u"étroits, en utilisant un portable extrêmement puissant".encode("utf-8").decode("windows-1252")
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u0192'
in position 1: ordinal not in range(256)
>>>
[toc] | [prev] | [next] | [standalone]
| From | Vlastimil Brom <vlastimil.brom@gmail.com> |
|---|---|
| Date | 2012-12-17 11:55 +0100 |
| Message-ID | <mailman.955.1355741714.29569.python-list@python.org> |
| In reply to | #34955 |
2012/12/17 Anatoli Hristov <tolidtm@gmail.com>:
>> if you only see encoding problems on printing results to your
>> terminal, its settings or unicode capability might be the cause,
>> however, if you also get badly encoding items in the database, you are
>> likely using an inappropriate encoding in some step.
>
> I get badly encoding into my DB
>
>> you seem to be doing something like the following (explicitly or
>> partly implicitly, based on your system defaults):
>>
>>>>> print u"étroits, en utilisant un portable extrêmement puissant".encode("utf-8").decode("windows-1252")
>> étroits, en utilisant un portable extrêmement puissant
>>>>>
>>
>> i.e. encode a text using utf-8 and handling it like windows-1252
>> afterwards (or take an already encoded text and decode it with the
>> inappropriate ANSI encoding.
>
> Thank you Vlastimil,
>
> I tried to print it as you sholed mr, but I receive an erro:
>>>> print u"étroits, en utilisant un portable extrêmement puissant".encode("utf-8").decode("windows-1252")
> Traceback (most recent call last):
> File "<stdin>", line 1, in ?
> UnicodeEncodeError: 'latin-1' codec can't encode character u'\u0192'
> in position 1: ordinal not in range(256)
>>>>
Hi,
this seems to be an encoding error of your terminal on printing.
You may need to describe (or better post the respective parts of the
source) where the text is coming from (external text file, database
entry, harcoded in the python source ...), how it is stored, retrieved
and possibly manipulated before you insert it to the database.
You may try to print a repr(...) of the string to be inserted to the
database to see, whether it isn't already mangled in some previous
part of the processing.
hth,
vbr
[toc] | [prev] | [next] | [standalone]
| From | Anatoli Hristov <tolidtm@gmail.com> |
|---|---|
| Date | 2012-12-17 12:14 +0100 |
| Message-ID | <mailman.956.1355742852.29569.python-list@python.org> |
| In reply to | #34955 |
> this seems to be an encoding error of your terminal on printing.
> You may need to describe (or better post the respective parts of the
> source) where the text is coming from (external text file, database
> entry, harcoded in the python source ...), how it is stored, retrieved
> and possibly manipulated before you insert it to the database.
>
Here is how I get the data using the urllib opener:
def GetSpecsFR(icecat_prod_id):
opener = urllib.FancyURLopener({})
ffr = opener.open("http://prf.icecat.biz/index.cgi?product_id=%s;mi=start;smi=product;shopname=openICEcat-url;lang=fr"
% icecat_prod_id)
specsfr = ffr.read()
#specsfr = specsfr.decode('utf-8')
specsfr = RemoveHTML(specsfr)
##specsfr = "%r" % specsfr
## if specsfr:
## try:
## specsfr = str(specsfr)
## except UnicodeEncodeError:
## specsfr = str(specsfr.encode('utf-16'))
return specsfr
[toc] | [prev] | [next] | [standalone]
| From | Vlastimil Brom <vlastimil.brom@gmail.com> |
|---|---|
| Date | 2012-12-17 12:56 +0100 |
| Message-ID | <mailman.957.1355745371.29569.python-list@python.org> |
| In reply to | #34955 |
2012/12/17 Anatoli Hristov <tolidtm@gmail.com>:
>> this seems to be an encoding error of your terminal on printing.
>> You may need to describe (or better post the respective parts of the
>> source) where the text is coming from (external text file, database
>> entry, harcoded in the python source ...), how it is stored, retrieved
>> and possibly manipulated before you insert it to the database.
>>
> Here is how I get the data using the urllib opener:
>
> def GetSpecsFR(icecat_prod_id):
> opener = urllib.FancyURLopener({})
> ffr = opener.open("http://prf.icecat.biz/index.cgi?product_id=%s;mi=start;smi=product;shopname=openICEcat-url;lang=fr"
> % icecat_prod_id)
> specsfr = ffr.read()
> #specsfr = specsfr.decode('utf-8')
> specsfr = RemoveHTML(specsfr)
> ##specsfr = "%r" % specsfr
> ## if specsfr:
> ## try:
> ## specsfr = str(specsfr)
> ## except UnicodeEncodeError:
> ## specsfr = str(specsfr.encode('utf-16'))
> return specsfr
Hi,
I don't know, what the product ID would look like, for this page, but
assuming, the catalog pages are also utf-8 encoded as well as the
error page I get, it should work ok; cf.:
>>> import urllib
>>> opener = urllib.FancyURLopener({})
>>> ffr = opener.open("http://prf.icecat.biz/index.cgi?product_id=%s;mi=start;smi=product;shopname=openICEcat-url;lang=fr" % (1234,))
>>> src = ffr.read()
>>> print src.decode("utf-8")
<!-- This Icecat template is used as head of all pages in Product finder -->
<HTML>
<HEAD>
[... - shortened]
<div align="center">"Désolé, pour ce produit, nous n'avons pas trouvé
d'autres informations produit.<br>Si vous n'êtes pas redirigés
automatiquement, veuillez cliquer" <a href="#" style="font-size:80%"
onclick="history.back()">ici</a>
</div>
<!--
<td bgcolor="" width="230" align="center"><img
src="/imgs/logo.gif" width="180" height="58"></td>
-->
>>>
Printing on an unicode-capable shell works ok (wx PyShell in my case),
inserting to the database should be straightforward too (although I
don't have experiences with the specific db you are using.
Are you getting another unicode errors in other parts of the process,
or do the above steps work differently on your computer?
hth,
vbr
[toc] | [prev] | [next] | [standalone]
| From | Anatoli Hristov <tolidtm@gmail.com> |
|---|---|
| Date | 2012-12-17 18:43 +0100 |
| Message-ID | <mailman.974.1355766240.29569.python-list@python.org> |
| In reply to | #34955 |
> Hi,
> I don't know, what the product ID would look like, for this page, but
> assuming, the catalog pages are also utf-8 encoded as well as the
> error page I get, it should work ok; cf.:
You are right, I get it work on Windows too, but not in Linux. I
changed the codec of linux, but still I don't get it
Here is what I get from Linux:
>>> import urllib
>>> opener = urllib.FancyURLopener({})
>>> ffr = opener.open("http://prf.icecat.biz/index.cgi?product_id=%s;mi=start;smi=product;shopname=openICEcat-url;lang=fr" % (14688538))
>>> src = ffr.read()
>>> print src.decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2122'
in position 17167: ordinal not in range(256)
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <d@davea.name> |
|---|---|
| Date | 2012-12-17 13:07 -0500 |
| Message-ID | <mailman.979.1355767688.29569.python-list@python.org> |
| In reply to | #34955 |
On 12/17/2012 12:43 PM, Anatoli Hristov wrote:
>> Hi,
>> I don't know, what the product ID would look like, for this page, but
>> assuming, the catalog pages are also utf-8 encoded as well as the
>> error page I get, it should work ok; cf.:
> You are right, I get it work on Windows too, but not in Linux. I
> changed the codec of linux, but still I don't get it
>
> Here is what I get from Linux:
>
>>>> import urllib
>>>> opener = urllib.FancyURLopener({})
>>>> ffr = opener.open("http://prf.icecat.biz/index.cgi?product_id=%s;mi=start;smi=product;shopname=openICEcat-url;lang=fr" % (14688538))
>>>> src = ffr.read()
>>>> print src.decode("utf-8")
> Traceback (most recent call last):
> File "<stdin>", line 1, in ?
> UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2122'
> in position 17167: ordinal not in range(256)
I can tell you what's happening, but maybe not how to fix it.
src.decode() is creating a unicode string. The error is not happening
there. But when print is used with a unicode string, it has to encode
the data. And for whatever reason, yours is using latin-1, and you have
a character in there which is not in the latin-1 encoding.
My python 2.7 uses utf-8 everywhere (on Linux Ubuntu 11.04).
--
DaveA
[toc] | [prev] | [next] | [standalone]
| From | Anatoli Hristov <tolidtm@gmail.com> |
|---|---|
| Date | 2012-12-17 19:36 +0100 |
| Message-ID | <mailman.982.1355769364.29569.python-list@python.org> |
| In reply to | #34955 |
> src.decode() is creating a unicode string. The error is not happening > there. But when print is used with a unicode string, it has to encode > the data. And for whatever reason, yours is using latin-1, and you have > a character in there which is not in the latin-1 encoding. I fixed the print, I changed the setting of the terminal and also on the sshconfig, so now when I print I'm able to print out without problems, but when I tried to run the script I've made it gives me again the same error : ""Unexpected error: exceptions.UnicodeEncodeError """ Maybe I will try to update to 2.7
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2012-12-18 00:07 +0000 |
| Message-ID | <50cfb3d9$0$29991$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #35004 |
On Mon, 17 Dec 2012 19:36:01 +0100, Anatoli Hristov wrote: >> src.decode() is creating a unicode string. The error is not happening >> there. But when print is used with a unicode string, it has to encode >> the data. And for whatever reason, yours is using latin-1, and you >> have a character in there which is not in the latin-1 encoding. > I fixed the print, I changed the setting of the terminal and also on the > sshconfig, so now when I print I'm able to print out without problems, > but when I tried to run the script I've made it gives me again the same > error : > ""Unexpected error: exceptions.UnicodeEncodeError """ That is not a full Python traceback. Python gives you lots of debugging information, in the form of a complete traceback. Use those tracebacks, don't ignore them. Trying to debug code without the full traceback is like trying to read a book by reading only every third page. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | Vlastimil Brom <vlastimil.brom@gmail.com> |
|---|---|
| Date | 2012-12-17 20:55 +0100 |
| Message-ID | <mailman.987.1355774107.29569.python-list@python.org> |
| In reply to | #34955 |
2012/12/17 Anatoli Hristov <tolidtm@gmail.com>: >> src.decode() is creating a unicode string. The error is not happening >> there. But when print is used with a unicode string, it has to encode >> the data. And for whatever reason, yours is using latin-1, and you have >> a character in there which is not in the latin-1 encoding. > I fixed the print, I changed the setting of the terminal and also on > the sshconfig, so now when I print I'm able to print out without > problems, but when I tried to run the script I've made it gives me > again the same error : > ""Unexpected error: exceptions.UnicodeEncodeError > """ > Maybe I will try to update to 2.7 > -- > http://mail.python.org/mailman/listinfo/python-list Well, we don't see the context or traceback of that error, but it looks like a mysql error on inserting data. Could it be, that your database is not unicode enabled, e.g. utf-8, but, say, latin-1? I don't have experiences with this database this, but I guess, there must be some configure options for this. Would maybe setting the encoding in db.connect(...) work? cf.: http://stackoverflow.com/questions/8365660/python-mysql-unicode-and-encoding Hopefully, others might give more reliable suggestions.. hth, vbr
[toc] | [prev] | [next] | [standalone]
| From | Anatoli Hristov <tolidtm@gmail.com> |
|---|---|
| Date | 2012-12-17 21:00 +0100 |
| Message-ID | <mailman.988.1355774436.29569.python-list@python.org> |
| In reply to | #34955 |
> I fixed the print, I changed the setting of the terminal and also on > the sshconfig, so now when I print I'm able to print out without > problems, but when I tried to run the script I've made it gives me > again the same error : > ""Unexpected error: exceptions.UnicodeEncodeError > """ > Maybe I will try to update to 2.7 Upgraded to python 27 and still it gives Unexpected error: exceptions.UnicodeEncodeError. Damn encoders I don'y know what to do...
[toc] | [prev] | [next] | [standalone]
| From | Dave Angel <d@davea.name> |
|---|---|
| Date | 2012-12-17 16:09 -0500 |
| Message-ID | <mailman.991.1355778571.29569.python-list@python.org> |
| In reply to | #34955 |
On 12/17/2012 03:00 PM, Anatoli Hristov wrote:
>> I fixed the print, I changed the setting of the terminal and also on
>> the sshconfig, so now when I print I'm able to print out without
>> problems, but when I tried to run the script I've made it gives me
>> again the same error :
>> ""Unexpected error: exceptions.UnicodeEncodeError
>> """
That's not the whole error message. What encoding does it report in the
error?
Maybe I will try to update to 2.7
> Upgraded to python 27 and still it gives Unexpected error:
> exceptions.UnicodeEncodeError. Damn encoders I don'y know what to
> do...
I doubted that 2.7 would make any difference.
1. What does your "terminal' expect. (For all I know you're using
TeraTermPro as a terminal, which doesn't support utf-8.)
Have you looked at the terminal encoding to see what your copy of
Terminal is expecting? On my Ubuntu Linux, I open the terminal with
Ctrl-Alt-t, then in the menu bar, I select
Terminal->SetCharacterEncoding->utf-8
2. What does your environment tell Linux to support? At a bash prompt, try
echo $LANG (there are two other environment variables I've seen
reference to, so this aspect is nuts)
Mine says
en_US.UTF-8
3. What does Python think it was told?
import sys
print sys.stdout.encoding
Mine says
UTF-8
I can force a similar error as follows:
import urllib
opener = urllib.FancyURLopener({})
ffr =
opener.open("http://prf.icecat.biz/index.cgi?product_id=%s;mi=start;smi=product;shopname=openICEcat-url;lang=fr"
% (14688538))
src = ffr.read()
out = src.decode("utf-8").encode("latin-1")
Traceback (most recent call last):
File "anatoli3.py", line 9, in <module>
src.decode("utf-8").encode("latin-1")
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2122' in
position 17167: ordinal not in range(256)
And from that it's quite clear that for that particular data, I cannot
use a latin-1 encoder.
So I did a bit of hunting, and I find the offending character is the one
after the word 'Core" in the following quote:
processeurs Intel® Core™ de 3ème génération
The symbol is a trademark symbol and is not part of latin-1. If you're
really stuck with a latin-1 terminal, then you could do something like:
print src.decode("utf-8").encode("latin-1", "ignore")
That says to decode it using utf-8 (because the html declared a utf-8
encoding), and encode it back to latin-1 (because your terminal is stuck
there), then print.
Just realize that once you start using 'ignore' you're going to also
ignore discrepancies that are real. For example, maybe your terminal is
actual something other than either latin-1 or utf-8.
For others that just want to play with a minimal subset:
test = u'processeurs Intel\xae Core\u2122 de 3\xe8me g\xe9n\xe9ration av'
print test
print test.encode("latin-1", "ignore")
print test.encode("latin-1")
produces :
processeurs Intel® Core™ de 3ème génération av
processeurs Intel� Core de 3�me g�n�ration av
Traceback (most recent call last):
File "anatoli3.py", line 22, in <module>
print test.encode("latin-1")
UnicodeEncodeError: 'latin-1' codec can't encode character u'\u2122' in
position 23: ordinal not in range(256)
--
DaveA
[toc] | [prev] | [next] | [standalone]
| From | Hans Mulder <hansmu@xs4all.nl> |
|---|---|
| Date | 2012-12-17 23:02 +0100 |
| Message-ID | <50cf966b$0$6969$e4fe514c@news2.news.xs4all.nl> |
| In reply to | #35014 |
On 17/12/12 22:09:04, Dave Angel wrote:
> print src.decode("utf-8").encode("latin-1", "ignore")
>
> That says to decode it using utf-8 (because the html declared a utf-8
> encoding), and encode it back to latin-1 (because your terminal is stuck
> there), then print.
>
>
> Just realize that once you start using 'ignore' you're going to also
> ignore discrepancies that are real. For example, maybe your terminal is
> actual something other than either latin-1 or utf-8.
If you need to see such discrepancies, you can do
print src.decode("utf-8").encode("latin-1", ""xmlcharrefreplace")
That would produce something like:
processeurs Intel® Core™ de 3ème génération av
that is, the problem characters are displayed in &#...; notation.
That is ugly, but sometimes it's the only way to see what character
you really have.
Notice that the number you get is in decimal, where the \u....
notation uses hex:
>>> ord(u"\u2122")
8482
>>>
Hope this helps,
-- HansM
[toc] | [prev] | [next] | [standalone]
| From | Anatoli Hristov <tolidtm@gmail.com> |
|---|---|
| Date | 2012-12-17 23:33 +0100 |
| Message-ID | <mailman.999.1355783638.29569.python-list@python.org> |
| In reply to | #35020 |
>> Just realize that once you start using 'ignore' you're going to also
>> ignore discrepancies that are real. For example, maybe your terminal is
>> actual something other than either latin-1 or utf-8.
>
> If you need to see such discrepancies, you can do
>
> print src.decode("utf-8").encode("latin-1", ""xmlcharrefreplace")
>
>
> That would produce something like:
>
> processeurs Intel® Core™ de 3ème génération av
>
> that is, the problem characters are displayed in &#...; notation.
> That is ugly, but sometimes it's the only way to see what character
> you really have.
>
> Notice that the number you get is in decimal, where the \u....
> notation uses hex:
Thanks guys my issue is now solved - the problem came from my Putty
client, it was on latin1 by default and changing it to utf-8, now
works...
[toc] | [prev] | [next] | [standalone]
| From | Terry Reedy <tjreedy@udel.edu> |
|---|---|
| Date | 2012-12-17 17:03 -0500 |
| Message-ID | <mailman.997.1355781824.29569.python-list@python.org> |
| In reply to | #34955 |
On 12/17/2012 3:00 PM, Anatoli Hristov wrote: >> I fixed the print, I changed the setting of the terminal and also on >> the sshconfig, so now when I print I'm able to print out without >> problems, but when I tried to run the script I've made it gives me >> again the same error : >> ""Unexpected error: exceptions.UnicodeEncodeError >> """ >> Maybe I will try to update to 2.7 > > Upgraded to python 27 and still it gives Unexpected error: > exceptions.UnicodeEncodeError. Damn encoders I don'y know what to > do... If you are working with unicode, and you can upgrade to 3.3, you will probably we happier if you do. This does not solve all problems, but the python side is definitely better. (IE, there are unicode bugs in 2.7 whose fix *is* to upgrade to 3.3.) That said, retrieving http://prf.icecat.biz/index.cgi?product_id=14688538;mi=start;smi=product;shopname=openICEcat-url;lang=fr with Firefox on Win 7 returns a page containing <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> so I presume the http encoding is also utf-8 Also: printing to the screen in IDLE may work better than with the standard interactive console (especially the awful Windows version). I have the font set to Lucida Sans Unicode (this may be windows specific) which seems to works for all BMP (Basic Multilingual Plane) chars. -- Terry Jan Reedy
[toc] | [prev] | [next] | [standalone]
Page 1 of 2 [1] 2 Next page →
Back to top | Article view | comp.lang.python
csiph-web