Groups > comp.lang.python > #93442 > unrolled thread

requests.Session() how do you set 'replace' on the encoding?

Started by	Veek M <vek.m1234@gmail.com>
First post	2015-07-02 21:52 +0530
Last post	2015-07-09 15:55 +0530
Articles	5 — 2 participants

Back to article view | Back to comp.lang.python

  requests.Session() how do you set 'replace' on the encoding? Veek M <vek.m1234@gmail.com> - 2015-07-02 21:52 +0530
    Re: requests.Session() how do you set 'replace' on the encoding? dieter <dieter@handshake.de> - 2015-07-03 07:59 +0200
      Re: requests.Session() how do you set 'replace' on the encoding? Veek M <vek.m1234@gmail.com> - 2015-07-06 15:06 +0530
        Re: requests.Session() how do you set 'replace' on the encoding? dieter <dieter@handshake.de> - 2015-07-07 07:38 +0200
          Re: requests.Session() how do you set 'replace' on the encoding? Veek M <vek.m1234@gmail.com> - 2015-07-09 15:55 +0530

#93442 — requests.Session() how do you set 'replace' on the encoding?

From	Veek M <vek.m1234@gmail.com>
Date	2015-07-02 21:52 +0530
Subject	requests.Session() how do you set 'replace' on the encoding?
Message-ID	<mn3oec$7ep$1@dont-email.me>

I'm getting a Unicode error:

Traceback (most recent call last):
  File "fooxxx.py", line 56, in <module>
    parent = anchor.getparent()
UnicodeEncodeError: 'gbk' codec can't encode character u'\xa0' in position 
8: illegal multibyte sequence

I'm doing:
s = requests.Session()
to suck data in, so.. how do i 'replace' chars that fit gbk

[toc] | [next] | [standalone]

#93461

From	dieter <dieter@handshake.de>
Date	2015-07-03 07:59 +0200
Message-ID	<mailman.266.1435903179.3674.python-list@python.org>
In reply to	#93442

Veek M <vek.m1234@gmail.com> writes:

> I'm getting a Unicode error:
>
> Traceback (most recent call last):
>   File "fooxxx.py", line 56, in <module>
>     parent = anchor.getparent()
> UnicodeEncodeError: 'gbk' codec can't encode character u'\xa0' in position 
> 8: illegal multibyte sequence

You give us very little context.

Using "getparent" seems to indicate that you are doing something with
hierarchies, likely some XML processing. In this case,
the XML document likely specified "gbk" as document encoding
(otherwise, you would get the default "utf-8") -- and it got it wrong
(which should not happen).

In general: when you need control over encoding handling because
deep in a framework an econding causes problems (as apparently in
your case), you can usually first take the plain text,
fix any encoding problems and only then pass the fixed text to
your framework.

> I'm doing:
> s = requests.Session()
> to suck data in, so.. how do i 'replace' chars that fit gbk

It does not seem that the problem occurs inside the "requests" module.
Thus, you have a chance to "intercept" the downloaded text
and fix encoding problems.

[toc] | [prev] | [next] | [standalone]

#93524

From	Veek M <vek.m1234@gmail.com>
Date	2015-07-06 15:06 +0530
Message-ID	<mndi4e$kd6$1@dont-email.me>
In reply to	#93461

dieter wrote:

> Veek M <vek.m1234@gmail.com> writes:
>> UnicodeEncodeError: 'gbk' codec can't encode character u'\xa0' in
>> position 8: illegal multibyte sequence
> 
> You give us very little context.

It's a longish chunk of code: basically, i'm trying to download using the 
'requests.Session' module and that should give me Unicode once it's told 
what encoding is being used 'gbk'.

def get_page(s, url):
    print(url)
    r = s.get(url, headers = {
          'User-Agent' : user_agent,
          'Keep-Alive' : '3600',
          'Connection' : 'keep-alive',
          })
    s.encoding='gbk'
    text = r.text
    return text

# Open output file
fh=codecs.open('/tmp/out', 'wb')
fh.write(header)

# Download
s = requests.Session()
------------

If 'text' is NOT proper unicode because the server introduced some junk, 
then when i do anchor.getparent() on my 'text' it'll traceback..
ergo the question, how do i set a replacement char within 'requests'

> In general: when you need control over encoding handling because
> deep in a framework an econding causes problems (as apparently in
> your case), you can usually first take the plain text,
> fix any encoding problems and only then pass the fixed text to
> your framework.
> 
>> I'm doing:
>> s = requests.Session()
>> to suck data in, so.. how do i 'replace' chars that fit gbk
> 
> It does not seem that the problem occurs inside the "requests" module.
> Thus, you have a chance to "intercept" the downloaded text
> and fix encoding problems.

Okay, so i should use the 'raw' method in requests and then clean up the 
raw-text and then convert that to unicode.. vs trying to do it using 
'requests'? The thing is 'codec's has a xmlcharrefreplace_errors(...) etc so 
i figured if output has clean up, input ought to have it :p

[toc] | [prev] | [next] | [standalone]

#93548

From	dieter <dieter@handshake.de>
Date	2015-07-07 07:38 +0200
Message-ID	<mailman.337.1436247510.3674.python-list@python.org>
In reply to	#93524

Veek M <vek.m1234@gmail.com> writes:

> dieter wrote:
>
>> Veek M <vek.m1234@gmail.com> writes:
>>> UnicodeEncodeError: 'gbk' codec can't encode character u'\xa0' in
>>> position 8: illegal multibyte sequence
>> 
>> You give us very little context.
>
> It's a longish chunk of code: basically, i'm trying to download using the 
> 'requests.Session' module and that should give me Unicode once it's told 
> what encoding is being used 'gbk'.
>
> def get_page(s, url):
>     print(url)
>     r = s.get(url, headers = {
>           'User-Agent' : user_agent,
>           'Keep-Alive' : '3600',
>           'Connection' : 'keep-alive',
>           })
>     s.encoding='gbk'

It looks strange that you can set "s.encoding" after you have
called "s.get" - but, as you apparently get an error related to
the "gbk" encoding, it seems to work.

>     text = r.text
>     return text
>
> # Open output file
> fh=codecs.open('/tmp/out', 'wb')
> fh.write(header)
>
> # Download
> s = requests.Session()
> ------------
>
> If 'text' is NOT proper unicode because the server introduced some junk, 
> then when i do anchor.getparent() on my 'text' it'll traceback..
> ergo the question, how do i set a replacement char within 'requests'

I see the following options for you:

  *  you look at the code (of "requests.Session"),
     determine where the "s.encoding" is taken care of and
     look around whether there it also support a replacement strategy.
     Then, you use this knowledge to set up your replacement.

  *  you avoid the "unicode" translating functionality of
     "requests.Session". If it does not immediately supports this,
     you can trick it using the "iso-8859-1" encoding (this maps
     bytes to the first 256 unicode codepoints in a one-to-one way)
     and then do the unicode handling in your own code -- with
     facilities you already know of (including replacement)

  *  you contact the website administrator and ask him why
     the delivered pages do not contain valid "gbk" content.

[toc] | [prev] | [next] | [standalone]

#93590

From	Veek M <vek.m1234@gmail.com>
Date	2015-07-09 15:55 +0530
Message-ID	<mnli4k$o08$1@dont-email.me>
In reply to	#93548

dieter wrote:
> 
> It looks strange that you can set "s.encoding" after you have
> called "s.get" - but, as you apparently get an error related to
> the "gbk" encoding, it seems to work.

Ooo! Sorry, typo - that was outside the function but before the call. 
Unfortunately whilst improving my function for Usenet, i screwed it up even 
further. (just ignore that, as you have)

to the rest of your answer - many thanks, will do.

[toc] | [prev] | [standalone]

csiph-web

requests.Session() how do you set 'replace' on the encoding?

Contents

#93442 — requests.Session() how do you set 'replace' on the encoding?

#93461

#93524

#93548

#93590