Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #93524

Re: requests.Session() how do you set 'replace' on the encoding?

From Veek M <vek.m1234@gmail.com>
Newsgroups comp.lang.python
Subject Re: requests.Session() how do you set 'replace' on the encoding?
Date 2015-07-06 15:06 +0530
Organization Home
Message-ID <mndi4e$kd6$1@dont-email.me> (permalink)
References <mn3oec$7ep$1@dont-email.me> <mailman.266.1435903179.3674.python-list@python.org>

Show all headers | View raw


dieter wrote:

> Veek M <vek.m1234@gmail.com> writes:
>> UnicodeEncodeError: 'gbk' codec can't encode character u'\xa0' in
>> position 8: illegal multibyte sequence
> 
> You give us very little context.

It's a longish chunk of code: basically, i'm trying to download using the 
'requests.Session' module and that should give me Unicode once it's told 
what encoding is being used 'gbk'.

def get_page(s, url):
    print(url)
    r = s.get(url, headers = {
          'User-Agent' : user_agent,
          'Keep-Alive' : '3600',
          'Connection' : 'keep-alive',
          })
    s.encoding='gbk'
    text = r.text
    return text

# Open output file
fh=codecs.open('/tmp/out', 'wb')
fh.write(header)

# Download
s = requests.Session()
------------

If 'text' is NOT proper unicode because the server introduced some junk, 
then when i do anchor.getparent() on my 'text' it'll traceback..
ergo the question, how do i set a replacement char within 'requests'

> In general: when you need control over encoding handling because
> deep in a framework an econding causes problems (as apparently in
> your case), you can usually first take the plain text,
> fix any encoding problems and only then pass the fixed text to
> your framework.
> 
>> I'm doing:
>> s = requests.Session()
>> to suck data in, so.. how do i 'replace' chars that fit gbk
> 
> It does not seem that the problem occurs inside the "requests" module.
> Thus, you have a chance to "intercept" the downloaded text
> and fix encoding problems.

Okay, so i should use the 'raw' method in requests and then clean up the 
raw-text and then convert that to unicode.. vs trying to do it using 
'requests'? The thing is 'codec's has a xmlcharrefreplace_errors(...) etc so 
i figured if output has clean up, input ought to have it :p

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

requests.Session() how do you set 'replace' on the encoding? Veek M <vek.m1234@gmail.com> - 2015-07-02 21:52 +0530
  Re: requests.Session() how do you set 'replace' on the encoding? dieter <dieter@handshake.de> - 2015-07-03 07:59 +0200
    Re: requests.Session() how do you set 'replace' on the encoding? Veek M <vek.m1234@gmail.com> - 2015-07-06 15:06 +0530
      Re: requests.Session() how do you set 'replace' on the encoding? dieter <dieter@handshake.de> - 2015-07-07 07:38 +0200
        Re: requests.Session() how do you set 'replace' on the encoding? Veek M <vek.m1234@gmail.com> - 2015-07-09 15:55 +0530

csiph-web