Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #93548

Re: requests.Session() how do you set 'replace' on the encoding?

From dieter <dieter@handshake.de>
Subject Re: requests.Session() how do you set 'replace' on the encoding?
Date 2015-07-07 07:38 +0200
References <mn3oec$7ep$1@dont-email.me> <mailman.266.1435903179.3674.python-list@python.org> <mndi4e$kd6$1@dont-email.me>
Newsgroups comp.lang.python
Message-ID <mailman.337.1436247510.3674.python-list@python.org> (permalink)

Show all headers | View raw


Veek M <vek.m1234@gmail.com> writes:

> dieter wrote:
>
>> Veek M <vek.m1234@gmail.com> writes:
>>> UnicodeEncodeError: 'gbk' codec can't encode character u'\xa0' in
>>> position 8: illegal multibyte sequence
>> 
>> You give us very little context.
>
> It's a longish chunk of code: basically, i'm trying to download using the 
> 'requests.Session' module and that should give me Unicode once it's told 
> what encoding is being used 'gbk'.
>
> def get_page(s, url):
>     print(url)
>     r = s.get(url, headers = {
>           'User-Agent' : user_agent,
>           'Keep-Alive' : '3600',
>           'Connection' : 'keep-alive',
>           })
>     s.encoding='gbk'

It looks strange that you can set "s.encoding" after you have
called "s.get" - but, as you apparently get an error related to
the "gbk" encoding, it seems to work.

>     text = r.text
>     return text
>
> # Open output file
> fh=codecs.open('/tmp/out', 'wb')
> fh.write(header)
>
> # Download
> s = requests.Session()
> ------------
>
> If 'text' is NOT proper unicode because the server introduced some junk, 
> then when i do anchor.getparent() on my 'text' it'll traceback..
> ergo the question, how do i set a replacement char within 'requests'

I see the following options for you:

  *  you look at the code (of "requests.Session"),
     determine where the "s.encoding" is taken care of and
     look around whether there it also support a replacement strategy.
     Then, you use this knowledge to set up your replacement.

  *  you avoid the "unicode" translating functionality of
     "requests.Session". If it does not immediately supports this,
     you can trick it using the "iso-8859-1" encoding (this maps
     bytes to the first 256 unicode codepoints in a one-to-one way)
     and then do the unicode handling in your own code -- with
     facilities you already know of (including replacement)

  *  you contact the website administrator and ask him why
     the delivered pages do not contain valid "gbk" content.

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

requests.Session() how do you set 'replace' on the encoding? Veek M <vek.m1234@gmail.com> - 2015-07-02 21:52 +0530
  Re: requests.Session() how do you set 'replace' on the encoding? dieter <dieter@handshake.de> - 2015-07-03 07:59 +0200
    Re: requests.Session() how do you set 'replace' on the encoding? Veek M <vek.m1234@gmail.com> - 2015-07-06 15:06 +0530
      Re: requests.Session() how do you set 'replace' on the encoding? dieter <dieter@handshake.de> - 2015-07-07 07:38 +0200
        Re: requests.Session() how do you set 'replace' on the encoding? Veek M <vek.m1234@gmail.com> - 2015-07-09 15:55 +0530

csiph-web