Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #93442 > unrolled thread
| Started by | Veek M <vek.m1234@gmail.com> |
|---|---|
| First post | 2015-07-02 21:52 +0530 |
| Last post | 2015-07-09 15:55 +0530 |
| Articles | 5 — 2 participants |
Back to article view | Back to comp.lang.python
requests.Session() how do you set 'replace' on the encoding? Veek M <vek.m1234@gmail.com> - 2015-07-02 21:52 +0530
Re: requests.Session() how do you set 'replace' on the encoding? dieter <dieter@handshake.de> - 2015-07-03 07:59 +0200
Re: requests.Session() how do you set 'replace' on the encoding? Veek M <vek.m1234@gmail.com> - 2015-07-06 15:06 +0530
Re: requests.Session() how do you set 'replace' on the encoding? dieter <dieter@handshake.de> - 2015-07-07 07:38 +0200
Re: requests.Session() how do you set 'replace' on the encoding? Veek M <vek.m1234@gmail.com> - 2015-07-09 15:55 +0530
| From | Veek M <vek.m1234@gmail.com> |
|---|---|
| Date | 2015-07-02 21:52 +0530 |
| Subject | requests.Session() how do you set 'replace' on the encoding? |
| Message-ID | <mn3oec$7ep$1@dont-email.me> |
I'm getting a Unicode error:
Traceback (most recent call last):
File "fooxxx.py", line 56, in <module>
parent = anchor.getparent()
UnicodeEncodeError: 'gbk' codec can't encode character u'\xa0' in position
8: illegal multibyte sequence
I'm doing:
s = requests.Session()
to suck data in, so.. how do i 'replace' chars that fit gbk
[toc] | [next] | [standalone]
| From | dieter <dieter@handshake.de> |
|---|---|
| Date | 2015-07-03 07:59 +0200 |
| Message-ID | <mailman.266.1435903179.3674.python-list@python.org> |
| In reply to | #93442 |
Veek M <vek.m1234@gmail.com> writes: > I'm getting a Unicode error: > > Traceback (most recent call last): > File "fooxxx.py", line 56, in <module> > parent = anchor.getparent() > UnicodeEncodeError: 'gbk' codec can't encode character u'\xa0' in position > 8: illegal multibyte sequence You give us very little context. Using "getparent" seems to indicate that you are doing something with hierarchies, likely some XML processing. In this case, the XML document likely specified "gbk" as document encoding (otherwise, you would get the default "utf-8") -- and it got it wrong (which should not happen). In general: when you need control over encoding handling because deep in a framework an econding causes problems (as apparently in your case), you can usually first take the plain text, fix any encoding problems and only then pass the fixed text to your framework. > I'm doing: > s = requests.Session() > to suck data in, so.. how do i 'replace' chars that fit gbk It does not seem that the problem occurs inside the "requests" module. Thus, you have a chance to "intercept" the downloaded text and fix encoding problems.
[toc] | [prev] | [next] | [standalone]
| From | Veek M <vek.m1234@gmail.com> |
|---|---|
| Date | 2015-07-06 15:06 +0530 |
| Message-ID | <mndi4e$kd6$1@dont-email.me> |
| In reply to | #93461 |
dieter wrote:
> Veek M <vek.m1234@gmail.com> writes:
>> UnicodeEncodeError: 'gbk' codec can't encode character u'\xa0' in
>> position 8: illegal multibyte sequence
>
> You give us very little context.
It's a longish chunk of code: basically, i'm trying to download using the
'requests.Session' module and that should give me Unicode once it's told
what encoding is being used 'gbk'.
def get_page(s, url):
print(url)
r = s.get(url, headers = {
'User-Agent' : user_agent,
'Keep-Alive' : '3600',
'Connection' : 'keep-alive',
})
s.encoding='gbk'
text = r.text
return text
# Open output file
fh=codecs.open('/tmp/out', 'wb')
fh.write(header)
# Download
s = requests.Session()
------------
If 'text' is NOT proper unicode because the server introduced some junk,
then when i do anchor.getparent() on my 'text' it'll traceback..
ergo the question, how do i set a replacement char within 'requests'
> In general: when you need control over encoding handling because
> deep in a framework an econding causes problems (as apparently in
> your case), you can usually first take the plain text,
> fix any encoding problems and only then pass the fixed text to
> your framework.
>
>> I'm doing:
>> s = requests.Session()
>> to suck data in, so.. how do i 'replace' chars that fit gbk
>
> It does not seem that the problem occurs inside the "requests" module.
> Thus, you have a chance to "intercept" the downloaded text
> and fix encoding problems.
Okay, so i should use the 'raw' method in requests and then clean up the
raw-text and then convert that to unicode.. vs trying to do it using
'requests'? The thing is 'codec's has a xmlcharrefreplace_errors(...) etc so
i figured if output has clean up, input ought to have it :p
[toc] | [prev] | [next] | [standalone]
| From | dieter <dieter@handshake.de> |
|---|---|
| Date | 2015-07-07 07:38 +0200 |
| Message-ID | <mailman.337.1436247510.3674.python-list@python.org> |
| In reply to | #93524 |
Veek M <vek.m1234@gmail.com> writes:
> dieter wrote:
>
>> Veek M <vek.m1234@gmail.com> writes:
>>> UnicodeEncodeError: 'gbk' codec can't encode character u'\xa0' in
>>> position 8: illegal multibyte sequence
>>
>> You give us very little context.
>
> It's a longish chunk of code: basically, i'm trying to download using the
> 'requests.Session' module and that should give me Unicode once it's told
> what encoding is being used 'gbk'.
>
> def get_page(s, url):
> print(url)
> r = s.get(url, headers = {
> 'User-Agent' : user_agent,
> 'Keep-Alive' : '3600',
> 'Connection' : 'keep-alive',
> })
> s.encoding='gbk'
It looks strange that you can set "s.encoding" after you have
called "s.get" - but, as you apparently get an error related to
the "gbk" encoding, it seems to work.
> text = r.text
> return text
>
> # Open output file
> fh=codecs.open('/tmp/out', 'wb')
> fh.write(header)
>
> # Download
> s = requests.Session()
> ------------
>
> If 'text' is NOT proper unicode because the server introduced some junk,
> then when i do anchor.getparent() on my 'text' it'll traceback..
> ergo the question, how do i set a replacement char within 'requests'
I see the following options for you:
* you look at the code (of "requests.Session"),
determine where the "s.encoding" is taken care of and
look around whether there it also support a replacement strategy.
Then, you use this knowledge to set up your replacement.
* you avoid the "unicode" translating functionality of
"requests.Session". If it does not immediately supports this,
you can trick it using the "iso-8859-1" encoding (this maps
bytes to the first 256 unicode codepoints in a one-to-one way)
and then do the unicode handling in your own code -- with
facilities you already know of (including replacement)
* you contact the website administrator and ask him why
the delivered pages do not contain valid "gbk" content.
[toc] | [prev] | [next] | [standalone]
| From | Veek M <vek.m1234@gmail.com> |
|---|---|
| Date | 2015-07-09 15:55 +0530 |
| Message-ID | <mnli4k$o08$1@dont-email.me> |
| In reply to | #93548 |
dieter wrote: > > It looks strange that you can set "s.encoding" after you have > called "s.get" - but, as you apparently get an error related to > the "gbk" encoding, it seems to work. Ooo! Sorry, typo - that was outside the function but before the call. Unfortunately whilst improving my function for Usenet, i screwed it up even further. (just ignore that, as you have) to the rest of your answer - many thanks, will do.
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web