Groups > comp.lang.python > #7773 > unrolled thread

HTTPConncetion - HEAD request

Started by	gervaz <gervaz@gmail.com>
First post	2011-06-16 15:43 -0700
Last post	2011-06-19 16:21 +0300
Articles	7 — 5 participants

Back to article view | Back to comp.lang.python

  HTTPConncetion - HEAD request gervaz <gervaz@gmail.com> - 2011-06-16 15:43 -0700
    Re: HTTPConncetion - HEAD request Ian Kelly <ian.g.kelly@gmail.com> - 2011-06-16 17:00 -0600
      Re: HTTPConncetion - HEAD request gervaz <gervaz@gmail.com> - 2011-06-17 01:19 -0700
        Re: HTTPConncetion - HEAD request Chris Angelico <rosuav@gmail.com> - 2011-06-17 18:44 +1000
    Re: HTTPConncetion - HEAD request Adam Tauno Williams <awilliam@whitemice.org> - 2011-06-17 06:14 -0400
      Re: HTTPConncetion - HEAD request gervaz <gervaz@gmail.com> - 2011-06-17 10:53 -0700
        Re: HTTPConncetion - HEAD request "Elias Fotinis" <efotinis@yahoo.com> - 2011-06-19 16:21 +0300

#7773 — HTTPConncetion - HEAD request

From	gervaz <gervaz@gmail.com>
Date	2011-06-16 15:43 -0700
Subject	HTTPConncetion - HEAD request
Message-ID	<b06bf827-671c-4555-b3ec-108bc5c3a0b8@m10g2000yqd.googlegroups.com>

Hi all, can someone tell me why the read() function in the following
py3 code returns b''?

>>> h = http.client.HTTPConnection("www.twitter.com")
>>> h.connect()
>>> h.request("HEAD", "/", "HTTP 1.0")
>>> r = h.getresponse()
>>> r.read()
b''

Thanks,

Mattia

[toc] | [next] | [standalone]

#7774

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2011-06-16 17:00 -0600
Message-ID	<mailman.44.1308265289.1164.python-list@python.org>
In reply to	#7773

On Thu, Jun 16, 2011 at 4:43 PM, gervaz <gervaz@gmail.com> wrote:
> Hi all, can someone tell me why the read() function in the following
> py3 code returns b''?
>
>>>> h = http.client.HTTPConnection("www.twitter.com")
>>>> h.connect()
>>>> h.request("HEAD", "/", "HTTP 1.0")
>>>> r = h.getresponse()
>>>> r.read()
> b''

You mean why does it return an empty byte sequence?  Because the HEAD
method only requests the response headers, not the body, so the body
is empty.  If you want to see the response body, use GET.

Cheers,
Ian

[toc] | [prev] | [next] | [standalone]

#7812

From	gervaz <gervaz@gmail.com>
Date	2011-06-17 01:19 -0700
Message-ID	<173e15bf-5fbd-484c-8be1-d4f0a6d155fc@u26g2000vby.googlegroups.com>
In reply to	#7774

On 17 Giu, 01:00, Ian Kelly <ian.g.ke...@gmail.com> wrote:
> On Thu, Jun 16, 2011 at 4:43 PM, gervaz <ger...@gmail.com> wrote:
> > Hi all, can someone tell me why the read() function in the following
> > py3 code returns b''?
>
> >>>> h = http.client.HTTPConnection("www.twitter.com")
> >>>> h.connect()
> >>>> h.request("HEAD", "/", "HTTP 1.0")
> >>>> r = h.getresponse()
> >>>> r.read()
> > b''
>
> You mean why does it return an empty byte sequence?  Because the HEAD
> method only requests the response headers, not the body, so the body
> is empty.  If you want to see the response body, use GET.
>
> Cheers,
> Ian

The fact is that I have a list of urls and I wanted to retrieve the
minimum necessary information in order to understand if the link is a
valid html page or e.g. a picture or something else. As far as I
understood here http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html
the HEAD command is the one that let you do this. But it seems it
doesn't work.

Any help?

Mattia

[toc] | [prev] | [next] | [standalone]

#7813

From	Chris Angelico <rosuav@gmail.com>
Date	2011-06-17 18:44 +1000
Message-ID	<mailman.68.1308300280.1164.python-list@python.org>
In reply to	#7812

On Fri, Jun 17, 2011 at 6:19 PM, gervaz <gervaz@gmail.com> wrote:
> The fact is that I have a list of urls and I wanted to retrieve the
> minimum necessary information in order to understand if the link is a
> valid html page or e.g. a picture or something else. As far as I
> understood here http://www.w3.org/Protocols/rfc2616/rfc2616-sec9.html
> the HEAD command is the one that let you do this. But it seems it
> doesn't work.

It's not working because of a few issues.

Twitter doesn't accept requests that come without a Host: header, so
you'll need to provide that. Also, your "HTTP 1.0" is going as the
body of the request, which is quite unnecessary. What you were getting
was a 301 redirect, as you can confirm thus:

>>> r.getcode()
301
>>> r.getheaders()
[('Date', 'Fri, 17 Jun 2011 08:31:31 GMT'), ('Server', 'Apache'),
('Location', 'http://twitter.com/'), ('Cache-Control', 'max-age=300'),
('Expires', 'Fri, 17 Jun 2011 08:36:31 GMT'), ('Vary',
'Accept-Encoding'), ('Connection', 'close'), ('Content-Type',
'text/html; charset=iso-8859-1')]

(Note the Location header - the server's asking you to go to
twitter.com by name.)

h.request("HEAD","/",None,{"Host":"twitter.com"})

Now we have a request that the server's prepared to answer:

>>> r.getcode()
200

The headers are numerous, so I won't quote them here, but you get a
Content-Length which tells you the size of the page that you would
get, plus a few others that may be of interest. But note that there's
still no body on a HEAD request:

>>> r.read()
b''

If you want to check validity, the most important part is the code:

>>> h.request("HEAD","/aasdfadefa",None,{"Host":"twitter.com"})
>>> r=h.getresponse()
>>> r.getcode()
404

Twitter might be a bad example for this, though, as the above call
will succeed if there is a user of that name (for instance, replacing
"/aasdfadefa" with "/rosuav" changes the response to a 200). You also
have to contend with the possibility that the server won't allow HEAD
requests at all, in which case just fall back on GET.

But all this isn't certain, even so. There are some misconfigured
servers that actually send a 200 response when a page doesn't exist.
But you can probably ignore those sorts of hassles, and just code to
the standard.

Hope that helps!

Chris Angelico

[toc] | [prev] | [next] | [standalone]

#7815

From	Adam Tauno Williams <awilliam@whitemice.org>
Date	2011-06-17 06:14 -0400
Message-ID	<mailman.69.1308304747.1164.python-list@python.org>
In reply to	#7773

On Thu, 2011-06-16 at 15:43 -0700, gervaz wrote:
> Hi all, can someone tell me why the read() function in the following
> py3 code returns b''
> >>> h = http.client.HTTPConnection("www.twitter.com")
> >>> h.connect()
> >>> h.request("HEAD", "/", "HTTP 1.0")
> >>> r = h.getresponse()
> >>> r.read()
> b''

Because there is no body in a HEAD request.  What is useful are the
Content-Type, Content-Length, and etag headers.

Is r.getcode() == 200?  That indicates a successful response; you
*always* much check the response code before interpreting the response.

Also I'm pretty sure that "HTTP 1.0" is wrong.

[toc] | [prev] | [next] | [standalone]

#7841

From	gervaz <gervaz@gmail.com>
Date	2011-06-17 10:53 -0700
Message-ID	<38df1fae-a7bd-43c3-8b3d-b8d685af4b9f@fp11g2000vbb.googlegroups.com>
In reply to	#7815

On 17 Giu, 12:14, Adam Tauno Williams <awill...@whitemice.org> wrote:
> On Thu, 2011-06-16 at 15:43 -0700, gervaz wrote:
> > Hi all, can someone tell me why the read() function in the following
> > py3 code returns b''
> > >>> h = http.client.HTTPConnection("www.twitter.com")
> > >>> h.connect()
> > >>> h.request("HEAD", "/", "HTTP 1.0")
> > >>> r = h.getresponse()
> > >>> r.read()
> > b''
>
> Because there is no body in a HEAD request.  What is useful are the
> Content-Type, Content-Length, and etag headers.
>
> Is r.getcode() == 200?  That indicates a successful response; you
> *always* much check the response code before interpreting the response.
>
> Also I'm pretty sure that "HTTP 1.0" is wrong.

Ok, thanks for the replies, just another question in order to have a
similar behaviour using a different approach...
I decided to implement this solution:

class HeadRequest(urllib.request.Request):
    def get_method(self):
        return "HEAD"

Now I download the url using:

r = HeadRequest(url, None, self.headers)
c = urllib.request.urlopen(r)

but I don't know how to retrieve the request status (e.g. 200) as in
the previous examples with a different implementation...

Any suggestion?

Thanks,

Mattia

[toc] | [prev] | [next] | [standalone]

#7952

From	"Elias Fotinis" <efotinis@yahoo.com>
Date	2011-06-19 16:21 +0300
Message-ID	<mailman.145.1308489706.1164.python-list@python.org>
In reply to	#7841

On Fri, 17 Jun 2011 20:53:39 +0300, gervaz <gervaz@gmail.com> wrote:

> I decided to implement this solution:
>
> class HeadRequest(urllib.request.Request):
>     def get_method(self):
>         return "HEAD"
>
> Now I download the url using:
>
> r = HeadRequest(url, None, self.headers)
> c = urllib.request.urlopen(r)
>
> but I don't know how to retrieve the request status (e.g. 200) as in
> the previous examples with a different implementation...

Use c.getcode() to get the response code. When you're testing interactively, you might find printing the headers with "print c.headers" quite handy.

Don't forget to close the response (c.close()) when your script exits its experimental state.

[toc] | [prev] | [standalone]

csiph-web

HTTPConncetion - HEAD request

Contents

#7773 — HTTPConncetion - HEAD request

#7774

#7812

#7813

#7815

#7841

#7952