Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #47081
| X-Received | by 10.224.165.143 with SMTP id i15mr19028417qay.0.1370443565284; Wed, 05 Jun 2013 07:46:05 -0700 (PDT) |
|---|---|
| X-Received | by 10.50.83.100 with SMTP id p4mr862831igy.9.1370443565236; Wed, 05 Jun 2013 07:46:05 -0700 (PDT) |
| Path | csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!news.glorb.com!p1no1554246qaj.0!news-out.google.com!10ni283qax.0!nntp.google.com!ch1no988660qab.0!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail |
| Newsgroups | comp.lang.python |
| Date | Wed, 5 Jun 2013 07:46:04 -0700 (PDT) |
| In-Reply-To | <c15bad9a-a7f7-456e-8dc5-b1af67fbdd44@googlegroups.com> |
| Complaints-To | groups-abuse@google.com |
| Injection-Info | glegroupsg2000goo.googlegroups.com; posting-host=182.242.231.250; posting-account=Z1-aQQoAAADvnuKxr9sysEiuPIcBNfjX |
| NNTP-Posting-Host | 182.242.231.250 |
| References | <c15bad9a-a7f7-456e-8dc5-b1af67fbdd44@googlegroups.com> |
| User-Agent | G2/1.0 |
| MIME-Version | 1.0 |
| Message-ID | <c8f8e97b-d866-4342-8dab-01815f20aa75@googlegroups.com> (permalink) |
| Subject | Re: how to detect the character encoding in a web page ? |
| From | iMath <redstone-cold@163.com> |
| Injection-Date | Wed, 05 Jun 2013 14:46:05 +0000 |
| Content-Type | text/plain; charset=UTF-8 |
| Content-Transfer-Encoding | quoted-printable |
| Xref | csiph.com comp.lang.python:47081 |
Show key headers only | View raw
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道:
> how to detect the character encoding in a web page ?
>
> such as this page
>
>
>
> http://python.org/
I found PyQt’s QtextStream can very accurately detect the character encoding in a web page .
even for this bad page
chardet and beautiful soup failed ,but QtextStream can get the right result .
here is my code
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtNetwork import *
import sys
def slotSourceDownloaded(reply):
redirctLocation=reply.header(QNetworkRequest.LocationHeader)
redirctLocationUrl=reply.url() if not redirctLocation else redirctLocation
print(redirctLocationUrl)
if (reply.error()!= QNetworkReply.NoError):
print('11111111', reply.errorString())
return
content=QTextStream(reply).readAll()
if content=='':
print('---------', 'cannot find any resource !')
return
print(content)
reply.deleteLater()
qApp.quit()
if __name__ == '__main__':
app =QCoreApplication(sys.argv)
manager=QNetworkAccessManager ()
url =input('input url :')
request=QNetworkRequest (QUrl.fromEncoded(QUrl.fromUserInput(url).toEncoded()))
request.setRawHeader("User-Agent" ,'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17 SE 2.X MetaSr 1.0')
manager.get(request)
manager.finished.connect(slotSourceDownloaded)
sys.exit(app.exec_())
Back to comp.lang.python | Previous | Next | Find similar | Unroll thread
Re: how to detect the character encoding in a web page ? iMath <redstone-cold@163.com> - 2013-06-05 07:46 -0700
csiph-web