Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #47449 > unrolled thread

Re: how to detect the character encoding in a web page ?

Started byiMath <redstone-cold@163.com>
First post2013-06-09 04:47 -0700
Last post2013-06-10 00:35 +0300
Articles 2 — 2 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.


Contents

  Re: how to detect the character encoding  in a web page ? iMath <redstone-cold@163.com> - 2013-06-09 04:47 -0700
    RE: how to detect the character encoding  in a web page ? Carlos Nepomuceno <carlosnepomuceno@outlook.com> - 2013-06-10 00:35 +0300

#47449 — Re: how to detect the character encoding in a web page ?

FromiMath <redstone-cold@163.com>
Date2013-06-09 04:47 -0700
SubjectRe: how to detect the character encoding in a web page ?
Message-ID<9ad413cd-bdfe-4b24-b4aa-468292a75963@googlegroups.com>
在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道:
> how to detect the character encoding  in a web page ?
> 
> such as this page 
> 
> 
> 
> http://python.org/

Finally ,I found by using PyQt’s QtextStream , QTextCodec and chardet ,we can get a web page code more securely  
even for this bad page
http://www.qnwz.cn/html/yinlegushihui/magazine/2013/0524/425731.html 

this script 
http://www.flvxz.com/getFlv.php?url=aHR0cDojI3d3dy41Ni5jb20vdTk1L3ZfT1RFM05UYzBNakEuaHRtbA==

and this page without chardet in its source code 
http://msdn.microsoft.com/en-us/library/bb802962(v=office.12).aspx


from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtNetwork  import *
import sys
import chardet

def slotSourceDownloaded(reply):
    redirctLocation=reply.header(QNetworkRequest.LocationHeader)
    redirctLocationUrl=reply.url() if not redirctLocation else redirctLocation
    #print(redirctLocationUrl,reply.header(QNetworkRequest.ContentTypeHeader))

    if (reply.error()!= QNetworkReply.NoError):
        print('11111111', reply.errorString())
        return

    pageCode=reply.readAll()
    charCodecInfo=chardet.detect(pageCode.data())

    textStream=QTextStream(pageCode)
    codec=QTextCodec.codecForHtml(pageCode,QTextCodec.codecForName(charCodecInfo['encoding'] ))
    textStream.setCodec(codec)
    content=textStream.readAll()
    print(content)

    if content=='':
        print('---------', 'cannot find any resource !')
        return

    reply.deleteLater()
    qApp.quit()


if __name__ == '__main__':
    app =QCoreApplication(sys.argv)
    manager=QNetworkAccessManager ()
    url =input('input url :')
    request=QNetworkRequest (QUrl.fromEncoded(QUrl.fromUserInput(url).toEncoded()))
    request.setRawHeader("User-Agent" ,'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17 SE 2.X MetaSr 1.0')
    manager.get(request)
    manager.finished.connect(slotSourceDownloaded)
sys.exit(app.exec_())

[toc] | [next] | [standalone]


#47500

FromCarlos Nepomuceno <carlosnepomuceno@outlook.com>
Date2013-06-10 00:35 +0300
Message-ID<mailman.2941.1370813757.3114.python-list@python.org>
In reply to#47449

[Multipart message — attachments visible in raw view] — view raw

Try this:

### get_charset.py ###
import re
import urllib2

def  get_charset(url):
    resp = urllib2.urlopen(url)
    #retrieve charset from header
    headers = ''.join(resp.headers.headers)
    charset_from_header_list = re.findall('charset=(.*)', headers)
    charset_from_header = charset_from_header_list[-1] if charset_from_header_list else ''

    #retrieve charset from html
    html = resp.read()
    charset_from_html_list = re.findall('Content-Type.*charset=["\']?(.*)["\']', html)
    charset_from_html = charset_from_html_list[-1]  if charset_from_html_list else ''

    return charset_from_html if charset_from_html else charset_from_header




> Date: Sun, 9 Jun 2013 04:47:02 -0700
> Subject: Re: how to detect the character encoding  in a web page ?
> From: redstone-cold@163.com
> To: python-list@python.org
> 
> 在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道:
> > how to detect the character encoding  in a web page ?
> > 
> > such as this page 
> > 
> > 
> > 
> > http://python.org/
> 
> Finally ,I found by using PyQt’s QtextStream , QTextCodec and chardet ,we can get a web page code more securely  
> even for this bad page
> http://www.qnwz.cn/html/yinlegushihui/magazine/2013/0524/425731.html 
> 
> this script 
> http://www.flvxz.com/getFlv.php?url=aHR0cDojI3d3dy41Ni5jb20vdTk1L3ZfT1RFM05UYzBNakEuaHRtbA==
> 
> and this page without chardet in its source code 
> http://msdn.microsoft.com/en-us/library/bb802962(v=office.12).aspx
> 
> 
> from PyQt4.QtCore import *
> from PyQt4.QtGui import *
> from PyQt4.QtNetwork  import *
> import sys
> import chardet
> 
> def slotSourceDownloaded(reply):
>     redirctLocation=reply.header(QNetworkRequest.LocationHeader)
>     redirctLocationUrl=reply.url() if not redirctLocation else redirctLocation
>     #print(redirctLocationUrl,reply.header(QNetworkRequest.ContentTypeHeader))
> 
>     if (reply.error()!= QNetworkReply.NoError):
>         print('11111111', reply.errorString())
>         return
> 
>     pageCode=reply.readAll()
>     charCodecInfo=chardet.detect(pageCode.data())
> 
>     textStream=QTextStream(pageCode)
>     codec=QTextCodec.codecForHtml(pageCode,QTextCodec.codecForName(charCodecInfo['encoding'] ))
>     textStream.setCodec(codec)
>     content=textStream.readAll()
>     print(content)
> 
>     if content=='':
>         print('---------', 'cannot find any resource !')
>         return
> 
>     reply.deleteLater()
>     qApp.quit()
> 
> 
> if __name__ == '__main__':
>     app =QCoreApplication(sys.argv)
>     manager=QNetworkAccessManager ()
>     url =input('input url :')
>     request=QNetworkRequest (QUrl.fromEncoded(QUrl.fromUserInput(url).toEncoded()))
>     request.setRawHeader("User-Agent" ,'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17 SE 2.X MetaSr 1.0')
>     manager.get(request)
>     manager.finished.connect(slotSourceDownloaded)
> sys.exit(app.exec_())
> -- 
> http://mail.python.org/mailman/listinfo/python-list
 		 	   		  

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web