Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #47500

RE: how to detect the character encoding in a web page ?

Path csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!goblin2!goblin.stu.neva.ru!newsfeed.xs4all.nl!newsfeed3.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
Return-Path <carlosnepomuceno@outlook.com>
X-Original-To python-list@python.org
Delivered-To python-list@mail.python.org
X-Spam-Status OK 0.025
X-Spam-Evidence '*H*': 0.95; '*S*': 0.00; 'encoding': 0.05; 'detect': 0.07; 'sys': 0.07; 'url:msdn': 0.07; 'urllib2': 0.07; '__name__': 0.09; 'resp': 0.09; 'skip:c 80': 0.09; 'def': 0.12; '###': 0.16; "'__main__':": 0.16; '(windows': 0.16; 'charset': 0.16; 'html)': 0.16; 'skip:= 10': 0.16; 'subject: ?': 0.16; 'app': 0.19; 'subject:page': 0.19; 'import': 0.22; 'to:name:python- list@python.org': 0.22; '2.x': 0.24; 'headers': 0.24; 'received:65.55.116': 0.24; 'header': 0.24; 'source': 0.25; 'script': 0.25; '&gt;': 0.26; 'this:': 0.26; 'header:In-Reply- To:1': 0.27; 'character': 0.29; 'url:mailman': 0.30; 'code': 0.31; 'skip:( 50': 0.31; 'skip:= 20': 0.31; 'skip:q 20': 0.31; 'skip:r 60': 0.31; 'url:python': 0.33; 'date:': 0.34; 'subject:the': 0.34; 'skip:u 20': 0.35; 'url:listinfo': 0.36; 'url:org': 0.36; 'url:microsoft': 0.37; 'email addr:python.org': 0.37; 'skip:m 40': 0.38; 'url:library': 0.38; 'url:office': 0.38; 'to:addr:python- list': 0.38; 'resource': 0.38; 'subject:': 0.39; 'bad': 0.39; 'to:addr:python.org': 0.39; 'skip:p 20': 0.39; 'url:mail': 0.40; 'how': 0.40; 'even': 0.60; 'skip:t 30': 0.61; 'subject: ': 0.61; 're:': 0.63; 'such': 0.63; 'more': 0.64; 'email name:python-list': 0.65; 'finally': 0.65; 'url:en-us': 0.68; 'skip:r 30': 0.69; '8bit%:57': 0.74; 'charset:gb2312': 0.80; 'url:2013': 0.84; 'url:php': 0.85; 'url:url': 0.91; 'url:cn': 0.93; 'email addr:163.com': 0.95; '2013': 0.98
X-TMN [uYpT5h+2fat4RNDEoKzAoa/iuNWjs3cc]
X-Originating-Email [carlosnepomuceno@outlook.com]
Content-Type multipart/alternative; boundary="_546269f6-8ea9-48f2-b4e0-15257dd5a781_"
From Carlos Nepomuceno <carlosnepomuceno@outlook.com>
To "python-list@python.org" <python-list@python.org>
Subject RE: how to detect the character encoding in a web page ?
Date Mon, 10 Jun 2013 00:35:55 +0300
Importance Normal
In-Reply-To <9ad413cd-bdfe-4b24-b4aa-468292a75963@googlegroups.com>
References <9ad413cd-bdfe-4b24-b4aa-468292a75963@googlegroups.com>
MIME-Version 1.0
X-OriginalArrivalTime 09 Jun 2013 21:35:55.0087 (UTC) FILETIME=[527EF5F0:01CE6559]
X-BeenThere python-list@python.org
X-Mailman-Version 2.1.15
Precedence list
List-Id General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe <http://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive <http://mail.python.org/pipermail/python-list/>
List-Post <mailto:python-list@python.org>
List-Help <mailto:python-list-request@python.org?subject=help>
List-Subscribe <http://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups comp.lang.python
Message-ID <mailman.2941.1370813757.3114.python-list@python.org> (permalink)
Lines 135
NNTP-Posting-Host 2001:888:2000:d::a6
X-Trace 1370813757 news.xs4all.nl 15882 [2001:888:2000:d::a6]:55675
X-Complaints-To abuse@xs4all.nl
Xref csiph.com comp.lang.python:47500

Show key headers only | View raw


[Multipart message — attachments visible in raw view] - view raw

Try this:

### get_charset.py ###
import re
import urllib2

def  get_charset(url):
    resp = urllib2.urlopen(url)
    #retrieve charset from header
    headers = ''.join(resp.headers.headers)
    charset_from_header_list = re.findall('charset=(.*)', headers)
    charset_from_header = charset_from_header_list[-1] if charset_from_header_list else ''

    #retrieve charset from html
    html = resp.read()
    charset_from_html_list = re.findall('Content-Type.*charset=["\']?(.*)["\']', html)
    charset_from_html = charset_from_html_list[-1]  if charset_from_html_list else ''

    return charset_from_html if charset_from_html else charset_from_header




> Date: Sun, 9 Jun 2013 04:47:02 -0700
> Subject: Re: how to detect the character encoding  in a web page ?
> From: redstone-cold@163.com
> To: python-list@python.org
> 
> 在 2012年12月24日星期一UTC+8上午8时34分47秒,iMath写道:
> > how to detect the character encoding  in a web page ?
> > 
> > such as this page 
> > 
> > 
> > 
> > http://python.org/
> 
> Finally ,I found by using PyQt’s QtextStream , QTextCodec and chardet ,we can get a web page code more securely  
> even for this bad page
> http://www.qnwz.cn/html/yinlegushihui/magazine/2013/0524/425731.html 
> 
> this script 
> http://www.flvxz.com/getFlv.php?url=aHR0cDojI3d3dy41Ni5jb20vdTk1L3ZfT1RFM05UYzBNakEuaHRtbA==
> 
> and this page without chardet in its source code 
> http://msdn.microsoft.com/en-us/library/bb802962(v=office.12).aspx
> 
> 
> from PyQt4.QtCore import *
> from PyQt4.QtGui import *
> from PyQt4.QtNetwork  import *
> import sys
> import chardet
> 
> def slotSourceDownloaded(reply):
>     redirctLocation=reply.header(QNetworkRequest.LocationHeader)
>     redirctLocationUrl=reply.url() if not redirctLocation else redirctLocation
>     #print(redirctLocationUrl,reply.header(QNetworkRequest.ContentTypeHeader))
> 
>     if (reply.error()!= QNetworkReply.NoError):
>         print('11111111', reply.errorString())
>         return
> 
>     pageCode=reply.readAll()
>     charCodecInfo=chardet.detect(pageCode.data())
> 
>     textStream=QTextStream(pageCode)
>     codec=QTextCodec.codecForHtml(pageCode,QTextCodec.codecForName(charCodecInfo['encoding'] ))
>     textStream.setCodec(codec)
>     content=textStream.readAll()
>     print(content)
> 
>     if content=='':
>         print('---------', 'cannot find any resource !')
>         return
> 
>     reply.deleteLater()
>     qApp.quit()
> 
> 
> if __name__ == '__main__':
>     app =QCoreApplication(sys.argv)
>     manager=QNetworkAccessManager ()
>     url =input('input url :')
>     request=QNetworkRequest (QUrl.fromEncoded(QUrl.fromUserInput(url).toEncoded()))
>     request.setRawHeader("User-Agent" ,'Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17 SE 2.X MetaSr 1.0')
>     manager.get(request)
>     manager.finished.connect(slotSourceDownloaded)
> sys.exit(app.exec_())
> -- 
> http://mail.python.org/mailman/listinfo/python-list
 		 	   		  

Back to comp.lang.python | Previous | NextPrevious in thread | Find similar | Unroll thread


Thread

Re: how to detect the character encoding  in a web page ? iMath <redstone-cold@163.com> - 2013-06-09 04:47 -0700
  RE: how to detect the character encoding  in a web page ? Carlos Nepomuceno <carlosnepomuceno@outlook.com> - 2013-06-10 00:35 +0300

csiph-web