Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #91190
| Date | 2015-05-24 19:48 +0200 |
|---|---|
| From | Friedrich Rentsch <anthra.norell@bluewin.ch> |
| Subject | Re: Extract email address from Java script in html source using python |
| References | <CAAXuHoeJ-YMQDB85qLDJ_o+9CrsfwLvm9wuOaRtbSj-i9kBaFA@mail.gmail.com> <CAPTjJmqBDu0nB2u_mf7KMpFzxxJFUqB7o-7dJiGgu-xyOL2uzg@mail.gmail.com> <CAAXuHod4NH74+VK1EY78-vCOyiOA3879qp1uPqKOFit7qVE5sQ@mail.gmail.com> |
| Newsgroups | comp.lang.python |
| Message-ID | <mailman.26.1432489776.5151.python-list@python.org> (permalink) |
On 05/23/2015 04:15 PM, savitha devi wrote:
> What I exactly want is the java script is in the html code. I am trying for
> a regular expression to find the email address embedded with in the java
> script.
>
> On Sat, May 23, 2015 at 2:31 PM, Chris Angelico <rosuav@gmail.com> wrote:
>
>> On Sat, May 23, 2015 at 4:46 PM, savitha devi <savithad8@gmail.com> wrote:
>>> I am developing a web scraper code using HTMLParser. I need to extract
>>> text/email address from java script with in the HTMLCode.I am beginner
>> level
>>> in python coding and totally lost here. Need some help on this. The java
>>> script code is as below:
>>>
>>> <script type='text/javascript'>
>>> //<!--
>>> document.getElementById('cloak48218').innerHTML = '';
>>> var prefix = 'ma' + 'il' + 'to';
>>> var path = 'hr' + 'ef' + '=';
>>> var addy48218 = 'info' + '@';
>>> addy48218 = addy48218 + 'tsv-neuried' + '.' +
>>> 'de';
>>> document.getElementById('cloak48218').innerHTML += '<a ' + path + '\'' +
>>> prefix + ':' + addy48218 + '\'>' + addy48218+'<\/a>';
>>> //-->
>> This is deliberately being done to prevent scripted usage. What
>> exactly are you needing to do this for?
>>
>> You're basically going to have to execute the entire block of
>> JavaScript code, and then decode the entities to get to what you want.
>> Doing it manually is pretty easy; doing it automatically will
>> virtually require a language interpreter.
>>
>> ChrisA
>> --
>> https://mail.python.org/mailman/listinfo/python-list
>>
This is just about nuts and bolts, not about the ethics of presumed
intentions.
Hope it helps one way or other
Frederic
-------------------------------------------------------------------------------
sample = '''//<!--
document.getElementById('cloak48218').innerHTML = '';
var prefix = 'ma' + 'il' + 'to';
var path = 'hr' + 'ef' + '=';
var addy48218 = 'info' + '@';
addy48218 = addy48218 + 'tsv-neuried' + '.' +
'de';
document.getElementById('cloak48218').innerHTML += '<a ' + path + '\'' +
prefix + ':' + addy48218 + ''>' + addy48218+'<\/a>';
//-->'''
>>> import SE # Download from PyPi at https://pypi.python.org/pypi/SE
>>> def make_se_translator ():
# Make SE substitutions
subs_list = []
# Make &# code substitutions
for i in range (256):
subs_list.append ('&#%d;=%c' % (i, chr(i)))
# Delete Java stuff
subs_list.append (' "document.getElementById(\'cloak48218\').=" ')
subs_list.append (' "var =" "\n=" //<!--= //-->= ')
# Java syntax? Tweaks needed to get the sample working
subs_list.append (' "+ \'\'\'=" \'\'>\'=\'>\' <\/=</ ')
# Add more as needed trial and error style
# subs_list.append ( . . . format: ' old=new "delete this=" '
# Make text
subs = '\n'.join (subs_list)
# Make SE translator
translator = SE.SE (subs)
# return translator, subs # print subs, if you want to see what
they look like
return translator
>>> translator = make_se_translator ()
>>> translation = translator (sample)
>>> print translation # See
innerHTML = ''; prefix = 'ma' + 'il' + 'to'; path = 'hr' + 'ef' + '=';
addy48218 = 'info' + '@'; addy48218 = addy48218 + 'tsv-neuried' + '.'
+'de'; innerHTML += '<a ' + path +prefix + ':' + addy48218 + '>' +
addy48218+'</a>';
>>> exec (translation.lstrip ())
>>> print innerHTML
<a href=mailto:info@tsv-neuried.de>info@tsv-neuried.de</a>
Back to comp.lang.python | Previous | Next | Find similar | Unroll thread
Re: Extract email address from Java script in html source using python Friedrich Rentsch <anthra.norell@bluewin.ch> - 2015-05-24 19:48 +0200
csiph-web