Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #91190

Re: Extract email address from Java script in html source using python

Date 2015-05-24 19:48 +0200
From Friedrich Rentsch <anthra.norell@bluewin.ch>
Subject Re: Extract email address from Java script in html source using python
References <CAAXuHoeJ-YMQDB85qLDJ_o+9CrsfwLvm9wuOaRtbSj-i9kBaFA@mail.gmail.com> <CAPTjJmqBDu0nB2u_mf7KMpFzxxJFUqB7o-7dJiGgu-xyOL2uzg@mail.gmail.com> <CAAXuHod4NH74+VK1EY78-vCOyiOA3879qp1uPqKOFit7qVE5sQ@mail.gmail.com>
Newsgroups comp.lang.python
Message-ID <mailman.26.1432489776.5151.python-list@python.org> (permalink)

Show all headers | View raw



On 05/23/2015 04:15 PM, savitha devi wrote:
> What I exactly want is the java script is in the html code. I am trying for
> a regular expression to find the email address embedded with in the java
> script.
>
> On Sat, May 23, 2015 at 2:31 PM, Chris Angelico <rosuav@gmail.com> wrote:
>
>> On Sat, May 23, 2015 at 4:46 PM, savitha devi <savithad8@gmail.com> wrote:
>>> I am developing a web scraper code using HTMLParser. I need to extract
>>> text/email address from java script with in the HTMLCode.I am beginner
>> level
>>> in python coding and totally lost here. Need some help on this. The java
>>> script code is as below:
>>>
>>> <script type='text/javascript'>
>>>   //<!--
>>>   document.getElementById('cloak48218').innerHTML = '';
>>>   var prefix = '&#109;a' + 'i&#108;' + '&#116;o';
>>>   var path = 'hr' + 'ef' + '=';
>>>   var addy48218 = '&#105;nf&#111;' + '&#64;';
>>>   addy48218 = addy48218 + 'tsv-n&#101;&#117;r&#105;&#101;d' + '&#46;' +
>>> 'd&#101;';
>>>   document.getElementById('cloak48218').innerHTML += '<a ' + path + '\'' +
>>> prefix + ':' + addy48218 + '\'>' + addy48218+'<\/a>';
>>>   //-->
>> This is deliberately being done to prevent scripted usage. What
>> exactly are you needing to do this for?
>>
>> You're basically going to have to execute the entire block of
>> JavaScript code, and then decode the entities to get to what you want.
>> Doing it manually is pretty easy; doing it automatically will
>> virtually require a language interpreter.
>>
>> ChrisA
>> --
>> https://mail.python.org/mailman/listinfo/python-list
>>

This is just about nuts and bolts, not about the ethics of presumed 
intentions.

Hope it helps one way or other

Frederic


------------------------------------------------------------------------------- 


sample = '''//<!--
  document.getElementById('cloak48218').innerHTML = '';
  var prefix = '&#109;a' + 'i&#108;' + '&#116;o';
  var path = 'hr' + 'ef' + '=';
  var addy48218 = '&#105;nf&#111;' + '&#64;';
  addy48218 = addy48218 + 'tsv-n&#101;&#117;r&#105;&#101;d' + '&#46;' +
'd&#101;';
  document.getElementById('cloak48218').innerHTML += '<a ' + path + '\'' +
prefix + ':' + addy48218 + ''>' + addy48218+'<\/a>';
  //-->'''

 >>> import SE  # Download from PyPi at https://pypi.python.org/pypi/SE

 >>> def make_se_translator ():

     # Make SE substitutions
     subs_list = []

     # Make &# code substitutions
     for i in range (256):
         subs_list.append ('&#%d;=%c' % (i, chr(i)))

     # Delete Java stuff
     subs_list.append (' "document.getElementById(\'cloak48218\').=" ')
     subs_list.append (' "var =" "\n=" //<!--= //-->= ')

     # Java syntax? Tweaks needed to get the sample working
     subs_list.append (' "+ \'\'\'=" \'\'>\'=\'>\' <\/=</ ')

     # Add more as needed trial and error style
     # subs_list.append ( . . . format: ' old=new "delete this=" '

     # Make text
     subs = '\n'.join (subs_list)

     # Make SE translator
     translator = SE.SE (subs)

     # return translator, subs   # print subs, if you want to see what 
they look like
     return translator


 >>> translator = make_se_translator ()

 >>> translation = translator (sample)

 >>> print translation   # See
  innerHTML = ''; prefix = 'ma' + 'il' + 'to'; path = 'hr' + 'ef' + '='; 
addy48218 = 'info' + '@'; addy48218 = addy48218 + 'tsv-neuried' + '.' 
+'de'; innerHTML += '<a ' + path  +prefix + ':' + addy48218 + '>' + 
addy48218+'</a>';

 >>> exec (translation.lstrip ())

 >>> print innerHTML
<a href=mailto:info@tsv-neuried.de>info@tsv-neuried.de</a>

Back to comp.lang.python | Previous | Next | Find similar | Unroll thread


Thread

Re: Extract email address from Java script in html source using python Friedrich Rentsch <anthra.norell@bluewin.ch> - 2015-05-24 19:48 +0200

csiph-web