Groups > comp.lang.python > #91107 > unrolled thread

Re: Extract email address from Java script in html source using python

Started by	Chris Angelico <rosuav@gmail.com>
First post	2015-05-23 19:01 +1000
Last post	2015-05-24 01:17 -0500
Articles	3 — 3 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.

  Re: Extract email address from Java script in html source using python Chris Angelico <rosuav@gmail.com> - 2015-05-23 19:01 +1000
    Re: Extract email address from Java script in html source using python Steve Hayes <hayesstw@telkomsa.net> - 2015-05-24 03:04 +0200
      Re: Extract email address from Java script in html source using python VanguardLH <V@nguard.LH> - 2015-05-24 01:17 -0500

#91107 — Re: Extract email address from Java script in html source using python

From	Chris Angelico <rosuav@gmail.com>
Date	2015-05-23 19:01 +1000
Subject	Re: Extract email address from Java script in html source using python
Message-ID	<mailman.267.1432371718.17265.python-list@python.org>

On Sat, May 23, 2015 at 4:46 PM, savitha devi <savithad8@gmail.com> wrote:
> I am developing a web scraper code using HTMLParser. I need to extract
> text/email address from java script with in the HTMLCode.I am beginner level
> in python coding and totally lost here. Need some help on this. The java
> script code is as below:
>
> <script type='text/javascript'>
>  //<!--
>  document.getElementById('cloak48218').innerHTML = '';
>  var prefix = '&#109;a' + 'i&#108;' + '&#116;o';
>  var path = 'hr' + 'ef' + '=';
>  var addy48218 = '&#105;nf&#111;' + '&#64;';
>  addy48218 = addy48218 + 'tsv-n&#101;&#117;r&#105;&#101;d' + '&#46;' +
> 'd&#101;';
>  document.getElementById('cloak48218').innerHTML += '<a ' + path + '\'' +
> prefix + ':' + addy48218 + '\'>' + addy48218+'<\/a>';
>  //-->

This is deliberately being done to prevent scripted usage. What
exactly are you needing to do this for?

You're basically going to have to execute the entire block of
JavaScript code, and then decode the entities to get to what you want.
Doing it manually is pretty easy; doing it automatically will
virtually require a language interpreter.

ChrisA

[toc] | [next] | [standalone]

#91154

From	Steve Hayes <hayesstw@telkomsa.net>
Date	2015-05-24 03:04 +0200
Message-ID	<cq82ma9d4pqufo7u11m52eh4ca2hgmi25a@4ax.com>
In reply to	#91107

On Sat, 23 May 2015 19:01:55 +1000, Chris Angelico <rosuav@gmail.com>
wrote:

>On Sat, May 23, 2015 at 4:46 PM, savitha devi <savithad8@gmail.com> wrote:
>> I am developing a web scraper code using HTMLParser. I need to extract
>> text/email address from java script with in the HTMLCode.I am beginner level
>> in python coding and totally lost here. Need some help on this. The java
>> script code is as below:
>>
>> <script type='text/javascript'>
>>  //<!--
>>  document.getElementById('cloak48218').innerHTML = '';
>>  var prefix = '&#109;a' + 'i&#108;' + '&#116;o';
>>  var path = 'hr' + 'ef' + '=';
>>  var addy48218 = '&#105;nf&#111;' + '&#64;';
>>  addy48218 = addy48218 + 'tsv-n&#101;&#117;r&#105;&#101;d' + '&#46;' +
>> 'd&#101;';
>>  document.getElementById('cloak48218').innerHTML += '<a ' + path + '\'' +
>> prefix + ':' + addy48218 + '\'>' + addy48218+'<\/a>';
>>  //-->
>
>This is deliberately being done to prevent scripted usage. What
>exactly are you needing to do this for?

To sell addresses to spammers, of course. 


-- 
Terms and conditions apply. 

Steve Hayes
hayesmstw@hotmail.com

[toc] | [prev] | [next] | [standalone]

#91159

From	VanguardLH <V@nguard.LH>
Date	2015-05-24 01:17 -0500
Message-ID	<csd8neFidb7U1@mid.individual.net>
In reply to	#91154

Steve Hayes wrote:

> On Sat, 23 May 2015 19:01:55 +1000, Chris Angelico <rosuav@gmail.com>
> wrote:
> 
>>On Sat, May 23, 2015 at 4:46 PM, savitha devi <savithad8@gmail.com> wrote:
>>> I am developing a web scraper code using HTMLParser. I need to extract
>>> text/email address from java script with in the HTMLCode.I am beginner level
>>> in python coding and totally lost here. Need some help on this. The java
>>> script code is as below:
>>>
>>> <script type='text/javascript'>
>>>  //<!--
>>>  document.getElementById('cloak48218').innerHTML = '';
>>>  var prefix = '&#109;a' + 'i&#108;' + '&#116;o';
>>>  var path = 'hr' + 'ef' + '=';
>>>  var addy48218 = '&#105;nf&#111;' + '&#64;';
>>>  addy48218 = addy48218 + 'tsv-n&#101;&#117;r&#105;&#101;d' + '&#46;' +
>>> 'd&#101;';
>>>  document.getElementById('cloak48218').innerHTML += '<a ' + path + '\'' +
>>> prefix + ':' + addy48218 + '\'>' + addy48218+'<\/a>';
>>>  //-->
>>
>>This is deliberately being done to prevent scripted usage. What
>>exactly are you needing to do this for?
> 
> To sell addresses to spammers, of course.

The boob that uses this javascripted obfuscation (by slicing up the URL
across variables and using concatenation within a variable) hasn't a
clue that the javascript or user clicking on a URL will still have to
eventually go to the destination so it will still get blocked.  Duh!
Nothing is actually cloaked by the javascript (it's just another means
of building up the <A> tag) and the URL string (even if it used a
decimal value instead of IP-dotted) still has to connect to somewhere
and that gets detected and blocked.  Slicing up a URL across variables
and concantenation within a variable is a child's ploy to obfuscate.
Apparently savitha can't even distinguish between an e-mail address and
a URL string.

[toc] | [prev] | [standalone]

csiph-web

Re: Extract email address from Java script in html source using python

Contents

#91107 — Re: Extract email address from Java script in html source using python

#91154

#91159