Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #91107 > unrolled thread
| Started by | Chris Angelico <rosuav@gmail.com> |
|---|---|
| First post | 2015-05-23 19:01 +1000 |
| Last post | 2015-05-24 01:17 -0500 |
| Articles | 3 — 3 participants |
Back to article view | Back to comp.lang.python
This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by
below is the oldest one visible, not the original post.
Re: Extract email address from Java script in html source using python Chris Angelico <rosuav@gmail.com> - 2015-05-23 19:01 +1000
Re: Extract email address from Java script in html source using python Steve Hayes <hayesstw@telkomsa.net> - 2015-05-24 03:04 +0200
Re: Extract email address from Java script in html source using python VanguardLH <V@nguard.LH> - 2015-05-24 01:17 -0500
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2015-05-23 19:01 +1000 |
| Subject | Re: Extract email address from Java script in html source using python |
| Message-ID | <mailman.267.1432371718.17265.python-list@python.org> |
On Sat, May 23, 2015 at 4:46 PM, savitha devi <savithad8@gmail.com> wrote:
> I am developing a web scraper code using HTMLParser. I need to extract
> text/email address from java script with in the HTMLCode.I am beginner level
> in python coding and totally lost here. Need some help on this. The java
> script code is as below:
>
> <script type='text/javascript'>
> //<!--
> document.getElementById('cloak48218').innerHTML = '';
> var prefix = 'ma' + 'il' + 'to';
> var path = 'hr' + 'ef' + '=';
> var addy48218 = 'info' + '@';
> addy48218 = addy48218 + 'tsv-neuried' + '.' +
> 'de';
> document.getElementById('cloak48218').innerHTML += '<a ' + path + '\'' +
> prefix + ':' + addy48218 + '\'>' + addy48218+'<\/a>';
> //-->
This is deliberately being done to prevent scripted usage. What
exactly are you needing to do this for?
You're basically going to have to execute the entire block of
JavaScript code, and then decode the entities to get to what you want.
Doing it manually is pretty easy; doing it automatically will
virtually require a language interpreter.
ChrisA
[toc] | [next] | [standalone]
| From | Steve Hayes <hayesstw@telkomsa.net> |
|---|---|
| Date | 2015-05-24 03:04 +0200 |
| Message-ID | <cq82ma9d4pqufo7u11m52eh4ca2hgmi25a@4ax.com> |
| In reply to | #91107 |
On Sat, 23 May 2015 19:01:55 +1000, Chris Angelico <rosuav@gmail.com>
wrote:
>On Sat, May 23, 2015 at 4:46 PM, savitha devi <savithad8@gmail.com> wrote:
>> I am developing a web scraper code using HTMLParser. I need to extract
>> text/email address from java script with in the HTMLCode.I am beginner level
>> in python coding and totally lost here. Need some help on this. The java
>> script code is as below:
>>
>> <script type='text/javascript'>
>> //<!--
>> document.getElementById('cloak48218').innerHTML = '';
>> var prefix = 'ma' + 'il' + 'to';
>> var path = 'hr' + 'ef' + '=';
>> var addy48218 = 'info' + '@';
>> addy48218 = addy48218 + 'tsv-neuried' + '.' +
>> 'de';
>> document.getElementById('cloak48218').innerHTML += '<a ' + path + '\'' +
>> prefix + ':' + addy48218 + '\'>' + addy48218+'<\/a>';
>> //-->
>
>This is deliberately being done to prevent scripted usage. What
>exactly are you needing to do this for?
To sell addresses to spammers, of course.
--
Terms and conditions apply.
Steve Hayes
hayesmstw@hotmail.com
[toc] | [prev] | [next] | [standalone]
| From | VanguardLH <V@nguard.LH> |
|---|---|
| Date | 2015-05-24 01:17 -0500 |
| Message-ID | <csd8neFidb7U1@mid.individual.net> |
| In reply to | #91154 |
Steve Hayes wrote:
> On Sat, 23 May 2015 19:01:55 +1000, Chris Angelico <rosuav@gmail.com>
> wrote:
>
>>On Sat, May 23, 2015 at 4:46 PM, savitha devi <savithad8@gmail.com> wrote:
>>> I am developing a web scraper code using HTMLParser. I need to extract
>>> text/email address from java script with in the HTMLCode.I am beginner level
>>> in python coding and totally lost here. Need some help on this. The java
>>> script code is as below:
>>>
>>> <script type='text/javascript'>
>>> //<!--
>>> document.getElementById('cloak48218').innerHTML = '';
>>> var prefix = 'ma' + 'il' + 'to';
>>> var path = 'hr' + 'ef' + '=';
>>> var addy48218 = 'info' + '@';
>>> addy48218 = addy48218 + 'tsv-neuried' + '.' +
>>> 'de';
>>> document.getElementById('cloak48218').innerHTML += '<a ' + path + '\'' +
>>> prefix + ':' + addy48218 + '\'>' + addy48218+'<\/a>';
>>> //-->
>>
>>This is deliberately being done to prevent scripted usage. What
>>exactly are you needing to do this for?
>
> To sell addresses to spammers, of course.
The boob that uses this javascripted obfuscation (by slicing up the URL
across variables and using concatenation within a variable) hasn't a
clue that the javascript or user clicking on a URL will still have to
eventually go to the destination so it will still get blocked. Duh!
Nothing is actually cloaked by the javascript (it's just another means
of building up the <A> tag) and the URL string (even if it used a
decimal value instead of IP-dotted) still has to connect to somewhere
and that gets detected and blocked. Slicing up a URL across variables
and concantenation within a variable is a child's ploy to obfuscate.
Apparently savitha can't even distinguish between an e-mail address and
a URL string.
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web