Groups > comp.lang.python > #27974 > unrolled thread

What do I do to read html files on my pc?

Started by	mikcec82 <michele.cecere@gmail.com>
First post	2012-08-27 03:59 -0700
Last post	2012-08-29 05:00 -0700
Articles	12 — 8 participants

Back to article view | Back to comp.lang.python

  What do I do to read html files on my pc? mikcec82 <michele.cecere@gmail.com> - 2012-08-27 03:59 -0700
    Re: What do I do to read html files on my pc? Chris Angelico <rosuav@gmail.com> - 2012-08-27 21:58 +1000
    Re: What do I do to read html files on my pc? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-27 13:05 +0100
    Re: What do I do to read html files on my pc? mikcec82 <michele.cecere@gmail.com> - 2012-08-27 06:51 -0700
      Re: What do I do to read html files on my pc? Joel Goldstick <joel.goldstick@gmail.com> - 2012-08-27 10:21 -0400
      Re: What do I do to read html files on my pc? Chris Angelico <rosuav@gmail.com> - 2012-08-28 00:41 +1000
      Re: What do I do to read html files on my pc? Jean-Michel Pichavant <jeanmichel@sequans.com> - 2012-08-27 18:57 +0200
    Re: What do I do to read html files on my pc? mikcec82 <michele.cecere@gmail.com> - 2012-08-28 03:09 -0700
      Re: What do I do to read html files on my pc? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-08-28 13:31 +0100
    Re: What do I do to read html files on my pc? Peter Otten <__peter__@web.de> - 2012-08-28 17:38 +0200
    Re: What do I do to read html files on my pc? mikcec82 <michele.cecere@gmail.com> - 2012-08-29 03:22 -0700
    Re: What do I do to read html files on my pc? Umesh Sharma <usharma01@gmail.com> - 2012-08-29 05:00 -0700

#27974 — What do I do to read html files on my pc?

From	mikcec82 <michele.cecere@gmail.com>
Date	2012-08-27 03:59 -0700
Subject	What do I do to read html files on my pc?
Message-ID	<1c7cd833-b6ad-4a17-8ffe-a0ce20c8f400@googlegroups.com>

Hallo,

I have an html file on my pc and I want to read it to extract some text.
Can you help on which libs I have to use and how can I do it?

thank you so much.

Michele

[toc] | [next] | [standalone]

#27980

From	Chris Angelico <rosuav@gmail.com>
Date	2012-08-27 21:58 +1000
Message-ID	<mailman.3872.1346068691.4697.python-list@python.org>
In reply to	#27974

On Mon, Aug 27, 2012 at 8:59 PM, mikcec82 <michele.cecere@gmail.com> wrote:
> Hallo,
>
> I have an html file on my pc and I want to read it to extract some text.
> Can you help on which libs I have to use and how can I do it?
>
> thank you so much.

Try BeautifulSoup. You can find it at the opposite end of a web search.

Not trying to be unhelpful, but without more description of the
problem, there's not a lot more to say :)

ChrisA

[toc] | [prev] | [next] | [standalone]

#27981

From	Mark Lawrence <breamoreboy@yahoo.co.uk>
Date	2012-08-27 13:05 +0100
Message-ID	<mailman.3873.1346069036.4697.python-list@python.org>
In reply to	#27974

On 27/08/2012 11:59, mikcec82 wrote:
> Hallo,
>
> I have an html file on my pc and I want to read it to extract some text.
> Can you help on which libs I have to use and how can I do it?
>
> thank you so much.
>
> Michele
>

Type something like "python html parsing" into the box of your favourite 
search engine, hit return and follow the links it comes back with. 
Write some code.  If you have problems give us the smallest code snippet 
that reproduces the issue together with the complete traceback and we'll 
help.

-- 
Cheers.

Mark Lawrence.

[toc] | [prev] | [next] | [standalone]

#27984

From	mikcec82 <michele.cecere@gmail.com>
Date	2012-08-27 06:51 -0700
Message-ID	<858c2da2-6936-4bd7-8944-f45446fbd3be@googlegroups.com>
In reply to	#27974

Il giorno lunedì 27 agosto 2012 12:59:02 UTC+2, mikcec82 ha scritto:
> Hallo,
> 
> 
> 
> I have an html file on my pc and I want to read it to extract some text.
> 
> Can you help on which libs I have to use and how can I do it?
> 
> 
> 
> thank you so much.
> 
> 
> 
> Michele

Hi ChrisA, Hi Mark.
Thanks a lot.

I have this html data and I want to check if it is present a string "XXXX" or/and a string "NOT PASSED": 

</th>
<td>
<samp>
&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;
</samp>
XXXX
</td>
</tr>
<tr>
.
.
.
<th/>
<th/>
</tr>
<tr align="left" style="color: red">
<th/>
<th>
CODE CHECK
</th>
<th>
: NOT PASSED
</th>
</tr>
<tr>
<th/>

Depending on this check I have to fill a cell in an excel file with answer: NOK (if Not passed or XXXX is present), or OK (if Not passed and XXXX are not present).

Thanks again for your help (and sorry for my english)

[toc] | [prev] | [next] | [standalone]

#27985

From	Joel Goldstick <joel.goldstick@gmail.com>
Date	2012-08-27 10:21 -0400
Message-ID	<mailman.3876.1346077298.4697.python-list@python.org>
In reply to	#27984

On Mon, Aug 27, 2012 at 9:51 AM, mikcec82 <michele.cecere@gmail.com> wrote:
> Il giorno lunedì 27 agosto 2012 12:59:02 UTC+2, mikcec82 ha scritto:
>> Hallo,
>>
>>
>>
>> I have an html file on my pc and I want to read it to extract some text.
>>
>> Can you help on which libs I have to use and how can I do it?
>>
>>
>>
>> thank you so much.
>>
>>
>>
>> Michele
>
> Hi ChrisA, Hi Mark.
> Thanks a lot.
>
> I have this html data and I want to check if it is present a string "XXXX" or/and a string "NOT PASSED":
>
> </th>
> <td>
> <samp>
> &nbsp;
> &nbsp;
> &nbsp;
> &nbsp;
> &nbsp;
> </samp>
> XXXX
> </td>
> </tr>
> <tr>
> .
> .
> .
> <th/>
> <th/>
> </tr>
> <tr align="left" style="color: red">
> <th/>
> <th>
> CODE CHECK
> </th>
> <th>
> : NOT PASSED
> </th>
> </tr>
> <tr>
> <th/>
>
> Depending on this check I have to fill a cell in an excel file with answer: NOK (if Not passed or XXXX is present), or OK (if Not passed and XXXX are not present).
>
> Thanks again for your help (and sorry for my english)
> --
> http://mail.python.org/mailman/listinfo/python-list

from your example it doesn't seem there is enough information to know
where in the html your strings will be.

If you just read the whole file into a string you can do this:

>>> s = "this is a string"
>>> if 'this' in s:
...   print 'yes'
...
yes
>>>

Of course you will be testing for 'XXXX' or 'NOT PASSED'


-- 
Joel Goldstick

[toc] | [prev] | [next] | [standalone]

#27986

From	Chris Angelico <rosuav@gmail.com>
Date	2012-08-28 00:41 +1000
Message-ID	<mailman.3877.1346078509.4697.python-list@python.org>
In reply to	#27984

On Mon, Aug 27, 2012 at 11:51 PM, mikcec82 <michele.cecere@gmail.com> wrote:
> I have this html data and I want to check if it is present a string "XXXX" or/and a string "NOT PASSED":

Start by scribbling down some notes in your native language (that is,
don't bother trying to write code yet), defining exactly what you're
looking for. What constitutes a hit? What would be a false positive
that you need to avoid? For instance:

* The string XXXX must occur outside of any HTML tag.
or:
* The string XXXX must occur inside a <td> but not inside <samp>.
or:
* The string XXXX must be in the first <td> inside of a <tr> in the
<table> that immediately follows the text "abcdefg".

Make sure it's clear enough that anybody could follow it, even without
knowing everything you know about your files. Once you have that
algorithmic description, it's simply a matter of translating it into a
language the computer can handle; and that's fairly straight-forward.
An hour or two with language/library documentation and you'll quite
possibly have working code, or if you don't, you'll at least have
something that you can show to the list and ask for help with.

But until you have that, advice from this list is going to be fairly
vague, and may turn out to be quite misleading. We can't solve your
problem until we know what it is, and you can't tell us what the
problem is until you know yourself.

ChrisA

[toc] | [prev] | [next] | [standalone]

#27991

From	Jean-Michel Pichavant <jeanmichel@sequans.com>
Date	2012-08-27 18:57 +0200
Message-ID	<mailman.3880.1346086640.4697.python-list@python.org>
In reply to	#27984

mikcec82 wrote:
> [snip]
> <th/>
> <th/>
> </tr>
> <tr align="left" style="color: red">
> <th/>
> <th>
> CODE CHECK
> </th>
> <th>
> : NOT PASSED
> </th>
> </tr>
> <tr>
> <th/>
>
> Depending on this check I have to fill a cell in an excel file with answer: NOK (if Not passed or XXXX is present), or OK (if Not passed and XXXX are not present).
>
> Thanks again for your help (and sorry for my english)
>   
Html is not a format you wish to extract data from. Mainly because this 
is the endpoint of content AND display, meaning, that what is properly 
parsed today may not be parsed tomorrow because someone changed the 
background color.
You should change your server so he can feed a client with data (xml for 
instance is quite close from the html syntax, it's based on tags and is 
suitable for data).

JM

[toc] | [prev] | [next] | [standalone]

#28016

From	mikcec82 <michele.cecere@gmail.com>
Date	2012-08-28 03:09 -0700
Message-ID	<dd711c0c-19ec-4d3d-bfc4-93ff209ec7e4@googlegroups.com>
In reply to	#27974

Il giorno lunedì 27 agosto 2012 12:59:02 UTC+2, mikcec82 ha scritto:
> Hallo,
> 
> 
> 
> I have an html file on my pc and I want to read it to extract some text.
> 
> Can you help on which libs I have to use and how can I do it?
> 
> 
> 
> thank you so much.
> 
> 
> 
> Michele

Thank you to all.

Hi Chris, thank you for your hint. I'll try to do as you said and to be clear:

I have to work on an HTML File. This file is  not a website-file, neither it comes from internet.
It is a file created by a local software (where "local" means "on my pc").

On this file, I need to do this operation:

	1) Open the file
	2) Check the occurences of the strings:
		2a) XXXX, in this case I have this code:
					
					<tr style="font-size: 10" align="left">
					<th>
					</th><th>
					DTC CODE Read:
					</th>
					<td>
					<samp>
					&nbsp;
					&nbsp;
					&nbsp;
					&nbsp;
					&nbsp;
					</samp>
					XXXX
					</td>
					</tr>

		2b)	NOT PASSED, in this case I have this code:
		
					<tr style="color: red" align="left">
					<th>
					</th><th>
					CODE CHECK
					</th>
					<th>
					: NOT PASSED
					</th>
					</tr>
			Note: color in "<tr style="color: red" align="left">" can be "red" or "orange"
			
		2c) OK or PASSED
	   
	3) Then, I need to fill an excel file following this rules:
		3a) If 2a or 2b occurs on htmlfile, I'll write NOK in excel file
		3b) If 2c occurs on htmlfile, I'll write OK in excel file

Note:
1) In this example, in 2b case, I have "CODE CHECK" in the code, but I could also have "TEXT CHECK" or "CHAR CHECK".
2) The research of occurences can be done either by tag ("<tr style="color: red" align="left">") or via  (NOT PASSED, PASSED). But I would to use the first method.
==================================================

In my script I have used the second way to looking for, i.e.:

**
fileorig = "C:\Users\Mike\Desktop\\2012_05_16_1___p0201_13.html"

f = open(fileorig, 'r')
nomefile = f.read()

for x in nomefile:
    if 'XXXX' in nomefile:
        print 'NOK'
    else :
        print 'OK'
**
But this one works on charachters and not on strings (i.e.: in this way I have searched NOT string by string, but charachters-by-charachters).
		
===============================================

I hope I was clear.

Thank for your help
Michele

[toc] | [prev] | [next] | [standalone]

#28017

From	Oscar Benjamin <oscar.j.benjamin@gmail.com>
Date	2012-08-28 13:31 +0100
Message-ID	<mailman.3897.1346157132.4697.python-list@python.org>
In reply to	#28016

On Tue, 28 Aug 2012 03:09:11 -0700 (PDT), mikcec82 
<michele.cecere@gmail.com> wrote:
> f = open(fileorig, 'r')
> nomefile = f.read()


> for x in nomefile:
>     if 'XXXX' in nomefile:
>         print 'NOK'
>     else :
>         print 'OK'

You don't need the for loop. Just do:

nomefile = f.read()
if 'XXXX' in nomefile:
    print('NOK')

> **
> But this one works on charachters and not on strings (i.e.: in this 
way I h=
> ave searched NOT string by string, but charachters-by-charachters).

Oscar

[toc] | [prev] | [next] | [standalone]

#28018

From	Peter Otten <__peter__@web.de>
Date	2012-08-28 17:38 +0200
Message-ID	<mailman.3900.1346168282.4697.python-list@python.org>
In reply to	#27974

mikcec82 wrote:

> Il giorno lunedì 27 agosto 2012 12:59:02 UTC+2, mikcec82 ha scritto:
>> Hallo,
>> 
>> 
>> 
>> I have an html file on my pc and I want to read it to extract some text.
>> 
>> Can you help on which libs I have to use and how can I do it?
>> 
>> 
>> 
>> thank you so much.
>> 
>> 
>> 
>> Michele
> 
> Hi Oscar,
> I tried as you said and I've developed the code as you will see.
> But, when I have a such situation in an html file, in wich there is a
> repetition of a string (XX in this case):
> CODE Target: 	        0201
> CODE Read: 	        XXXX
> CODE CHECK 	: NOT PASSED
> TEXT Target:              13
> TEXT Read: 	          XX
> TEXT CHECK 	: NOT PASSED
> CHAR Target: 	          AA
> CHAR Read: 	          XX
> CHAR CHECK 	: NOT PASSED
> 
> With this code (created starting from yours)
> 
> index = nomefile.find('XXXX')
> print 'XXXX_ found at location', index
> 
> index2 = nomefile.find('XX')
> print 'XX_ found at location', index2
> 
> found = nomefile.find('XX')
> while found > -1:
>     print "XX found at location", found
>     found = nomefile.find('XX', found+1)
> 
> I have an answer like this:
> 
> XXXX_ found at location 51315
> XX_ found at location 51315
> XX found at location 51315
> XX found at location 51316
> XX found at location 51317
> XX found at location 52321
> XX found at location 53328
> 
> I have done it to find all occurences of 'XXXX' and 'XX' strings. But, as
> you can see, the script find the occurrences of XX also at locations
> 51315, 51316 , 51317 corresponding to string XXXX.
> 
> Is there a way to search all occurences of XX avoiding XXXX location?

Remove the wrong positives afterwards:

start = nomefile.find("XX")
while start != -1:
    if nomefile[start:start+4] == "XXXX":
        start += 4
    else:
        print "XX found at location", start
        start += 3
    start = nomefile.find("XX", start)

By the way, what do you want to do if there are runs of "X" with repeats 
other than 2 or 4?

[toc] | [prev] | [next] | [standalone]

#28052

From	mikcec82 <michele.cecere@gmail.com>
Date	2012-08-29 03:22 -0700
Message-ID	<c92e1fdc-ff14-4484-9e03-e322de9ba82b@googlegroups.com>
In reply to	#27974

Il giorno lunedì 27 agosto 2012 12:59:02 UTC+2, mikcec82 ha scritto:
> Hallo,
> 
> 
> 
> I have an html file on my pc and I want to read it to extract some text.
> 
> Can you help on which libs I have to use and how can I do it?
> 
> 
> 
> thank you so much.
> 
> 
> 
> Michele

Hi Peter and thanks for your precious help.
Fortunately, there aren't runs of "X" with repeats other than 2 or 4.
Starting from your code, I wrote this code (I post it, so it could be helpful for other people):
f = open(fileorig, 'r') 
nomefile = f.read()

start = nomefile.find("XX")
start2 = nomefile.find("NOT PASSED")
c0 = 0
c1 = 0
c2 = 0

while (start != -1) | (start2 != -1):
    
    if nomefile[start:start+4] == "XXXX": 
        print "XXXX       found at location", start
        start += 4
        c0 +=1
    elif nomefile[start:start+2] == "XX":
        print "XX         found at location", start
        start += 2
        c1 +=1
        
    if nomefile[start2:start2+10] == "NOT PASSED": 
        print "NOT PASSED found at location", start2
        start2 += 10
        c2 +=1

    start = nomefile.find("XX", start)
    start2 = nomefile.find("NOT PASSED", start2)

print "XXXX       %s founded" % c0, "\nXX         %s founded" % c1, "\nNOT PASSED %s founded" % c2

Now, I'm able to find all occurences of strings: "XXXX", "XX" and "NOT PASSED" 


Thank you so much.

[toc] | [prev] | [next] | [standalone]

#28058

From	Umesh Sharma <usharma01@gmail.com>
Date	2012-08-29 05:00 -0700
Message-ID	<1770147e-cb1f-4100-9220-f3f3f4e23f04@googlegroups.com>
In reply to	#27974

You can use httplib library to download the html and then for extracting the text from it either you can use any library (google for it) or you can use regular expression for it .

[toc] | [prev] | [standalone]

csiph-web

What do I do to read html files on my pc?

Contents

#27974 — What do I do to read html files on my pc?

#27980

#27981

#27984

#27985

#27986

#27991

#28016

#28017

#28018

#28052

#28058