Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #27974 > unrolled thread
| Started by | mikcec82 <michele.cecere@gmail.com> |
|---|---|
| First post | 2012-08-27 03:59 -0700 |
| Last post | 2012-08-29 05:00 -0700 |
| Articles | 12 — 8 participants |
Back to article view | Back to comp.lang.python
What do I do to read html files on my pc? mikcec82 <michele.cecere@gmail.com> - 2012-08-27 03:59 -0700
Re: What do I do to read html files on my pc? Chris Angelico <rosuav@gmail.com> - 2012-08-27 21:58 +1000
Re: What do I do to read html files on my pc? Mark Lawrence <breamoreboy@yahoo.co.uk> - 2012-08-27 13:05 +0100
Re: What do I do to read html files on my pc? mikcec82 <michele.cecere@gmail.com> - 2012-08-27 06:51 -0700
Re: What do I do to read html files on my pc? Joel Goldstick <joel.goldstick@gmail.com> - 2012-08-27 10:21 -0400
Re: What do I do to read html files on my pc? Chris Angelico <rosuav@gmail.com> - 2012-08-28 00:41 +1000
Re: What do I do to read html files on my pc? Jean-Michel Pichavant <jeanmichel@sequans.com> - 2012-08-27 18:57 +0200
Re: What do I do to read html files on my pc? mikcec82 <michele.cecere@gmail.com> - 2012-08-28 03:09 -0700
Re: What do I do to read html files on my pc? Oscar Benjamin <oscar.j.benjamin@gmail.com> - 2012-08-28 13:31 +0100
Re: What do I do to read html files on my pc? Peter Otten <__peter__@web.de> - 2012-08-28 17:38 +0200
Re: What do I do to read html files on my pc? mikcec82 <michele.cecere@gmail.com> - 2012-08-29 03:22 -0700
Re: What do I do to read html files on my pc? Umesh Sharma <usharma01@gmail.com> - 2012-08-29 05:00 -0700
| From | mikcec82 <michele.cecere@gmail.com> |
|---|---|
| Date | 2012-08-27 03:59 -0700 |
| Subject | What do I do to read html files on my pc? |
| Message-ID | <1c7cd833-b6ad-4a17-8ffe-a0ce20c8f400@googlegroups.com> |
Hallo, I have an html file on my pc and I want to read it to extract some text. Can you help on which libs I have to use and how can I do it? thank you so much. Michele
[toc] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2012-08-27 21:58 +1000 |
| Message-ID | <mailman.3872.1346068691.4697.python-list@python.org> |
| In reply to | #27974 |
On Mon, Aug 27, 2012 at 8:59 PM, mikcec82 <michele.cecere@gmail.com> wrote: > Hallo, > > I have an html file on my pc and I want to read it to extract some text. > Can you help on which libs I have to use and how can I do it? > > thank you so much. Try BeautifulSoup. You can find it at the opposite end of a web search. Not trying to be unhelpful, but without more description of the problem, there's not a lot more to say :) ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2012-08-27 13:05 +0100 |
| Message-ID | <mailman.3873.1346069036.4697.python-list@python.org> |
| In reply to | #27974 |
On 27/08/2012 11:59, mikcec82 wrote: > Hallo, > > I have an html file on my pc and I want to read it to extract some text. > Can you help on which libs I have to use and how can I do it? > > thank you so much. > > Michele > Type something like "python html parsing" into the box of your favourite search engine, hit return and follow the links it comes back with. Write some code. If you have problems give us the smallest code snippet that reproduces the issue together with the complete traceback and we'll help. -- Cheers. Mark Lawrence.
[toc] | [prev] | [next] | [standalone]
| From | mikcec82 <michele.cecere@gmail.com> |
|---|---|
| Date | 2012-08-27 06:51 -0700 |
| Message-ID | <858c2da2-6936-4bd7-8944-f45446fbd3be@googlegroups.com> |
| In reply to | #27974 |
Il giorno lunedì 27 agosto 2012 12:59:02 UTC+2, mikcec82 ha scritto: > Hallo, > > > > I have an html file on my pc and I want to read it to extract some text. > > Can you help on which libs I have to use and how can I do it? > > > > thank you so much. > > > > Michele Hi ChrisA, Hi Mark. Thanks a lot. I have this html data and I want to check if it is present a string "XXXX" or/and a string "NOT PASSED": </th> <td> <samp> </samp> XXXX </td> </tr> <tr> . . . <th/> <th/> </tr> <tr align="left" style="color: red"> <th/> <th> CODE CHECK </th> <th> : NOT PASSED </th> </tr> <tr> <th/> Depending on this check I have to fill a cell in an excel file with answer: NOK (if Not passed or XXXX is present), or OK (if Not passed and XXXX are not present). Thanks again for your help (and sorry for my english)
[toc] | [prev] | [next] | [standalone]
| From | Joel Goldstick <joel.goldstick@gmail.com> |
|---|---|
| Date | 2012-08-27 10:21 -0400 |
| Message-ID | <mailman.3876.1346077298.4697.python-list@python.org> |
| In reply to | #27984 |
On Mon, Aug 27, 2012 at 9:51 AM, mikcec82 <michele.cecere@gmail.com> wrote: > Il giorno lunedì 27 agosto 2012 12:59:02 UTC+2, mikcec82 ha scritto: >> Hallo, >> >> >> >> I have an html file on my pc and I want to read it to extract some text. >> >> Can you help on which libs I have to use and how can I do it? >> >> >> >> thank you so much. >> >> >> >> Michele > > Hi ChrisA, Hi Mark. > Thanks a lot. > > I have this html data and I want to check if it is present a string "XXXX" or/and a string "NOT PASSED": > > </th> > <td> > <samp> > > > > > > </samp> > XXXX > </td> > </tr> > <tr> > . > . > . > <th/> > <th/> > </tr> > <tr align="left" style="color: red"> > <th/> > <th> > CODE CHECK > </th> > <th> > : NOT PASSED > </th> > </tr> > <tr> > <th/> > > Depending on this check I have to fill a cell in an excel file with answer: NOK (if Not passed or XXXX is present), or OK (if Not passed and XXXX are not present). > > Thanks again for your help (and sorry for my english) > -- > http://mail.python.org/mailman/listinfo/python-list from your example it doesn't seem there is enough information to know where in the html your strings will be. If you just read the whole file into a string you can do this: >>> s = "this is a string" >>> if 'this' in s: ... print 'yes' ... yes >>> Of course you will be testing for 'XXXX' or 'NOT PASSED' -- Joel Goldstick
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2012-08-28 00:41 +1000 |
| Message-ID | <mailman.3877.1346078509.4697.python-list@python.org> |
| In reply to | #27984 |
On Mon, Aug 27, 2012 at 11:51 PM, mikcec82 <michele.cecere@gmail.com> wrote: > I have this html data and I want to check if it is present a string "XXXX" or/and a string "NOT PASSED": Start by scribbling down some notes in your native language (that is, don't bother trying to write code yet), defining exactly what you're looking for. What constitutes a hit? What would be a false positive that you need to avoid? For instance: * The string XXXX must occur outside of any HTML tag. or: * The string XXXX must occur inside a <td> but not inside <samp>. or: * The string XXXX must be in the first <td> inside of a <tr> in the <table> that immediately follows the text "abcdefg". Make sure it's clear enough that anybody could follow it, even without knowing everything you know about your files. Once you have that algorithmic description, it's simply a matter of translating it into a language the computer can handle; and that's fairly straight-forward. An hour or two with language/library documentation and you'll quite possibly have working code, or if you don't, you'll at least have something that you can show to the list and ask for help with. But until you have that, advice from this list is going to be fairly vague, and may turn out to be quite misleading. We can't solve your problem until we know what it is, and you can't tell us what the problem is until you know yourself. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Jean-Michel Pichavant <jeanmichel@sequans.com> |
|---|---|
| Date | 2012-08-27 18:57 +0200 |
| Message-ID | <mailman.3880.1346086640.4697.python-list@python.org> |
| In reply to | #27984 |
mikcec82 wrote: > [snip] > <th/> > <th/> > </tr> > <tr align="left" style="color: red"> > <th/> > <th> > CODE CHECK > </th> > <th> > : NOT PASSED > </th> > </tr> > <tr> > <th/> > > Depending on this check I have to fill a cell in an excel file with answer: NOK (if Not passed or XXXX is present), or OK (if Not passed and XXXX are not present). > > Thanks again for your help (and sorry for my english) > Html is not a format you wish to extract data from. Mainly because this is the endpoint of content AND display, meaning, that what is properly parsed today may not be parsed tomorrow because someone changed the background color. You should change your server so he can feed a client with data (xml for instance is quite close from the html syntax, it's based on tags and is suitable for data). JM
[toc] | [prev] | [next] | [standalone]
| From | mikcec82 <michele.cecere@gmail.com> |
|---|---|
| Date | 2012-08-28 03:09 -0700 |
| Message-ID | <dd711c0c-19ec-4d3d-bfc4-93ff209ec7e4@googlegroups.com> |
| In reply to | #27974 |
Il giorno lunedì 27 agosto 2012 12:59:02 UTC+2, mikcec82 ha scritto:
> Hallo,
>
>
>
> I have an html file on my pc and I want to read it to extract some text.
>
> Can you help on which libs I have to use and how can I do it?
>
>
>
> thank you so much.
>
>
>
> Michele
Thank you to all.
Hi Chris, thank you for your hint. I'll try to do as you said and to be clear:
I have to work on an HTML File. This file is not a website-file, neither it comes from internet.
It is a file created by a local software (where "local" means "on my pc").
On this file, I need to do this operation:
1) Open the file
2) Check the occurences of the strings:
2a) XXXX, in this case I have this code:
<tr style="font-size: 10" align="left">
<th>
</th><th>
DTC CODE Read:
</th>
<td>
<samp>
</samp>
XXXX
</td>
</tr>
2b) NOT PASSED, in this case I have this code:
<tr style="color: red" align="left">
<th>
</th><th>
CODE CHECK
</th>
<th>
: NOT PASSED
</th>
</tr>
Note: color in "<tr style="color: red" align="left">" can be "red" or "orange"
2c) OK or PASSED
3) Then, I need to fill an excel file following this rules:
3a) If 2a or 2b occurs on htmlfile, I'll write NOK in excel file
3b) If 2c occurs on htmlfile, I'll write OK in excel file
Note:
1) In this example, in 2b case, I have "CODE CHECK" in the code, but I could also have "TEXT CHECK" or "CHAR CHECK".
2) The research of occurences can be done either by tag ("<tr style="color: red" align="left">") or via (NOT PASSED, PASSED). But I would to use the first method.
==================================================
In my script I have used the second way to looking for, i.e.:
**
fileorig = "C:\Users\Mike\Desktop\\2012_05_16_1___p0201_13.html"
f = open(fileorig, 'r')
nomefile = f.read()
for x in nomefile:
if 'XXXX' in nomefile:
print 'NOK'
else :
print 'OK'
**
But this one works on charachters and not on strings (i.e.: in this way I have searched NOT string by string, but charachters-by-charachters).
===============================================
I hope I was clear.
Thank for your help
Michele
[toc] | [prev] | [next] | [standalone]
| From | Oscar Benjamin <oscar.j.benjamin@gmail.com> |
|---|---|
| Date | 2012-08-28 13:31 +0100 |
| Message-ID | <mailman.3897.1346157132.4697.python-list@python.org> |
| In reply to | #28016 |
On Tue, 28 Aug 2012 03:09:11 -0700 (PDT), mikcec82
<michele.cecere@gmail.com> wrote:
> f = open(fileorig, 'r')
> nomefile = f.read()
> for x in nomefile:
> if 'XXXX' in nomefile:
> print 'NOK'
> else :
> print 'OK'
You don't need the for loop. Just do:
nomefile = f.read()
if 'XXXX' in nomefile:
print('NOK')
> **
> But this one works on charachters and not on strings (i.e.: in this
way I h=
> ave searched NOT string by string, but charachters-by-charachters).
Oscar
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2012-08-28 17:38 +0200 |
| Message-ID | <mailman.3900.1346168282.4697.python-list@python.org> |
| In reply to | #27974 |
mikcec82 wrote:
> Il giorno lunedì 27 agosto 2012 12:59:02 UTC+2, mikcec82 ha scritto:
>> Hallo,
>>
>>
>>
>> I have an html file on my pc and I want to read it to extract some text.
>>
>> Can you help on which libs I have to use and how can I do it?
>>
>>
>>
>> thank you so much.
>>
>>
>>
>> Michele
>
> Hi Oscar,
> I tried as you said and I've developed the code as you will see.
> But, when I have a such situation in an html file, in wich there is a
> repetition of a string (XX in this case):
> CODE Target: 0201
> CODE Read: XXXX
> CODE CHECK : NOT PASSED
> TEXT Target: 13
> TEXT Read: XX
> TEXT CHECK : NOT PASSED
> CHAR Target: AA
> CHAR Read: XX
> CHAR CHECK : NOT PASSED
>
> With this code (created starting from yours)
>
> index = nomefile.find('XXXX')
> print 'XXXX_ found at location', index
>
> index2 = nomefile.find('XX')
> print 'XX_ found at location', index2
>
> found = nomefile.find('XX')
> while found > -1:
> print "XX found at location", found
> found = nomefile.find('XX', found+1)
>
> I have an answer like this:
>
> XXXX_ found at location 51315
> XX_ found at location 51315
> XX found at location 51315
> XX found at location 51316
> XX found at location 51317
> XX found at location 52321
> XX found at location 53328
>
> I have done it to find all occurences of 'XXXX' and 'XX' strings. But, as
> you can see, the script find the occurrences of XX also at locations
> 51315, 51316 , 51317 corresponding to string XXXX.
>
> Is there a way to search all occurences of XX avoiding XXXX location?
Remove the wrong positives afterwards:
start = nomefile.find("XX")
while start != -1:
if nomefile[start:start+4] == "XXXX":
start += 4
else:
print "XX found at location", start
start += 3
start = nomefile.find("XX", start)
By the way, what do you want to do if there are runs of "X" with repeats
other than 2 or 4?
[toc] | [prev] | [next] | [standalone]
| From | mikcec82 <michele.cecere@gmail.com> |
|---|---|
| Date | 2012-08-29 03:22 -0700 |
| Message-ID | <c92e1fdc-ff14-4484-9e03-e322de9ba82b@googlegroups.com> |
| In reply to | #27974 |
Il giorno lunedì 27 agosto 2012 12:59:02 UTC+2, mikcec82 ha scritto:
> Hallo,
>
>
>
> I have an html file on my pc and I want to read it to extract some text.
>
> Can you help on which libs I have to use and how can I do it?
>
>
>
> thank you so much.
>
>
>
> Michele
Hi Peter and thanks for your precious help.
Fortunately, there aren't runs of "X" with repeats other than 2 or 4.
Starting from your code, I wrote this code (I post it, so it could be helpful for other people):
f = open(fileorig, 'r')
nomefile = f.read()
start = nomefile.find("XX")
start2 = nomefile.find("NOT PASSED")
c0 = 0
c1 = 0
c2 = 0
while (start != -1) | (start2 != -1):
if nomefile[start:start+4] == "XXXX":
print "XXXX found at location", start
start += 4
c0 +=1
elif nomefile[start:start+2] == "XX":
print "XX found at location", start
start += 2
c1 +=1
if nomefile[start2:start2+10] == "NOT PASSED":
print "NOT PASSED found at location", start2
start2 += 10
c2 +=1
start = nomefile.find("XX", start)
start2 = nomefile.find("NOT PASSED", start2)
print "XXXX %s founded" % c0, "\nXX %s founded" % c1, "\nNOT PASSED %s founded" % c2
Now, I'm able to find all occurences of strings: "XXXX", "XX" and "NOT PASSED"
Thank you so much.
[toc] | [prev] | [next] | [standalone]
| From | Umesh Sharma <usharma01@gmail.com> |
|---|---|
| Date | 2012-08-29 05:00 -0700 |
| Message-ID | <1770147e-cb1f-4100-9220-f3f3f4e23f04@googlegroups.com> |
| In reply to | #27974 |
You can use httplib library to download the html and then for extracting the text from it either you can use any library (google for it) or you can use regular expression for it .
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web