Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #67346 > unrolled thread
| Started by | "Golam Md. Shibly" <shiblydu60@yahoo.com> |
|---|---|
| First post | 2014-03-01 10:10 -0800 |
| Last post | 2014-06-27 10:36 -0700 |
| Articles | 2 — 2 participants |
Back to article view | Back to comp.lang.python
How to extract contents of inner text of html tag? "Golam Md. Shibly" <shiblydu60@yahoo.com> - 2014-03-01 10:10 -0800
Re: How to extract contents of inner text of html tag? Jesse Adam <jaahush@gmail.com> - 2014-06-27 10:36 -0700
| From | "Golam Md. Shibly" <shiblydu60@yahoo.com> |
|---|---|
| Date | 2014-03-01 10:10 -0800 |
| Subject | How to extract contents of inner text of html tag? |
| Message-ID | <mailman.7534.1393704366.18130.python-list@python.org> |
[Multipart message — attachments visible in raw view] — view raw
Hi,
###in.txt
<kbd class="command">
cp -v --remove-destination /usr/share/zoneinfo/
<em class="replaceable"><code><xxx></code></em>
\
/etc/localtime
</kbd>
import sys
import unicodedata
from bs4 import BeautifulSoup
file_name="in.txt"
html_doc=open(file_name,'r')
soup=BeautifulSoup(html_doc)
#print soup.prettify().encode('utf-8')
#file_to_write.writelines( soup.prettify().encode() )
all_kbd=soup.find_all('kbd')
for line in all_kbd:
if line.string == None:
extract_code=line.code.extract().string
#store_code=line.code.decompose()
for inside_line in line:
if "<<" not in inside_line and "EOF" not in inside_line:
if len(inside_line)>0:
print inside_line
print extract_code
expected output:
cp -v --remove-destination /usr/share/zoneinfo/<xxx>\
/etc/localtime
Got output:
cp -v --remove-destination /usr/share/zoneinfo/
None
\
/etc/localtime
None
shibly
[toc] | [next] | [standalone]
| From | Jesse Adam <jaahush@gmail.com> |
|---|---|
| Date | 2014-06-27 10:36 -0700 |
| Message-ID | <34303ae2-c719-407d-bf83-744ccf05c21c@googlegroups.com> |
| In reply to | #67346 |
I don't have BeautifulSoup installed so I am unable to tell whether a) for line in all_kbd: processes one line at a time as given in the input, or do you get the clean text in single lines in a list as shown in the example in the doc http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree b) for inside_line in line: Does this process one token at a time? In any case, it looks like the reason you got "None" in the output is because you assume that every single line contains <code> and </code> tags. This may not be case all the time, so, prior to printing extract_code perhaps you could check whether that is None.
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web