Groups > comp.lang.python > #67346 > unrolled thread

How to extract contents of inner text of html tag?

Started by	"Golam Md. Shibly" <shiblydu60@yahoo.com>
First post	2014-03-01 10:10 -0800
Last post	2014-06-27 10:36 -0700
Articles	2 — 2 participants

Back to article view | Back to comp.lang.python

  How to extract contents of inner text of html tag? "Golam Md. Shibly" <shiblydu60@yahoo.com> - 2014-03-01 10:10 -0800
    Re: How to extract contents of inner text of html tag? Jesse Adam <jaahush@gmail.com> - 2014-06-27 10:36 -0700

#67346 — How to extract contents of inner text of html tag?

From	"Golam Md. Shibly" <shiblydu60@yahoo.com>
Date	2014-03-01 10:10 -0800
Subject	How to extract contents of inner text of html tag?
Message-ID	<mailman.7534.1393704366.18130.python-list@python.org>

[Multipart message — attachments visible in raw view] — view raw

Hi,

###in.txt
<kbd class="command">
    cp -v --remove-destination /usr/share/zoneinfo/
    <em class="replaceable"><code><xxx></code></em>
       \
    /etc/localtime
</kbd>

import sys
import unicodedata
from bs4 import BeautifulSoup

file_name="in.txt"
html_doc=open(file_name,'r')
soup=BeautifulSoup(html_doc)
#print soup.prettify().encode('utf-8')
#file_to_write.writelines( soup.prettify().encode() )

all_kbd=soup.find_all('kbd')

for line in all_kbd:
	if line.string == None:		
		extract_code=line.code.extract().string
		#store_code=line.code.decompose()
		for inside_line in line:
			if "<<" not in inside_line and "EOF" not in inside_line:
				if len(inside_line)>0: 
					print inside_line
					print extract_code

expected output:
    cp -v --remove-destination /usr/share/zoneinfo/<xxx>\      
    /etc/localtime


Got output:
    cp -v --remove-destination /usr/share/zoneinfo/
    
None

       \
    /etc/localtime

None 

shibly

[toc] | [next] | [standalone]

#73666

From	Jesse Adam <jaahush@gmail.com>
Date	2014-06-27 10:36 -0700
Message-ID	<34303ae2-c719-407d-bf83-744ccf05c21c@googlegroups.com>
In reply to	#67346

I don't have BeautifulSoup installed so I am unable to tell whether

a) for line in all_kbd:
processes one line at a time as given in the input, or do you get the clean
text in single lines in a list as shown in the example in the doc 
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#searching-the-tree


b) for inside_line in line:
  Does this process one token at a time? 

In any case, it looks like the reason you got "None" in the output is 
because you assume that every single line contains <code> and </code> tags.
This may not be case all the time, so, prior to printing extract_code
perhaps you could check whether that is None.

[toc] | [prev] | [standalone]

csiph-web

How to extract contents of inner text of html tag?

Contents

#67346 — How to extract contents of inner text of html tag?

#73666