Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #70634 > unrolled thread
| Started by | oyster <lepto.python@gmail.com> |
|---|---|
| First post | 2014-04-26 23:53 +0800 |
| Last post | 2014-04-27 02:49 +0000 |
| Articles | 2 — 2 participants |
Back to article view | Back to comp.lang.python
Re: how to split this kind of text into sections oyster <lepto.python@gmail.com> - 2014-04-26 23:53 +0800
Re: how to split this kind of text into sections Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-04-27 02:49 +0000
| From | oyster <lepto.python@gmail.com> |
|---|---|
| Date | 2014-04-26 23:53 +0800 |
| Subject | Re: how to split this kind of text into sections |
| Message-ID | <mailman.9520.1398527597.18130.python-list@python.org> |
First of all, thank you all for your answers. I received python mail-list in a daily digest, so it is not easy for me to quote your mail separately. I will try to explain my situation to my best, but English is not my native language, I don't know whether I can make it clear at last. Every SECTION starts with 2 special lines; these 2 lines is special because they have some same characters (the length is not const for different section) at the beginning; these same characters is called the KEY for this section. For every 2 neighbor sections, they have different KEYs. After these 2 special lines, some paragraph is followed. Paragraph does not have any KEYs. So, a section = 2 special lines with KEYs at the beginning + some paragraph without KEYs However there maybe some paragraph before the first section, which I do not need and want to drop it I need a method to split the whole text into SECTIONs and to know all the KEYs I have tried to solve this problem via re module, but failed. Maybe I can make you understand me clearly by showing the regular expression object reobj = re.compile(r"(?P<bookname>[^\r\n]*?)[^\r\n]*?\r\n(?P=bookname)[^\r\n]*?\r\n.*?", re.DOTALL) which can get the first 2 lines of a section, but fail to get the rest of a section which does not have any KEYs at the begin. The hard part for me is to express "paragraph does not have KEYs". Even I can get the first 2 line, I think regular expression is expensive for my text. That is all. I hope get some more suggestions. Thanks. [demo text starts] a line we do not need I am section axax I am section bbb (and here goes many other text)... let's continue to let's continue, yeah .....(and here goes many other text)... I am using python I am using perl .....(and here goes many other text)... Programming is hard Programming is easy How do you thing? I do’t know [demo text ends] the above text should be splited to a LIST with 4 items, and I also need to know the KEY for LIST is ['I am section ', 'let's continue', 'I am using ', ' Programming is ']: lst=[ '''a line we do not need I am section axax I am section bbb (and here goes many other text)... ''', '''let's continue to let's continue, yeah .....(and here goes many other text)... ''', '''I am using python I am using perl .....(and here goes many other text)... ''', '''Programming is hard Programming is easy How do you thing? I do’t know''' ]
[toc] | [next] | [standalone]
| From | Steven D'Aprano <steve+comp.lang.python@pearwood.info> |
|---|---|
| Date | 2014-04-27 02:49 +0000 |
| Message-ID | <535c703a$0$29965$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #70634 |
On Sat, 26 Apr 2014 23:53:14 +0800, oyster wrote:
> Every SECTION starts with 2 special lines; these 2 lines is special
> because they have some same characters (the length is not const for
> different section) at the beginning; these same characters is called the
> KEY for this section. For every 2 neighbor sections, they have different
> KEYs.
>
> After these 2 special lines, some paragraph is followed. Paragraph does
> not have any KEYs.
>
> So, a section = 2 special lines with KEYs at the beginning + some
> paragraph without KEYs
>
> However there maybe some paragraph before the first section, which I do
> not need and want to drop it
>
> I need a method to split the whole text into SECTIONs and to know all
> the KEYs
Let me try to describe how I would solve this, in English.
I would look at each pair of lines (1st + 2nd, 2nd + 3rd, 3rd + 4th,
etc.) looking for a pair of lines with matching prefixes. E.g.:
"This line matches the next"
"This line matches the previous"
do match, because they both start with "This line matches the ".
Question: how many characters in common counts as a match?
"This line matches the next"
"That previous line matches this line"
have a common prefix of "Th", two characters. Is that a match?
So let me start with a function to extract the matching prefix, if there
is one. It returns '' if there is no match, and the prefix (the KEY) if
there is one:
def extract_key(line1, line2):
"""Return the key from two matching lines, or '' if not matching."""
# Assume they need five characters in common.
if line1[:5] == line2[:5]:
return line1[:5]
return ''
I'm pretty much guessing that this is how you decide there's a match. I
don't know if five characters is too many or two few, or if you need a
more complicated test. It seems that you want to match as many characters
as possible. I'll leave you to adjust this function to work exactly as
needed.
Now we iterate over the text in pairs of lines. We need somewhere to hold
the the lines in each section, so I'm going to use a dict of lists of
lines. As a bonus, I'm going to collect the ignored lines using a key of
None. However, I do assume that all keys are unique. It should be easy
enough to adjust the following to handle non-unique keys. (Use a list of
lists, rather than a dict, and save the keys in a separate list.)
Lastly, the way it handles lines at the beginning of a section is not
exactly the way you want it. This puts the *first* line of the section as
the *last* line of the previous section. I will leave you to sort out
that problem.
from collections import OrderedDict
section = []
sections = OrderedDict()
sections[None] = section
lines = iter(text.split('\n'))
prev_line = ''
for line in lines:
key = extract_key(prev_line, line)
if key == '':
# No match, so we're still in the same section as before.
section.append(line)
else:
# Match, so we start a new section.
section = [line]
sections[key] = section
prev_line = line
--
Steven D'Aprano
http://import-that.dreamwidth.org/
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web