Groups > comp.lang.python > #70634 > unrolled thread

Re: how to split this kind of text into sections

Started by	oyster <lepto.python@gmail.com>
First post	2014-04-26 23:53 +0800
Last post	2014-04-27 02:49 +0000
Articles	2 — 2 participants

Back to article view | Back to comp.lang.python

  Re: how to split this kind of text into sections oyster <lepto.python@gmail.com> - 2014-04-26 23:53 +0800
    Re: how to split this kind of text into sections Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-04-27 02:49 +0000

#70634 — Re: how to split this kind of text into sections

From	oyster <lepto.python@gmail.com>
Date	2014-04-26 23:53 +0800
Subject	Re: how to split this kind of text into sections
Message-ID	<mailman.9520.1398527597.18130.python-list@python.org>

First of all, thank you all for your answers. I received python
mail-list in a daily digest, so it is not easy for me to quote your
mail separately.

I will try to explain my situation to my best, but English is not my
native language, I don't know whether I can make it clear at last.

Every SECTION starts with 2 special lines; these 2 lines is special
because they have some same characters (the length is not const for
different section) at the beginning; these same characters is called
the KEY for this section. For every 2 neighbor sections, they have
different KEYs.

After these 2 special lines, some paragraph is followed. Paragraph
does not have any KEYs.

So, a section = 2 special lines with KEYs at the beginning + some
paragraph without KEYs

However there maybe some paragraph before the first section, which I
do not need and want to drop it

I need a method to split the whole text into SECTIONs and to know all the KEYs

I have tried to solve this problem via re module, but failed. Maybe I
can make you understand me clearly by showing the regular expression
object
reobj = re.compile(r"(?P<bookname>[^\r\n]*?)[^\r\n]*?\r\n(?P=bookname)[^\r\n]*?\r\n.*?",
re.DOTALL)
which can get the first 2 lines of a section, but fail to get the rest
of a section which does not have any KEYs at the begin. The hard part
for me is to express "paragraph does not have KEYs".

Even I can get the first 2 line, I think regular expression is
expensive for my text.

That is all. I hope get some more suggestions. Thanks.

[demo text starts]
a line we do not need
I am section axax
I am section bbb
(and here goes many other text)...

let's continue to
let's continue, yeah
.....(and here goes many other text)...

I am using python
I am using perl
.....(and here goes many other text)...

Programming is hard
Programming is easy
How do you thing?
I do’t know
[demo text ends]

the above text should be splited to a LIST with 4 items, and I also
need to know the KEY for LIST is ['I am section ', 'let's continue',
'I am using ', ' Programming is ']:
lst=[
'''a line we do not need
I am section axax
I am section bbb
(and here goes many other text)... ''',

'''let's continue to
let's continue, yeah
.....(and here goes many other text)... ''',

'''I am using python
I am using perl
.....(and here goes many other text)... ''',

'''Programming is hard
Programming is easy
How do you thing?
I do’t know'''
]

[toc] | [next] | [standalone]

#70644

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2014-04-27 02:49 +0000
Message-ID	<535c703a$0$29965$c3e8da3$5496439d@news.astraweb.com>
In reply to	#70634

On Sat, 26 Apr 2014 23:53:14 +0800, oyster wrote:

> Every SECTION starts with 2 special lines; these 2 lines is special
> because they have some same characters (the length is not const for
> different section) at the beginning; these same characters is called the
> KEY for this section. For every 2 neighbor sections, they have different
> KEYs.
> 
> After these 2 special lines, some paragraph is followed. Paragraph does
> not have any KEYs.
> 
> So, a section = 2 special lines with KEYs at the beginning + some
> paragraph without KEYs
> 
> However there maybe some paragraph before the first section, which I do
> not need and want to drop it
> 
> I need a method to split the whole text into SECTIONs and to know all
> the KEYs

Let me try to describe how I would solve this, in English.

I would look at each pair of lines (1st + 2nd, 2nd + 3rd, 3rd + 4th, 
etc.) looking for a pair of lines with matching prefixes. E.g.:

"This line matches the next"
"This line matches the previous"

do match, because they both start with "This line matches the ".

Question: how many characters in common counts as a match?

"This line matches the next"
"That previous line matches this line"

have a common prefix of "Th", two characters. Is that a match?

So let me start with a function to extract the matching prefix, if there 
is one. It returns '' if there is no match, and the prefix (the KEY) if 
there is one:

def extract_key(line1, line2):
    """Return the key from two matching lines, or '' if not matching."""
    # Assume they need five characters in common.
    if line1[:5] == line2[:5]:
        return line1[:5]
    return ''

I'm pretty much guessing that this is how you decide there's a match. I 
don't know if five characters is too many or two few, or if you need a 
more complicated test. It seems that you want to match as many characters 
as possible. I'll leave you to adjust this function to work exactly as 
needed.

Now we iterate over the text in pairs of lines. We need somewhere to hold 
the the lines in each section, so I'm going to use a dict of lists of 
lines. As a bonus, I'm going to collect the ignored lines using a key of 
None. However, I do assume that all keys are unique. It should be easy 
enough to adjust the following to handle non-unique keys. (Use a list of 
lists, rather than a dict, and save the keys in a separate list.)

Lastly, the way it handles lines at the beginning of a section is not 
exactly the way you want it. This puts the *first* line of the section as 
the *last* line of the previous section. I will leave you to sort out 
that problem.

from collections import OrderedDict
section = []
sections = OrderedDict()
sections[None] = section
lines = iter(text.split('\n'))
prev_line = ''
for line in lines:
    key = extract_key(prev_line, line)
    if key == '':
        # No match, so we're still in the same section as before.
        section.append(line)
    else:
        # Match, so we start a new section.
        section = [line]
        sections[key] = section
    prev_line = line

-- 
Steven D'Aprano
http://import-that.dreamwidth.org/

[toc] | [prev] | [standalone]

csiph-web

Re: how to split this kind of text into sections

Contents

#70634 — Re: how to split this kind of text into sections

#70644