Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!rt.uk.eu.org!newsfeed.xs4all.nl!newsfeed3a.news.xs4all.nl!xs4all!post.news.xs4all.nl!not-for-mail
MIME-Version: 1.0
Date: Sat, 26 Apr 2014 23:53:14 +0800
Subject: Re: how to split this kind of text into sections
From: oyster <lepto.python@gmail.com>
To: python-list@python.org
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.9520.1398527597.18130.python-list@python.org>
Lines: 82
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:70634

First of all, thank you all for your answers. I received python
mail-list in a daily digest, so it is not easy for me to quote your
mail separately.

I will try to explain my situation to my best, but English is not my
native language, I don't know whether I can make it clear at last.

Every SECTION starts with 2 special lines; these 2 lines is special
because they have some same characters (the length is not const for
different section) at the beginning; these same characters is called
the KEY for this section. For every 2 neighbor sections, they have
different KEYs.

After these 2 special lines, some paragraph is followed. Paragraph
does not have any KEYs.

So, a section =3D 2 special lines with KEYs at the beginning + some
paragraph without KEYs

However there maybe some paragraph before the first section, which I
do not need and want to drop it

I need a method to split the whole text into SECTIONs and to know all the K=
EYs

I have tried to solve this problem via re module, but failed. Maybe I
can make you understand me clearly by showing the regular expression
object
reobj =3D re.compile(r"(?P<bookname>[^\r\n]*?)[^\r\n]*?\r\n(?P=3Dbookname)[=
^\r\n]*?\r\n.*?",
re.DOTALL)
which can get the first 2 lines of a section, but fail to get the rest
of a section which does not have any KEYs at the begin. The hard part
for me is to express "paragraph does not have KEYs".

Even I can get the first 2 line, I think regular expression is
expensive for my text.

That is all. I hope get some more suggestions. Thanks.

[demo text starts]
a line we do not need
I am section axax
I am section bbb
(and here goes many other text)...

let's continue to
let's continue, yeah
.....(and here goes many other text)...

I am using python
I am using perl
.....(and here goes many other text)...

Programming is hard
Programming is easy
How do you thing?
I do=E2=80=99t know
[demo text ends]

the above text should be splited to a LIST with 4 items, and I also
need to know the KEY for LIST is ['I am section ', 'let's continue',
'I am using ', ' Programming is ']:
lst=3D[
'''a line we do not need
I am section axax
I am section bbb
(and here goes many other text)... ''',

'''let's continue to
let's continue, yeah
.....(and here goes many other text)... ''',

'''I am using python
I am using perl
.....(and here goes many other text)... ''',

'''Programming is hard
Programming is easy
How do you thing?
I do=E2=80=99t know'''
]