Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #70634

Re: how to split this kind of text into sections

Date 2014-04-26 23:53 +0800
Subject Re: how to split this kind of text into sections
From oyster <lepto.python@gmail.com>
Newsgroups comp.lang.python
Message-ID <mailman.9520.1398527597.18130.python-list@python.org> (permalink)

Show all headers | View raw


First of all, thank you all for your answers. I received python
mail-list in a daily digest, so it is not easy for me to quote your
mail separately.

I will try to explain my situation to my best, but English is not my
native language, I don't know whether I can make it clear at last.

Every SECTION starts with 2 special lines; these 2 lines is special
because they have some same characters (the length is not const for
different section) at the beginning; these same characters is called
the KEY for this section. For every 2 neighbor sections, they have
different KEYs.

After these 2 special lines, some paragraph is followed. Paragraph
does not have any KEYs.

So, a section = 2 special lines with KEYs at the beginning + some
paragraph without KEYs

However there maybe some paragraph before the first section, which I
do not need and want to drop it

I need a method to split the whole text into SECTIONs and to know all the KEYs

I have tried to solve this problem via re module, but failed. Maybe I
can make you understand me clearly by showing the regular expression
object
reobj = re.compile(r"(?P<bookname>[^\r\n]*?)[^\r\n]*?\r\n(?P=bookname)[^\r\n]*?\r\n.*?",
re.DOTALL)
which can get the first 2 lines of a section, but fail to get the rest
of a section which does not have any KEYs at the begin. The hard part
for me is to express "paragraph does not have KEYs".

Even I can get the first 2 line, I think regular expression is
expensive for my text.

That is all. I hope get some more suggestions. Thanks.

[demo text starts]
a line we do not need
I am section axax
I am section bbb
(and here goes many other text)...

let's continue to
let's continue, yeah
.....(and here goes many other text)...

I am using python
I am using perl
.....(and here goes many other text)...

Programming is hard
Programming is easy
How do you thing?
I do’t know
[demo text ends]

the above text should be splited to a LIST with 4 items, and I also
need to know the KEY for LIST is ['I am section ', 'let's continue',
'I am using ', ' Programming is ']:
lst=[
'''a line we do not need
I am section axax
I am section bbb
(and here goes many other text)... ''',

'''let's continue to
let's continue, yeah
.....(and here goes many other text)... ''',

'''I am using python
I am using perl
.....(and here goes many other text)... ''',

'''Programming is hard
Programming is easy
How do you thing?
I do’t know'''
]

Back to comp.lang.python | Previous | NextNext in thread | Find similar | Unroll thread


Thread

Re: how to split this kind of text into sections oyster <lepto.python@gmail.com> - 2014-04-26 23:53 +0800
  Re: how to split this kind of text into sections Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2014-04-27 02:49 +0000

csiph-web