Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!rt.uk.eu.org!newsfeed.xs4all.nl!newsfeed1.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Date: Sat, 26 Apr 2014 11:59:56 -0500
From: Tim Chase <python.list@tim.thechases.com>
To: oyster <lepto.python@gmail.com>
Subject: Re: how to split this kind of text into sections
In-Reply-To: <CACW-qXVFHDGxRAemjtV9mm-V13U_iH5gWPCAZwoYdNT+ycUSUg@mail.gmail.com>
References: <CACW-qXVFHDGxRAemjtV9mm-V13U_iH5gWPCAZwoYdNT+ycUSUg@mail.gmail.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Cc: python-list@python.org
Precedence: list
Newsgroups: comp.lang.python
Message-ID: <mailman.9523.1398531613.18130.python-list@python.org>
Lines: 71
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:70637

On 2014-04-26 23:53, oyster wrote:
> I will try to explain my situation to my best, but English is not my
> native language, I don't know whether I can make it clear at last.

Your follow-up reply made much more sense and your written English is
far better than many native speakers'. :-)

> Every SECTION starts with 2 special lines; these 2 lines is special
> because they have some same characters (the length is not const for
> different section) at the beginning; these same characters is called
> the KEY for this section. For every 2 neighbor sections, they have
> different KEYs.

I suspect you have a minimum number of characters (or words) to
consider, otherwise a single character duplicated at the beginning of
the line would delimit a section, such as

 abcd
 afgh

because they share the commonality of an "a".  The code I provided
earlier should give you what you describe.  I've tweaked and tested,
and provided it below.  Note that I require a minimum overlap of 6
characters (MIN_LEN).  It also gathers the initial stuff (that you
want to discard) under the empty key, so you can either delete that,
or ignore it.

> I need a method to split the whole text into SECTIONs and to know
> all the KEYs
> 
> I have tried to solve this problem via re module

I don't think the re module will be as much help here.

-tkc


from collections import defaultdict
import itertools as it
MIN_LEN = 6
def overlap(s1, s2):
    "Given 2 strings, return the initial overlap between them"
    return ''.join(
        c1
        for c1, c2
        in it.takewhile(
            lambda pair: pair[0] == pair[1],
            it.izip(s1, s2)
            )
        )
prevline = "" # the initial key under which preamble gets stored
output = defaultdict(list)
key = None
with open("data.txt") as f:
    for line in f:
        if len(line) >= MIN_LEN and prevline[:MIN_LEN] == line[:MIN_LEN]:
            key = overlap(prevline, line)
        output[key].append(line)
        prevline = line
for k,v in output.items():
    print str(k).center(60,'=')
    print ''.join(v)








.