Path: csiph.com!usenet.pasdenom.info!aioe.org!news.stack.nl!newsfeed.xs4all.nl!newsfeed1.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Date: Wed, 25 Sep 2013 22:53:52 +0100
From: MRAB <python@mrabarnett.plus.com>
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:17.0) Gecko/20130801 Thunderbird/17.0.8
MIME-Version: 1.0
To: python-list@python.org
Subject: Re: Extracting lines from text files - script with a couple of 'side effects'
References: <cca8071a-c33e-47d6-b4dc-83dba2c28b74@googlegroups.com>
In-Reply-To: <cca8071a-c33e-47d6-b4dc-83dba2c28b74@googlegroups.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Precedence: list
Reply-To: python-list@python.org
Newsgroups: comp.lang.python
Message-ID: <mailman.324.1380146037.18130.python-list@python.org>
Lines: 71
NNTP-Posting-Host: 2001:888:2000:d::a6
Xref: csiph.com comp.lang.python:54759

On 25/09/2013 21:06, mstagliamonte wrote:
> Dear All,
>
> Here I am, with another newbie question. I am trying to extract some lines from a fasta (text) file which match the headers in another file. i.e:
> Fasta file:
>>header1|info1:info2_info3
> general text
>>header2|info1:info2_info3
> general text
>
> headers file:
> header1|info1:info2_info3
> header2|info1:info2_info3
>
> I want to create a third file, similar to the first one, but only containing headers and text of what is listed in the second file. Also, I want to print out how many headers were actually found from the second file to match the first.
>
> I have done a script which seems to work, but with a couple of 'side effects'
> Here is my script:
> -------------------------------------------------------------------
> import re
> class Extractor():
>
>      def __init__(self,headers_file, fasta_file,output_file):
>          with open(headers_file,'r') as inp0:
>              counter0=0
>              container=''
>              inp0_bis=inp0.read().split('\n')
>              for x in inp0_bis:
>                  container+=x.replace(':','_').replace('|','_')
>              with open(fasta_file,'r') as inp1:
>                  inp1_bis=inp1.read().split('>')
>                  for i in inp1_bis:
>                      i_bis= i.split('\n')
>                      match = re.search(i_bis[0].replace(':','_').replace('|','_'),container)
>                      if match:
>                          counter0+=1
>                          with open(output_file,'at') as out0:
>                              out0.write('>'+i)
>               print '{} sequences were found'.format(counter0)
>
> -------------------------------------------------------------------
> Side effects:
> 1) The very first header is written as >>header1 rather than >header1
> 2) the number of sequences found is 1 more than the ones actually found!
>
> Have you got any thoughts about causes/solutions?
>
> Thanks for your time!
>
Here's my version:

class Extractor():
     def __init__(self, headers_file, fasta_file, output_file):
         with open(headers_file) as inp:
             headers = set('>' + line for line in inp)

         counter = 0
         accept = False

         with open(fasta_file) as inp, open(output_file, 'w') as out:
             for line in inp:
                 if line.startswith('>'):
                     accept = line in headers
                     if accept:
                         counter += 1

                 if accept:
                     out.write(line)

         print '{} sequences were found'.format(counter)