Groups > comp.lang.python > #54756 > unrolled thread

Extracting lines from text files - script with a couple of 'side effects'

Started by	mstagliamonte <madmaxthc@yahoo.it>
First post	2013-09-25 13:06 -0700
Last post	2013-09-26 05:44 -0700
Articles	4 — 3 participants

Back to article view | Back to comp.lang.python

  Extracting lines from text files - script with a couple of 'side effects' mstagliamonte <madmaxthc@yahoo.it> - 2013-09-25 13:06 -0700
    Re: Extracting lines from text files - script with a couple of 'side effects' Dave Angel <davea@davea.name> - 2013-09-25 21:02 +0000
    Re: Extracting lines from text files - script with a couple of 'side effects' MRAB <python@mrabarnett.plus.com> - 2013-09-25 22:53 +0100
    Re: Extracting lines from text files - script with a couple of 'side effects' mstagliamonte <madmaxthc@yahoo.it> - 2013-09-26 05:44 -0700

#54756 — Extracting lines from text files - script with a couple of 'side effects'

From	mstagliamonte <madmaxthc@yahoo.it>
Date	2013-09-25 13:06 -0700
Subject	Extracting lines from text files - script with a couple of 'side effects'
Message-ID	<cca8071a-c33e-47d6-b4dc-83dba2c28b74@googlegroups.com>

Dear All,

Here I am, with another newbie question. I am trying to extract some lines from a fasta (text) file which match the headers in another file. i.e:
Fasta file:
>header1|info1:info2_info3
general text
>header2|info1:info2_info3
general text

headers file:
header1|info1:info2_info3
header2|info1:info2_info3

I want to create a third file, similar to the first one, but only containing headers and text of what is listed in the second file. Also, I want to print out how many headers were actually found from the second file to match the first.

I have done a script which seems to work, but with a couple of 'side effects'
Here is my script:
-------------------------------------------------------------------
import re
class Extractor():
        
    def __init__(self,headers_file, fasta_file,output_file):
        with open(headers_file,'r') as inp0:
            counter0=0
            container=''
            inp0_bis=inp0.read().split('\n')
            for x in inp0_bis:
                container+=x.replace(':','_').replace('|','_')
            with open(fasta_file,'r') as inp1:
                inp1_bis=inp1.read().split('>')
                for i in inp1_bis:
                    i_bis= i.split('\n')                                       
                    match = re.search(i_bis[0].replace(':','_').replace('|','_'),container)
                    if match:
                        counter0+=1
                        with open(output_file,'at') as out0:
                            out0.write('>'+i)
             print '{} sequences were found'.format(counter0)

-------------------------------------------------------------------
Side effects:
1) The very first header is written as >>header1 rather than >header1
2) the number of sequences found is 1 more than the ones actually found!

Have you got any thoughts about causes/solutions?

Thanks for your time!
P.S.: I think I have removed the double posting... not sure...
Max

[toc] | [next] | [standalone]

#54757

From	Dave Angel <davea@davea.name>
Date	2013-09-25 21:02 +0000
Message-ID	<mailman.323.1380142949.18130.python-list@python.org>
In reply to	#54756

On 25/9/2013 16:06, mstagliamonte wrote:

> Dear All,
>
> Here I am, with another newbie question. I am trying to extract some lines from a fasta (text) file which match the headers in another file. i.e:
> Fasta file:
>>header1|info1:info2_info3
> general text
>>header2|info1:info2_info3
> general text
>
> headers file:
> header1|info1:info2_info3
> header2|info1:info2_info3
>
> I want to create a third file, similar to the first one, but only containing headers and text of what is listed in the second file. Also, I want to print out how many headers were actually found from the second file to match the first.
>
> I have done a script which seems to work, but with a couple of 'side effects'
> Here is my script:
> -------------------------------------------------------------------
> import re
> class Extractor():
>         
>     def __init__(self,headers_file, fasta_file,output_file):
>         with open(headers_file,'r') as inp0:
>             counter0=0
>             container=''
>             inp0_bis=inp0.read().split('\n')
>             for x in inp0_bis:
>                 container+=x.replace(':','_').replace('|','_')
>             with open(fasta_file,'r') as inp1:
>                 inp1_bis=inp1.read().split('>')
>                 for i in inp1_bis:
>                     i_bis= i.split('\n')                                       
>                     match = re.search(i_bis[0].replace(':','_').replace('|','_'),container)
>                     if match:
>                         counter0+=1
>                         with open(output_file,'at') as out0:
>                             out0.write('>'+i)
>              print '{} sequences were found'.format(counter0)
>
> -------------------------------------------------------------------
> Side effects:
> 1) The very first header is written as >>header1 rather than >header1
> 2) the number of sequences found is 1 more than the ones actually found!
>
> Have you got any thoughts about causes/solutions?
>

The cause is the same.  The first line in the file starts with a "<",
and you're splitting on the same.  So the first item of inp1_bis is the
empty string.  That string is certainly contained within container, so
it matches, and produces a result of ">"

You can "fix" this by adding a line after the "for i in inp1_bis" line
    if not i: continue

But it also seems to me you're doing the search backwards.  if the Fasta
file has a line like: >der

it would be considered a match!  Seems to me you'd want to only match
lines which contain an entire header.


-- 
DaveA

[toc] | [prev] | [next] | [standalone]

#54759

From	MRAB <python@mrabarnett.plus.com>
Date	2013-09-25 22:53 +0100
Message-ID	<mailman.324.1380146037.18130.python-list@python.org>
In reply to	#54756

On 25/09/2013 21:06, mstagliamonte wrote:
> Dear All,
>
> Here I am, with another newbie question. I am trying to extract some lines from a fasta (text) file which match the headers in another file. i.e:
> Fasta file:
>>header1|info1:info2_info3
> general text
>>header2|info1:info2_info3
> general text
>
> headers file:
> header1|info1:info2_info3
> header2|info1:info2_info3
>
> I want to create a third file, similar to the first one, but only containing headers and text of what is listed in the second file. Also, I want to print out how many headers were actually found from the second file to match the first.
>
> I have done a script which seems to work, but with a couple of 'side effects'
> Here is my script:
> -------------------------------------------------------------------
> import re
> class Extractor():
>
>      def __init__(self,headers_file, fasta_file,output_file):
>          with open(headers_file,'r') as inp0:
>              counter0=0
>              container=''
>              inp0_bis=inp0.read().split('\n')
>              for x in inp0_bis:
>                  container+=x.replace(':','_').replace('|','_')
>              with open(fasta_file,'r') as inp1:
>                  inp1_bis=inp1.read().split('>')
>                  for i in inp1_bis:
>                      i_bis= i.split('\n')
>                      match = re.search(i_bis[0].replace(':','_').replace('|','_'),container)
>                      if match:
>                          counter0+=1
>                          with open(output_file,'at') as out0:
>                              out0.write('>'+i)
>               print '{} sequences were found'.format(counter0)
>
> -------------------------------------------------------------------
> Side effects:
> 1) The very first header is written as >>header1 rather than >header1
> 2) the number of sequences found is 1 more than the ones actually found!
>
> Have you got any thoughts about causes/solutions?
>
> Thanks for your time!
>
Here's my version:

class Extractor():
     def __init__(self, headers_file, fasta_file, output_file):
         with open(headers_file) as inp:
             headers = set('>' + line for line in inp)

         counter = 0
         accept = False

         with open(fasta_file) as inp, open(output_file, 'w') as out:
             for line in inp:
                 if line.startswith('>'):
                     accept = line in headers
                     if accept:
                         counter += 1

                 if accept:
                     out.write(line)

         print '{} sequences were found'.format(counter)

[toc] | [prev] | [next] | [standalone]

#54818

From	mstagliamonte <madmaxthc@yahoo.it>
Date	2013-09-26 05:44 -0700
Message-ID	<9b2df692-a229-401c-b129-3b8905d46f6b@googlegroups.com>
In reply to	#54756

Hi,

Thanks for the answers! And Dave, thanks for explaining the cause of the problem I will keep that in mind for the future. You're right, I am doing the search backward, it just seemed easier for me to do it in that way. Looks like I need to keep practising...

Both your suggestions work, I will try and learn from them.

Have a nice day
Max

[toc] | [prev] | [standalone]

csiph-web

Extracting lines from text files - script with a couple of 'side effects'

Contents

#54756 — Extracting lines from text files - script with a couple of 'side effects'

#54757

#54759

#54818