Re: Extracting lines from text files - script with a couple of 'side effects'

Path	csiph.com!usenet.pasdenom.info!news.etla.org!news.stack.nl!newsfeed.xs4all.nl!newsfeed1.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail
Return-Path	<python-python-list@m.gmane.org>
X-Original-To	python-list@python.org
Delivered-To	python-list@mail.python.org
X-Spam-Status	OK 0.002
X-Spam-Evidence	'H': 1.00; 'S': 0.00; 'newbie': 0.05; 'subject:text': 0.05; 'received:80.91': 0.09; 'received:80.91.229': 0.09; 'received:gmane.org': 0.09; 'received:list': 0.09; 'subject:files': 0.09; 'subject:script': 0.09; 'def': 0.12; 'question.': 0.14; 'match:': 0.16; 'received:80.91.229.3': 0.16; 'received:plane.gmane.org': 0.16; 'skip:> 20': 0.16; 'wrote:': 0.18; 'all,': 0.19; 'trying': 0.19; 'file,': 0.19; 'first.': 0.19; 'thoughts': 0.19; 'starts': 0.20; 'work,': 0.20; 'written': 0.21; 'seems': 0.21; 'import': 0.22; 'print': 0.22; 'header:User-Agent:1': 0.23; 'headers': 0.24; 'header': 0.24; 'file.': 0.24; 'script': 0.25; 'second': 0.26; 'skip:_ 20': 0.27; 'header:X-Complaints-To:1': 0.27; 'am,': 0.29; 'lines': 0.31; 'extract': 0.31; 'file:': 0.31; 'skip:r 60': 0.31; 'file': 0.32; 'class': 0.32; 'another': 0.32; 'text': 0.33; 'third': 0.33; 'subject:from': 0.34; 'subject:with': 0.35; 'one,': 0.35; 'but': 0.35; 'done': 0.36; 'doing': 0.36; 'charset:us- ascii': 0.36; 'similar': 0.36; 'skip:o 20': 0.38; 'same.': 0.38; 'to:addr:python-list': 0.38; 'rather': 0.38; 'to:addr:python.org': 0.39; 'skip:- 60': 0.39; 'received:org': 0.40; 'how': 0.40; 'entire': 0.61; "you're": 0.61; 'first': 0.61; 'more': 0.64; 'dear': 0.65; 'here': 0.66; 'side': 0.67; 'containing': 0.69; 'found!': 0.84
X-Injected-Via-Gmane	http://gmane.org/
To	python-list@python.org
From	Dave Angel <davea@davea.name>
Subject	Re: Extracting lines from text files - script with a couple of 'side effects'
Date	Wed, 25 Sep 2013 21:02:07 +0000 (UTC)
References	<cca8071a-c33e-47d6-b4dc-83dba2c28b74@googlegroups.com>
Mime-Version	1.0
Content-Type	text/plain; charset=US-ASCII
Content-Transfer-Encoding	7bit
X-Gmane-NNTP-Posting-Host	174.32.174.35
User-Agent	XPN/1.2.6 (Street Spirit ; Linux)
X-BeenThere	python-list@python.org
X-Mailman-Version	2.1.15
Precedence	list
List-Id	General discussion list for the Python programming language <python-list.python.org>
List-Unsubscribe	<https://mail.python.org/mailman/options/python-list>, <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive	<http://mail.python.org/pipermail/python-list/>
List-Post	<mailto:python-list@python.org>
List-Help	<mailto:python-list-request@python.org?subject=help>
List-Subscribe	<https://mail.python.org/mailman/listinfo/python-list>, <mailto:python-list-request@python.org?subject=subscribe>
Newsgroups	comp.lang.python
Message-ID	<mailman.323.1380142949.18130.python-list@python.org> (permalink)
Lines	68
NNTP-Posting-Host	2001:888:2000:d::a6
X-Trace	1380142949 news.xs4all.nl 15885 [2001:888:2000:d::a6]:42562
X-Complaints-To	abuse@xs4all.nl
Xref	csiph.com comp.lang.python:54757

Show key headers only | View raw

On 25/9/2013 16:06, mstagliamonte wrote:

> Dear All,
>
> Here I am, with another newbie question. I am trying to extract some lines from a fasta (text) file which match the headers in another file. i.e:
> Fasta file:
>>header1|info1:info2_info3
> general text
>>header2|info1:info2_info3
> general text
>
> headers file:
> header1|info1:info2_info3
> header2|info1:info2_info3
>
> I want to create a third file, similar to the first one, but only containing headers and text of what is listed in the second file. Also, I want to print out how many headers were actually found from the second file to match the first.
>
> I have done a script which seems to work, but with a couple of 'side effects'
> Here is my script:
> -------------------------------------------------------------------
> import re
> class Extractor():
>         
>     def __init__(self,headers_file, fasta_file,output_file):
>         with open(headers_file,'r') as inp0:
>             counter0=0
>             container=''
>             inp0_bis=inp0.read().split('\n')
>             for x in inp0_bis:
>                 container+=x.replace(':','_').replace('|','_')
>             with open(fasta_file,'r') as inp1:
>                 inp1_bis=inp1.read().split('>')
>                 for i in inp1_bis:
>                     i_bis= i.split('\n')                                       
>                     match = re.search(i_bis[0].replace(':','_').replace('|','_'),container)
>                     if match:
>                         counter0+=1
>                         with open(output_file,'at') as out0:
>                             out0.write('>'+i)
>              print '{} sequences were found'.format(counter0)
>
> -------------------------------------------------------------------
> Side effects:
> 1) The very first header is written as >>header1 rather than >header1
> 2) the number of sequences found is 1 more than the ones actually found!
>
> Have you got any thoughts about causes/solutions?
>

The cause is the same.  The first line in the file starts with a "<",
and you're splitting on the same.  So the first item of inp1_bis is the
empty string.  That string is certainly contained within container, so
it matches, and produces a result of ">"

You can "fix" this by adding a line after the "for i in inp1_bis" line
    if not i: continue

But it also seems to me you're doing the search backwards.  if the Fasta
file has a line like: >der

it would be considered a match!  Seems to me you'd want to only match
lines which contain an entire header.


-- 
DaveA

Thread

Extracting lines from text files - script with a couple of 'side effects' mstagliamonte <madmaxthc@yahoo.it> - 2013-09-25 13:06 -0700
  Re: Extracting lines from text files - script with a couple of 'side effects' Dave Angel <davea@davea.name> - 2013-09-25 21:02 +0000
  Re: Extracting lines from text files - script with a couple of 'side effects' MRAB <python@mrabarnett.plus.com> - 2013-09-25 22:53 +0100
  Re: Extracting lines from text files - script with a couple of 'side effects' mstagliamonte <madmaxthc@yahoo.it> - 2013-09-26 05:44 -0700

csiph-web