Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #46994 > unrolled thread
| Started by | claire morandin <claire.morandin@gmail.com> |
|---|---|
| First post | 2013-06-04 19:41 -0700 |
| Last post | 2013-06-04 21:05 -0700 |
| Articles | 5 — 3 participants |
Back to article view | Back to comp.lang.python
Issue values dictionary claire morandin <claire.morandin@gmail.com> - 2013-06-04 19:41 -0700
Re: Issue values dictionary alex23 <wuwei23@gmail.com> - 2013-06-04 20:17 -0700
Re: Issue values dictionary Peter Otten <__peter__@web.de> - 2013-06-05 09:43 +0200
Re: Issue values dictionary alex23 <wuwei23@gmail.com> - 2013-06-05 02:46 -0700
Re: Issue values dictionary claire morandin <claire.morandin@gmail.com> - 2013-06-04 21:05 -0700
| From | claire morandin <claire.morandin@gmail.com> |
|---|---|
| Date | 2013-06-04 19:41 -0700 |
| Subject | Issue values dictionary |
| Message-ID | <d6c181ee-d2bf-4278-a26c-22bc26402271@googlegroups.com> |
I have two text file with a bunch of transcript name and their corresponding length, it looks like this:
ERCC.txt
ERCC-00002 1061
ERCC-00003 1023
ERCC-00004 523
ERCC-00009 984
ERCC-00012 994
ERCC-00013 808
ERCC-00014 1957
ERCC-00016 844
ERCC-00017 1136
ERCC-00019 644
blast.tx
ERCC-00002 1058
ERCC-00003 1017
ERCC-00004 519
ERCC-00009 977
ERCC-00019 638
ERCC-00022 746
ERCC-00024 134
ERCC-00024 126
ERCC-00024 98
ERCC-00025 445
I want to compare the length of the transcript and see if the length in blast.txt is at least 90% of the length in ERCC.txt for the corresponding transcript name ( I hope I am clear!)
So I wrote the following script:
ercctranscript_size = {}
for line in open('ERCC.txt'):
columns = line.strip().split()
transcript = columns[0]
size = columns[1]
ercctranscript_size[transcript] = int(size)
unknown_transcript = open('Not_sequenced_ERCC_transcript.txt', 'w')
blast_file = open('blast.txt')
out_file = open ('out.txt', 'w')
blast_transcript = {}
blast_file.readline()
for line in blast_file:
blasttranscript = columns[0].strip()
blastsize = columns[1].strip()
blast_transcript[blasttranscript] = int(blastsize)
blastsize = blast_transcript[blasttranscript]
size = ercctranscript_size[transcript]
print size
if transcript not in blast_transcript:
unknown_transcript.write('{0}\n'.format(transcript))
else:
size = ercctranscript_size[transcript]
if blastsize >= 0.9*size:
print >> out_file, transcript, True
else:
print >> out_file, transcript, False
But I have a problem storing all size length to the value size as it is always comes back with the last entry.
Could anyone explain to me what I am doing wrong and how I should set the values for each dictionary? I am really new to python and this is my first script
Thanks for your help everybody!
[toc] | [next] | [standalone]
| From | alex23 <wuwei23@gmail.com> |
|---|---|
| Date | 2013-06-04 20:17 -0700 |
| Message-ID | <d503401e-d8cc-4059-80bf-c41af334222c@ve4g2000pbb.googlegroups.com> |
| In reply to | #46994 |
On Jun 5, 12:41 pm, claire morandin <claire.moran...@gmail.com> wrote:
> But I have a problem storing all size length to the value size as it is always comes back with the last entry.
> Could anyone explain to me what I am doing wrong and how I should set the values for each dictionary?
Your code has two for loops, one that reads ERCC.txt into a dict, and
one that reads blast.txt into a dict. The first assigns to
`transcript`, the second to `blasttranscript`. When the loops are
finished, you're using the _last_ value set for both `transcript` and
`blasttranscript`. So, really, you want _three_ loops: two to load the
files into dicts, then another to compare the two of them. If the
transcripts in blast.txt are guaranteed to be a subset of ERCC.txt,
then you could get away with two loops:
# convenience function for splitting lines into values
def get_transcript_and_size(line):
columns = line.strip().split()
return columns[0].strip(), int(columns[1].strip())
# read in blast_file
blast_transcripts = {}
with open('transcript_blast.txt') as blast_file:
# this is a context manager, it'll close the file when it's
finished
for line in blast_file:
blasttranscript, blastsize = get_transcript_and_size(line)
blast_transcripts[blasttranscript] = blastsize
# read in ERCC and compare to blast
with open('transcript_ERCC.txt') as ercc_file, \
open('Not_sequenced_ERCC_transcript.txt', 'w') as
unknown_transcript, \
open('transcript_out.txt', 'w') as out_file:
# this is called a _nested_ context manager, and requires 2.7+
or 3.1+
for line in ercc_file:
ercctranscript, erccsize = get_transcript_and_size(line)
if ercctranscript not in blast_transcripts:
print >> unknown_transcript, ercctranscript
else:
is_ninety_percent = blast_transcripts[ercctranscript]
>= 0.9*erccsize
print >> out_file, ercctranscript, is_ninety_percent
I've cleaned up your code a bit, using more similar naming schemes and
the same open/write procedures for all file access. Generally, any
time you're repeating code, you should stick it into a function and
use that instead, like the `get_transcript_and_size` func. If the
columns in your two files are separated by tabs, or always by the same
number of spaces, you can simplify this even further by using the csv
module: http://docs.python.org/2/library/csv.html
Hope this helps.
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2013-06-05 09:43 +0200 |
| Message-ID | <mailman.2705.1370418184.3114.python-list@python.org> |
| In reply to | #46996 |
alex23 wrote: > def get_transcript_and_size(line): > columns = line.strip().split() > return columns[0].strip(), int(columns[1].strip()) You can remove all strip() methods here as split() already strips off any whitespace from the columns. Not really important, but the nitpicker in me keeps nagging ;)
[toc] | [prev] | [next] | [standalone]
| From | alex23 <wuwei23@gmail.com> |
|---|---|
| Date | 2013-06-05 02:46 -0700 |
| Message-ID | <09f7cf28-70cc-425b-8e93-872ac49a55f3@ks18g2000pbb.googlegroups.com> |
| In reply to | #47031 |
On Jun 5, 5:43 pm, Peter Otten <__pete...@web.de> wrote: > You can remove all strip() methods here as split() already strips off any > whitespace from the columns. > > Not really important, but the nitpicker in me keeps nagging ;) Thanks, I really should have checked but just pushed the OPs code into a function, I didn't want to startle them with completely different code :) As I mentioned, I would've used the csv module for this anyway, which is why I never remember the split/strip behaviour. Nitpickery can be a virtue in this field :)
[toc] | [prev] | [next] | [standalone]
| From | claire morandin <claire.morandin@gmail.com> |
|---|---|
| Date | 2013-06-04 21:05 -0700 |
| Message-ID | <aee1a578-aafa-4163-84de-c331cf2dd415@googlegroups.com> |
| In reply to | #46994 |
@alex23 I can't thank you enough this really helped me so much, not only fixing my issue but also understanding where was my original error Thanks a lot
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web