Groups > comp.lang.ruby > #3574 > unrolled thread

File position and buffers

Started by	Cee Joe <cyril_jose@ymail.com>
First post	2011-04-27 15:02 -0500
Last post	2011-04-29 16:10 -0500
Articles	20 on this page of 22 — 5 participants

Back to article view | Back to comp.lang.ruby

  File position and buffers Cee Joe <cyril_jose@ymail.com> - 2011-04-27 15:02 -0500
    Re: File position and buffers Jesús Gabriel y Galán <jgabrielygalan@gmail.com> - 2011-04-27 16:47 -0500
    Re: File position and buffers jake kaiden <jakekaiden@yahoo.com> - 2011-04-27 17:33 -0500
    Re: File position and buffers 7stud -- <bbxx789_05ss@yahoo.com> - 2011-04-27 19:08 -0500
    Re: File position and buffers 7stud -- <bbxx789_05ss@yahoo.com> - 2011-04-27 19:50 -0500
    Re: File position and buffers Robert Klemme <shortcutter@googlemail.com> - 2011-04-28 02:54 -0500
      Re: File position and buffers 7stud -- <bbxx789_05ss@yahoo.com> - 2011-04-28 13:06 -0500
        Re: File position and buffers 7stud -- <bbxx789_05ss@yahoo.com> - 2011-04-28 13:25 -0500
          Re: File position and buffers Cee Joe <cyril_jose@ymail.com> - 2011-04-28 13:29 -0500
    Re: File position and buffers Cee Joe <cyril_jose@ymail.com> - 2011-04-28 09:06 -0500
    Re: File position and buffers 7stud -- <bbxx789_05ss@yahoo.com> - 2011-04-28 12:47 -0500
      Re: File position and buffers Cee Joe <cyril_jose@ymail.com> - 2011-04-28 13:27 -0500
        Re: File position and buffers 7stud -- <bbxx789_05ss@yahoo.com> - 2011-04-28 18:31 -0500
          Re: File position and buffers Cee Joe <cyril_jose@ymail.com> - 2011-04-28 20:05 -0500
    Re: File position and buffers 7stud -- <bbxx789_05ss@yahoo.com> - 2011-04-28 21:58 -0500
      Re: File position and buffers Cee Joe <cyril_jose@ymail.com> - 2011-04-29 10:20 -0500
    Re: File position and buffers jake kaiden <jakekaiden@yahoo.com> - 2011-04-28 22:36 -0500
    Re: File position and buffers 7stud -- <bbxx789_05ss@yahoo.com> - 2011-04-29 12:50 -0500
      Re: File position and buffers Cee Joe <cyril_jose@ymail.com> - 2011-04-29 13:32 -0500
        Re: File position and buffers 7stud -- <bbxx789_05ss@yahoo.com> - 2011-04-29 17:45 -0500
    Re: File position and buffers jake kaiden <jakekaiden@yahoo.com> - 2011-04-29 15:38 -0500
    Re: File position and buffers Cee Joe <cyril_jose@ymail.com> - 2011-04-29 16:10 -0500

Page 1 of 2 [1] 2 Next page →

#3574 — File position and buffers

From	Cee Joe <cyril_jose@ymail.com>
Date	2011-04-27 15:02 -0500
Subject	File position and buffers
Message-ID	<10d8ae57765e21626a7c64873dcba807@ruby-forum.com>

Hi all,

In a bit of a rut. Have a file with a lot of text. I want to seperate
the text in this file as entries. Each entry that I would be seperating,
would be done so using IO.pos and when that cursor reaches a certain
character in the file, it will ideally place all the content before that
character into a buffer. Then the cursor will continue reading until it
hits that same character again and put that content into a buffer, so on
and so forth. (Character I'll be reading would be a greater than symbol)

 Would I use a do iterator or use a while loop with a gets method? Or
readlines perhaps?

File:
>entry 1
rubyrubyrubyrubyrubyrubyrubyruby
(newline here which I don't want)
>entry 2
rubyrubyrubyrubyrubyrubyrubyruby

Entry1 and entry2 will be in seperate buffers which I would be able to
access again.

buffer1 = >entry 1
rubyrubyrubyrubyrubyrubyrubyruby

buffer2 = >entry 2
rubyrubyrubyrubyrubyrubyrubyruby


PS. The file is huge, so I don't want to read it into memory. What is
the best way to approach this? Any suggestions or comments would be
helpful. Thanks!

-- 
Posted via http://www.ruby-forum.com/.

[toc] | [next] | [standalone]

#3584

From	Jesús Gabriel y Galán <jgabrielygalan@gmail.com>
Date	2011-04-27 16:47 -0500
Message-ID	<BANLkTimwmQpU3E73MNnGRzf+BZsBFtDqyw@mail.gmail.com>
In reply to	#3574

You could use foreach checking if each line starts with '>'. If it doesn't
you accumulate in a buffer; if it does you do something with the current
buffer and start a new one.

Jesus
El 27/04/2011 22:04, "Cee Joe" <cyril_jose@ymail.com> escribió:

[toc] | [prev] | [next] | [standalone]

#3588

From	jake kaiden <jakekaiden@yahoo.com>
Date	2011-04-27 17:33 -0500
Message-ID	<e7fe81ff36d39f75e03940dd92433130@ruby-forum.com>
In reply to	#3574

hi Cee -

  this may well be WAY to simple for your needs, but it seems to me you 
could so something like this:

(0text.txt is a file with 7 lines that say rubyrubyrubyetc.)

  f = "0text.txt"
  file  = File.open(f)
  buffer = []
  bufferindex = 0

  file.each_line{|line|
       buffer[bufferindex] = line.chomp
       bufferindex += 1
  }

p buffer[0]
p buffer[1]
p buffer[2]
#etc...

  of course you could also set a maximum number of lines per buffer:

  f = "0text.txt"
  file  = File.open(f)
  buffer = Hash.new{|key, value| key[value]= []}
  bufferkey = 0
  maxbuflength = 3

  file.each_line{|line|
    if buffer[bufferkey].length == maxbuflength
      bufferkey +=1
      buffer[bufferkey] << line.chomp
    else
      buffer[bufferkey] << line.chomp
    end
  }

p buffer[0]
p buffer[1]
p buffer[2]

  if the file's extremely long i guess you'd want to write a method to 
dump the buffers at some point too.

  maybe this is dumb, i hope not!
  cheers,

  -j

-- 
Posted via http://www.ruby-forum.com/.

[toc] | [prev] | [next] | [standalone]

#3591

From	7stud -- <bbxx789_05ss@yahoo.com>
Date	2011-04-27 19:08 -0500
Message-ID	<5fcf83686c90d0c89ba3cdbb67b7255c@ruby-forum.com>
In reply to	#3574

Cee Joe wrote in post #995381:
> Hi all,
>
> In a bit of a rut. Have a file with a lot of text. I want to seperate
> the text in this file as entries. Each entry that I would be seperating,
> would be done so using IO.pos and when that cursor reaches a certain
> character in the file, it will ideally place all the content before that
> character into a buffer. Then the cursor will continue reading until it
> hits that same character again and put that content into a buffer, so on
> and so forth. (Character I'll be reading would be a greater than symbol)
>

There is absolutely no reason to use pos() to read that file.


>  Would I use a do iterator or use a while loop with a gets method? Or
> readlines perhaps?
>
> File:
>>entry 1
> rubyrubyrubyrubyrubyrubyrubyruby
> (newline here which I don't want)
>

chomp() removes one newline, if present, at the end of a string.

>
> PS. The file is huge, so I don't want to read it into memory. What is
> the best way to approach this? Any suggestions or comments would be
> helpful. Thanks!

Well, then you have to tell us what you want to do with the segments of 
the file.  If you store each chunk in a variable, then you will have 
read the whole file into memory.

You say your file looks like this:

>entry 1 <---WHAT'S AT THE END OF THIS LINE??
rubyrubyrubyrubyruby <---WHAT'S AT THE END OF THIS LINE??
(newline here which I don't want)

Those look like newlines.  Are you saying that your data is organized 
into paragraphs, i.e. separated by two newlines?  Like this:

>entry1\n
rubyrubyruby\n
\n
>entry2\n
rubyrubyruby\n
\n
>entry3

A paragraph is defined as two consective newlines between lines.  Note 
that in ruby the default line separator is one newline.  But you can 
change that to two newlines--or any other character:

require 'stringio'

str =<<ENDOFSTRING
>entry1
11111111111

>entry2
22222222222

>entry3
33333333333
ENDOFSTRING

input = StringIO.new(str)
$/ = "\n\n"

input.each do |para|
  p para.sub(/\n+ \z/xms, "")
end

--output:--
">entry1\n11111111111"
">entry2\n22222222222"
">entry3\n33333333333"

-- 
Posted via http://www.ruby-forum.com/.

[toc] | [prev] | [next] | [standalone]

#3593

From	7stud -- <bbxx789_05ss@yahoo.com>
Date	2011-04-27 19:50 -0500
Message-ID	<bd42da2f59207f9958b6377fbdea7517@ruby-forum.com>
In reply to	#3574

This shows the output better:

e = input.enum_for(:each) #You can do this for a File too.

e.each_slice(2) do |buffer1, buffer2|
  puts "buffer1: #{buffer1.inspect}"
  puts "buffer2: #{buffer2.inspect}"
  puts "-" * 10
end

--output:--
buffer1: ">entry1\n11111111111\n\n"
buffer2: ">entry2\n22222222222\n\n"
----------
buffer1: ">entry3\n33333333333\n"
buffer2: nil
----------

Before doing the sub() on buffer2, you will have to check if its nil:

  if buffer2.nil?
    #don't do a sub()
  else
    #do the sub()
  end

-- 
Posted via http://www.ruby-forum.com/.

[toc] | [prev] | [next] | [standalone]

#3612

From	Robert Klemme <shortcutter@googlemail.com>
Date	2011-04-28 02:54 -0500
Message-ID	<BANLkTi=SoLUqREXGZKu1yjzfik=OtUb+iQ@mail.gmail.com>
In reply to	#3574

On Wed, Apr 27, 2011 at 10:02 PM, Cee Joe <cyril_jose@ymail.com> wrote:
> Hi all,
>
> In a bit of a rut. Have a file with a lot of text. I want to seperate
> the text in this file as entries. Each entry that I would be seperating,
> would be done so using IO.pos and when that cursor reaches a certain
> character in the file, it will ideally place all the content before that
> character into a buffer. Then the cursor will continue reading until it
> hits that same character again and put that content into a buffer, so on
> and so forth. (Character I'll be reading would be a greater than symbol)
>
>  Would I use a do iterator or use a while loop with a gets method? Or
> readlines perhaps?
>
> File:
>>entry 1
> rubyrubyrubyrubyrubyrubyrubyruby
> (newline here which I don't want)
>>entry 2
> rubyrubyrubyrubyrubyrubyrubyruby
>
> Entry1 and entry2 will be in seperate buffers which I would be able to
> access again.
>
> buffer1 = >entry 1
> rubyrubyrubyrubyrubyrubyrubyruby
>
> buffer2 = >entry 2
> rubyrubyrubyrubyrubyrubyrubyruby
>
>
> PS. The file is huge, so I don't want to read it into memory. What is
> the best way to approach this? Any suggestions or comments would be
> helpful. Thanks!

One of the simplest approaches is to use Ruby's ability to use
arbitrary record delimiters:

File.foreach file_name, ">" do |chunk|
  chunk.chomp! ">"
  chunk.gsub! /\r\n?|\n/, '' # remove line terminators
  # if you need the leading ">":
  # chunk[0,0] = ">"
  p chunk
end

Kind regards

robert

-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

[toc] | [prev] | [next] | [standalone]

#3644

From	7stud -- <bbxx789_05ss@yahoo.com>
Date	2011-04-28 13:06 -0500
Message-ID	<6f3710886bbfa66d9bb6cb71ea04b6ee@ruby-forum.com>
In reply to	#3612

Robert K. wrote in post #995478:
> On Wed, Apr 27, 2011 at 10:02 PM, Cee Joe <cyril_jose@ymail.com> wrote:
>> Would I use a do iterator or use a while loop with a gets method? Or
>> access again.
>> helpful. Thanks!
> One of the simplest approaches is to use Ruby's ability to use
> arbitrary record delimiters:
>
> File.foreach file_name, ">" do |chunk|
>   chunk.chomp! ">"
>   chunk.gsub! /\r\n?|\n/, '' # remove line terminators
>

Cee Joe, are you reading the file in binary mode or text mode?

-- 
Posted via http://www.ruby-forum.com/.

[toc] | [prev] | [next] | [standalone]

#3646

From	7stud -- <bbxx789_05ss@yahoo.com>
Date	2011-04-28 13:25 -0500
Message-ID	<95426fd6f1235f8f1a26cd8383acff10@ruby-forum.com>
In reply to	#3644

7stud -- wrote in post #995589:
>
> Cee Joe, are you reading the file in binary mode or
> text mode?

If you don't know, then show us the line in your code where you open the 
file.

-- 
Posted via http://www.ruby-forum.com/.

[toc] | [prev] | [next] | [standalone]

#3649

From	Cee Joe <cyril_jose@ymail.com>
Date	2011-04-28 13:29 -0500
Message-ID	<591a0a6013482db87dd9625b6033fe62@ruby-forum.com>
In reply to	#3646

7stud -- wrote in post #995596:
> 7stud -- wrote in post #995589:
>>
>> Cee Joe, are you reading the file in binary mode or
>> text mode?
>
> If you don't know, then show us the line in your code where you open the
> file.

f = File.open("test.fasta", "r")

Where test.fasta contains the entries i posted earlier..

-- 
Posted via http://www.ruby-forum.com/.

[toc] | [prev] | [next] | [standalone]

#3624

From	Cee Joe <cyril_jose@ymail.com>
Date	2011-04-28 09:06 -0500
Message-ID	<0af64604aa1f5420f16889a3f19dbd0b@ruby-forum.com>
In reply to	#3574

Thanks guys for your helpful comments. I will be more descriptive. I am 
an intern and my mentor wants me to use the IO.pos to read the 
characters of the file until the character reaches the ">" symbol. SO 
upon the cursor reaching the ">" symbol(which is the start of a new 
entry), he wants me to place that previous entry in a buffer. Here is 
the actual test file I am working with:

>gi|329295464|ref|NM_2005745.3Acc1| Def1 zgc:65895 (zgc:65895), mRNA\n
AGCTCGGGGGCTCTAGCGATTTAAGGAGCGATGCGATCGAGCTGACCGTCGCG\n
\n
>gi|456299107|ref|NM_2342343.3Acc2| Def2 zgc:65895 (zgc:65895), mRNA\n
GTCGCTGGGTCGAAAAGTGGTGCTATATCGCGGCTCGCGTCGATGTCGCGATG\n
CGTGCGCGCGAGAGCGCGCTATGATGAAAGGATGAGAGAG\n
\n
>gi|3542945647|ref|NM_7453343.5Acc3| Def3 zgc:65895 (zgc:65895), mRNA\n
CGTGCGGGGABCCGTACGTGCCGTGGGGGTTTAATAGCGCGCCATCTGAGCAG\n
TTAGTCGCTGACGCATGCACG\n
\n

7stud, you are right there are two consecutive newlines which I failed 
to mention. This should be the output of a buffer for one entry:

>gi|456299107|ref|NM_2342343.3Acc2| Def2 zgc:65895 (zgc:65895), mRNA\n
GTCGCTGGGTCGAAAAGTGGTGCTATATCGCGGCTCGCGTCGATGTCGCGATG <-- no "\n"
CGTGCGCGCGAGAGCGCGCTATGATGAAAGGATGAGAGAG <-- no "\n"


Notice how the newlines are gone. So with the exception of the header in 
each entry, the newlines should be gone and be placed in a buffer. I am 
lost on how to use the IO.pos and a file iterator to make sure each 
respective entry goes into a buffer without the file being indexed into 
memory.

Thanks in advance, I'm new to the language and trying to wrap my head 
around it.

-- 
Posted via http://www.ruby-forum.com/.

[toc] | [prev] | [next] | [standalone]

#3642

From	7stud -- <bbxx789_05ss@yahoo.com>
Date	2011-04-28 12:47 -0500
Message-ID	<97869b15e6f881b951b4b230011182e0@ruby-forum.com>
In reply to	#3574

You still have not told us what you are supposed to do with the stuff 
you read in??  You can read a file line by line and print out each line 
as you go and the maximum amount of memory used will be one line's 
worth.   However, if you are supposed to store all the lines in an 
array, then you will read the whole file into memory.

> Thanks guys for your helpful comments. I will be more
> descriptive. I am an intern and my mentor wants me to
> use the IO.pos to read the characters of the file
> until the character reaches the ">" symbol.

What problems is that giving you?  You can create a loop, read the 
character at pos(i), then increment i, and do what Jesús Gabriel y Galán 
suggested.

-- 
Posted via http://www.ruby-forum.com/.

[toc] | [prev] | [next] | [standalone]

#3647

From	Cee Joe <cyril_jose@ymail.com>
Date	2011-04-28 13:27 -0500
Message-ID	<a8bfb33ef4ce932d8afc29b01bb29252@ruby-forum.com>
In reply to	#3642

7stud -- wrote in post #995581:
> You still have not told us what you are supposed to do with the stuff
> you read in??  You can read a file line by line and print out each line
> as you go and the maximum amount of memory used will be one line's
> worth.   However, if you are supposed to store all the lines in an
> array, then you will read the whole file into memory.
>
>> Thanks guys for your helpful comments. I will be more
>> descriptive. I am an intern and my mentor wants me to
>> use the IO.pos to read the characters of the file
>> until the character reaches the ">" symbol.
>


I am extracting text from each entry I read in, something I have figured 
out already. I want to read the file line by line and just store each 
entry into a buffer when it reaches the ">" symbol. THen extract 
specific info from it later. The entry lengths all vary as there long 
and short lengths. File is in text mode.

> What problems is that giving you?  You can create a loop, read the
> character at pos(i), then increment i, and do what Jesús Gabriel y Galán
> suggested.

Could you show me a simple example or refer me to a link?

-- 
Posted via http://www.ruby-forum.com/.

[toc] | [prev] | [next] | [standalone]

#3664

From	7stud -- <bbxx789_05ss@yahoo.com>
Date	2011-04-28 18:31 -0500
Message-ID	<c73bca63732c4f7e0943455cdc55a935@ruby-forum.com>
In reply to	#3647

Cee Joe wrote in post #995597:
>
> my mentor wants me to use the IO.pos to read the
> characters of the file until the character reaches the ">" symbol.
>

IO.pos() does not read in data, so you are going to have to ask your 
mentor what he means.   You should also ask your mentor if this is a 
lesson in how not to do things.  If he doesn't reply in the affirmative, 
then you should find a new mentor.


> I am extracting text from each entry I read in, something I have figured
> out already. I want to read the file line by line and just store each
> entry into a buffer when it reaches the ">" symbol. THen extract
> specific info from it later.
>

You told us you were not supposed to read the whole file into memory. 
If you store every line in an array, then you will have read the whole 
file into memory.  Once again, you are not being clear on what you want 
to do with the data.  You need to tell us which of the following you 
want to do:

1) Store every entry in an array, and "extract specific info from it 
later".

2) Read one entry, do something to the entry, then discard it and read 
in the next entry.


> The entry lengths all vary as there long
> and short lengths. File is in text mode.
>

Ok.

>> What problems is that giving you?  You can create a loop, read the
>> character at pos(i), then increment i, and do what Jesús Gabriel y Galán
>> suggested.
>

You could use each_byte to read the file char by char (that assumes your 
file contains all ascii characters), then when you find a '>', seek() 
back to the start of the file, and use IO.sysread() to read:

 old_pos = 0
 pos() - old_pos

number of characters.  Then do something like:

old_pos = pos()

and keep doing that. But, you will be reading every entry twice, which 
is stupid.

-- 
Posted via http://www.ruby-forum.com/.

[toc] | [prev] | [next] | [standalone]

#3667

From	Cee Joe <cyril_jose@ymail.com>
Date	2011-04-28 20:05 -0500
Message-ID	<859f1a6ec8eae502d91f42f274c1d8aa@ruby-forum.com>
In reply to	#3664

> 2) Read one entry, do something to the entry, then discard it and read
> in the next entry.

This is what I want to do. Read one entry, extract information from it, 
then read next entry. He says using an array will take up a lot of 
memory so he said use a buffer.


> But, you will end up reading every entry twice, which
> is stupid.  The easiest way to read in the file and prepare each entry
> is to set the input separator to "\n\n", then use each() to read in a
> paragraph, then use split("\n") to split each entry into lines, then add
> back a \n to the first line.
>
> Also, are you aware that this:
>
>>gi|456299107|ref|NM_2342343.3Acc2| Def2 zgc:65895 (zgc:65895), mRNA\n
> GTCGCTGGGTCGAAAAGTGGTGCTATATCGCGGCTCGCGTCGATGTCGCGATG <-- no "\n"
> CGTGCGCGCGAGAGCGCGCTATGATGAAAGGATGAGAGAG <-- no "\n"
>
> is equivalent to:
>
>>gi|456299107|ref|NM_2342343.3Acc2| Def2 zgc:65895 (zgc:65895), mRNA
> 
GTCGCTGGGTCGAAAAGTGGTGCTATATCGCGGCTCGCGTCGATGTCGCGATGCGTGCGCGCGAGAGCGCGCTATGATGAAAGGATGAGAGAG

Yes I am aware of that - I just put "no \n" for emphasis. Regarding the 
pos(), I think he said to use it as a guide to help with the detection 
of each ">" . Thanks for being patient and helping out.

-- 
Posted via http://www.ruby-forum.com/.

[toc] | [prev] | [next] | [standalone]

#3670

From	7stud -- <bbxx789_05ss@yahoo.com>
Date	2011-04-28 21:58 -0500
Message-ID	<1d5c91d93983445a08235be2797f7f0b@ruby-forum.com>
In reply to	#3574

If you don't have to use pos(), then see my first post.

-- 
Posted via http://www.ruby-forum.com/.

[toc] | [prev] | [next] | [standalone]

#3701

From	Cee Joe <cyril_jose@ymail.com>
Date	2011-04-29 10:20 -0500
Message-ID	<8279ca181d004aab4664d4fe4e1fc33c@ruby-forum.com>
In reply to	#3670

7stud -- wrote in post #995683:
> If you don't have to use pos(), then see my first post.  At some point,
> you might ask him why he thinks that pos() would be of any help at all!

Thanks jake and 7stud for replying. I tried this in irb for your first 
post:

>> e = File.open("test/test.fasta").enum_for(:each)
=> #<Enumerable::Enumerator:0x1005777a8>
>> $/ = "\n\n"
=> "\n\n"

>Before doing the sub() on buffer2, you will have to check if it's nil:

 >if buffer2.nil?
 >   #don't do a sub()
 > else
 >   #do the sub()
 >end

>> e.each_slice(2) do |buf1, buf2|
?> p buf1, buf2
>> if buf2.nil?
>> puts "Done"
>> else
?> buf2.sub(/\n+ \z/xms, "")
>> end
>> end

Output:
">gi|329299107|ref|NM_2005745.3Acc1| Def1 zgc:65895 (zgc:65895), 
mRNA\nAGCTCGGGGGCTCTAGCGATTTAAGGAGCGATGCGATCGAGCTGACCGTCGCG\n\n"
">gi|329299107|ref|NM_2342343.3Acc2| Def2 zgc:65895 (zgc:65895), 
mRNA\nGTCGCTGGGTCGAAAAGTGGTGCTATATCGCGGCTCGCGTCGATGTCGCGATG\nCGTGCGCGCGAGAGCGCGCTATGATGAAAGGATGAGAGAG\n\n"
">gi|329299107|ref|NM_7453343.5Acc3| Def3 zgc:65895 (zgc:65895), 
mRNA\nCGTGCGGGGABCCGTACGTGCCGTGGGGGTTTAATAGCGCGCCATCTGAGCAG\nTTAGTCGCTGACGCATGCACG\n"
nil
Done
=> nil

It still returns nil, am I doing what you suggested wrong?

-- 
Posted via http://www.ruby-forum.com/.

[toc] | [prev] | [next] | [standalone]

#3674

From	jake kaiden <jakekaiden@yahoo.com>
Date	2011-04-28 22:36 -0500
Message-ID	<79209ef84d9aba12fbb512ee6f18e427@ruby-forum.com>
In reply to	#3574

hi Cee -

  copying the text you posted above into the file "0text.txt" and 
running this:

  f = "0text.txt"
  file  = File.open(f)
  buffer = []
  bufferindex = 0

  file.each_line(sep=">"){|line|
       buffer[bufferindex] = line.chomp
       bufferkey+=1
  }

p buffer[0]
p buffer[1]
p buffer[2]
p buffer[3]

  i get this as output:

#=> ">"
#=> "gi|329295464|ref|NM_2005745.3Acc1| Def1 zgc:65895 (zgc:65895), 
mRNA\\n\nAGCTCGGGGGCTCTAGCGATTTAAGGAGCGATGCGATCGAGCTGACCGTCGCG\\n\n\\n\n>"
#=> "gi|456299107|ref|NM_2342343.3Acc2| Def2 zgc:65895 (zgc:65895), 
mRNA\\n\nGTCGCTGGGTCGAAAAGTGGTGCTATATCGCGGCTCGCGTCGATGTCGCGATG\\n\nCGTGCGCGCGAGAGCGCGCTATGATGAAAGGATGAGAGAG\\n\n\\n\n>"
#=> "gi|3542945647|ref|NM_7453343.5Acc3| Def3 zgc:65895 (zgc:65895), 
mRNA\\n\nCGTGCGGGGABCCGTACGTGCCGTGGGGGTTTAATAGCGCGCCATCTGAGCAG\\n\nTTAGTCGCTGACGCATGCACG\\n\n\\n"

  does this work for you?  you could easily write ways to deal with, 
dump, and reset the buffers when they fill up.  you can of course also 
clean up all the "\n"'s...

  i agree with 7stud that using #.pos and #.gets seems like a long walk 
off a short pier.  i'm pretty green myself, and there are probably 
better ways to iterate through the file, but #.each_line(sep=">") works 
just fine, and doesn't eat up memory.

  - j

-- 
Posted via http://www.ruby-forum.com/.

[toc] | [prev] | [next] | [standalone]

#3703

From	7stud -- <bbxx789_05ss@yahoo.com>
Date	2011-04-29 12:50 -0500
Message-ID	<12ea664fa8ebe548db95a756061e6489@ruby-forum.com>
In reply to	#3574

The first thing everyone in this thread needs to realize is that '>' is 
not the separator you want to look for.  That's because you don't care 
what character marks the beginning of every entry, rather you care what 
character marks the end of every entry.  The end of every entry is 
marked by the string "\n\n", so you should use that has your input line 
terminator.  Remember, ruby uses "\n" for the input line separator by 
default, which means that when you read a file using IO#each, ruby reads 
lines--where the end of a line is marked by a newline.  However, you can 
change the input line separator to the string "\n\n" (or any other 
string):

$/ = "\n\n"


Once you have an entry, then you just need to do a little housekeeping 
and remove some "\n" characters.



require 'stringio'

str =<<ENDOFSTRING
>gi|329295464|ref|NM_2005745.3Acc1| Def1 zgc:65895 (zgc:65895), mRNA
AGCTCGGGGGCTCTAGCGATTTAAGGAGCGATGCGATCGAGCTGACCGTCGCG

>gi|456299107|ref|NM_2342343.3Acc2| Def2 zgc:65895 (zgc:65895), mRNA
GTCGCTGGGTCGAAAAGTGGTGCTATATCGCGGCTCGCGTCGATGTCGCGATG
CGTGCGCGCGAGAGCGCGCTATGATGAAAGGATGAGAGAG

>gi|3542945647|ref|NM_7453343.5Acc3| Def3 zgc:65895 (zgc:65895), mRNA
CGTGCGGGGABCCGTACGTGCCGTGGGGGTTTAATAGCGCGCCATCTGAGCAG
TTAGTCGCTGACGCATGCACG

ENDOFSTRING


input = StringIO.new(str)  #Now input is just like a File

input.each(sep = "\n\n") do |para|
  buffer = ''

  lines = para.split("\n")
  buffer << lines.shift << "\n"
  lines.each do |line|
    buffer << line
  end

  puts buffer
  puts "-" * 20
end

p $/

--output:--
>gi|329295464|ref|NM_2005745.3Acc1| Def1 zgc:65895 (zgc:65895), mRNA
AGCTCGGGGGCTCTAGCGATTTAAGGAGCGATGCGATCGAGCTGACCGTCGCG
--------------------
>gi|456299107|ref|NM_2342343.3Acc2| Def2 zgc:65895 (zgc:65895), mRNA
GTCGCTGGGTCGAAAAGTGGTGCTATATCGCGGCTCGCGTCGATGTCGCGATGCGTGCGCGCGAGAGCGCGCTATGATGAAAGGATGAGAGAG
--------------------
>gi|3542945647|ref|NM_7453343.5Acc3| Def3 zgc:65895 (zgc:65895), mRNA
CGTGCGGGGABCCGTACGTGCCGTGGGGGTTTAATAGCGCGCCATCTGAGCAGTTAGTCGCTGACGCATGCACG
--------------------
"\n"


Note that specifying the new input line separator as an argument to 
each() serves to restore the original input line separator once the 
block has finished--which is a good thing.

-- 
Posted via http://www.ruby-forum.com/.

[toc] | [prev] | [next] | [standalone]

#3707

From	Cee Joe <cyril_jose@ymail.com>
Date	2011-04-29 13:32 -0500
Message-ID	<ae08ac328a97a170cbe366bc7fd10a2c@ruby-forum.com>
In reply to	#3703

7stud -- wrote in post #995821:
> I suggest that people never use irb because it has too many quirks.
>
> The first thing you need to realize is that '>' is
> not the separator you want to look for.  That is the second bit of
> erroneous advice your mentor gave you.  That's because you don't care
> what character marks the beginning of every entry, rather you care what
> character marks the end of every entry.  The end of every entry in your
> file is marked by the string "\n\n", so you should use that as your
> input line terminator.  Remember, ruby uses "\n" for the input line
> separator by default, which means that when you read a file using
> IO#each, ruby reads lines--where the end of a line is marked by a
> newline.

I understand the logic, it makes sense. What if the file looked like 
this, where there is one newline seperating the entries? :

>gi|329295464|ref|NM_2005745.3Acc1| Def1 zgc:65895 (zgc:65895), mRNA
AGCTCGGGGGCTCTAGCGATTTAAGGAGCGATGCGATCGAGCTGACCGTCGCG
>gi|456299107|ref|NM_2342343.3Acc2| Def2 zgc:65895 (zgc:65895), mRNA
GTCGCTGGGTCGAAAAGTGGTGCTATATCGCGGCTCGCGTCGATGTCGCGATG
CGTGCGCGCGAGAGCGCGCTATGATGAAAGGATGAGAGAG
>gi|3542945647|ref|NM_7453343.5Acc3| Def3 zgc:65895 (zgc:65895), mRNA
CGTGCGGGGABCCGTACGTGCCGTGGGGGTTTAATAGCGCGCCATCTGAGCAG
TTAGTCGCTGACGCATGCACG

Would an if-else(regarding"\n" and "\n\n") do the trick? I wanted to 
write my code to where it would handle both scenarios. Or maybe:

case
  when "\n\n"
    <code>
  when "\n"
    <code>
end

something to that extent? Suggestions?

-- 
Posted via http://www.ruby-forum.com/.

[toc] | [prev] | [next] | [standalone]

#3726

From	7stud -- <bbxx789_05ss@yahoo.com>
Date	2011-04-29 17:45 -0500
Message-ID	<43b7442097d0943daf8ee458d7329f8c@ruby-forum.com>
In reply to	#3707

Cee Joe wrote in post #995830:
> 7stud -- wrote in post #995821:
>> I suggest that people never use irb because it has too many quirks.
>>
>> The first thing you need to realize is that '>' is
>> not the separator you want to look for.  That is the second bit of
>> erroneous advice your mentor gave you.  That's because you don't care
>> what character marks the beginning of every entry, rather you care what
>> character marks the end of every entry.  The end of every entry in your
>> file is marked by the string "\n\n", so you should use that as your
>> input line terminator.  Remember, ruby uses "\n" for the input line
>> separator by default, which means that when you read a file using
>> IO#each, ruby reads lines--where the end of a line is marked by a
>> newline.
>
> I understand the logic, it makes sense. What if the file looked like
> this, where there is one newline seperating the entries? :

What if you had presented that possibility from the very beginning?


require 'stringio'

str =<<ENDOFSTRING
>gi|329295464|ref|NM_2005745.3Acc1| Def1 zgc:65895 (zgc:65895), mRNA
AGCTCGGGGGCTCTAGCGATTTAAGGAGCGATGCGATCGAGCTGACCGTCGCG

>gi|456299107|ref|NM_2342343.3Acc2| Def2 zgc:65895 (zgc:65895), mRNA
GTCGCTGGGTCGAAAAGTGGTGCTATATCGCGGCTCGCGTCGATGTCGCGATG
CGTGCGCGCGAGAGCGCGCTATGATGAAAGGATGAGAGAG

>gi|3542945647|ref|NM_7453343.5Acc3| Def3 zgc:65895 (zgc:65895), mRNA
CGTGCGGGGABCCGTACGTGCCGTGGGGGTTT
AATAGCGCGCCATCTGAGCAG
TTAGTCGCTGACGCATGCACG

ENDOFSTRING

input = StringIO.new(str)
buffer = ''

input.each do |line|
  if line[0, 1] == '>'
    if buffer != ''
      puts buffer  #or do something else to buffer
      puts '-' * 20
    end

    buffer = ''
    buffer << line
  else
    buffer << line.sub(/ \n+ \z /xms, '')
  end

end

puts buffer   #or do something else to buffer

--output:--
>gi|329295464|ref|NM_2005745.3Acc1| Def1 zgc:65895 (zgc:65895), mRNA
AGCTCGGGGGCTCTAGCGATTTAAGGAGCGATGCGATCGAGCTGACCGTCGCG
--------------------
>gi|456299107|ref|NM_2342343.3Acc2| Def2 zgc:65895 (zgc:65895), mRNA
GTCGCTGGGTCGAAAAGTGGTGCTATATCGCGGCTCGCGTCGATGTCGCGATGCGTGCGCGCGAGAGCGCGCTATGATGAAAGGATGAGAGAG
--------------------
>gi|3542945647|ref|NM_7453343.5Acc3| Def3 zgc:65895 (zgc:65895), mRNA
CGTGCGGGGABCCGTACGTGCCGTGGGGGTTTAATAGCGCGCCATCTGAGCAGTTAGTCGCTGACGCATGCACG

-- 
Posted via http://www.ruby-forum.com/.

[toc] | [prev] | [next] | [standalone]

Page 1 of 2 [1] 2 Next page →

csiph-web

File position and buffers

Contents

#3574 — File position and buffers

#3584

#3588

#3591

#3593

#3612

#3644

#3646

#3649

#3624

#3642

#3647

#3664

#3667

#3670

#3701

#3674

#3703

#3707

#3726