Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!aioe.org!news.stack.nl!talisker.lacave.net!lacave.net!not-for-mail
From: Martin Hansen <mail@maasha.dk>
Newsgroups: comp.lang.ruby
Subject: Re: Need for speed -> a C extension?
Date: Mon, 18 Apr 2011 13:30:13 -0500
Organization: Service de news de lacave.net
Lines: 55
Message-ID: <4638db940d54573d9643ab0a369c8c7e@ruby-forum.com>
References: <c094098c0ea21c2b9618d1b8d7a4b176@ruby-forum.com> <iohsms02e2q@enews2.newsguy.com>
NNTP-Posting-Host: bristol.highgroove.com
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
X-Trace: talisker.lacave.net 1303151443 94971 65.111.164.187 (18 Apr 2011 18:30:43 GMT)
X-Complaints-To: abuse@lacave.net
NNTP-Posting-Date: Mon, 18 Apr 2011 18:30:43 +0000 (UTC)
In-Reply-To: <iohsms02e2q@enews2.newsguy.com>
X-Received-From: This message has been automatically forwarded from the ruby-talk mailing list by a gateway at comp.lang.ruby. If it is SPAM, it did not originate at comp.lang.ruby. Please report the original sender, and not us. Thanks! For more details about this gateway, please visit: http://blog.grayproductions.net/categories/the_gateway
X-Mail-Count: 381792
X-Ml-Name: ruby-talk
X-Rubymirror: Yes
X-Ruby-Talk: <4638db940d54573d9643ab0a369c8c7e@ruby-forum.com>
Xref: x330-a1.tempe.blueboxinc.net comp.lang.ruby:3114

WJ wrote in post #993576:
> Martin Hansen wrote:
>
>> The below code is too slow for practical use. I need it to run at least
>> 20 times faster. Perhaps that is possible with some C code? I have no
>> experience with writing Ruby extensions. What are the pitfalls? Which
>> part of the code should be ported? Any pointers to get me started?
>
> Please give a clear description of the algorithm, and then
> give some sample input and output.


Here is a working version of the code that can be profiled (though it 
will take forever with 20M iterations):

http://pastie.org/1808127

The slow part according to profiler is:

  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 29.39     2.66      2.66     1521     1.75    11.78  Range#each
 15.80     4.09      1.43    33000     0.04     0.06  Seq#match?
 10.72     5.06      0.97    78940     0.01     0.03  Kernel.dup
  9.28     5.90      0.84    78940     0.01     0.01 
Kernel.initialize_dup
  6.63     6.50      0.60   142380     0.00     0.00 
Seq::Score#edit_distance
  5.30     6.98      0.48    22220     0.02     0.03  Seq#deletion?
  3.54     7.30      0.32    66016     0.00     0.00  String#ord
  3.43     7.61      0.31    14680     0.02     0.04  Seq#mismatch?
  3.31     7.91      0.30     8300     0.04     0.05  Seq#insertion?


The input is DNA sequences. Basically strings of ATCG and Ns of length 
50-100. These comes in files with 20M-30M sequences per file. I've got 
~50 of these files and more incoming. The output will be truncated 
sequences based on the match position located with this bit of code.

The algorithm is of the dynamic programming flavor and was inspired by 
the paper by Bruno Woltzenlogel Paleo (page 197):

http://www.logic.at/people/bruno/Papers/2007-GATE-ESSLLI.pdf

Locating variable length matches is tricky!



Cheers,

Martin

-- 
Posted via http://www.ruby-forum.com/.