Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!aioe.org!news.stack.nl!talisker.lacave.net!lacave.net!not-for-mail From: Martin Hansen Newsgroups: comp.lang.ruby Subject: Re: Need for speed -> a C extension? Date: Mon, 18 Apr 2011 13:30:13 -0500 Organization: Service de news de lacave.net Lines: 55 Message-ID: <4638db940d54573d9643ab0a369c8c7e@ruby-forum.com> References: NNTP-Posting-Host: bristol.highgroove.com Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Trace: talisker.lacave.net 1303151443 94971 65.111.164.187 (18 Apr 2011 18:30:43 GMT) X-Complaints-To: abuse@lacave.net NNTP-Posting-Date: Mon, 18 Apr 2011 18:30:43 +0000 (UTC) In-Reply-To: X-Received-From: This message has been automatically forwarded from the ruby-talk mailing list by a gateway at comp.lang.ruby. If it is SPAM, it did not originate at comp.lang.ruby. Please report the original sender, and not us. Thanks! For more details about this gateway, please visit: http://blog.grayproductions.net/categories/the_gateway X-Mail-Count: 381792 X-Ml-Name: ruby-talk X-Rubymirror: Yes X-Ruby-Talk: <4638db940d54573d9643ab0a369c8c7e@ruby-forum.com> Xref: x330-a1.tempe.blueboxinc.net comp.lang.ruby:3114 WJ wrote in post #993576: > Martin Hansen wrote: > >> The below code is too slow for practical use. I need it to run at least >> 20 times faster. Perhaps that is possible with some C code? I have no >> experience with writing Ruby extensions. What are the pitfalls? Which >> part of the code should be ported? Any pointers to get me started? > > Please give a clear description of the algorithm, and then > give some sample input and output. Here is a working version of the code that can be profiled (though it will take forever with 20M iterations): http://pastie.org/1808127 The slow part according to profiler is: % cumulative self self total time seconds seconds calls ms/call ms/call name 29.39 2.66 2.66 1521 1.75 11.78 Range#each 15.80 4.09 1.43 33000 0.04 0.06 Seq#match? 10.72 5.06 0.97 78940 0.01 0.03 Kernel.dup 9.28 5.90 0.84 78940 0.01 0.01 Kernel.initialize_dup 6.63 6.50 0.60 142380 0.00 0.00 Seq::Score#edit_distance 5.30 6.98 0.48 22220 0.02 0.03 Seq#deletion? 3.54 7.30 0.32 66016 0.00 0.00 String#ord 3.43 7.61 0.31 14680 0.02 0.04 Seq#mismatch? 3.31 7.91 0.30 8300 0.04 0.05 Seq#insertion? The input is DNA sequences. Basically strings of ATCG and Ns of length 50-100. These comes in files with 20M-30M sequences per file. I've got ~50 of these files and more incoming. The output will be truncated sequences based on the match position located with this bit of code. The algorithm is of the dynamic programming flavor and was inspired by the paper by Bruno Woltzenlogel Paleo (page 197): http://www.logic.at/people/bruno/Papers/2007-GATE-ESSLLI.pdf Locating variable length matches is tricky! Cheers, Martin -- Posted via http://www.ruby-forum.com/.