X-FeedAbuse: http://nntpfeed.proxad.net/abuse.pl feeded by 88.191.16.109 Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!news.dougwise.org!nntpfeed.proxad.net!nospam.fr.eu.org!talisker.lacave.net!lacave.net!not-for-mail From: Markus Fischer Newsgroups: comp.lang.ruby Subject: Match a pattern multiple times, returning matches, captures and offset? Date: Tue, 5 Apr 2011 12:22:20 -0500 Organization: Service de news de lacave.net Lines: 103 Message-ID: <4D9B4FBD.9020602@fischer.name> NNTP-Posting-Host: bristol.highgroove.com Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Trace: talisker.lacave.net 1302024161 57967 65.111.164.187 (5 Apr 2011 17:22:41 GMT) X-Complaints-To: abuse@lacave.net NNTP-Posting-Date: Tue, 5 Apr 2011 17:22:41 +0000 (UTC) X-Received-From: This message has been automatically forwarded from the ruby-talk mailing list by a gateway at comp.lang.ruby. If it is SPAM, it did not originate at comp.lang.ruby. Please report the original sender, and not us. Thanks! For more details about this gateway, please visit: http://blog.grayproductions.net/categories/the_gateway X-Mail-Count: 381015 X-Ml-Name: ruby-talk X-Rubymirror: Yes X-Ruby-Talk: <4D9B4FBD.9020602@fischer.name> Xref: x330-a1.tempe.blueboxinc.net comp.lang.ruby:2355 Hi, I'm used to be able to use the following in PHP. What is basically does is: return me all matches, including the captures, order by matching set and provide me the offsets. $ php -r 'preg_match_all("/_(\w+)_/", "_foo_ _bar_", $matches, PREG_SET_ORDER|PREG_OFFSET_CAPTURE); var_dump($matches);' array(2) { [0]=> array(2) { [0]=> array(2) { [0]=> string(5) "_foo_" [1]=> int(0) } [1]=> array(2) { [0]=> string(3) "foo" [1]=> int(1) } } [1]=> array(2) { [0]=> array(2) { [0]=> string(5) "_bar_" [1]=> int(6) } [1]=> array(2) { [0]=> string(3) "bar" [1]=> int(7) } } } I've found two ways in ruby getting in this direction, either use String#match or String#scan, but both only provide me partial information. I guess I can combine the knowledge of both, but before attempting this I wanted to verify if I didn't overlook something. Here are my ruby attempts: ruby-1.9.2-p180 :001 > m = "_foo_ _bar_".match(/_(\w+)_/) => # ruby-1.9.2-p180 :002 > [ m[0], m[1] ] => ["_foo_", "foo"] ruby-1.9.2-p180 :003 > [ m.begin(0), m.begin(1) ] => [0, 1] But here I'm missing the further possible matches, "_bar_" and "bar". Or the #scan approach: ruby-1.9.2-p180 :004 > m = "_foo_ _bar_".scan(/_(\w+)_/) => [["foo"], ["bar"]] But in this case I've even less information, the match including _foo_ or _bar_ is not present and I can't get the offsets too. I re-read the documentation for Regexp#match and found out that you can pass an offset into the string as second parameter, so I guess I can iterate over the string in a loop until I find no further matches ...? Considering this I came up with: $ cat test_match_all.rb require 'pp' class String def match_all(pattern) matches = [] offset = 0 while m = match(pattern, offset) do matches << m offset = m.begin(0) + m[0].length end matches end end pp "_foo_ _bar_ _baz_".match_all(/_(\w+)_/) $ ruby test_match_all.rb [#, #, #] I've lots of data to parse so I could foresee that this approach can become a bottleneck. Is there a more direct solution to it? thanks, - Markus