Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.ruby > #2355 > unrolled thread
| Started by | Markus Fischer <markus@fischer.name> |
|---|---|
| First post | 2011-04-05 12:22 -0500 |
| Last post | 2011-04-08 14:53 -0500 |
| Articles | 10 — 4 participants |
Back to article view | Back to comp.lang.ruby
Match a pattern multiple times, returning matches, captures and offset? Markus Fischer <markus@fischer.name> - 2011-04-05 12:22 -0500
Re: Match a pattern multiple times, returning matches, captures and offset? Brian Candler <b.candler@pobox.com> - 2011-04-05 13:07 -0500
Re: Match a pattern multiple times, returning matches, captures and offset? 7stud -- <bbxx789_05ss@yahoo.com> - 2011-04-05 20:37 -0500
Re: Match a pattern multiple times, returning matches, captures and offset? Robert Klemme <shortcutter@googlemail.com> - 2011-04-06 04:42 -0500
Re: Match a pattern multiple times, returning matches, captures and offset? 7stud -- <bbxx789_05ss@yahoo.com> - 2011-04-06 18:58 -0500
Re: Match a pattern multiple times, returning matches, captures and offset? Robert Klemme <shortcutter@googlemail.com> - 2011-04-07 02:13 -0500
Re: Match a pattern multiple times, returning matches, captures and offset? Brian Candler <b.candler@pobox.com> - 2011-04-07 03:39 -0500
Re: Match a pattern multiple times, returning matches, captures and offset? 7stud -- <bbxx789_05ss@yahoo.com> - 2011-04-07 14:04 -0500
Re: Match a pattern multiple times, returning matches, captures and offset? Brian Candler <b.candler@pobox.com> - 2011-04-08 02:19 -0500
Re: Match a pattern multiple times, returning matches, captures and offset? 7stud -- <bbxx789_05ss@yahoo.com> - 2011-04-08 14:53 -0500
| From | Markus Fischer <markus@fischer.name> |
|---|---|
| Date | 2011-04-05 12:22 -0500 |
| Subject | Match a pattern multiple times, returning matches, captures and offset? |
| Message-ID | <4D9B4FBD.9020602@fischer.name> |
Hi,
I'm used to be able to use the following in PHP. What is basically does
is: return me all matches, including the captures, order by matching set
and provide me the offsets.
$ php -r 'preg_match_all("/_(\w+)_/", "_foo_ _bar_", $matches,
PREG_SET_ORDER|PREG_OFFSET_CAPTURE); var_dump($matches);'
array(2) {
[0]=>
array(2) {
[0]=>
array(2) {
[0]=>
string(5) "_foo_"
[1]=>
int(0)
}
[1]=>
array(2) {
[0]=>
string(3) "foo"
[1]=>
int(1)
}
}
[1]=>
array(2) {
[0]=>
array(2) {
[0]=>
string(5) "_bar_"
[1]=>
int(6)
}
[1]=>
array(2) {
[0]=>
string(3) "bar"
[1]=>
int(7)
}
}
}
I've found two ways in ruby getting in this direction, either use
String#match or String#scan, but both only provide me partial
information. I guess I can combine the knowledge of both, but before
attempting this I wanted to verify if I didn't overlook something. Here
are my ruby attempts:
ruby-1.9.2-p180 :001 > m = "_foo_ _bar_".match(/_(\w+)_/)
=> #<MatchData "_foo_" 1:"foo">
ruby-1.9.2-p180 :002 > [ m[0], m[1] ]
=> ["_foo_", "foo"]
ruby-1.9.2-p180 :003 > [ m.begin(0), m.begin(1) ]
=> [0, 1]
But here I'm missing the further possible matches, "_bar_" and "bar". Or
the #scan approach:
ruby-1.9.2-p180 :004 > m = "_foo_ _bar_".scan(/_(\w+)_/)
=> [["foo"], ["bar"]]
But in this case I've even less information, the match including _foo_
or _bar_ is not present and I can't get the offsets too.
I re-read the documentation for Regexp#match and found out that you can
pass an offset into the string as second parameter, so I guess I can
iterate over the string in a loop until I find no further matches ...?
Considering this I came up with:
$ cat test_match_all.rb
require 'pp'
class String
def match_all(pattern)
matches = []
offset = 0
while m = match(pattern, offset) do
matches << m
offset = m.begin(0) + m[0].length
end
matches
end
end
pp "_foo_ _bar_ _baz_".match_all(/_(\w+)_/)
$ ruby test_match_all.rb
[#<MatchData "_foo_" 1:"foo">,
#<MatchData "_bar_" 1:"bar">,
#<MatchData "_baz_" 1:"baz">]
I've lots of data to parse so I could foresee that this approach can
become a bottleneck. Is there a more direct solution to it?
thanks,
- Markus
[toc] | [next] | [standalone]
| From | Brian Candler <b.candler@pobox.com> |
|---|---|
| Date | 2011-04-05 13:07 -0500 |
| Message-ID | <61d24bb96a28e517e89adef15f444b29@ruby-forum.com> |
| In reply to | #2355 |
String#scan with a block may do what you want:
>> "_foo_ _bar_".scan(/_(\w+)_/) { |x| puts "Offset #{$`.size}, captures
#{x.inspect}" }
Offset 0, captures ["foo"]
Offset 6, captures ["bar"]
=> "_foo_ _bar_"
But it doesn't give you offsets to the individual captures, just to the
start of the whole match. (You also get the full match in $& and the
rest of the string after the match in $')
--
Posted via http://www.ruby-forum.com/.
[toc] | [prev] | [next] | [standalone]
| From | 7stud -- <bbxx789_05ss@yahoo.com> |
|---|---|
| Date | 2011-04-05 20:37 -0500 |
| Message-ID | <3436432a3d0d05b87c5d5e94decd007d@ruby-forum.com> |
| In reply to | #2355 |
Markus Fischer wrote in post #991092: > > But here I'm missing the further possible matches, "_bar_" and "bar". Or > the #scan approach: > > ruby-1.9.2-p180 :004 > m = "_foo_ _bar_".scan(/_(\w+)_/) > => [["foo"], ["bar"]] > > But in this case I've even less information, the match including _foo_ > or _bar_ is not present and I can't get the offsets too. > > I re-read the documentation for Regexp#match If you look at the preamble in the docs for the MatchData class, you can retrieve a MatchData object using Regexp.last_match, which you can call inside a scan() block: str = "_foo_ _bar_" str.scan(/_(\w+)_/) do |match| md = Regexp.last_match p [md[0], md[1], md.offset(1)] end --output:-- ["_foo_", "foo", [1, 4]] ["_bar_", "bar", [7, 10]] -- Posted via http://www.ruby-forum.com/.
[toc] | [prev] | [next] | [standalone]
| From | Robert Klemme <shortcutter@googlemail.com> |
|---|---|
| Date | 2011-04-06 04:42 -0500 |
| Message-ID | <BANLkTinrpMH314-e8hg_jqRY_wmZBhpwqw@mail.gmail.com> |
| In reply to | #2368 |
On Wed, Apr 6, 2011 at 3:37 AM, 7stud -- <bbxx789_05ss@yahoo.com> wrote:
> Markus Fischer wrote in post #991092:
>>
>> But here I'm missing the further possible matches, "_bar_" and "bar". Or
>> the #scan approach:
>>
>> ruby-1.9.2-p180 :004 > m = "_foo_ _bar_".scan(/_(\w+)_/)
>> => [["foo"], ["bar"]]
>>
>> But in this case I've even less information, the match including _foo_
>> or _bar_ is not present and I can't get the offsets too.
>>
>> I re-read the documentation for Regexp#match
>
> If you look at the preamble in the docs for the MatchData class, you can
> retrieve a MatchData object using Regexp.last_match, which you can call
> inside a scan() block:
When doing nested matching it may be better to use $~ because that is
local to the current stack frame which Regexp.last_match isn't.
Example with relative offsets as well:
irb(main):022:0> str.scan /_(\w+)_/ do
irb(main):023:1* 2.times {|i| p [$~[i], $~.offset(i), $~.offset(i).map
{|o| o - $~.offset(0)[0]}]}
irb(main):024:1> end
["_foo_", [0, 5], [0, 5]]
["foo", [1, 4], [1, 4]]
["_bar_", [6, 11], [0, 5]]
["bar", [7, 10], [1, 4]]
=> "_foo_ _bar_"
Kind regards
robert
--
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/
[toc] | [prev] | [next] | [standalone]
| From | 7stud -- <bbxx789_05ss@yahoo.com> |
|---|---|
| Date | 2011-04-06 18:58 -0500 |
| Message-ID | <f56ffd49e8ba8202e431f9bb9d4620cd@ruby-forum.com> |
| In reply to | #2355 |
You can also get the relative offset like this:
str = "_foo_ _bar_"
str.scan(/_(\w+)_/) do |curr_match|
md = Regexp.last_match
whole_match = md[0]
captures = md.captures
captures.each do |capture|
p [whole_match, capture, whole_match.index(capture)]
end
--
Posted via http://www.ruby-forum.com/.
[toc] | [prev] | [next] | [standalone]
| From | Robert Klemme <shortcutter@googlemail.com> |
|---|---|
| Date | 2011-04-07 02:13 -0500 |
| Message-ID | <BANLkTimkAeiuN7jWHC9g75NHb-JcNonFeg@mail.gmail.com> |
| In reply to | #2420 |
On Thu, Apr 7, 2011 at 1:58 AM, 7stud -- <bbxx789_05ss@yahoo.com> wrote: > You can also get the relative offset like this: > > str = "_foo_ _bar_" > > str.scan(/_(\w+)_/) do |curr_match| > md = Regexp.last_match > whole_match = md[0] > captures = md.captures > captures.each do |capture| > p [whole_match, capture, whole_match.index(capture)] > end That's nice! I wasn't aware of this. Thanks for sharing! I also just read this in the docs: "Note that the last_match is local to the thread and method scope of the method that did the pattern match." So forget my point about $~ being safer. Kind regards robert -- remember.guy do |as, often| as.you_can - without end http://blog.rubybestpractices.com/
[toc] | [prev] | [next] | [standalone]
| From | Brian Candler <b.candler@pobox.com> |
|---|---|
| Date | 2011-04-07 03:39 -0500 |
| Message-ID | <e9517b4095df9291249c3c221e9a35bc@ruby-forum.com> |
| In reply to | #2420 |
7stud -- wrote in post #991338: > You can also get relative beginning offsets like this: > > str = "_foo_ _bar_" > > str.scan(/_(\w+)_/) do |curr_match| > md = Regexp.last_match > whole_match = md[0] > captures = md.captures > > captures.each do |capture| > p [whole_match, capture, whole_match.index(capture)] > end > > end Using 'index' doesn't work if you have multiple captures which have the same pattern, or one is a substring of the other. Use captures.begin and captures.end instead. >> md = /(...)(...)/.match "foofoo" => #<MatchData "foofoo" 1:"foo" 2:"foo"> >> md.captures => ["foo", "foo"] >> md.begin(1) => 0 >> md.begin(2) => 3 -- Posted via http://www.ruby-forum.com/.
[toc] | [prev] | [next] | [standalone]
| From | 7stud -- <bbxx789_05ss@yahoo.com> |
|---|---|
| Date | 2011-04-07 14:04 -0500 |
| Message-ID | <6d45afa0bb38b1d865423b633e201ee9@ruby-forum.com> |
| In reply to | #2437 |
Brian Candler wrote in post #991406: > 7stud -- wrote in post #991338: >> You can also get relative beginning offsets like this: >> >> str = "_foo_ _bar_" >> >> str.scan(/_(\w+)_/) do |curr_match| >> md = Regexp.last_match >> whole_match = md[0] >> captures = md.captures >> >> captures.each do |capture| >> p [whole_match, capture, whole_match.index(capture)] >> end >> >> end > > Using 'index' doesn't work if you have multiple captures which have the > same pattern, or one is a substring of the other. > > Use captures.begin and captures.end instead. > begin() and end() are the two elements of offset(), which we've already discussed above: The idea was to get the relative offsets within a match, not the absolute offsets within the string. -- Posted via http://www.ruby-forum.com/.
[toc] | [prev] | [next] | [standalone]
| From | Brian Candler <b.candler@pobox.com> |
|---|---|
| Date | 2011-04-08 02:19 -0500 |
| Message-ID | <81699e38451e9fd5b75d6c84084309c6@ruby-forum.com> |
| In reply to | #2483 |
7stud -- wrote in post #991546: > However, note that > begin() and end() are the two elements of offset(), which we've already > discussed above. The idea was to additionally provide the relative > offsets within a match, not just the absolute offsets within the string. That's easy - subtract begin(0) which is the absolute offset of the start of the match. >> "foo bar" =~ /ba(.)/ => 4 >> $~.captures => ["r"] >> $~.begin(1) => 6 >> $~.begin(1) - $~.begin(0) => 2 -- Posted via http://www.ruby-forum.com/.
[toc] | [prev] | [next] | [standalone]
| From | 7stud -- <bbxx789_05ss@yahoo.com> |
|---|---|
| Date | 2011-04-08 14:53 -0500 |
| Message-ID | <9e87ab4e65b386a00d2bed80ffd89a71@ruby-forum.com> |
| In reply to | #2514 |
Brian Candler wrote in post #991686: > 7stud -- wrote in post #991546: >> However, note that >> begin() and end() are the two elements of offset(), which we've already >> discussed above. The idea was to additionally provide the relative >> offsets within a match, not just the absolute offsets within the string. > > That's easy - subtract begin(0) which is the absolute offset of the > start of the match. The "subtraction method" was thoroughly vetted earlier. -- Posted via http://www.ruby-forum.com/.
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.ruby
csiph-web