Path: csiph.com!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail From: Janis Papanagnou Newsgroups: comp.lang.awk Subject: Re: Experiences with match() subexpressions? Date: Thu, 10 Apr 2025 13:55:07 +0200 Organization: A noiseless patient Spider Lines: 114 Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Injection-Date: Thu, 10 Apr 2025 13:55:09 +0200 (CEST) Injection-Info: dont-email.me; posting-host="9032be69d3bbb9e07dafcd88fbc9ee37"; logging-data="3099461"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/QAjyFwKNWDuvaMQNd6fd4" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 Cancel-Lock: sha1:rrIdGI/2X9+DGLxKMppF6Zh5S8g= In-Reply-To: X-Enigmail-Draft-Status: N1110 Xref: csiph.com comp.lang.awk:9936 On 10.04.2025 13:08, Kenny McCormack wrote: > In article , > Janis Papanagnou wrote: >> On 10.04.2025 09:06, Janis Papanagnou wrote: >>> I'm looking for subexpressions of regexp-matches using GNU Awk's >>> third parameter of match(). For example >>> >>> data = "R=r1,R=r2,R=r3,E=e" >>> match (data, /^(R=([^,]+),){2,5}E=(.+)$/, arr) >>> >>> The result stored in 'arr' seems to be determined by the static >>> parenthesis structure, so with the pattern repetition {2,5} only >>> the last matched data in the subexpression (r3) seems to persist >>> in arr. - I suppose there's no cute way to achieve what I wanted? >> >> To clarify; what I wanted is access of the values "r1", "r2", "r3", >> and "e" through 'arr'. > > I have to admit that I (still) don't really understand how this match third > arg stuff works. I've never used that before but it seems to be quite simple; for every parenthesis group expression in the regexp it provides (statically, as the parentheses are written, from left to right) an array element with the expanded matched subexpression. > I.e., I can never predict what will happen, so I always > just dump out the array and try to reverse-engineer it each time I need to > use it. > > I adapted your code into the following test script: > > --- Cut Here --- > #!/bin/sh > gawk 'BEGIN { > data = "R=r1,R=r2,R=r3,E=e" > match (data, /^(R=([^,]+),){2,5}E=(.+)$/, arr) > for (i in arr) print i,arr[i] > }' > > # To clarify; what I wanted is access of the values "r1", "r2", "r3", > # and "e" through 'arr'. > --- Cut Here --- > > The output I get is: > > --- Cut Here --- > 0start 1 > 0length 18 > 3start 18 > 1start 11 > 2start 13 > 3length 1 > 2length 2 > 1length 5 Above output stuff appears because in 'arr' there's additional elements about the pattern positions stored. I don't need that so I'm just interested in the data patterns below and iterate with a index-counted loop... > 0 R=r1,R=r2,R=r3,E=e the whole expression > 1 R=r3, the expression in the first parenthesis > 2 r3 the expression in the second, embedded parenthesis > 3 e the expression in the final parenthesis > --- Cut Here --- > > After playing around a bit, I could not come up with any sensible way of > getting what you want to get. Yeah, Arnold just told me the same; that it's impossible because the underlying GNU regexp library doesn't support what I'm looking for. What I considered a possible workaround (in this case) is to sequence the (...){2,5} expression by using sequences of (...)? expressions. (But in the general case, for larger ranges than 2-5, that's neither feasible nor sensible any more.) > > As an alternative, it sounds like you could just could just split the > string on the comma; that would get you: Yes, that was also how I did such things in the past. Only when I saw that "third argument" to match() I hoped the two-level parsing could be simplified in one step. The reason was that I thought to have seen other languages (Perl, maybe?) that supported such a feature. > > R=r1 > R=r2 > R=r3 > E=e > > Or, for finer control, you could use patsplit(). I think I'll do the parsing the straightforward two-step way as I did before the GNU Awk specific functions were available; it's probably also the clearest way to program that functionality. Janis