Path: csiph.com!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: Ed Morton <mortonspam@gmail.com>
Newsgroups: comp.lang.awk
Subject: Re: Experiences with match() subexpressions?
Date: Sun, 13 Apr 2025 12:52:27 -0500
Organization: A noiseless patient Spider
Lines: 62
Message-ID: <vtgtkr$3br8e$1@dont-email.me>
References: <vt7qlq$2ge70$1@dont-email.me> <vt7qs4$2gior$1@dont-email.me> <vt9q0n$70fm$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 13 Apr 2025 19:52:28 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="f1e9ff24612dbcb2d5cf0531a0dc237a"; logging-data="3534094"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18Bt7aD54npic5N2cc3NTCe"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:sXJt8Bjn2xNjA+XkcS5pIFKC3gI=
In-Reply-To: <vt9q0n$70fm$1@dont-email.me>
X-Antivirus-Status: Clean
X-Antivirus: Avast (VPS 250413-4, 4/13/2025), Outbound message
Content-Language: en-US
Xref: csiph.com comp.lang.awk:9947

On 4/10/2025 8:07 PM, Ed Morton wrote:
> On 4/10/2025 2:09 AM, Janis Papanagnou wrote:
>> On 10.04.2025 09:06, Janis Papanagnou wrote:
>>> I'm looking for subexpressions of regexp-matches using GNU Awk's
>>> third parameter of match(). For example
>>>
>>>    data = "R=r1,R=r2,R=r3,E=e"
>>>    match (data, /^(R=([^,]+),){2,5}E=(.+)$/, arr)
>>>
>>> The result stored in 'arr' seems to be determined by the static
>>> parenthesis structure, so with the pattern repetition {2,5} only
>>> the last matched data in the subexpression (r3) seems to persist
>>> in arr. - I suppose there's no cute way to achieve what I wanted?
>>
>> To clarify; what I wanted is access of the values "r1", "r2", "r3",
>> and "e" through 'arr'.
> 
> Correct, you can't do what you want using just `match()`, it's simply 
> matching a regexp with capture groups against a string, just like sed does.
> 
> There are, of course, several other ways to get `arr[]` populated the 
> way you want. e.g split(), patsplit(), while(match()), or dynamically 
> generating the regexp. The best one to choose will depend on the real 
> values that r1, etc. can have, for example it'd be hard to use split() 
> if `r1` can be a quoted string that might itself contain similar 
> substrings such as `data = "R=\"R=r1,R=r2\",R=r2,R=r3,E=e"`.

FWIW, probably more for the benefit of any awk newcomers reading this, 
if your data really could have quoted fields (otherwise a simple 
`split(data,",")` is all you need) then, assuming they follow the same 
quoting rules as for CSVs, I'd use either of these or similar with GNU 
awk (for `patsplit()`:

     data = "R=\"R=r1,R=r2\",R=r2,R=r3,E=e"
     nf = patsplit(data, arr, /[RE]=([^,]*|"([^"]|"")*")/)
     delete arr
     for ( i in arr ) {
         sub(/[^=]+=/, "", arr[i])
     }

or any awk:

     data = "R=\"R=r1,R=r2\",R=r2,R=r3,E=e"
     nf = 0
     delete arr
     while ( match(data, /[RE]=([^,]*|"([^"]|"")*")/, a) ) {
         arr[++nf] = substr(data, RSTART+2, RLENGTH-2)
         data = substr(data, RSTART+RLENGTH)
     }

either of which would populate `arr[]` with:

     "R=r1,R=r2"
     r2
     r3
     e

and set `nf` to the number of entries in `arr[]`.

Regards,

     Ed.