Path: csiph.com!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail From: Ed Morton Newsgroups: comp.lang.awk Subject: Re: Experiences with match() subexpressions? Date: Sun, 13 Apr 2025 12:52:27 -0500 Organization: A noiseless patient Spider Lines: 62 Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Sun, 13 Apr 2025 19:52:28 +0200 (CEST) Injection-Info: dont-email.me; posting-host="f1e9ff24612dbcb2d5cf0531a0dc237a"; logging-data="3534094"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18Bt7aD54npic5N2cc3NTCe" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:sXJt8Bjn2xNjA+XkcS5pIFKC3gI= In-Reply-To: X-Antivirus-Status: Clean X-Antivirus: Avast (VPS 250413-4, 4/13/2025), Outbound message Content-Language: en-US Xref: csiph.com comp.lang.awk:9947 On 4/10/2025 8:07 PM, Ed Morton wrote: > On 4/10/2025 2:09 AM, Janis Papanagnou wrote: >> On 10.04.2025 09:06, Janis Papanagnou wrote: >>> I'm looking for subexpressions of regexp-matches using GNU Awk's >>> third parameter of match(). For example >>> >>>    data = "R=r1,R=r2,R=r3,E=e" >>>    match (data, /^(R=([^,]+),){2,5}E=(.+)$/, arr) >>> >>> The result stored in 'arr' seems to be determined by the static >>> parenthesis structure, so with the pattern repetition {2,5} only >>> the last matched data in the subexpression (r3) seems to persist >>> in arr. - I suppose there's no cute way to achieve what I wanted? >> >> To clarify; what I wanted is access of the values "r1", "r2", "r3", >> and "e" through 'arr'. > > Correct, you can't do what you want using just `match()`, it's simply > matching a regexp with capture groups against a string, just like sed does. > > There are, of course, several other ways to get `arr[]` populated the > way you want. e.g split(), patsplit(), while(match()), or dynamically > generating the regexp. The best one to choose will depend on the real > values that r1, etc. can have, for example it'd be hard to use split() > if `r1` can be a quoted string that might itself contain similar > substrings such as `data = "R=\"R=r1,R=r2\",R=r2,R=r3,E=e"`. FWIW, probably more for the benefit of any awk newcomers reading this, if your data really could have quoted fields (otherwise a simple `split(data,",")` is all you need) then, assuming they follow the same quoting rules as for CSVs, I'd use either of these or similar with GNU awk (for `patsplit()`: data = "R=\"R=r1,R=r2\",R=r2,R=r3,E=e" nf = patsplit(data, arr, /[RE]=([^,]*|"([^"]|"")*")/) delete arr for ( i in arr ) { sub(/[^=]+=/, "", arr[i]) } or any awk: data = "R=\"R=r1,R=r2\",R=r2,R=r3,E=e" nf = 0 delete arr while ( match(data, /[RE]=([^,]*|"([^"]|"")*")/, a) ) { arr[++nf] = substr(data, RSTART+2, RLENGTH-2) data = substr(data, RSTART+RLENGTH) } either of which would populate `arr[]` with: "R=r1,R=r2" r2 r3 e and set `nf` to the number of entries in `arr[]`. Regards, Ed.