Groups | Search | Server Info | Login | Register


Groups > comp.lang.awk > #9942

Re: Experiences with match() subexpressions?

From Kaz Kylheku <643-408-1753@kylheku.com>
Newsgroups comp.lang.awk
Subject Re: Experiences with match() subexpressions?
Date 2025-04-11 07:40 +0000
Organization A noiseless patient Spider
Message-ID <20250411000404.794@kylheku.com> (permalink)
References <vt7qlq$2ge70$1@dont-email.me> <vt8bit$2uiq5$1@dont-email.me> <vt8j5u$1gmdg$1@news.xmission.com> <vt9dre$3t3po$1@dont-email.me> <67f8b7af$0$705$14726298@news.sunsite.dk>

Show all headers | View raw


On 2025-04-11, Aharon Robbins <arnold@freefriends.org> wrote:
> In article <vt9dre$3t3po$1@dont-email.me>,
> Janis Papanagnou  <janis_papanagnou+ng@hotmail.com> wrote:
>>The feature can be very useful,
>>but not for the case I was looking for. - Actually, it could have
>>provided the functionality I was seeking, but since GNU Awk relies
>>on the GNU regexp functions as they are implemented I cannot expect
>>that any provided features gets extended by Awk. - If GNU Awk would
>>have an own RE implementation then we could think about using, e.g.,
>>another array dimension to store the (now only temporary existing,
>>and generally unavailable) subexpressions.
>
> Actually, this is not so trivial.  The data structures at the C level
> as mandated by POSIX are one dimensional; the submatches in parentheses
> are counted from left to right. There's no way to represent the
> subexpressions that are under control of interval expressions, which
> would essentially require a two-dimensional data structure.
>
> Mike Haertel is writing a new regexp matcher for gawk; it was announced
> here some time agao: https://github.com/mikehaertel/minrx. The code is
> in the feature/minrx branch of the gawk Git repository.
>
> I just opened an issue, https://github.com/mikehaertel/minrx/issues/43,
> about this question. We shall see what develops.

Unix and POSIX regular expressions have perpetrated a kind of
misfeature.  They took the purely algebraic parentheses described in
classic literature on regular expressions, whose only role is to
override the precedence and associativity of operators, and turned them
into active operators that perform a double duty: they still override
precedence, but also denote submatches associated with capture
registers.

Parentheses are enumerated and made to correspond with numbered capture
registers, I think, as follows:

 ( ( ) ( ( ) ) )
 1 2   3 4

Scanning left to right, we identify the open left parentheses
which have matching closing parentheses, and number these in order
starting from 1.

There is a convention that capture register 0 is reserved for
the full match for the expression. This is how it is with 
the array reported by POSIX's regexec. Thus the numbering is
one based.

The POSIX standard clearly says what happens when a parenthesized
subexpression matches something more than once.

This is spelled out in the documentation page on the regcomp,
regexec and regfree functions. Look for this text:

  "If subexpression i in a regular expression is not contained within
   another subexpression, and it participated in the match several times,
   then the byte offsets in pmatch[ i] shall delimit the last such match.

This is exactly the last match behavior observed by Janis in Awk's
match function.

Basically, subexpressions are dumb hack. As the regex automaton
traverses through its states in response to the input, it triggers
some anchor points associated with the original subexpression,
which copy some data, or keep track of some pointers to the start and
end of the match. When the submatch is complete, there is a data
transfer which clobbers any previous such a data transfer.

There are some tricky rules nested expressions.
Suppose that we have:

  ( ... ( ... ) ...)
  1     2

2 is nested inside 1.  Suppose that 1 matches multiple times.
Clearly, the corresponding register is left with the most
recent match when the matching is done.

But suppose that subexpression 2 sometimes matches when 1
matches, but sometimes doe snot match when 1 matches.

I think the obscurely worded POSIX rules are trying to prevent an
inconsistency.

In a nutshell, if a string is reported in register 2 from
matching subexpression 2, it has to be a substring of a match that is
concurrently happening for subexpression 1.

Now suppose that that an iteration of 1 matches something,
but in that iteration, subexpression 2 does not match.
Then 2 has to be reset to indicate that it didn't match anything.

Probably, it's a good idea to implement the behavior follows: whenever a
new capture iteration begins for 1, the register for 2 must also be
cleared, so that it doesn't retain stale data in the event that a match
for 2 is not encountered in the new iteration of 1.

This stuff is not really that usable for repetition; captures
were clearly envisioned mainly for non-repeating matching without
any kleene stars or {m, n} repetitions.

-- 
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

Back to comp.lang.awk | Previous | NextPrevious in thread | Next in thread | Find similar


Thread

Experiences with match() subexpressions? Janis Papanagnou <janis_papanagnou+ng@hotmail.com> - 2025-04-10 09:06 +0200
  Re: Experiences with match() subexpressions? Janis Papanagnou <janis_papanagnou+ng@hotmail.com> - 2025-04-10 09:09 +0200
    Re: Experiences with match() subexpressions? gazelle@shell.xmission.com (Kenny McCormack) - 2025-04-10 11:08 +0000
      Re: Experiences with match() subexpressions? Janis Papanagnou <janis_papanagnou+ng@hotmail.com> - 2025-04-10 13:55 +0200
        Re: Experiences with match() subexpressions? gazelle@shell.xmission.com (Kenny McCormack) - 2025-04-10 14:04 +0000
          Re: Experiences with match() subexpressions? Janis Papanagnou <janis_papanagnou+ng@hotmail.com> - 2025-04-10 23:39 +0200
            Re: Experiences with match() subexpressions? arnold@freefriends.org (Aharon Robbins) - 2025-04-11 06:33 +0000
              Re: Experiences with match() subexpressions? Janis Papanagnou <janis_papanagnou+ng@hotmail.com> - 2025-04-11 09:10 +0200
                Re: Experiences with match() subexpressions? Kaz Kylheku <643-408-1753@kylheku.com> - 2025-04-11 08:22 +0000
                Re: Experiences with match() subexpressions? Manuel Collado <mcollado2011@gmail.com> - 2025-04-18 12:03 +0200
                Re: Experiences with match() subexpressions? gazelle@shell.xmission.com (Kenny McCormack) - 2025-04-18 12:01 +0000
                Re: Experiences with match() subexpressions? Janis Papanagnou <janis_papanagnou+ng@hotmail.com> - 2025-04-18 14:24 +0200
              Re: Experiences with match() subexpressions? Kaz Kylheku <643-408-1753@kylheku.com> - 2025-04-11 07:40 +0000
              The new matcher (Was: Experiences with match() subexpressions?) gazelle@shell.xmission.com (Kenny McCormack) - 2025-04-11 08:57 +0000
                Re: The new matcher (Was: Experiences with match() subexpressions?) Janis Papanagnou <janis_papanagnou+ng@hotmail.com> - 2025-04-11 15:50 +0200
              Re: Experiences with match() subexpressions? Kaz Kylheku <643-408-1753@kylheku.com> - 2025-04-11 17:54 +0000
    Re: Experiences with match() subexpressions? Ed Morton <mortonspam@gmail.com> - 2025-04-10 20:07 -0500
      Re: Experiences with match() subexpressions? Ed Morton <mortonspam@gmail.com> - 2025-04-13 12:52 -0500
        Nitpicking the code (Was: Experiences with match() subexpressions?) gazelle@shell.xmission.com (Kenny McCormack) - 2025-04-14 18:20 +0000
          Re: Nitpicking the code (Was: Experiences with match() subexpressions?) Janis Papanagnou <janis_papanagnou+ng@hotmail.com> - 2025-04-14 20:53 +0200
            Re: Nitpicking the code (Was: Experiences with match() subexpressions?) Ed Morton <mortonspam@gmail.com> - 2025-04-14 18:55 -0500
              Re: Nitpicking the code (Was: Experiences with match() subexpressions?) Janis Papanagnou <janis_papanagnou+ng@hotmail.com> - 2025-04-15 05:35 +0200

csiph-web