Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.awk > #9866 > unrolled thread

GNU Awk's types of regular expressions

Started byJanis Papanagnou <janis_papanagnou+ng@hotmail.com>
First post2024-11-28 19:18 +0100
Last post2024-12-02 23:13 +0100
Articles 10 — 4 participants

Back to article view | Back to comp.lang.awk


Contents

  GNU Awk's types of regular expressions Janis Papanagnou <janis_papanagnou+ng@hotmail.com> - 2024-11-28 19:18 +0100
    Re: GNU Awk's types of regular expressions Kaz Kylheku <643-408-1753@kylheku.com> - 2024-11-29 04:13 +0000
      Re: GNU Awk's types of regular expressions Janis Papanagnou <janis_papanagnou+ng@hotmail.com> - 2024-11-29 09:33 +0100
      Re: GNU Awk's types of regular expressions Janis Papanagnou <janis_papanagnou+ng@hotmail.com> - 2024-11-30 12:41 +0100
    Re: GNU Awk's types of regular expressions arnold@freefriends.org (Aharon Robbins) - 2024-12-01 20:20 +0000
      Re: GNU Awk's types of regular expressions Janis Papanagnou <janis_papanagnou+ng@hotmail.com> - 2024-12-01 22:17 +0100
        Re: GNU Awk's types of regular expressions arnold@skeeve.com (Aharon Robbins) - 2024-12-01 23:18 +0000
          Re: GNU Awk's types of regular expressions Janis Papanagnou <janis_papanagnou+ng@hotmail.com> - 2024-12-02 08:00 +0100
            Re: GNU Awk's types of regular expressions arnold@skeeve.com (Aharon Robbins) - 2024-12-02 20:58 +0000
              Re: GNU Awk's types of regular expressions Janis Papanagnou <janis_papanagnou+ng@hotmail.com> - 2024-12-02 23:13 +0100

#9866 — GNU Awk's types of regular expressions

FromJanis Papanagnou <janis_papanagnou+ng@hotmail.com>
Date2024-11-28 19:18 +0100
SubjectGNU Awk's types of regular expressions
Message-ID<viac5m$l8oh$1@dont-email.me>
In GNU Awk there's currently three types of regular expressions, in
addition to the standard regexp-constants (/regex/) and the dynamic
regexps ("regex", or variables containing "regex") there's in newer
versions also first class regexp objects (@/regex/, "Strongly Typed
Regexp Constants") supported.

One principal advantage of regexp-constants is that the engine to
parse the regexp can be created in advance, while a dynamic regexp
may be constructed dynamically (from strings) and needs an explicit
runtime-step to create the engine before the matching can be done.
Now I assumed that  @/regex-const/  would in that respect behave as
 /regex-const/ ... - until I found in the GNU Awk manual this text:

|
| Thus, if you have something like this:
|
|   re = @/don't panic/
|   sub(/don't/, "do", re)
|   print typeof(re), re
|
| then re retains its type, but now attempts to match the string ‘do
| panic’. This provides a (very indirect) way to create regexp-typed
| variables at runtime.
|

(I'm astonished that first class regexp objects can be dynamically
changed. But that is not my point here; I'm interested in potential
pre-compiles of regexp constants...)

This would imply that the first class regexp constants can be changed
like dynamic regexps and that there's no regexp pre-compile involved.
This would also rise suspicion that the "normal" regexp-constants are
probably also not precomputed.

So constant-regexps (both forms) have (only?) the advantage that the
regexp-syntax can be (initially during awk parsing) checked, e.g.,

 	re = @/don't panic[/
 	     ^ unterminated regexp

And dynamic regexps and first class regexps that got changed (e.g.
by code like

  sub(/don't/, "do[", re)

in above sample snippet) would both create runtime errors, e.g.

  error: Unmatched [, [^, [:, [., or [=: /do[ panic/
  fatal: could not make typed regex

(as all ill-formed regexp-types will produce a runtime error).

Janis

[toc] | [next] | [standalone]


#9867

FromKaz Kylheku <643-408-1753@kylheku.com>
Date2024-11-29 04:13 +0000
Message-ID<20241128200247.439@kylheku.com>
In reply to#9866
On 2024-11-28, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
> In GNU Awk there's currently three types of regular expressions, in
> addition to the standard regexp-constants (/regex/) and the dynamic
> regexps ("regex", or variables containing "regex") there's in newer
> versions also first class regexp objects (@/regex/, "Strongly Typed
> Regexp Constants") supported.
>
> One principal advantage of regexp-constants is that the engine to
> parse the regexp can be created in advance, while a dynamic regexp
> may be constructed dynamically (from strings) and needs an explicit
> runtime-step to create the engine before the matching can be done.
> Now I assumed that  @/regex-const/  would in that respect behave as
>  /regex-const/ ... - until I found in the GNU Awk manual this text:
>
>|
>| Thus, if you have something like this:
>|
>|   re = @/don't panic/
>|   sub(/don't/, "do", re)
>|   print typeof(re), re
>|
>| then re retains its type, but now attempts to match the string ‘do
>| panic’. This provides a (very indirect) way to create regexp-typed
>| variables at runtime.
>|
>
> (I'm astonished that first class regexp objects can be dynamically
> changed. But that is not my point here; I'm interested in potential
> pre-compiles of regexp constants...)

I would flatly reject a commit to do such a thing.  Yikes!

What representation is it working on? If the regex contains
a match for a literal backslash using escaping, does that
count as two backslash characters when you operate on it?
Or is it a single backslash? Can you replace the second
backslash with an 'n' and have the pair turn into a newline?

Is it just tromboning back to printed representation,
and then parsing again?

I provide this:

  1> (regex-source #/a.*b(c|d)/)
  (compound #\a (0+ wild) #\b (or #\c #\d))

You can get the source code of the regex object as a nested
list with symbols, characters and other objects.

When you have this, you can analyze and transform it.

Then you can call regex-compile on the result.

For instance, prepend a match for the z character:

  2> (regex-compile ^(compound #\z ,*(cdr *1)))
  #/za.*b(c|d)/

This is robust; you're not dealing with any character-syntax issues like
escapes, because you have the abstract syntax tree of the regex.

> This would imply that the first class regexp constants can be changed
> like dynamic regexps and that there's no regexp pre-compile involved.

Not necessarily; it could be that a new regex is compiled, and put into
the re variable, clobbering the old regex, which is freed (if it
hits a refcount of zero or whatever mem management is used).

It could also (in combination with this) be lazy. So that is to say
@/abc/ will just store the textual source code of the regex into
the regex object, but not compile anything. When it comes time to
use the regex, on first use, it is compiled and then cached into
that object.  When the regex is edited, the cache is invalidated.

Someone will undoubtedly chime in confirming or refuting these
hypotheses.

It would be pretty silly if these regex objects didn't cache a compiled
regex across multiple uses.

> And dynamic regexps and first class regexps that got changed (e.g.
> by code like
>
>   sub(/don't/, "do[", re)
>
> in above sample snippet) would both create runtime errors, e.g.

Have you tried this? Do you get an error at sub() time, or when
you later try to use re?

-- 
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca

[toc] | [prev] | [next] | [standalone]


#9868

FromJanis Papanagnou <janis_papanagnou+ng@hotmail.com>
Date2024-11-29 09:33 +0100
Message-ID<vibu9m$10nck$1@dont-email.me>
In reply to#9867
On 29.11.2024 05:13, Kaz Kylheku wrote:
> On 2024-11-28, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote:
>> [...]
> 
>> And dynamic regexps and first class regexps that got changed (e.g.
>> by code like
>>
>>   sub(/don't/, "do[", re)
>>
>> in above sample snippet) would both create runtime errors, e.g.
> 
> Have you tried this?

Yes. (With a response that appeared in my post behind the "e.g." [that
you snipped].)

> Do you get an error at sub() time, or when you later try to use re?

It seems to appear with sub(); in the snippet
    ...
    print "PRE"
    sub(/don't/, "do[", re)
    print "POST"
    print typeof(re), re
    ...
"PRE" ist printed but not "POST".

Janis

[toc] | [prev] | [next] | [standalone]


#9869

FromJanis Papanagnou <janis_papanagnou+ng@hotmail.com>
Date2024-11-30 12:41 +0100
Message-ID<vietm2$1munm$1@dont-email.me>
In reply to#9867
Coming back to this...

On 29.11.2024 05:13, Kaz Kylheku wrote:
> [...]
> 
> It could also (in combination with this) be lazy. [...]

Yes. There's already something like "on-demand logic" there, where
in  print > "a_file"  the file won't be created or overwritten if
the statement doesn't get triggered, and subsequent calls won't
overwrite it. So it would indeed be not surprising if such a
mechanism is implemented. (But I haven't examined the awk code.)

> 
> Someone will undoubtedly chime in confirming or refuting these
> hypotheses.
> 
> It would be pretty silly if these regex objects didn't cache a compiled
> regex across multiple uses.

True. But, OTOH, in GNU Awk there's a couple functions that are
just passed through to other (external) library functions. If these
functions happen to support only an interface like  match(re,str)
where match() supports no [thread-safe] static memory for "re"
the caller might have no choice. (Don't know how it's actually
implemented.)

Janis

> [...]

[toc] | [prev] | [next] | [standalone]


#9870

Fromarnold@freefriends.org (Aharon Robbins)
Date2024-12-01 20:20 +0000
Message-ID<674cc506$0$711$14726298@news.sunsite.dk>
In reply to#9866
Hi. Mack The Knife pointed me at this question.

This kind of query should go to the bug list (where I'll see it).
I skim the help list occasionally but don't reply to mails there.

In article <viac5m$l8oh$1@dont-email.me> Janis writes:
>In GNU Awk there's currently three types of regular expressions, in
>addition to the standard regexp-constants (/regex/) and the dynamic
>regexps ("regex", or variables containing "regex") there's in newer
>versions also first class regexp objects (@/regex/, "Strongly Typed
>Regexp Constants") supported.
>
>One principal advantage of regexp-constants is that the engine to
>parse the regexp can be created in advance, while a dynamic regexp
>may be constructed dynamically (from strings) and needs an explicit
>runtime-step to create the engine before the matching can be done.

Even for such dynamically created regexps, the regexp is compiled once and
cached, not compiled each time it's used (as long as it doesn't change).

>Now I assumed that  @/regex-const/  would in that respect behave as
> /regex-const/ ... - until I found in the GNU Awk manual this text:
>
>| Thus, if you have something like this:
>|
>|   re = @/don't panic/
>|   sub(/don't/, "do", re)
>|   print typeof(re), re
>|
>| then re retains its type, but now attempts to match the string ‘do
>| panic’. This provides a (very indirect) way to create regexp-typed
>| variables at runtime.
>
>(I'm astonished that first class regexp objects can be dynamically
>changed. But that is not my point here; I'm interested in potential
>pre-compiles of regexp constants...)

Since `re' is a variable, it can be changed, just as when you do

	str = "don't panic"
	sub(/don't/, "do", str)

>This would imply that the first class regexp constants can be changed
>like dynamic regexps and that there's no regexp pre-compile involved.

"Not so, Watson! Not so!"  When you do

	re = @/don't panic/

gawk uses reference counted pointers to the original object; the
original strongly typed regexp is precompiled and remains that way.

As soon as you go to *change* `re', gawk makes a copy of the string
value of the orginal regexp, makes the substitution, notes that
it's a strongly typed regexp, and compiles the new regexp. From then
on, the cached compiled regexp is used for matching.

>This would also rise suspicion that the "normal" regexp-constants are
>probably also not precomputed.

Also not true.

>So constant-regexps (both forms) have (only?) the advantage that the
>regexp-syntax can be (initially during awk parsing) checked, e.g.,
>
> 	re = @/don't panic[/
> 	     ^ unterminated regexp

Incorrect, they are compiled when the program is parsed.

>And dynamic regexps and first class regexps that got changed (e.g.
>by code like
>
>  sub(/don't/, "do[", re)
>
>in above sample snippet) would both create runtime errors, e.g.
>
>  error: Unmatched [, [^, [:, [., or [=: /do[ panic/
>  fatal: could not make typed regex
>
>(as all ill-formed regexp-types will produce a runtime error).

Well, of course.

In short, I jump through a lot of hoops in order to avoid recompiling
regexps if it's not necessary.

Hope this helps,

Arnold
-- 
Aharon (Arnold) Robbins 		arnold AT skeeve DOT com

[toc] | [prev] | [next] | [standalone]


#9871

FromJanis Papanagnou <janis_papanagnou+ng@hotmail.com>
Date2024-12-01 22:17 +0100
Message-ID<viijof$2q4u6$1@dont-email.me>
In reply to#9870
On 01.12.2024 21:20, Aharon Robbins wrote:
> Hi. Mack The Knife pointed me at this question.
> 
> This kind of query should go to the bug list (where I'll see it).

Oh, I haven't considered what I wrote and suspected as a bug, so
it didn't occur to me to use a bug-mailing list.

> [ explanations snipped ]
> 
> In short, I jump through a lot of hoops in order to avoid recompiling
> regexps if it's not necessary.
> 
> Hope this helps,

Yes. Thanks for shedding light on the internals. And glad to hear
how it's actually implemented.

Janis

[toc] | [prev] | [next] | [standalone]


#9872

Fromarnold@skeeve.com (Aharon Robbins)
Date2024-12-01 23:18 +0000
Message-ID<674ceed3$0$708$14726298@news.sunsite.dk>
In reply to#9871
In article <viijof$2q4u6$1@dont-email.me>,
Janis Papanagnou  <janis_papanagnou+ng@hotmail.com> wrote:
>> This kind of query should go to the bug list (where I'll see it).
>
>Oh, I haven't considered what I wrote and suspected as a bug, so
>it didn't occur to me to use a bug-mailing list.

Legitimate questions like this about how gawk works internally,
even if not bug reports, are welcome on the bug list. Sending them
there makes it easy for me to respond to them.

And of course, you can always look at the source code.

Arnold
-- 
Aharon (Arnold) Robbins 		arnold AT skeeve DOT com

[toc] | [prev] | [next] | [standalone]


#9873

FromJanis Papanagnou <janis_papanagnou+ng@hotmail.com>
Date2024-12-02 08:00 +0100
Message-ID<vijlu6$35un4$1@dont-email.me>
In reply to#9872
On 02.12.2024 00:18, Aharon Robbins wrote:
> 
> And of course, you can always look at the source code.

While I do that occasionally with some [better known] software
packages I'm not familiar with the GNU Awk source code and it
would IME require quite some analysis, how it's structured,
what's going on, and in the end you are typically never quite
sure whether it does what you think it does.

This isn't meant as a statement of quality of software design
or existence of useful comments in GNU Awk. It's only so that
last time I looked into the sources (with the intention to add
new syntax and semantic for a feature I'd have liked) I wasn't
able to identify how to do it without doing harm to the code;
I'm lacking the familiarity with this source code. Of course I
could have looked into the source code instead of posting, but
the described experience lead me to not take that path.

Re "(where I'll see it)": My post's intention was not meant to
address/bother you personally - yet, all the more I appreciate
your reply! In this newsgroup there's also some folks who have
some expertise and might answer such questions. And I'm not a
"client" of the mailing list. (Just to make you understand why
I used this Usenet communication channel.) And finally, there
was some discussion recently in another newsgroup about Regexps
and I wanted to initiate a potential discussion on the topic.

Janis

[toc] | [prev] | [next] | [standalone]


#9874

Fromarnold@skeeve.com (Aharon Robbins)
Date2024-12-02 20:58 +0000
Message-ID<674e1f7e$0$710$14726298@news.sunsite.dk>
In reply to#9873
In article <vijlu6$35un4$1@dont-email.me>,
Janis Papanagnou  <janis_papanagnou+ng@hotmail.com> wrote:
>This isn't meant as a statement of quality of software design
>or existence of useful comments in GNU Awk. It's only so that
>last time I looked into the sources (with the intention to add
>new syntax and semantic for a feature I'd have liked) I wasn't
>able to identify how to do it without doing harm to the code;
>I'm lacking the familiarity with this source code. Of course I
>could have looked into the source code instead of posting, but
>the described experience lead me to not take that path.

You can always ask me directly.

>Re "(where I'll see it)": My post's intention was not meant to
>address/bother you personally - yet, all the more I appreciate
>your reply! In this newsgroup there's also some folks who have
>some expertise and might answer such questions.

True, but ultimately I'm authoritative. :-)

>And I'm not a "client" of the mailing list.

You don't have to be subscribed to the bug list to send messages
there.

Arnold
-- 
Aharon (Arnold) Robbins 		arnold AT skeeve DOT com

[toc] | [prev] | [next] | [standalone]


#9875

FromJanis Papanagnou <janis_papanagnou+ng@hotmail.com>
Date2024-12-02 23:13 +0100
Message-ID<vilbec$3jkjo$1@dont-email.me>
In reply to#9874
On 02.12.2024 21:58, Aharon Robbins wrote:
> In article <vijlu6$35un4$1@dont-email.me>,
> Janis Papanagnou  <janis_papanagnou+ng@hotmail.com> wrote:
>> [...]
> 
> You can always ask me directly.

Thanks :-)

(It's been so long that I wrote you that I completely forgot about
that possibility. Shame on me.)

> 
>> Re "(where I'll see it)": My post's intention was not meant to
>> address/bother you personally - yet, all the more I appreciate
>> your reply! In this newsgroup there's also some folks who have
>> some expertise and might answer such questions.
> 
> True, but ultimately I'm authoritative. :-)

Undisputedly ;-)

> 
>> And I'm not a "client" of the mailing list.
> 
> You don't have to be subscribed to the bug list to send messages
> there.

Good to know.

See you,
Janis

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.awk


csiph-web