Groups | Search | Server Info | Login | Register


Groups > comp.lang.awk > #9820

Longest Common Substring

Path csiph.com!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From porkchop@invalid.foo (Mike Sanders)
Newsgroups comp.lang.awk
Subject Longest Common Substring
Date Fri, 23 Aug 2024 04:39:59 -0000 (UTC)
Organization A noiseless patient Spider
Lines 57
Sender Mike Sanders <busybox@sdf.org>
Message-ID <va93qv$pome$1@dont-email.me> (permalink)
Injection-Date Fri, 23 Aug 2024 06:39:59 +0200 (CEST)
Injection-Info dont-email.me; posting-host="b537a019881267c51ad5afadf5b95d49"; logging-data="844494"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+DS7a8XEF+lFOAZtVW1z8k"
User-Agent tin/2.6.2-20221225 ("Pittyvaich") (NetBSD/9.3 (amd64))
Cancel-Lock sha1:Pf32RaD6eKe8ijP8H0RYqPAb9cY=
Xref csiph.com comp.lang.awk:9820

Show key headers only | View raw


# largest q-gram (longest common substring) in AWK: Michael Sanders - 2024
# A q-gram is a contiguous sequence of characters of fixed length
# see also: https://en.wikipedia.org/wiki/N-gram

BEGIN {
  str1 = "MAWK"
  str2 = "gawk"
  q = largest_qgram(str1, str2)
  l = length(q)

  print str1 " : " str2

  if (length(q) > 0) {
    printf("largest q-gram (longest common substring): \"%s\" size: %d\n", q, l)
  } else {
    printf("no common substring found\n")
  }

}

# find largest q-gram (longest common substring) between 2 words
function largest_qgram(str1, str2, i, j, maxlen, largest, tmp) {
  maxlen = 0
  largest = ""

  # for convenience's sake
  str1 = tolower(str1)
  str2 = tolower(str2)

  # loop over each character of 1st string
  for (i = 1; i <= length(str1); i++) {
     # loop over each character of 2nd string
     for (j = 1; j <= length(str2); j++) {
        tmp = ""
        # check for matching substrings at positions i and j
        while (substr(str1, i + length(tmp), 1) == substr(str2,
              j + length(tmp), 1) && i + length(tmp) <=
              length(str1) && j + length(tmp) <= length(str2)) {
           tmp = tmp substr(str1, i + length(tmp), 1)
        }

        # if matching substring is longer than previous maximum
        if (length(tmp) > maxlen) {
           maxlen = length(tmp)
           largest = tmp
        }
     }
  }
  return largest
}

# eof

-- 
:wq
Mike Sanders

Back to comp.lang.awk | Previous | Next | Find similar


Thread

Longest Common Substring porkchop@invalid.foo (Mike Sanders) - 2024-08-23 04:39 +0000

csiph-web