Groups | Search | Server Info | Login | Register


Groups > comp.lang.awk > #9797

Re: (Long post) Metaphone Algorithm In AWK

From porkchop@invalid.foo (Mike Sanders)
Newsgroups comp.lang.awk
Subject Re: (Long post) Metaphone Algorithm In AWK
Date 2024-08-19 03:22 +0000
Organization A noiseless patient Spider
Message-ID <v9udpv$2nq9o$1@dont-email.me> (permalink)
References <v9qbgh$1u7qe$1@dont-email.me> <878qwts8bd.fsf@bsb.me.uk>

Show all headers | View raw


Ben Bacarisse <ben@bsb.me.uk> wrote:

> Using a word list, I found some odd matches.

Hey Ben. Probaly a ton of them (see refactored script below...)

Most often used invocation this time around was:

echo "taxes drunkeness indigestion" | awk -f metaphone.awk -v find=texas

Lengthened phoneme count, accounted for doublets like "ness",
added Levenshtein distance & note next url (& its note near
the bottom of the page):

https://tilores.io/metaphone-phonetic-algorithm-online-tool

Likely lots left to do. Have at it if you're in the mood,
I'll weave in your & others changes (with credit) & repost
here if & when they appear.

Will catch up soon & have fun =)

# Metaphone Algorithm in AWK v2: Michael Sanders - 2024

BEGIN { find_code = metaphone(find, 10) }

# -----------------------------------------------------------------

# entry point...
{
  for (x = 1; x <= NF; x++) {
    if (similarity(metaphone($x, 10), find_code) >= 80)
      print find " : " $x
  }
}

# -----------------------------------------------------------------

# check if character is a vowel
function isvowel(c, is_vowel) { return c ~ /[AEIOU]/ }

# -----------------------------------------------------------------

# combine array into a string
function combine(array, result, i) {
  for (i in array) { result = result array[i] }

  return result
}

# -----------------------------------------------------------------

# add a character or string to the result array
function phonize(s, result, p_idx, i) {
  for (i = 1; i <= length(s); i++) {
    result[p_idx++] = substr(s, i, 1)
  }
  return p_idx
}

# -----------------------------------------------------------------

# compute metaphone code
function metaphone(word, max_phonemes, result,
                   p_idx, w_idx, c, next_c, last_c) {
  w_idx = 1
  p_idx = 1

  while (w_idx <= length(word) && p_idx <= max_phonemes) {
    c = toupper(substr(word, w_idx, 1))
    next_c = toupper(substr(word, w_idx + 1, 1))

    # skip duplicate letters
    if (c == last_c) {
      w_idx++
      continue
    }

    if (c == "B") {
      if (w_idx == 1 ||
          toupper(substr(word, w_idx - 1, 1)) != "M") {
        p_idx = phonize("B", result, p_idx)
      }
    } else if (c == "C") {
      if (next_c == "I" &&
          toupper(substr(word, w_idx + 2, 1)) ~ /[AO]/) {
        p_idx = phonize("SH", result, p_idx)
      } else if (next_c == "H") {
        p_idx = phonize("X", result, p_idx)
        w_idx++
      } else {
        p_idx = phonize("K", result, p_idx)
      }
    } else if (c == "D") {
      if (next_c == "G" &&
          toupper(substr(word, w_idx + 2, 1)) ~ /[EIY]/) {
        p_idx = phonize("J", result, p_idx)
        w_idx++
      } else {
        p_idx = phonize("T", result, p_idx)
      }
    } else if (c == "G") {
      if (next_c == "I" &&
          toupper(substr(word, w_idx + 2, 1)) ~ /[EOY]/) {
        p_idx = phonize("J", result, p_idx)
      } else if (next_c == "H" &&
                 !(w_idx > 1 &&
                  toupper(substr(word, w_idx - 1, 1)) ~ /[BDH]/)) {
        p_idx = phonize("F", result, p_idx)
        w_idx++
      } else if (next_c != "N" ||
                 toupper(substr(word, w_idx + 2, 1)) != "E") {
        p_idx = phonize("K", result, p_idx)
      }
    } else if (c == "H") {
      if (isvowel(next_c) &&
          toupper(substr(word, w_idx - 1, 1)) !~ /[CGPST]/) {
        p_idx = phonize("H", result, p_idx)
      }
    } else if (c == "K") {
      if (w_idx == 1 ||
          toupper(substr(word, w_idx - 1, 1)) != "C") {
        p_idx = phonize("K", result, p_idx)
      }
    } else if (c == "N") {
      p_idx = phonize("N", result, p_idx)
    } else if (c == "P") {
      if (next_c == "H") {
        p_idx = phonize("F", result, p_idx)
      } else {
        p_idx = phonize("P", result, p_idx)
      }
    } else if (c == "Q") {
      p_idx = phonize("K", result, p_idx)
    } else if (c == "R") {
      p_idx = phonize("R", result, p_idx)
    } else if (c == "S") {
      if (next_c == "H") {
        p_idx = phonize("SH", result, p_idx)
        w_idx++
      } else if (next_c == "C" &&
                 toupper(substr(word, w_idx + 2, 1)) == "H") {
        p_idx = phonize("X", result, p_idx)
        w_idx += 2
      } else {
        p_idx = phonize("S", result, p_idx)
      }
    } else if (c == "T") {
      if (next_c == "I" &&
          toupper(substr(word, w_idx + 2, 1)) ~ /[AO]/) {
        p_idx = phonize("X", result, p_idx)
      } else if (next_c == "H") {
        p_idx = phonize("0", result, p_idx)
        w_idx++
      } else if (next_c != "C" ||
                 toupper(substr(word, w_idx + 2, 1)) != "H") {
        p_idx = phonize("T", result, p_idx)
      }
    } else if (c == "V") {
      p_idx = phonize("F", result, p_idx)
    } else if (c == "W" || c == "Y") {
      if (isvowel(next_c)) {
        p_idx = phonize(c, result, p_idx)
      }
    } else if (c == "X") {
      p_idx = phonize("KS", result, p_idx)
    } else if (c == "Z") {
      p_idx = phonize("S", result, p_idx)
    }

    last_c = c  # Track last character processed
    w_idx++
  }

  if (p_idx > max_phonemes) p_idx = max_phonemes

  return substr(combine(result), 1, p_idx)
}

# -----------------------------------------------------------------

# calculate levenshtein distance between 2 words
# https://en.wikipedia.org/wiki/Levenshtein_distance
# https://rosettacode.org/wiki/Levenshtein_distance#AWK

function levenshtein(s1, s2, l1, l2, cost, i, j, dist) {
  l1 = length(s1)
  l2 = length(s2)

  # initialize distance array
  for (i = 0; i <= l1; i++) dist[i, 0] = i
  for (j = 0; j <= l2; j++) dist[0, j] = j

  # compute distance
  for (i = 1; i <= l1; i++) {
    for (j = 1; j <= l2; j++) {
      cost = (substr(s1, i, 1) ==
              substr(s2, j, 1)) ? 0 : 1
      dist[i, j] = min3(dist[i-1, j] + 1,      # deletion
                        dist[i, j-1] + 1,      # insertion
                        dist[i-1, j-1] + cost) # substitution
    }
  }

  return dist[l1, l2]
}

# -----------------------------------------------------------------

# minimum of 3 numbers (levenshtein helper function)
function min3(a, b, c) {
  return (a < b ? (a < c ? a : c) : (b < c ? b : c))
}

# -----------------------------------------------------------------

# calculate similarity as a percentage
function similarity(word1, word2, lev_dist, max_len) {
  lev_dist = levenshtein(word1, word2)
  max_len  = (length(word1) >
             length(word2)) ?
             length(word1)  :
             length(word2)

  if (max_len == 0) return 100  # both strings empty

  return (1 - lev_dist / max_len) * 100
}

# eof

-- 
:wq
Mike Sanders

Back to comp.lang.awk | Previous | NextPrevious in thread | Next in thread | Find similar


Thread

(Long post) Metaphone Algorithm In AWK porkchop@invalid.foo (Mike Sanders) - 2024-08-17 14:18 +0000
  Re: (Long post) Metaphone Algorithm In AWK Ben Bacarisse <ben@bsb.me.uk> - 2024-08-19 00:46 +0100
    Re: (Long post) Metaphone Algorithm In AWK Ben Bacarisse <ben@bsb.me.uk> - 2024-08-19 02:15 +0100
    Re: (Long post) Metaphone Algorithm In AWK porkchop@invalid.foo (Mike Sanders) - 2024-08-19 03:22 +0000
      Re: (Long post) Metaphone Algorithm In AWK porkchop@invalid.foo (Mike Sanders) - 2024-08-19 04:34 +0000
    Re: (Long post) Metaphone Algorithm In AWK porkchop@invalid.foo (Mike Sanders) - 2024-08-20 05:45 +0000
      Re: (Long post) Metaphone Algorithm In AWK Ben Bacarisse <ben@bsb.me.uk> - 2024-08-21 00:58 +0100
        Re: (Long post) Metaphone Algorithm In AWK porkchop@invalid.foo (Mike Sanders) - 2024-08-21 01:07 +0000
          Re: (Long post) Metaphone Algorithm In AWK porkchop@invalid.foo (Mike Sanders) - 2024-08-21 02:50 +0000
          Re: (Long post) Metaphone Algorithm In AWK Ben Bacarisse <ben@bsb.me.uk> - 2024-08-21 09:15 +0100
            Re: (Long post) Metaphone Algorithm In AWK porkchop@invalid.foo (Mike Sanders) - 2024-08-21 19:13 +0000
  Re: (Long post) Metaphone Algorithm In AWK porkchop@invalid.foo (Mike Sanders) - 2024-08-20 11:33 +0000
  Re: (Long post) Metaphone Algorithm In AWK porkchop@invalid.foo (Mike Sanders) - 2024-08-21 02:42 +0000
    AWK language trivia (Was: (Long post) Metaphone Algorithm In AWK) gazelle@shell.xmission.com (Kenny McCormack) - 2024-08-21 03:13 +0000
      Re: AWK language trivia porkchop@invalid.foo (Mike Sanders) - 2024-08-21 05:32 +0000
  Re: (Long post) Metaphone Algorithm In AWK porkchop@invalid.foo (Mike Sanders) - 2024-08-21 19:03 +0000
  Re: (Long post) Metaphone Algorithm In AWK porkchop@invalid.foo (Mike Sanders) - 2024-08-23 06:13 +0000

csiph-web