Skip to main content
Kofax

Kofax TotalAgility - How does fuzzy searching work?

Article # 3031083 - Page views: 25

Applies to

  • TotalAgility 7.5
  • TotalAgility 7.6
  • TotalAgility 7.7
  • TotalAgility 7.8
  • TotalAgility 7.9

Question

How does a fuzzy search find records that match a given query text/phrase?

Answer

Two factors come into play when performing a fuzzy search.  The first is the query text/phrase similar in length to a record. Depending on the query text/phrase length, the database word length taking into account is increased or reduced:

  • It will find words containing (+1)  character for a text/phrase of length 4
  • It will find words containing (-/+ 1) characters for a text/phrase of length 5
  • It will find words containing (-/+ 2)  characters for a text/phrase of length 6 and 7
  • It will find words containing (-2 /+3)  characters for a text/phrase of length 8
  • It will find words containing (-/+ 3)  characters for a text/phrase of length 9
  • It will find words containing (-3 / +4)  characters for text/phrase of length 10, 11, 12
  • For bigger query text/phrase of length, it will take care from words containing (-4 / +5)

The second factor that's considered is the Levenshtein distance between the queried text/phrase and the text/phrase in the database is greater than 70%.

Example 1

If a database has a record called "Hospital de Cataluna", searching the word "Hospital" would not return the record.  This is because the query word is 9 characters whereas the record in the database (including spaces) is 20 characters.  This doesn't fall within the length requirements above.

Example 2

If a database has records called "Hospital de Valencia" and "Hospital de Cataluna", searching the text "Hospital de Cataluna" will return both records.  "Hospital de Valencia" is returned firstly because it is the same length as the query (20 characters).  Secondly is because the Levenshtein distance is greater than 70%.

The 2 words share the highlighted characters below (and spaces).

  • Hospital de Valencia
  • Hospital de Cataluna

Including spaces, "Hospital de Valencia" shares 15 of the same characters with "Hospital de Cataluna" which results in a Levenshtein distance of 75% i.e. 15 / 20 * 100.

Example 3

If a database has records called "Hospital Trim" and "Hospital de Cataluna", searching the text "Hospital de" will return "Hospital Trim".  

"Hospital de Cataluna" is 20 characters whereas the queried phrase is only 11 characters.  This doesn't meet the length requirements above.

"Hospital Trim" is 13 characters so the queried phrase does meet the above length requirements.  Additionally, it has a Levenshtein distance of 70% as it shares 9 of the same characters (including spaces) i.e. 9 / 13 * 100