Skip to main content
Kofax

Scripting - Locate words based on their content

Article # 3035477 - Page views: 274

Issue

How to locate words based on their content?

 

Solution

This example will show how you can use a Script Locator to find all words on a Document with an even number of digits to demonstrate how to use XDocument word collections in conjunction with a Script Locator.

In the Add an Alternative article, we show how to add an alternative to Script Locator extraction results. In the LocateAlternatives event, we get the XDocument that is currently processed as input.

Inside an XDocument, we can access Full Text OCR results of the first representation:

pXDoc.Representations(0).Words

Words can also be accessed by addressing a certain page:

pXDoc.Representations(0).Pages(0).Words

Now we loop over all words in the following manner:

Dim i As Integer
Dim pWord As CscXDocWord
For i = 0 To pXDoc.Representations(0).Words.Count - 1
    pWord = pXDoc.Representations(0).Words(i)
Next

Since we now can access all words, let's see if the the current word has an even number of characters first, and add an alternative for those words. The CscXDocFieldAlternative object has a word collection of its own, but the use of it is optional.

Basically, what an alternative needs is the Text property. To make the word coordinates appear in the Document Viewer after extraction or in TestValidation, we also have to specify the location of the word by setting the PageIndex, Left, Top, Width and Height properties.

However, using the words collection can make the script code easier. We can simply add the word to the alternatives word collection, and the PageIndex, Left, Top, Width and Height properties will be set automatically:

Private Sub MyScriptLocator_LocateAlternatives(ByVal pXDoc As CASCADELib.CscXDocument, ByVal pLocator As CASCADELib.CscXDocField)
For i = 0 To pXDoc.Representations(0).Words.Count - 1
  Set pWord = pXDoc.Representations(0).Words(i)
  sText = pWord.Text
  If (Len(pWord.Text)) Mod 2 = 0 Then
     Set pAlt = pLocator.Alternatives.Create()
     pAlt.Words.Append(pWord)
   End If
   Next
End Sub

When we now press the test button of the Script Locator, we should see all words with an even number of characters hightlighted. Now that we know we have a word with an even number of characters, we have to make sure all characters are digits. We do this by looping over every single character of the word and checking if its ASCII value is within the ASCII range of digits:

Private Sub MyScriptLocator_LocateAlternatives(ByVal pXDoc As CASCADELib.CscXDocument, ByVal pLocator As CASCADELib.CscXDocField)
  Dim i As Integer
  Dim j As Integer
  Dim pWord As CscXDocWord
  Dim pAlt As CscXDocFieldAlternative
  Dim sText As String
  Dim bIsNumber As Boolean
  For i = 0 To pXDoc.Representations(0).Words.Count - 1
    Set pWord = pXDoc.Representations(0).Words(i)
    sText = pWord.Text
    bIsNumber = True
    For i = 0 To pXDoc.Representations(0).Words.Count - 1
      Set pWord = pXDoc.Representations(0).Words(i)
      sText = pWord.Text
      bIsNumber = True
      If (Len(pWord.Text)) Mod 2 = 0 Then
         For j = 1 To Len(sText)
         If Not ((Asc(Mid(sText, j, 1)) < 58) And (Asc(Mid(sText, j, 1)) > 47 )) Then
            bIsNumber = False
            Exit For
         End If
       Next
      If bIsNumber Then
        Set pAlt = pLocator.Alternatives.Create()
        pAlt.Words.Append(pWord)
        bIsNumber = True
     End If
    End If
  Next
End Sub

When we now press the test button of the Script Locator we will see that all words that consist of an even number of digits are shown in the Document Viewer and the test tab of the Script Locator. Please note — we did not set a confidence, so the confidences of the alternatives are all set to 0. We can simply set the confidence by using the Confidence property:

Set pAlt = pLocator.Alternatives.Create()
pAlt.Confidence = 1.0

That way, we set the confidence to 100%. Now let's do that for every word we found:

Private Sub MyScriptLocator_LocateAlternatives(ByVal pXDoc As CASCADELib.CscXDocument, ByVal pLocator As CASCADELib.CscXDocField)
  Dim i As Integer
  Dim j As Integer
  Dim pWord As CscXDocWord
  Dim pAlt As CscXDocFieldAlternative
  Dim sText As String
  Dim bIsNumber As Boolean
  For i = 0 To pXDoc.Representations(0).Words.Count - 1
    Set pWord = pXDoc.Representations(0).Words(i)
    sText = pWord.Text
    bIsNumber = True
    If (Len(pWord.Text)) Mod 2 = 0 Then
       For j = 1 To Len(sText)
          If Not ((Asc(Mid(sText, j, 1)) < 58) And (Asc(Mid(sText, j, 1)) > 47 )) Then
             bIsNumber = False
             Exit For
          End If
       Next
    If bIsNumber Then
       Set pAlt = pLocator.Alternatives.Create()
       pAlt.Words.Append(pWord)
       pAlt.Confidence = 1.0
     End If
   End If
  Next
End Sub

When we press test on the Script Locator, we will see that all alternatives now have 100% confidence.

 

Level of Complexity 

High

 

Applies to  

Product Version Build Environment Hardware
Kofax Transformation Modules All      

 

 

Article # 3035477
  • Was this article helpful?