Skip to main content
Kofax

Locate Words Based on Their Content

17758

QAID # 17758 Published

Question / Problem:

Locate Words Based on Their Content

Answer / Solution:

This example will show how you can use a Script Locator to find all words on a Document with an even number of digits to demonstrate how to use XDocument word collections in conjunction with a Script Locator.

In the Add an Alternative article, we show how to add an alternative to Script Locator extraction results. In the LocateAlternatives event, we get the XDocument that is currently processed as input.

Inside an XDocument, we can access Full Text OCR results of the first representation:

pXDoc.Representations(0).Words

Words can also be accessed by addressing a certain page:

pXDoc.Representations(0).Pages(0).Words

Now we loop over all words in the following manner:

Dim i As Integer
Dim pWord As CscXDocWord
For i = 0 To pXDoc.Representations(0).Words.Count - 1
    pWord = pXDoc.Representations(0).Words(i)
Next

Since we now can access all words, let's see if the the current word has an even number of characters first, and add an alternative for those words. The CscXDocFieldAlternative object has a word collection of its own, but the use of it is optional.

Basically, what an alternative needs is the Text property. To make the word coordinates appear in the Document Viewer after extraction or in TestValidation, we also have to specify the location of the word by setting the PageIndex, Left, Top, Width and Height properties.

However, using the words collection can make the script code easier. We can simply add the word to the alternatives word collection, and the PageIndex, Left, Top, Width and Height properties will be set automatically:

Private Sub MyScriptLocator_LocateAlternatives(ByVal pXDoc As CASCADELib.CscXDocument, _
                                               ByVal pLocator As CASCADELib.CscXDocField)
   For i = 0 To pXDoc.Representations(0).Words.Count - 1
      Set pWord = pXDoc.Representations(0).Words(i)
      sText = pWord.Text
      If (Len(pWord.Text)) Mod 2 = 0 Then
         Set pAlt = pLocator.Alternatives.Create()
         pAlt.Words.Append(pWord)
      End If
    Next
End Sub

When we now press the test button of the Script Locator, we should see all words with an even number of characters hightlighted. Now that we know we have a word with an even number of characters, we have to make sure all characters are digits. We do this by looping over every single character of the word and checking if its ASCII value is within the ASCII range of digits:

Private Sub MyScriptLocator_LocateAlternatives(ByVal pXDoc As CASCADELib.CscXDocument, _
                                               ByVal pLocator As CASCADELib.CscXDocField)
   Dim i As Integer
   Dim j As Integer
   Dim pWord As CscXDocWord
   Dim pAlt As CscXDocFieldAlternative
   Dim sText As String
   Dim bIsNumber As Boolean
   For i = 0 To pXDoc.Representations(0).Words.Count - 1
      Set pWord = pXDoc.Representations(0).Words(i)
      sText = pWord.Text
      bIsNumber = True
   For i = 0 To pXDoc.Representations(0).Words.Count - 1
      Set pWord = pXDoc.Representations(0).Words(i)
      sText = pWord.Text
      bIsNumber = True
      If (Len(pWord.Text)) Mod 2 = 0 Then
         For j = 1 To Len(sText)
             If Not ((Asc(Mid(sText, j, 1)) < 58) And (Asc(Mid(sText, j, 1)) > 47 )) Then
                bIsNumber = False
                Exit For
            End If
        Next
            If bIsNumber Then
                Set pAlt = pLocator.Alternatives.Create()
                pAlt.Words.Append(pWord)
                bIsNumber = True
            End If
        End If
    Next
End Sub

When we now press the test button of the Script Locator we will see that all words that consist of an even number of digits are shown in the Document Viewer and the test tab of the Script Locator. Please note — we did not set a confidence, so the confidences of the alternatives are all set to 0. We can simply set the confidence by using the Confidence property:

Set pAlt = pLocator.Alternatives.Create()
pAlt.Confidence = 1.0

That way, we set the confidence to 100%. Now let's do that for every word we found:

Private Sub MyScriptLocator_LocateAlternatives(ByVal pXDoc As CASCADELib.CscXDocument, _
                                               ByVal pLocator As CASCADELib.CscXDocField)
    Dim i As Integer
    Dim j As Integer
    Dim pWord As CscXDocWord
    Dim pAlt As CscXDocFieldAlternative
    Dim sText As String
    Dim bIsNumber As Boolean
    For i = 0 To pXDoc.Representations(0).Words.Count - 1
        Set pWord = pXDoc.Representations(0).Words(i)
        sText = pWord.Text
        bIsNumber = True
        If (Len(pWord.Text)) Mod 2 = 0 Then
            For j = 1 To Len(sText)
                If Not ((Asc(Mid(sText, j, 1)) < 58) And (Asc(Mid(sText, j, 1)) > 47 )) Then
                        bIsNumber = False
                    Exit For
                End If
            Next
             If bIsNumber Then
                Set pAlt = pLocator.Alternatives.Create()
                pAlt.Words.Append(pWord)
                pAlt.Confidence = 1.0
            End If
        End If
    Next
End Sub

When we press test on the Script Locator, we will see that all alternatives now have 100% confidence.

  • Was this article helpful?