Scripting - Locate words based on their content
Issue
How to locate words based on their content?
Solution
This example will show how you can use a Script Locator to find all words on a Document with an even number of digits to demonstrate how to use XDocument word collections in conjunction with a Script Locator.
In the Add an Alternative article, we show how to add an alternative to Script Locator extraction results. In the LocateAlternatives event, we get the XDocument that is currently processed as input.
Inside an XDocument, we can access Full Text OCR results of the first representation:
pXDoc.Representations(0).Words
Words can also be accessed by addressing a certain page:
pXDoc.Representations(0).Pages(0).Words
Now we loop over all words in the following manner:
Dim i As Integer Dim pWord As CscXDocWord For i = 0 To pXDoc.Representations(0).Words.Count - 1 pWord = pXDoc.Representations(0).Words(i) Next
Since we now can access all words, let's see if the the current word has an even number of characters first, and add an alternative for those words. The CscXDocFieldAlternative object has a word collection of its own, but the use of it is optional.
Basically, what an alternative needs is the Text property. To make the word coordinates appear in the Document Viewer after extraction or in TestValidation, we also have to specify the location of the word by setting the PageIndex, Left, Top, Width and Height properties.
However, using the words collection can make the script code easier. We can simply add the word to the alternatives word collection, and the PageIndex, Left, Top, Width and Height properties will be set automatically:
Private Sub MyScriptLocator_LocateAlternatives(ByVal pXDoc As CASCADELib.CscXDocument, ByVal pLocator As CASCADELib.CscXDocField) For i = 0 To pXDoc.Representations(0).Words.Count - 1 Set pWord = pXDoc.Representations(0).Words(i) sText = pWord.Text If (Len(pWord.Text)) Mod 2 = 0 Then Set pAlt = pLocator.Alternatives.Create() pAlt.Words.Append(pWord) End If Next End Sub
When we now press the test button of the Script Locator, we should see all words with an even number of characters hightlighted. Now that we know we have a word with an even number of characters, we have to make sure all characters are digits. We do this by looping over every single character of the word and checking if its ASCII value is within the ASCII range of digits:
Private Sub MyScriptLocator_LocateAlternatives(ByVal pXDoc As CASCADELib.CscXDocument, ByVal pLocator As CASCADELib.CscXDocField) Dim i As Integer Dim j As Integer Dim pWord As CscXDocWord Dim pAlt As CscXDocFieldAlternative Dim sText As String Dim bIsNumber As Boolean For i = 0 To pXDoc.Representations(0).Words.Count - 1 Set pWord = pXDoc.Representations(0).Words(i) sText = pWord.Text bIsNumber = True For i = 0 To pXDoc.Representations(0).Words.Count - 1 Set pWord = pXDoc.Representations(0).Words(i) sText = pWord.Text bIsNumber = True If (Len(pWord.Text)) Mod 2 = 0 Then For j = 1 To Len(sText) If Not ((Asc(Mid(sText, j, 1)) < 58) And (Asc(Mid(sText, j, 1)) > 47 )) Then bIsNumber = False Exit For End If Next If bIsNumber Then Set pAlt = pLocator.Alternatives.Create() pAlt.Words.Append(pWord) bIsNumber = True End If End If Next End Sub
When we now press the test button of the Script Locator we will see that all words that consist of an even number of digits are shown in the Document Viewer and the test tab of the Script Locator. Please note — we did not set a confidence, so the confidences of the alternatives are all set to 0. We can simply set the confidence by using the Confidence property:
Set pAlt = pLocator.Alternatives.Create() pAlt.Confidence = 1.0
That way, we set the confidence to 100%. Now let's do that for every word we found:
Private Sub MyScriptLocator_LocateAlternatives(ByVal pXDoc As CASCADELib.CscXDocument, ByVal pLocator As CASCADELib.CscXDocField) Dim i As Integer Dim j As Integer Dim pWord As CscXDocWord Dim pAlt As CscXDocFieldAlternative Dim sText As String Dim bIsNumber As Boolean For i = 0 To pXDoc.Representations(0).Words.Count - 1 Set pWord = pXDoc.Representations(0).Words(i) sText = pWord.Text bIsNumber = True If (Len(pWord.Text)) Mod 2 = 0 Then For j = 1 To Len(sText) If Not ((Asc(Mid(sText, j, 1)) < 58) And (Asc(Mid(sText, j, 1)) > 47 )) Then bIsNumber = False Exit For End If Next If bIsNumber Then Set pAlt = pLocator.Alternatives.Create() pAlt.Words.Append(pWord) pAlt.Confidence = 1.0 End If End If Next End Sub
When we press test on the Script Locator, we will see that all alternatives now have 100% confidence.
Level of Complexity
High
Applies to
Product | Version | Build | Environment | Hardware |
---|---|---|---|---|
Kofax Transformation Modules | All |
Article # 3035477