Skip to main content
Kofax

How To Delete OCR Results After Extraction

Article # 3037104 - Page views: 122

Issue

How To Delete OCR Results After Extraction

 

Solution

The following script shows how to remove OCR information after extraction. Generally, this is not necessary, and we don't suggest doing it. It can make sense, however, if you are working with very large documents and you do not know on which page(s) the information is located.

In that case, you cannot restrict the OCR to certain pages up front. You need to do OCR on all pages and then perform the extraction.

If the documents are larger than 50 pages, it can make sense to get rid of the OCR information after extraction, so that the Validation module does not unnecessarily have to load and save very large XDocuments.

The script below shows how to remove the OCR information from all but the first page. The words of the first page are saved in an array, then all OCR information is destroyed and the words from the first page re-entered.

Type MyWord
    Text As String
    Left As Long
    Top As Long
    Width As Long
    Height As Long
End Type
Private Sub Document_AfterExtract(pXDoc As CASCADELib.CscXDocument)
    Dim i As Long
    Dim words() As MyWord
    ReDim words(pXDoc.Pages(0).Words.Count)
    ' Remember all words of page one, those are the ones we want to keep
    For i = 0 To pXDoc.Pages(0).Words.Count - 1
        Dim wo As MyWord
        wo.Left = pXDoc.Pages(0).Words(i).Left
        wo.Top = pXDoc.Pages(0).Words(i).Top
        wo.Width = pXDoc.Pages(0).Words(i).Width
        wo.Height = pXDoc.Pages(0).Words(i).Height
        wo.Text = pXDoc.Pages(0).Words(i).Text
        words(i) = wo
    Next i
    ' Remove all representations (in these the OCR information is stored)
    For i = pXDoc.Representations.Count - 1 To 0 Step -1
        pXDoc.Representations.Remove(i)
    Next i
    ' Create a new representation
    Dim rep As CscXDocRepresentation
    Set rep = pXDoc.Representations.Create("New FR8")
    ' Add the words from the original page 1 again
    Dim word As CscXDocWord
    For i = 0 To UBound(words)
        Set word = New CscXDocWord
        word.Text = words(i).Text
        word.Left = words(i).Left
        word.Top = words(i).Top
        word.Width = words(i).Width
        word.Height = words(i).Height
        word.PageIndex = 0
        rep.Pages(0).AddWord(word)
    Next i
    ' Let the system analyze the textline structure
    rep.AnalyzeLines
End Sub

 

Level of Complexity 

High

 

Applies to  

Product Version Build Environment Hardware
Kofax Transformation Modules

6.3

6.4

N/A N/A N/A

References

Add any references to other internal or external articles

 

  • Was this article helpful?