QAID # 17333 Published
Question / Problem:
Delete OCR Results After Extraction
Answer / Solution:
The following script shows how to remove OCR information after extraction. Generally, this is not necessary, and we don't suggest doing it. It can make sense, however, if you are working with very large documents and you do not know on which page(s) the information is located.
In that case, you cannot restrict the OCR to certain pages up front. You need to do OCR on all pages and then perform the extraction.
If the documents are larger than 50 pages, it can make sense to get rid of the OCR information after extraction, so that the Validation module does not unnecessarily have to load and save very large XDocuments.
The script below shows how to remove the OCR information from all but the first page. The words of the first page are saved in an array, then all OCR information is destroyed and the words from the first page re-entered.
Type MyWord Text As String Left As Long Top As Long Width As Long Height As Long End Type Private Sub Document_AfterExtract(pXDoc As CASCADELib.CscXDocument) Dim i As Long Dim words() As MyWord ReDim words(pXDoc.Pages(0).Words.Count) ' Remember all words of page one, those are the ones we want to keep For i = 0 To pXDoc.Pages(0).Words.Count - 1 Dim wo As MyWord wo.Left = pXDoc.Pages(0).Words(i).Left wo.Top = pXDoc.Pages(0).Words(i).Top wo.Width = pXDoc.Pages(0).Words(i).Width wo.Height = pXDoc.Pages(0).Words(i).Height wo.Text = pXDoc.Pages(0).Words(i).Text words(i) = wo Next i ' Remove all representations (in these the OCR information is stored) For i = pXDoc.Representations.Count - 1 To 0 Step -1 pXDoc.Representations.Remove(i) Next i ' Create a new representation Dim rep As CscXDocRepresentation Set rep = pXDoc.Representations.Create("New FR8") ' Add the words from the original page 1 again Dim word As CscXDocWord For i = 0 To UBound(words) Set word = New CscXDocWord word.Text = words(i).Text word.Left = words(i).Left word.Top = words(i).Top word.Width = words(i).Width word.Height = words(i).Height word.PageIndex = 0 rep.Pages(0).AddWord(word) Next i ' Let the system analyze the textline structure rep.AnalyzeLines End Sub