Skip to main content
Kofax

Recommendations and supported input file formats

Article # 3023989 - Page views: 544

 

Supported file formats

Documents sent to ReadSoft Online may be of various formats, and they may be sent in as electronically generated PDFs, scanned images/PDFs or XML files. A complete list of all supported file formats can be found in ReadSoft Online help.

Electronically created PDFs

Electronically created PDFs have images of high quality, embedded text layer and minimal chance of character misinterpretation. Auto rotation is not performed for documents with text layers, since it is assumed that the file has been created with correct rotation. 

Instead of performing OCR on the images ReadSoft Online reads the embedded text layer. 

Multiple overlapping text layers, some fonts like Identity-H or fonts not embedded in the PDF may cause issues when reading the text layer. OCR problems for Identity-H is a well known issue related to the Unicode lookup table in the PDF, more details can be found in the article Problem with incorrect capture on a PDF

PDF documents that require a password to open or decrypt are also not supported.  However, Digitally Signed PDF documents should not cause a problem with extraction and processing, unless there are other underlying issues within the PDF.

Scanned images

A document may be scanned as an image (JPEG, TIF, PNG) or as a PDF document. The accuracy of the OCR will depend on the quality of the scanned document as well as scanner settings. 

If auto rotation is enabled, invoices in landscape or upside down will automatically be rotated correctly in ReadSoft Online. 

Scan quality

Recommended resolution is at least 300 DPI. For Asian languages, like Chinese, at least 400 DPI is required. 

File compression is a type of data compression that creates a smaller version of a file to allow for easier sharing over a network or internet connection. Generally, it will have a greater impact on image-based PDFs or image files like JPEGs and PNGs, rather than text-only documents.

Image compression can be lossy or lossless. Lossless is preferred for OCR since it does not affect image quality. Lossless is mostly used for TIFF. Lossy is mainly used for JPEG and, depending on the method, it may affect OCR results. Examples of text with Lossy compression below. High compression rates will increase the risk of character misinterpretation.   

Designers Social Media Joyoge.com | Why JPEG Images Suck for Printing     Lossy compression dictionary definition | lossy compression defined   Formats of Web Graphic Images  

Text layer in scanned documents

Some scanners have built in OCR functionality which embeds a text layer in the scanned document. If this functionality is used, ReadSoft Online treats this as an electronic generated PDF and reads the text layer instead of using OCR. The text layer/OCR output from scanners varies in quality, meaning even if the image looks correct there might be characters in the text layer that have been incorrectly recognized by the scanner.

If there is a problem with character interpretation and the scanner's OCR function is used, it is recommended to turn it off and let ReadSoft Online perform OCR. 

XML

XML documents contain structured information. Tags in the XML file are mapped to specific fields in ReadSoft Online. No OCR is performed. Only mapped tags will be available in ReadSoft Online, custom fields cannot be used to map additional content from the XML document. Different XML formats have different mappings and a full list of supported formats and mapped fields can be found in ReadSoft Online help

It is recommended to use the XML document type when receiving XML documents. 

Level of Complexity 

Moderate

Applies to  

Product Version Build Environment Hardware
ReadSoft Online Current      
  • Was this article helpful?