Skip to main content

Problem with incorrect capture on a PDF

Question / Problem: 

RSO has a problem with capturing the correct text shown in the invoice or it captures it but adds symbols.  

Answer / Solution: 

There are cases when the captured values does not resemble what is on the invoice. For instance it can be that the value shows as a random set of characters or extra characters seemingly not on the document are captured. In many of these cases the document has been encoded with Identity-H. This is easily verified by checking the font properties in the original PDF.

This is relatively common, and is caused when the application creating the PDF fails to correctly embed the Unicode lookup table for the font. Without that lookup table there is no relationship between the visible character on screen and the equivalent character code, so copying and pasting the text will lead to either a series of unknown markers, or a jumble of characters with a 1:1 relationship to the original text.

As a PDF stores the character codes rather than the human-readable text, the fact you can see a letter "A" on the page doesn't mean Acrobat has any idea that it is an "A". The lookup tables make that connection, so if they are missing or corrupted there is no way to recreate the semantic connection unless you can re-fry the file with an original copy of the font.

There is nothing that can be done on the Readsoft Online side to improve interpretation for these cases. The supplier needs to create the PDF using another encoding or tool in order to improve capture.

How to check the font in a PDF:

  1. Open the PDF with Adobe Reader and right click on the document.
  2. Select "Properties" and then choose "Fonts", check if Identity-H is found in the list.
  3. Ask the supplier to create the PDF using another font encoding or to use another tool to create the PDF.
  • Was this article helpful?