Skip to main content

Problem with capturing correctly


Question / Problem:  

There are cases when the captured values does not resemble what is on the invoice. For instance it can be that the value shows as a random set of characters or extra characters seemingly not on the document are captured.

Answer / Solution:  

In many of these cases the document has been encoded with Identity-H. This is easily verified by checking the font properties in the original PDF.

This is relatively common, and is caused when the application creating the PDF fails to correctly embed the Unicode lookup table for the font. Without that lookup table there is no relationship between the visible character on screen and the equivalent character code, so copying and pasting the text will lead to either a series of unknown markers, or a jumble of characters with a 1:1 relationship to the original text.
As a PDF stores the character codes rather than the human-readable text, the fact you can see a letter "A" on the page doesn't mean Acrobat has any idea that it's an "A". The lookup tables make that connection, so if they're missing or corrupted there's no way to recreate the semantic connection unless you can re-fry the file with an original copy of the font.

There is nothing that can be done on the Readsoft Online side to improve interpretation for these cases. The customer needs to create the PDF using another encoding or tool in order to improve capture.

Enter How to here:

  1. Download the document.
  2. Open the PDF with Adobe Reader and right click the document in Adobe reader.
  3. Choose "Properties" and then choose "Fonts", if Identity-H can be found in the document that is then the cause.
  4. Inform the supplier and ask them to send invoices encoded in a different language.
  5. If Identity-H can't be found please create a case and add the information regarding that Identity-H is not present in the PDF.

Applies to:  

Product Version