Skip to main content
Kofax

Unicode Database Without Byte Order Mark

17778
3011435

Question / Problem: 

Is it possible to make use of a Unicode database without Byte Mark Order (BOM)?

Answer / Solution: 

Kofax Transformation supports Unicode for database files. However these database files MUST contain the Unicode Byte Order Mark (BOM) at the head of the file, otherwise KTM will parse the file according to the local default code page. 

The following script helps when the database does not start with a Unicode BOM (e.g., the ERP system that is generating the file is unable to create the BOM.)

If the file is not Unicode BOM, a script can be used to add the correct BOM to the file. 

Below is a sample script that looks at the first 3 bytes of the database file and determines if the file is utf8, utf16 or utf16Bigendian. Then the correct BOM is added to the database file, so that KTM parses the file correctly. The database is then compiled. 

Note: The "Automatically update from input file" in the Database Settings must be disable, otherwise the script will exit as soon as it runs.

Private Sub Batch_Opened(ByVal ServiceNumber As Long)
   refreshUnicodeDatabase(Project.Databases(0))
End Sub

Private Sub refreshUnicodeDatabase(db As CscDatabase)

   'If the database is already up-to-date do nothing
   If FileDateTime(db.DatabasePath) > FileDateTime(db.ImportFilename) Then Exit Sub
   Dim temp As String
   temp = Environ("TEMP") & "\" & "db.txt"
   FileCopy db.ImportFilename, temp  'make a copy of the database, so we can edit the original
   Dim header() As Byte
   ReDim header(2)
   Open temp For Binary Access Read As #1
   Open db.ImportFilename For Binary Access Write As #2
   'read the first 3 bytes of the file and try to work out the Unicode encoding.
   'This supports only UTF-8 and UTF-16
   'This doesn't support UTF-7, UTF-32 and other encodings
   Get  #1, , header

   Dim encoding As String
   If header(0) = &hEF And header(1) = &hBB And header(2) = &hBF Then
      encoding = "utf8"
   ElseIf header(0) = &hFF And header(1) = &hFE Then
      encoding = "unicode"
   ElseIf header(0) = &hFE And header(1) = &hFF Then
      encoding = "bigendian"
   ElseIf header(0) <> 0 And header(1) = 0 And header(2 ) <> 0 Then 'Guess
      encoding = "unicode"
      ReDim Preserve header(4)
      header(4) = header(2)
      header(3) = header(1)
      header(2) = header(0)
   ElseIf header(0) = 0 And header(1 ) <> 0 And header(2) = 0 Then  'Guess
      encoding = "bigendian"
      ReDim Preserve header(4)
      header(4) = header(2)
      header(3) = header(1)
      header(2) = header(0)
   Else
      encoding = "utf8"   'Guess
      ReDim Preserve header(5)
      header(5) = header(2)
      header(4) = header(1)
      header(3) = header(0)
   End If

   Select Case encoding
      Case  "utf8"
         header(0) = &hEF
         header(1) = &hBB
         header(2) = &hBF
      Case "unicode"
         header(0) = &hFF
         header(1) = &hFE
         'leave 3rd byte alone
      Case "bigendian"
         header(0) = &hFE
         header(1) = &hFF
         'leave 3rd byte alone
   End Select

   Put  #2, ,header   'Write header to database file
   Dim buffer() As Byte
   Dim buffersize As Long
   While Seek(1) <= LOF(1)         'copy rest of file
      buffersize = LOF(1) - Seek(1)
      If buffersize > 2047 Then buffersize = 2047
      ReDim buffer(buffersize)
      Get #1, , buffer
      Put #2, , buffer
   Wend
   Close #1
   Close #2
   Kill temp  'delete the copy

   db.ImportDatabase(True)
End Sub

Applies to:  

Product Version Category
KTM  6.3 Scripting
KTM  6.2  Scripting

 

Author:  harold.gue@kofax.com

From legacy Kofax KnowledgeBase article QAID 17778: Unicode Database Without Byte Order Mark 

  • Was this article helpful?