Skip to main content
Kofax

KCS on VMware - Information Gathering & Interpretation Guide

Summary

3548

How to prevent problems

Before you even start with the virtualisation of your KCS environment you should read the latest issue of the “Platform System Manual”. In chapter “5 – KCS Server Package on VMware Server” you’ll be able to find the needed and required performance information and in chapter “6.4 – KCS HW-Requirement Calculator” it’s explained how to determine if the available environment is up to the challenge. Please also have a read through KB article “Performance Monitoring” as it might come in handy for and is very closely related to some parts of this article.

Please make sure that you’re running the latest KCS package, if not, consider updating to the latest KCS package as these have been fine-tuned for better performance.

Possible scenarios / problems

Single KCS server:

  • Interruptions in the fax traffic.
  • Loss of connection to the LS1 and automatic, unexpected reboots of the LS1.
  • Unexpected TCOSS reboots.

Tandem KCS server:

In addition to the above mentioned ‘Single KCS server’ issues your system can suffer from the following symptoms.

  • Remote Disk Access Timeout errors
  • Failing failover situation. Unexpected boots of the Secondary server.
  • Desynchronised Primary and Secondary Filestructure

In these situations it’s very important and tricky to pinpoint the source of the problem. Usually these issues aren’t caused by a lack of CPU power or available RAM but by disk or network delays. Nevertheless it’s crucial to capture all the needed information and to know how to analyse this information.

What to gather?

In short: Performance logs, traces, registry, event logs, general environment information, network information. Please note that if a Tandem system is involved we need this information from both Primary AND Secondary server.

To make sure that you include peaks of business communication, it is very important to capture all this information over a longer period of time and that they are all taken from the same period of time.

Performance logs.

Format is <Object>/<Counter>, normally valid for “All Instances”

  • CPU
    Processor / % Processor Time
    Process / % Processor Time (to view specific processes)
  • Memory
    Memory / Available Kbytes
    Memory / Page faults/sec
  • Disk access/throughput
     Physical Disk / all counters of C: and D: or _Total,

Especially interesting:
Physical Disk / % Read Time
Physical Disk / % Write Time
Physical Disk / % Idle Time
Physical Disk / Disk Read Bytes/sec
Physical Disk / Disk Reads/sec
Physical Disk / Disk Write Bytes/sec
Physical Disk / Disk Writes/sec
Physical Disk / Avg. disk write queue length

  • Netwxork
    Network Interface / Bytes sent/sec (for the specific NIC)
    Network Interface / Bytes received/sec (for the specific NIC)
    TCPv4 / Segments Retransmitted/sec
     
  • TCOSS
    TCOSS Disk / All counters
    TCOSS Links / All counters for instance 3. Instance 3 is the dedicated LanLink to the secondary server. If there are also issues with LS1s, activate for all links.
    TCOSS Cache / All counters
  • ESX counters
    On the VMware ESX server you can also activate some counters in order to identify performance issues.
    Locate the ‘Performance’ tab and click on “Change Chart Options...”
    Screen Shot 2018-07-23 at 1.02.19 PM.png

Now you should be able to change the chart by selecting different counters for certain objects. Here you’re also able to define a time-frame that needs to be monitored.

Please enable the following counters. For the Disk Object:

Disk Read Rate
Disk Write Rate
Disk Read Latency
Disk Write Latency
Disk Write Requests
Disk Read Requests

For the Network Object activate all available counters.

Screen Shot 2018-07-23 at 1.04.57 PM.png

Traces, event logs and registry.

In addition to all these counters also make an export of the System and Application event logs, the registry and enable the following TCOSS traces:

HKLM\SOFTWARE\TOPCALL\TCOSS\TraceLevel=0x1083 hex
HKLM\Software\TOPCALL\TCOSS\MaxTraceFileSize=0x1388 hex (5000 dec)
HKLM\Software\TOPCALL\TCOSS\MaxTraceFiles=0xa hex (10 dec)

This information is required from both primary and secondary servers.

Environment information.

Understand the big picture.

  • It is important to know which other Guests are sharing the same VMware ESX server with the KCS and what their tasks are.
  • Are they running disk or network intensive operations?
  • Is it possible to isolate the KCS file structure?
  • Local or SAN disks? Raid configuration? Disk specifications (RPM, average latency, Avg. seek time, Writes/sec)?

Network information.

  • Network connection for the Data link between Primary and Secondary server.
    • Dedicated or shared connection?
    • Virtual or physical network and switches?
    • What’s the exact build-up? Cross-over connections, separate switches?
    • Distance between Primary and Secondary and how many routers are in between the two?
    • Any routers that reduce the MTU? (=Maximum Transmission Unit, the largest physical packet size, measured in bytes, that a network can transmit.)
    • Speed, Duplex mode and do they match with the switches in between?
    • Can it be switched to Gbit connection?
    • Response time when pinging with -l=16384 (packet size used by TCOSS)
    • Teamed NICs and or dedicated physical NICs?
    • Are there any retransmitted segments? (see “TCPv4 / Segments retransmitted” counter.)
  • Network connection between Primary and LS1.
    • All of the above but you can forget about the Gbit connection.
    • Installed image versions on the LS1?

What to look out for and how to interpret it.

From the Traces

  • Loads of “WARNING: TAWIS delay xxxx ms on TAM channel XX, cmd type 0”.
    TAWIS delays can be caused by a bad network connection between primary and secondary or by a slow disk that can’t keep up with the communication with the LS1. When these delays go over 5000 ms a “DISK-Remote Disk Access Timeout” error will be thrown which you’ll also be able to find in the traces and event logs.
  • “LanLink[xxx] Connecting failed. Winsock Error Code: 10060 Connection timed out.”
    Usually caused by network issues and accompanied by these trace entries: “Boot LS1(L.xx.DSP0): LS1 Update/Boot returned (error=4)” and “Boot LS1(L.xx.DSP0): Boot Procedure failed”.

From the Traces and Event Logs.

You can find a high occurrence of the following events in both Trace Files and Event Logs:

  • ID: 16002 “Warning: single disk operation on disk 1 started ok in TOS.”
    Can be found on both primary and secondary server and indicate a loss of connection between the two.
  • ID: 16002 “Warning: disk 2 deactivated;Write4 on Sec 2”
    Located on the primary and usually in combination with the above mentioned 16002 warning. Same cause.
  • ID: 16004 “DISK-Remote Disk Access Timeout”
    Only on the primary and indicates that writing on the remote disk is taking too much time. Can be caused by a slow network and/or slow disk access.
  • ID: 16005 “An unrecoverable error occurred. Parts of the system may not be available anymore. Error Message: Primary Master out of order, Secondary Master is running stand alone! in START.”
    Only on the secondary and points to loss of network connection or the primary is really down.
  • ID: 16020 “LS1(L.xx.DSP0) has been stopped due to a link error between Sec. Master and LS1(L.xx.DSP0)”
    Can be found on the active server and point to network issues between the server and the LS1.
  • ID: 16022 “Reloading LS1(L.xx.DSP0) failed due to link error between Sec. Master and LS1(L.xx.DSP0)”
    The LS1 can’t be reached.
  • ID: 16053 “Process will be stopped due the following fatal problem: Connection to TCOSS Master lost! (Wrong State of Link to Master)”

In addition to the above events you can find a high number of the following events. Even if almost no peaks and only average values are exceeded this indicates a pretty serious lack of performance.

  • ID: 16054 Avg. local disk time x ms exceeded avg. limit, peak was x ms, x perc. values exceeded peak limit (2000 ms), x perc. values exceeded avg. limit (20 ms) during last 60 sec
  • ID: 16056 Avg. remote disk time x ms exceeded avg. limit, peak was x ms, x perc. values exceeded peak limit (2000 ms), x perc. values exceeded avg. limit (25 ms) during last 60 sec
  • ID: 16058 Avg. disk network delay time x ms exceeded avg. limit, peak was x ms, x perc. values exceeded peak limit (500 ms), x perc. values exceeded avg. limit (10 ms) during last 60 sec
  • ID: 16063 Node 3 avg. round-trip time x ms exceeds avg. limit, peak was x ms, x perc. values exceeded peak limit (1000 ms), x perc. values exceeded avg. limit (300 ms) during last 60 sec

From the Performance Logs

Below you can find a listing and a description of the most important and most telling performance counters

  • TCPv4
    • ‘Segments Retransmitted/sec’
      This counter should stay below 1 retransmitted segment per second. Higher values could indicate network, NIC driver and/or speed and duplex problems
  • TCOSS Disk
    • ‘Avg. Remote Disk ms/Read ‘
      Should be very low or zero. Nothing is read from the secondary server.

    • ‘Avg. Remote Disk ms/Write‘
      This one logs the time it takes to write data on the remote secondary disk and is calculated by adding the ’avg. remote network delay ms’ + ‘avg. local disk ms/Write’ on the secondary server. Now let’s have a look at the above mentioned the “HW-RequirementCalculator_XXXXX.xlsx” document.

Screen Shot 2018-07-23 at 1.19.11 PM.png

Please note that calculations with this tool are based on the presupposition that the ‘TCOSS Cache’ settings are set to an adequate value. With this tool you can get an idea of the maximum allowed “average disk access time in ms”. You actually should half this value to be on the safe side. It’s this value that you’ll have to compare with the ‘avg. Remote Disk ms/Write’. In this case you should be aiming for an average of around 5ms.

  • ‘Peak remote Disk ms/Read’
    Should be very low or zero. Nothing is read from the secondary server.
  • ‘Peak remote Disk ms/Write’
    Peaks are bad. Here the maximum value is important.
  • ‘Avg. remote Disk Network Delay ms’
    Average values are important here and should be below 1ms for a dedicated Gbit connection.
  • ‘Peak remote Disk Network Delay ms’
    You don’t want peaks. The maximum value is important.
  • ‘Avg. local Disk ms/Write’, ‘Avg. local Disk ms/Read’, ‘Peak local Disk ms/Write’, ‘Peak local Disk ms/Read’

These keys will provide you with an insight of the disc usage. The lower these values are, the better. You can expect values between 1.5 and 5 ms. Reading from the disk requires more time than writing actions as these usually are cached. The avg. and peak local Disk ms/Read counters are closely related with the Cache Misses counters explained below. To limit or totally abolish reading from the local disk you can change some Cache settings which are also explained below.

  • TCOSS Cache
    • ‘Cache Misses/sec’ for Document, Directory and Database instances.
      If TCOSS isn’t able to read something from the cache it will have to read from the disk and a Cache Miss is logged. Since reading actions are more disk intensive than writing actions it’s best to limit the cache misses and disk reads. This can be done by increasing the TCOSS Cache values. Beware that increasing the cache values the RAM usage will be increased with the same amount as well.

Here are the default Cache settings. (~ 50 Mb)

HKLM\SOFTWARE\TOPCALL\TCOSS\Drive0\DocCacheSize=40960 decimal (40MB)
HKLM\SOFTWARE\TOPCALL\TCOSS\Drive0\DatabaseCacheSize =5120 decimal (5 MB)
HKLM\SOFTWARE\TOPCALL\TCOSS\Drive0\DirCacheSize=2048 decimal (2MB)

If you have enough free RAM you can ramp these settings up to 500MB or 1GB. Be sure to leave enough for the system to work with. Divide the available amount up according to the following values: DocCacheSize = 80 % of wished total, DatabaseCacheSize = 13.3 % of wished total and DirCacheSize = 6.6 % of wished total.

Below you can find an example for 1GB.

HKLM\SOFTWARE\TOPCALL\TCOSS\Drive0\DocCacheSize=819200 decimal (800 MB)
HKLM\SOFTWARE\TOPCALL\TCOSS\Drive0\DatabaseCacheSize =136192 decimal (133 MB)
HKLM\SOFTWARE\TOPCALL\TCOSS\Drive0\DirCacheSize=67584 decimal (66 MB)

Applies to

  • Microsoft Windows Server 2003 32-bit x86
  • Microsoft Windows Server 2008, 32-bit or 64-bit
  • KCS 8.0 – Current, TC/SP 7.82.00 – Current (Article based on KCS 8.2, TC/SP 7.86.00)

Keywords: Lineserver, Line server, HardwarePerformance, virtualisation, virtual

  • Was this article helpful?