KCS Tandem System - Design, Update and Failure States
Issue
Tandem Scenarios
Per design, the tandem System is intended to have 3 different machines, each with a specific connection to the other two. The Primary machine (PRI) and the Secondary machine (SEC) should even have a dedicated connection (cross-over cable) between themselves. The connection to the Status machine (STATUS) is not as data-intensive so even a dial-up modem connection would be enough, but it should be independent from the connection between PRI and SEC.
In today’s IT environments, this setup is less and less feasible, as more customers split their PRI and SEC between different datacenters. As it often does not make sense to have a third datacenter for the STATUS, this needs to be placed either in the same datacenter as the PRI or as the SEC.
Kofax recommends installing the STATUS in the same datacenter as the PRI.
Solution
Tandem Update
A version upgrade of a Tandem System is always accompanied by a downtime of the system. This is necessary to ensure that both PRI and SEC are running on the same up to date version and that there is no desync status during the update process.
For this reason, the recommended way of upgrading a KCS Tandem System are following steps:
- Shut down both PRI and SEC
- Update the PRI normally by running the setup and installing the config locally
- Start the PRI and ensure that the update works
- If PRI is working, update the SEC as well and bring it online. If the update did not work and you want to roll back, you just have to roll back the PRI.
Again, please be aware that there is NO supported way of upgrading a Tandem System without downtime.
Tandem failure states
Single Point of Failure (PoF) scenarios:
PRI Machine fails
When the PRI Machine is failing, the SEC server will see that the connection to the PRI is no longer possible. It also checks the STATUS to see if the PRI is really down, and if it is, starts after 100 seconds in standalone mode.
SEC Machine fails
If the SEC Machine fails during normal operation, no visible change occurs in the system. However, as long as the SEC is down, the PRI is running standalone and the disk is no longer mirrored. If the PRI also fails, the system is out of order.
STATUS Machine fails
If only the STATUS Machine fails and the PRI can still reach the SEC, normal tandem operation is still ongoing.
Connection PRI-SEC fails
If the connection between PRI and SEC is failing, but connection to the STATUS from both machines is intact, the PRI will keep running. However, as the PRI can no longer update the disk of the SEC, the data is no longer mirrored and the PRI is running in standalone mode. The SEC will also reboot once and stay in the waiting state again.
Connection PRI-STATUS fails
When the PRI can no longer reach the STATUS, tandem operation is still possible as normal, as the data on the SEC can be updated normally. Only the state of the PRI will no longer be visible on the STATUS machine.
Connection SEC-STATUS fails
This leads to a similar behavior as when the connection between PRI and STATUS is gone. The tandem operation will continue normally, but the STATUS will no longer be updated with the current state of the SEC machine.
Multiple PoF scenarios:
This section mentions the registry value RunWithoutStatusAgent a lot. It can be found under:
HKEY_LOCAL_MACHINE/SOFTWARE/Topcall/TCOSS/RunWithoutStatusAgent
Also, these scenarios will assume that the PRI machine and SEC machine are located in separate datacenters and the STATUS machine is located in the same datacenter as the PRI.
A) PRI and STATUS fail, SEC alive
If both the PRI and STATUS machines fail, the behavior of the SEC depends on the registry value RunWithoutStatusAgent.
On RunWithoutStatusAgent = 1, the secondary master remains in the boot process until the first disk write occurs. This usually happens when the system startup is complete and the system message “TCOSS 7.xx.xx (using module TCOSS.EXE 7.xx.xx) started” is created. Then the secondary master tries to reach and update the status agent. If it fails, the secondary master continues and creates the warning message “single disk operation on disk 1 started without status agent update”. This behavior, may lead to a “desynchronized” state later.
On RunWithoutStatusAgent = 0, the SEC will try to contact the status agent before TCOSS continues to boot. KCS Monitor displays the status line “Waiting for Status Agent”, with the information line “get permission to run stand-alone”. The SEC waits here until either the status agent is reachable again, or the process is shut down by the primary master.
During this possible indefinite wait, the RunWithoutStatusAgent registry value is read at regular intervals so that an operator can modify this value and set it to 1. This allows the secondary master to run standalone without status agent update. The registry reload interval depends on the status agent access timeout (default 120 seconds). Therefore, the total waiting time to see any effect is about 2 minutes after changing the registry value.
B) SEC and STATUS fail, PRI alive
As in scenario A), the behavior of the PRI depends on the registry value of RunWithoutStatusAgent.
On RunWithoutStatusAgent = 1, the PRI will continue running, even when it now can no longer reach the STATUS machine. However, it is now running in standalone mode without mirroring its disk to the SEC.
On RunWithoutStatusAgent = 0, the PRI stops with the error message "Model/22x sync stop (no write quorum)". It then tries to restart and contact the STATUS again with the status line “Waiting for Status Agent”, with the information line “get permission to run stand-alone”. The PRI waits here until either the status agent is reachable again, or the registry key is changed manually, see scenario A).
C) Connection between Datacenters fails, PRI, SEC and STATUS alive
In this scenario, all machines are still alive, but effectively, the SEC can no longer see PRI and STATUS, while PRI and STATUS can see each other but not the SEC. The PRI will continue running, regardless of the setting of RunWithoutStatusAgent, as the STATUS is still reachable. However, tandem operation is no longer maintained, and the disk is not mirrored. The behavior of the SEC is again dependent on the registry value RunWithoutStatusAgent.
With RunWithoutStatusAgent = 1, the SEC will start in standalone mode, probably creating a desync state.
With RunWithoutStatusAgent = 0, the SEC will try to update the STATUS, but because it is unreachable, will stay in the “Can‟t run stand-alone, tandem system would go desync” state. If the connection between the Datacenters is restored, the disk on the SEC will be updated, and normal TANDEM operation can resume.
D) Total network failure, PRI, SEC and STATUS machines alive
In this scenario, both PRI and SEC will behave according to RunWithoutStatusAgent.
When the PRI has RunWithoutStatusAgent = 1, it will continue running, even when it now can no longer reach the STATUS machine. As the disk on the SEC can no longer be updated, the PRI is now running in standalone mode.
When the SEC has RunWithoutStatusAgent = 1, it will start in standalone mode after the wait time of 100 seconds. Paired with the RunWithoutStatusAgent = 1 on the PRI, this may lead to a desync state later on.
When the PRI has RunWithoutStatusAgent = 0, like in scenario B), the PRI stops with the error message "Model/22x sync stop (no write quorum)". It then tries to restart and contact the STATUS again with the status line “Waiting for Status Agent”, with the information line “get permission to run stand-alone”. The PRI waits here until either the status agent is reachable again, or the registry key is changed manually, see scenario A).
When the SEC has RunWithoutStatusAgent = 0, it will try to contact the status agent before TCOSS continues to boot. KCS Monitor displays the status line “Waiting for Status Agent”, with the information line “get permission to run stand-alone”. The SEC waits here until either the status agent is reachable again, the process is shut down by the primary master, or the registry value is changed, see scenario A)
If both PRI and SEC have RunWithoutStatusAgent = 0, then the service will no longer be available as both machines will not boot standalone.
E) PRI fails, network fails
In this scenario, the SEC will behave like in scenario A), as it is effectively the same from the point of view of the SEC machine.
On RunWithoutStatusAgent = 1, the secondary master remains in the boot process until the first disk write occurs. This usually happens when the system startup is complete and the system message “TCOSS 7.xx.xx (using module TCOSS.EXE 7.xx.xx) started” is created. Then the secondary master tries to reach and update the status agent. If it fails, the secondary master continues and creates the warning message “single disk operation on disk 1 started without status agent update”. This behavior may lead to a “desynchronized” state later.
On RunWithoutStatusAgent = 0, the SEC will try to contact the status agent before TCOSS continues to boot. KCS Monitor displays the status line “Waiting for Status Agent”, with the information line “get permission to run stand-alone”. The SEC waits here until either the status agent is reachable again, the process is shut down by the primary master, or the registry value is changed, see scenario A)
F) SEC fails, network fails.
Here, the PRI will behave like mentioned in scenario D).
On RunWithoutStatusAgent = 1, the PRI will continue running, even when it now can no longer reach the STATUS machine. As the disk on the SEC can no longer be updated, the PRI is now running in standalone mode.
On RunWithoutStatusAgent = 0, like in scenario B), the PRI stops with the error message "Model/22x sync stop (no write quorum)". It then tries to restart and contact the STATUS again with the status line “Waiting for Status Agent”, with the information line “get permission to run stand-alone”. The PRI waits here until either the status agent is reachable again, or the registry key is changed manually, see scenario A).
Level of Complexity
Moderate
Applies to
Product | Version | Build | Environment | Hardware |
---|---|---|---|---|
Kofax Communication Server | 10.1 and higher |
References
Add any references to other internal or external articles