It’s been a month since NSX 6.3 has been released. It just came to my attention that an unknown behaviour of the NSX-Controller has been changed with that release.
We have 3 NSX-Controller while 2 of them should at least be available to remain complete functional. As we all hopefully now the NSX-Controller manages tables (VTEP, MAC, ARP) for Layer-2 VXLAN Operations (if we have selected hybrid or unicast as a replication mode).
The following tables tries to explain the impact of controller failures in case we are in Unicast or Hybrid-Mode in case of Layer-2 switching functionality:
|NSX Controller Available||Cluster-Status||NSX-Controller Operations Mode||Impact|
|1||degraded||read only||New VMs or vMotioned VMs will have networking issues|
|0||headless(-chicken)||–||New VMs or vMotioned VMs will have networking issues|
Make sure that at least 2 NSX-Controller nodes are available. As soon as you get into a degraded state I would recommend to change DRS into a partial automated mode.
If only 1 or 0 controller are left, make sure you do not move or power-on VMs in the environment.
Pro-Tipp: For proper NSX-Controller functionality place them on low-latency disks. Higher-disk latency might lead to strange Controller-Cluster behaviour
Comment: I need to verify the behaviour if only 1 NSX-Controller is available in 6.3 – any comments on that topic are appreciated.
UPDATE: Thanks to Robert Kloosterhuis I was made aware of a new technical preview within NSX 6.3 dealing with such a failure scenario: CDO Please check this blogpost out for further information about the new control plane reliability mode that will help to protect our VXLAN infrastructure in the future after a complete controller-cluster outage.
In situations we loose 1 NSX-Controller completely make the recommended way of fixing it has been changed according the official documentation.
Up to 6.2:
We recommend deleting the controller cluster when one or more of the controllers encounter catastrophic, unrecoverable errors or when one or more of the controller VMs become inaccessible and cannot be fixed.
We recommend replace the controller when one of the controllers encounter catastrophic, unrecoverable errors or when one of the controller VMs become inaccessible and cannot be fixed. You must first delete the broken controller, and then deploy a new controller.
The behaviour before 6.3 was not really what I expected and would have done in such a situation/operations-design (that’s why we all read documentation and release notes, right?!). Luckily this has changed so we can just remove the failed controller and deploy a new one. That’s quite an enhancement if you ask me. I haven’t verified this new behaviour, but will do that as soon as a project brings me to a NSX 6.3 design.