VMware #NSX 6.3 Controller – Failure / Data loss behaviour

It’s been a month since NSX 6.3 has been released. It just came to my attention that an unknown behaviour of the NSX-Controller has been changed with that release.

We have 3 NSX-Controller while 2 of them should at least be available to remain complete functional. As we all hopefully now the NSX-Controller manages tables (VTEP, MAC, ARP) for Layer-2 VXLAN Operations (if we have selected hybrid or unicast as a replication mode).

NSX-Controller Failure

The following tables tries to explain the impact of controller failures in case we are in Unicast or Hybrid-Mode in case of Layer-2 switching functionality:

NSX Controller Available Cluster-Status NSX-Controller Operations Mode Impact
3 healthy read/write No Impact
2 degraded read/write No Impact
1 degraded read only New VMs or vMotioned VMs will have networking issues
0 headless(-chicken) New VMs or vMotioned VMs will have networking issues

Make sure that at least 2 NSX-Controller nodes are available. As soon as you get into a degraded state I would recommend to change DRS into a partial automated mode.

If only 1 or 0 controller are left, make sure you do not move or power-on VMs in the environment.

Pro-Tipp: For proper NSX-Controller functionality place them on low-latency disks. Higher-disk latency might lead to strange Controller-Cluster behaviour

Comment: I need to verify the behaviour if only 1 NSX-Controller is available in 6.3 – any comments on that topic are appreciated.

UPDATE: Thanks to Robert Kloosterhuis I was made aware of a new technical preview within NSX 6.3 dealing with such a failure scenario: CDO  Please check this blogpost out for further information about the new control plane reliability mode that will help to protect our VXLAN infrastructure in the future after a complete controller-cluster outage.

NSX-Controller data-loss

In situations we loose 1 NSX-Controller completely make the recommended way of fixing it has been changed according the official documentation.

Up to 6.2:

We recommend deleting the controller cluster when one or more of the controllers encounter catastrophic, unrecoverable errors or when one or more of the controller VMs become inaccessible and cannot be fixed.

6.3++:

We recommend replace the controller when one of the controllers encounter catastrophic, unrecoverable errors or when one of the controller VMs become inaccessible and cannot be fixed. You must first delete the broken controller, and then deploy a new controller.

The behaviour before 6.3 was not really what I expected and would have done in such a situation/operations-design (that’s why we all read documentation and release notes, right?!). Luckily this has changed so we can just remove the failed controller and deploy a new one. That’s quite an enhancement if you ask me. I haven’t verified this new behaviour, but will do that as soon as a project brings me to a NSX 6.3 design.

2 thoughts on “VMware #NSX 6.3 Controller – Failure / Data loss behaviour

  • 3. April 2017 at 6:41
    Permalink

    Wow thanks for sharing this great information. Actually i was looking for the same information about VMware NSX, but i am not able to find the proper information and now my search is over here . Thanks once again for sharing . The way you explained each and everything in this article is really great.

    Reply
  • 3. April 2017 at 6:42
    Permalink

    Wow thanks for sharing this great information. Actually i was looking for the same information about VMware NSX but i am not able to find the proper information and now my search is over here . Thanks once again for sharing . The way you explained each and everything in this article is really great.

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

Time limit is exhausted. Please reload CAPTCHA.

This site uses Akismet to reduce spam. Learn how your comment data is processed.