During the week I was at a customer site that is using vSAN 6.2 as foundation for their upcoming virtual desktop infrastructure (Seems like 2016 is really really the year of the VDI). I love vSAN and believe that at the moment it’s a great fit for many dedicated use-cases within the virtualization field.
During some load- & failover tests of the vSAN installation I realized something regarding the IO-queues within the vSAN-stack and to be honest, I am not quiet sure what the risks, mitigations and therefore the correct actions are.
We open a VMware ticket in parallel, but if you have any more in-depth knowledge about this topic, please let me know since this might be interesting to more of people (since the number of vSAN implementations is increasing).
Since the integration of flash/SSD in the performance/cache tier of vSAN the performance is great compared to classical HDD-based solutions.
Going through the documents I am still missing the consequences or impact when the adapter queue depth >
∑ (device queue). This might happen in a case when you use a single SAS-Controller and fully equip your ESXi host to reach the vSAN configuration maximum regarding of disk groups, disks, etc.
What I realized during the load-tests was that I had excessive Kernel queuing (ESXTOP: K/QAVG > 5ms) during IO-peaks. I realized in the past that when kernel queuing is too excessive the whole ESXi reacts a little bit ‘sluggish’.
I am not sure if I it would be necessary to take some actions here or what the risk might be if we leave the setup as it is (1 SCSI controller design decision is a non-discussable constraint).
I can imagine that reducing the capacity devices queue depth down to 54 might be suitable, so that the maximum device queues does not reach the adapter limit. As a consequence the queuing would not take place within the ESXi, but within the guest OS of the VMs and therefore we move away stress from the ESXi IO-stack.
But that’s only one option I guess and I would really looking forward to hear from one of the vSAN deep-tech experts out there or your experience with this topic. If any further findings will show up I will update the post.