Hello Bob,
I respect you taking the time to write an elaborate reply. I've been using vSAN for a long time and a lot of frustration has built up. It got a lot better since 6.6, I must admit.
"As you have access to multiple sites that have/haven't hit this alert constantly I would imagine you have built up an extensive amount of knowledge with regard to what each set-up has in common at a granular network configuration level"
As we believe in standardisation, our network setups are the same. Either Cisco UCS based where vSAN traffic stays within the FI and does not go north to the core) or if not, dedicated vSAN Switches. Customers who "rolled their own network" are not taken into my consideration as you never know what "lurkes around on those nets".
So going back to standardised networks: these errors popup randomly and we never see issues with vSAN in general on those sites. I was never able to find a pattern why some customers see them sometimes, and others don't. In also never see them in Back2back Robo's so it must have something todo with switches. When such an error is triggered, and you dive into the WebGUI, you see that node 2 could not talk to Node 7 and Node 5 not to Node 3 and what not. Then press "re-test" and poof, all errors are gone. Everything talks to everyone just happily.
"In fact, you can disable several VSAN alarms as VMware has not show any interest in solving these glitches..."
I say that because it has been happening for years. For years. And so many installations are affected (and so many are not). And it's still happening. It gives the impression that these things do not have priority.
"Which tests are you referring to?"
The VM creation test. Super running cluster, no issues and it does not work on a regular basis. Again, for some it does, for others it doesn't. Never could find a pattern.
The multi-cast test has been a dog too. Some customers are on older 6.x releases (for various reasons, vxRail is only recently available with 6.5) and even I tell them to ignore failed MC test (I tell them to not even run it), they will still open a ticket saying "we have a problem with MC, the test does not complete without errors" and then I explain again that the test is broken. Then they want me to open an SR to get the test fixed. Which I know will not happen. Try explaining that... Such a drag.
"If you can provide some other examples of health checks you consider broken"
- MTU check. Ping with large packet size (has always been broken, MTU's don't suddenly change and after a manual re-test, all is good again all of a sudden)
- Disk Balance (clusters with a lot of stuff happening are often imbalanced to some degree, so upping the warning percentage a lot would help)
- Site latency health (never figured why this alert traps sometimes (it's quite rare but it does happen even though latency was fine and the cluster humms along nicely)
And not broken, but simply annoying:
- Customer Improvement Program
- Build recommendation Engine
These 5 tests we disable by default as they are just a pain in the butt.
And in general, the tests concerning Controllers and their Firmware. We have customers that use controllers that where certified in 5.5.x and since 6.x, they suddenly are not anymore. Take the LSI 9207-8i for example. Since 6.x it's off the HCL. So vCenter complains all the time. But it's identical twin, the HP H220 is still on the HCL. It's the same friggin card. There are DELL and Fujitsu cards that are OEM clones of this card and they are on the HCL too.
I had a discussion about this with Duncan a while back. The people, mostly the HW vendors like LSI, simply don't bother testing existing validated hardware on newer vSAN versions. Do you honestly think that customers will either "not upgrade" or "rip out and replace all their cards in all nodes". Hell no. And the cards work fine. If a H220 works fine, a 9207-8i does too. And customers give me heat over it.
And the worst thing: it will happen again with the next major release of vSAN. Some hardware will not be on the HCL anymore, or take a while before they appear on the HCL (with a complaining vCenter in the meanwhile) and existing customers can stick it...
vSAN major versions appear(ed) faster that hardware-lifecycles. Hardware is supposed to last 3 to 5 years at most customers, so they will run into this issue sooner or later. So that is why we turn the HW compatibly alerts off and tell customers that "that yellow triangle" is ok.