Sunday, October 14, 2018

The one with lost access to volumes which turned out to be SAN congestion and Zero buffer-to-buffer credits (Cisco Slow-Drain) nightmare

Does it sound familiar that all major issues always happen on a Friday?

Past Friday, we ran into an issue with random ESXi Hosts where we would see Host disconnected in vCenter and several VMs on it as orphaned or disconnected. In addition some VMs would be unresponsive, VM console would not work and RDP sessions would either freeze or get disconnected though the VM was still pingable.

The Host would self heal after a while and everything would return to normal with few VMs hung that would require a reboot to fix them.

First stop, vCenter Events -

There were several "lost access to volume due to connectivity issues" messages.




Next stop, VMKernel Logs - 

This is what was found in the VMkernel logs -

018-10-12T14:11:19.584Z cpu2:390228)WARNING: LinScsi: SCSILinuxAbortCommands:1909: Failed, Driver fnic, for vmhba2
2018-10-12T14:11:19.584Z cpu3:390230)<7>fnic : 2 :: Returning from abort cmd type 2 FAILED

2018-10-12T14:11:19.584Z cpu3:390230)WARNING: LinScsi: SCSILinuxAbortCommands:1909: Failed, Driver fnic, for vmhba2

Seems like a pretty straightforward error. The problem seems to be with the fnic driver for vmhba2.
But we are already running the latest supported drivers from Cisco and there is no know issues reported for this driver. There was a know issue in the previous version and here is the KB from Cisco - https://quickview.cloudapps.cisco.com/quickview/bug/CSCux90320

You can check the fnic drivers as follows -


Next stop was the Cisco Fabric Interconnects - Not a single error reported there.

Back to VMkernel Logs to see if more can be found -

There were numerous SCSI sense code errors. Here is how they looked -

2018-10-12T21:14:26.966Z cpu0:66222)ScsiDeviceIO: 2968: Cmd(0x439d44f24800) 0x89, CmdSN 0x245a3 from world 67160 to dev "naa.514f0c5595a00104" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0.
2018-10-12T21:23:00.565Z cpu24:66691)ScsiDeviceIO: 2954: Cmd(0x439d457db340) 0x85, CmdSN 0x125c from world 67651 to dev "naa.514f0c5595a00002" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2018-10-12T21:23:00.568Z cpu27:105197)ScsiDeviceIO: 2954: Cmd(0x439d457db340) 0x85, CmdSN 0x125d from world 67651 to dev "naa.514f0c5595a00003" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

You can decipher these codes and their meanings here - http://www.t10.org/lists/1spc-lst.htm

Highly recommend using this Blog which makes it a piece of cake to decipher these codes -

https://www.virten.net/vmware/esxi-scsi-sense-code-decoder/

Here is the deciphered version -

failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0 Cmd(0x439d44f24800) 0x89

Type
Code
Name
Description
Host Status
[0x0]
OK
This status is returned when there is no error on the host side. This is when you will see if there is a status for a Device or Plugin. It is also when you will see Valid sense data instead of Possible sense Data.
Device Status
[0x2]
CHECK_CONDITION
This status is returned when a command fails for a specific reason. When a CHECK CONDITION is received, the ESX storage stack will send out a SCSI command 0x3 (REQUEST SENSE) in order to get the SCSI sense data (Sense Key, Additional Sense Code, ASC Qualifier, and other bits). The sense data is listed after Valid sense data in the order of Sense Key, Additional Sense Code, and ASC Qualifier.
Plugin Status
[0x0]
GOOD
No error. (ESXi 5.x / 6.x only)
Sense Key
[0xE]
MISCOMPARE
Additional Sense Data
1D/00
MISCOMPARE DURING VERIFY OPERATION
OP Code
0x89
COMPARE AND WRITE

failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0 Cmd(0x439d457db340) 0x85


Type
Code
Name
Description
Host Status
[0x0]
OK
This status is returned when there is no error on the host side. This is when you will see if there is a status for a Device or Plugin. It is also when you will see Valid sense data instead of Possible sense Data.
Device Status
[0x2]
CHECK_CONDITION
This status is returned when a command fails for a specific reason. When a CHECK CONDITION is received, the ESX storage stack will send out a SCSI command 0x3 (REQUEST SENSE) in order to get the SCSI sense data (Sense Key, Additional Sense Code, ASC Qualifier, and other bits). The sense data is listed after Valid sense data in the order of Sense Key, Additional Sense Code, and ASC Qualifier.
Plugin Status
[0x0]
GOOD
No error. (ESXi 5.x / 6.x only)
Sense Key
[0x5]
ILLEGAL REQUEST
Additional Sense Data
20/00
INVALID COMMAND OPERATION CODE
OP Code
0x85
ATA PASS-THROUGH(16)

This sounds very similar to the vSphere 6.0 ATS miscompare issue. Read more about it here.

We are running "vSphere ESXi 6.5 U2 8294253" so this is not valid in our case.

So far we had checked the vSphere logs, VMkernel logs, Fabric Interconnects. Next stop would be one level higher - the Nexus 5Ks which are being used as our Fiber channel uplinks. In parallel the Storage team was involved to see if any errors are seen on the XtremIO Array.

Storage Team did not see any errors on the XtremIO Array.

Cisco was involved to look at the Nexus configuration. They inspected the FC ports on the Nexus 5K. Since there are 8 paths we need to inspect each one of them on Nexus A and Nexus B.

Side A looked clean but something on Side B stood right in the face. Upon inspecting one of the FC interface we noticed that the remaining B2B credits were ZERO. Here is how it looks -



Most of us had no idea what these B2B credits were. Here is a quick summary of what it means and does -

Buffer credits, also called buffer-to-buffer credits are used as a flow control method by Fibre Channel technology and represent the number of frames a port can store.
Each time a port transmits a frame that port’s BB Credit is decremented by one; for each R RDY received, that port’s BB Credit is incremented by one. If the BB Credit is zero the corresponding node cannot transmit until an R_RDY is received back.


This is also referred to as the slow drain issue

So in our case the FC could receive but could not transmit anything because there were no credits left. When ever an ESXi Host tried to send Data and used this specific path, storage could not receive that data and the data was lost in space. As we know FC is a lossless data delivery and hence the data loss would send the ESXi Kernel in a panic state. 

Next Cisco showed us how to find the loss events - 

Just run this command on the Nexus and you would see all the events that occurred.

Here is how it looks on a switch with no loss - 

nx5k-A# show process creditmon credit-loss-events

        Credit Loss Events: NO

Here is how it looks on the switch with a problem FC port - 


Everything on the physical layer of this port looked OK, so Cisco suspected that the XtremIO Storage is having an issue return/sending back an R_RDY B2B credit, causing the switch to stop processing traffic. 

Temporary solution was to just disable this port so no traffic is sent to it and thus avoiding further damage to the VMs as well as any escalations over the weekend ;-) 

EMC was involved again and they came back confirming that the Rx Power column is showing a status of Fault on the suspected XtremIO port. This indicated that there maybe an improperly seated cable/SFP, the cable might be bad or that the SFP on the switch is bad. 

We ended up replacing both the SFP and the cable.

This ended our 15 hour long call and nobody was called over the weekend !

No comments:

Post a Comment