Virtually Curious: XtremIO

Does it sound familiar that all major issues always happen on a Friday?

Past Friday, we ran into an issue with random ESXi Hosts where we would see Host disconnected in vCenter and several VMs on it as orphaned or disconnected. In addition some VMs would be unresponsive, VM console would not work and RDP sessions would either freeze or get disconnected though the VM was still pingable.

The Host would self heal after a while and everything would return to normal with few VMs hung that would require a reboot to fix them.

First stop, vCenter Events -

There were several "lost access to volume due to connectivity issues" messages.

Next stop, VMKernel Logs -

This is what was found in the VMkernel logs -

018-10-12T14:11:19.584Z cpu2:390228)WARNING: LinScsi: SCSILinuxAbortCommands:1909: Failed, Driver fnic, for vmhba2

2018-10-12T14:11:19.584Z cpu3:390230)<7>fnic : 2 :: Returning from abort cmd type 2 FAILED

2018-10-12T14:11:19.584Z cpu3:390230)WARNING: LinScsi: SCSILinuxAbortCommands:1909: Failed, Driver fnic, for vmhba2

Seems like a pretty straightforward error. The problem seems to be with the fnic driver for vmhba2.
But we are already running the latest supported drivers from Cisco and there is no know issues reported for this driver. There was a know issue in the previous version and here is the KB from Cisco - https://quickview.cloudapps.cisco.com/quickview/bug/CSCux90320

You can check the fnic drivers as follows -

Next stop was the Cisco Fabric Interconnects - Not a single error reported there.

Back to VMkernel Logs to see if more can be found -

There were numerous SCSI sense code errors. Here is how they looked -

2018-10-12T21:14:26.966Z cpu0:66222)ScsiDeviceIO: 2968: Cmd(0x439d44f24800) 0x89, CmdSN 0x245a3 from world 67160 to dev "naa.514f0c5595a00104" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0.
2018-10-12T21:23:00.565Z cpu24:66691)ScsiDeviceIO: 2954: Cmd(0x439d457db340) 0x85, CmdSN 0x125c from world 67651 to dev "naa.514f0c5595a00002" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2018-10-12T21:23:00.568Z cpu27:105197)ScsiDeviceIO: 2954: Cmd(0x439d457db340) 0x85, CmdSN 0x125d from world 67651 to dev "naa.514f0c5595a00003" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

You can decipher these codes and their meanings here - http://www.t10.org/lists/1spc-lst.htm

Highly recommend using this Blog which makes it a piece of cake to decipher these codes -

https://www.virten.net/vmware/esxi-scsi-sense-code-decoder/

Here is the deciphered version -

failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0 Cmd(0x439d44f24800) 0x89

Type	Code	Name	Description
Host Status	[0x0]	OK	This status is returned when there is no error on the host side. This is when you will see if there is a status for a Device or Plugin. It is also when you will see Valid sense data instead of Possible sense Data.
Device Status	[0x2]	CHECK_CONDITION	This status is returned when a command fails for a specific reason. When a CHECK CONDITION is received, the ESX storage stack will send out a SCSI command 0x3 (REQUEST SENSE) in order to get the SCSI sense data (Sense Key, Additional Sense Code, ASC Qualifier, and other bits). The sense data is listed after Valid sense data in the order of Sense Key, Additional Sense Code, and ASC Qualifier.
Plugin Status	[0x0]	GOOD	No error. (ESXi 5.x / 6.x only)
Sense Key	[0xE]	MISCOMPARE
Additional Sense Data	1D/00	MISCOMPARE DURING VERIFY OPERATION

OP Code	0x89	COMPARE AND WRITE

failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0 Cmd(0x439d457db340) 0x85

Type	Code	Name	Description
Host Status	[0x0]	OK	This status is returned when there is no error on the host side. This is when you will see if there is a status for a Device or Plugin. It is also when you will see Valid sense data instead of Possible sense Data.
Device Status	[0x2]	CHECK_CONDITION	This status is returned when a command fails for a specific reason. When a CHECK CONDITION is received, the ESX storage stack will send out a SCSI command 0x3 (REQUEST SENSE) in order to get the SCSI sense data (Sense Key, Additional Sense Code, ASC Qualifier, and other bits). The sense data is listed after Valid sense data in the order of Sense Key, Additional Sense Code, and ASC Qualifier.
Plugin Status	[0x0]	GOOD	No error. (ESXi 5.x / 6.x only)
Sense Key	[0x5]	ILLEGAL REQUEST
Additional Sense Data	20/00	INVALID COMMAND OPERATION CODE
OP Code	0x85	ATA PASS-THROUGH(16)

This sounds very similar to the vSphere 6.0 ATS miscompare issue. Read more about it here.

We are running "vSphere ESXi 6.5 U2 8294253" so this is not valid in our case.

So far we had checked the vSphere logs, VMkernel logs, Fabric Interconnects. Next stop would be one level higher - the Nexus 5Ks which are being used as our Fiber channel uplinks. In parallel the Storage team was involved to see if any errors are seen on the XtremIO Array.

Storage Team did not see any errors on the XtremIO Array.

Cisco was involved to look at the Nexus configuration. They inspected the FC ports on the Nexus 5K. Since there are 8 paths we need to inspect each one of them on Nexus A and Nexus B.

Side A looked clean but something on Side B stood right in the face. Upon inspecting one of the FC interface we noticed that the remaining B2B credits were ZERO. Here is how it looks -

Most of us had no idea what these B2B credits were. Here is a quick summary of what it means and does -

Buffer credits, also called buffer-to-buffer credits are used as a flow control method by Fibre Channel technology and represent the number of frames a port can store.
Each time a port transmits a frame that port’s BB Credit is decremented by one; for each R RDY received, that port’s BB Credit is incremented by one. If the BB Credit is zero the corresponding node cannot transmit until an R_RDY is received back.

This is also referred to as the slow drain issue

So in our case the FC could receive but could not transmit anything because there were no credits left. When ever an ESXi Host tried to send Data and used this specific path, storage could not receive that data and the data was lost in space. As we know FC is a lossless data delivery and hence the data loss would send the ESXi Kernel in a panic state.

Next Cisco showed us how to find the loss events -

Just run this command on the Nexus and you would see all the events that occurred.

Here is how it looks on a switch with no loss -

nx5k-A# show process creditmon credit-loss-events

Credit Loss Events: NO

Here is how it looks on the switch with a problem FC port -

Everything on the physical layer of this port looked OK, so Cisco suspected that the XtremIO Storage is having an issue return/sending back an R_RDY B2B credit, causing the switch to stop processing traffic.

Temporary solution was to just disable this port so no traffic is sent to it and thus avoiding further damage to the VMs as well as any escalations over the weekend ;-)

EMC was involved again and they came back confirming that the Rx Power column is showing a status of Fault on the suspected XtremIO port. This indicated that there maybe an improperly seated cable/SFP, the cable might be bad or that the SFP on the switch is bad.

We ended up replacing both the SFP and the cable.

This ended our 15 hour long call and nobody was called over the weekend !

I’m working on a plan to upgrade our existing VMware Stack and wanted to write a detailed post about all components involved and the order of upgrade.

Current State –

· Primary and Recovery site with XtremIO Arrays on both ends with physical RecoverPoint appliances for array based replication.

· vCenter Servers on both sites (with embedded Platform Services Controller) running on Windows Server 2012 R2 with an external SQL Database.

· ESXi 6.0.0, 3825889

· Site Recovery Manager (SRM) 6.1.1.13825

· Storage Replication Adapter (SRA) 2.2.0.3

· XtremIO (XIOS) 4.0.15-24, XMS 4.2.1

· RecoverPoint (RPA) 4.4.1

· Avamar 7.3

Future State –

· vCenter Server Appliance (VCSA) 6.5 U1f

· ESXi 6.5, 7388607

· SRM 6.5.1, 6014840

· SRA 2.2.0.3

· XIOS 6.0.1, XMS 6.0.1

· RecoverPoint (RPA) 5.1

· Avamar 7.5

There are numerous inter-dependencies to get to the future state.

· Avamar 7.3 is not compatible with vCenter 6.5

· RPA 4.4.1 is not compatible with ESXi 6.5

· Upgrading SRM from 6.0.x to 6.5 is not supported. You have to upgrade to 6.1.x before you upgrade to 6.5 (in my case that’s not required since already running on 6.1.1.x)

· Any vCenter upgrade will break SRM until both sides are on the same level. I have Array replication active in case of a disaster while upgrading SRM.

Prerequisites –

· Backup of vCenter Database on primary and recovery site.

· Backup of SRM vPostgres Database on primary and recovery sit. (Explained in detail below)

· Primary and Recovery Site Platform Services Controller and vCenter server instances must be running.

Order of Upgrade –

· Upgrade Avamar to 7.5

· RPA 4.4.1 is not compatible with ESXi 6.5 hence upgrade that to RPA 5.1

· Upgrade vCenter Server to VCSA 6.5 GA at primary site.

· Upgrade SRM to 6.5 at primary site. Note: SRM cannot be upgraded from 6.1.1 to 6.5.1

· Upgrade SRA at primary site – Not required since running on latest

· vCenter Server to VCSA 6.5 GA at recovery site.

· Upgrade SRM to 6.5 at recovery site.
· Upgrade SRA at recovery site – Not required since running on latest.

· Upgrade vCenter from 6.5 GA to 6.5U1g at primary site.

· Upgrade SRM from 6.5 to 6.5.1 at primary site.

·         Upgrade vCenter Server to VCSA 6.5U1g at recovery site.
·         Upgrade SRM from 6.5 to 6.5.1 at recovery site.
·         Verify connection between SRM. Verify Protection groups and recovery plans are valid.
· Upgrade ESXi to 6.5, 7388607 at recovery site.

· Upgrade ESXi to 6.5, 7388607 at primary site.

· Upgrade virtual hardware and then VMtools on Virtual Machines – Can be scheduled during the next available outage window.

· Upgrade XIOS and XMS to 6.0.1

Backup & Restore (if required) the SRM Embedded vPostgres Database -

1) Log into the system on which you installed Site Recovery Manager Server.

2) Stop the Site Recovery Manager service.

3) Navigate to the folder that contains the vPostgres commands.

4) If you installed Site Recovery Manager Server in the default location, you find the vPostgres commands in C:\Program Files\VMware\VMware vCenter Site Recovery Manager Embedded Database\bin.

5) Create a backup of the embedded vPostgres database by using the pg_dump command.

pg_dump -Fc --host 127.0.0.1 --port port_number --username=db_username srm_db > srm_backup_name. To create a backup you need the admin password. We did not have the Admin password documented. Here is a link on how to reset the Admin password - http://virtuallycurious.blogspot.com/2018/06/the-case-of-forgotten-site-recovery.html

You set the port number, username, and password for the embedded vPostgres database when you installed Site Recovery Manager. The default port number is 5678. The database name is srm_db and cannot be changed.

6) Start the Site Recovery Manager service.

7) Restore (if things go south) by using the pg_restore command

pg_restore -Fc --host 127.0.0.1 --port port_number --username=db_username --dbname=srm_db srm_backup_name

References –

· VMware Product Interoperability Matrices -http://partnerweb.vmware.com/comp_guide2/sim/interop_matrix.php

· Update sequence for vSphere 6.5 and its compatible VMware products (2147289) - https://kb.vmware.com/s/article/2147289

· Backup and Restore the embedded vPostgres Database - https://docs.vmware.com/en/Site-Recovery-Manager/6.5/srm-install-config-6-5.pdf

· EMC Recoverpoint SRA compatibility Matrix - https://www.vmware.com/resources/compatibility/detail.php?deviceCategory=sra&productid=39129

· Compatibility Matrix for SRM 6.5 - https://www.vmware.com/support/srm/srm-compat-matrix-6-5.html

· Order of upgrading vSphere and SRM components - https://docs.vmware.com/en/Site-Recovery-Manager/6.5/com.vmware.srm.install_config.doc/GUID-E7B47738-C63D-4A05-9A13-7C5FF20801A7.html

· Avamar Comaptiblity Matrix - https://support.emc.com/docu32263_Avamar-Compatibility-and-Interoperability-Matrix.pdf?language=en_US

Virtually Curious

Sunday, October 14, 2018

The one with lost access to volumes which turned out to be SAN congestion and Zero buffer-to-buffer credits (Cisco Slow-Drain) nightmare

Thursday, March 15, 2018

VMware Stack Upgrade

About Me