Virtually Curious: VMware

Showing posts with label VMware. Show all posts

Sunday, October 14, 2018

The one with lost access to volumes which turned out to be SAN congestion and Zero buffer-to-buffer credits (Cisco Slow-Drain) nightmare

Does it sound familiar that all major issues always happen on a Friday?

Past Friday, we ran into an issue with random ESXi Hosts where we would see Host disconnected in vCenter and several VMs on it as orphaned or disconnected. In addition some VMs would be unresponsive, VM console would not work and RDP sessions would either freeze or get disconnected though the VM was still pingable.

The Host would self heal after a while and everything would return to normal with few VMs hung that would require a reboot to fix them.

First stop, vCenter Events -

There were several "lost access to volume due to connectivity issues" messages.

Next stop, VMKernel Logs -

This is what was found in the VMkernel logs -

018-10-12T14:11:19.584Z cpu2:390228)WARNING: LinScsi: SCSILinuxAbortCommands:1909: Failed, Driver fnic, for vmhba2

2018-10-12T14:11:19.584Z cpu3:390230)<7>fnic : 2 :: Returning from abort cmd type 2 FAILED

2018-10-12T14:11:19.584Z cpu3:390230)WARNING: LinScsi: SCSILinuxAbortCommands:1909: Failed, Driver fnic, for vmhba2

Seems like a pretty straightforward error. The problem seems to be with the fnic driver for vmhba2.
But we are already running the latest supported drivers from Cisco and there is no know issues reported for this driver. There was a know issue in the previous version and here is the KB from Cisco - https://quickview.cloudapps.cisco.com/quickview/bug/CSCux90320

You can check the fnic drivers as follows -

Next stop was the Cisco Fabric Interconnects - Not a single error reported there.

Back to VMkernel Logs to see if more can be found -

There were numerous SCSI sense code errors. Here is how they looked -

2018-10-12T21:14:26.966Z cpu0:66222)ScsiDeviceIO: 2968: Cmd(0x439d44f24800) 0x89, CmdSN 0x245a3 from world 67160 to dev "naa.514f0c5595a00104" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0.
2018-10-12T21:23:00.565Z cpu24:66691)ScsiDeviceIO: 2954: Cmd(0x439d457db340) 0x85, CmdSN 0x125c from world 67651 to dev "naa.514f0c5595a00002" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2018-10-12T21:23:00.568Z cpu27:105197)ScsiDeviceIO: 2954: Cmd(0x439d457db340) 0x85, CmdSN 0x125d from world 67651 to dev "naa.514f0c5595a00003" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

You can decipher these codes and their meanings here - http://www.t10.org/lists/1spc-lst.htm

Highly recommend using this Blog which makes it a piece of cake to decipher these codes -

https://www.virten.net/vmware/esxi-scsi-sense-code-decoder/

Here is the deciphered version -

failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0 Cmd(0x439d44f24800) 0x89

Type	Code	Name	Description
Host Status	[0x0]	OK	This status is returned when there is no error on the host side. This is when you will see if there is a status for a Device or Plugin. It is also when you will see Valid sense data instead of Possible sense Data.
Device Status	[0x2]	CHECK_CONDITION	This status is returned when a command fails for a specific reason. When a CHECK CONDITION is received, the ESX storage stack will send out a SCSI command 0x3 (REQUEST SENSE) in order to get the SCSI sense data (Sense Key, Additional Sense Code, ASC Qualifier, and other bits). The sense data is listed after Valid sense data in the order of Sense Key, Additional Sense Code, and ASC Qualifier.
Plugin Status	[0x0]	GOOD	No error. (ESXi 5.x / 6.x only)
Sense Key	[0xE]	MISCOMPARE
Additional Sense Data	1D/00	MISCOMPARE DURING VERIFY OPERATION

OP Code	0x89	COMPARE AND WRITE

failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0 Cmd(0x439d457db340) 0x85

Type	Code	Name	Description
Host Status	[0x0]	OK	This status is returned when there is no error on the host side. This is when you will see if there is a status for a Device or Plugin. It is also when you will see Valid sense data instead of Possible sense Data.
Device Status	[0x2]	CHECK_CONDITION	This status is returned when a command fails for a specific reason. When a CHECK CONDITION is received, the ESX storage stack will send out a SCSI command 0x3 (REQUEST SENSE) in order to get the SCSI sense data (Sense Key, Additional Sense Code, ASC Qualifier, and other bits). The sense data is listed after Valid sense data in the order of Sense Key, Additional Sense Code, and ASC Qualifier.
Plugin Status	[0x0]	GOOD	No error. (ESXi 5.x / 6.x only)
Sense Key	[0x5]	ILLEGAL REQUEST
Additional Sense Data	20/00	INVALID COMMAND OPERATION CODE
OP Code	0x85	ATA PASS-THROUGH(16)

This sounds very similar to the vSphere 6.0 ATS miscompare issue. Read more about it here.

We are running "vSphere ESXi 6.5 U2 8294253" so this is not valid in our case.

So far we had checked the vSphere logs, VMkernel logs, Fabric Interconnects. Next stop would be one level higher - the Nexus 5Ks which are being used as our Fiber channel uplinks. In parallel the Storage team was involved to see if any errors are seen on the XtremIO Array.

Storage Team did not see any errors on the XtremIO Array.

Cisco was involved to look at the Nexus configuration. They inspected the FC ports on the Nexus 5K. Since there are 8 paths we need to inspect each one of them on Nexus A and Nexus B.

Side A looked clean but something on Side B stood right in the face. Upon inspecting one of the FC interface we noticed that the remaining B2B credits were ZERO. Here is how it looks -

Most of us had no idea what these B2B credits were. Here is a quick summary of what it means and does -

Buffer credits, also called buffer-to-buffer credits are used as a flow control method by Fibre Channel technology and represent the number of frames a port can store.
Each time a port transmits a frame that port’s BB Credit is decremented by one; for each R RDY received, that port’s BB Credit is incremented by one. If the BB Credit is zero the corresponding node cannot transmit until an R_RDY is received back.

This is also referred to as the slow drain issue

So in our case the FC could receive but could not transmit anything because there were no credits left. When ever an ESXi Host tried to send Data and used this specific path, storage could not receive that data and the data was lost in space. As we know FC is a lossless data delivery and hence the data loss would send the ESXi Kernel in a panic state.

Next Cisco showed us how to find the loss events -

Just run this command on the Nexus and you would see all the events that occurred.

Here is how it looks on a switch with no loss -

nx5k-A# show process creditmon credit-loss-events

Credit Loss Events: NO

Here is how it looks on the switch with a problem FC port -

Everything on the physical layer of this port looked OK, so Cisco suspected that the XtremIO Storage is having an issue return/sending back an R_RDY B2B credit, causing the switch to stop processing traffic.

Temporary solution was to just disable this port so no traffic is sent to it and thus avoiding further damage to the VMs as well as any escalations over the weekend ;-)

EMC was involved again and they came back confirming that the Rx Power column is showing a status of Fault on the suspected XtremIO port. This indicated that there maybe an improperly seated cable/SFP, the cable might be bad or that the SFP on the switch is bad.

We ended up replacing both the SFP and the cable.

This ended our 15 hour long call and nobody was called over the weekend !

Sunday, August 26, 2018

Curious Case of Network Latency

It all started with a colleague noticing ping response times higher than usual on a RHEL VM. Normal value is considered below 0.5 ms. We were seeing values upto 8-10ms.

We were comparing these values with another VM on the same subnet.

So I tried the basic troubleshooting -

Reboot VM.
Move VM to another Host.
Move VM to same Host as the normal VM.
Remove network adapter and add new adapter and reconfigure the IP.
Use a new IP.
Block port on the Port Group and unblock it.
The network team was involved.
Clear the ARP - this is an internal joke ;-)
Network team tried to ping from the Nexus 5K. Same response time to this specific VM.
Traced MACs to make sure there are no duplicates, traced vNICs on UCS and vmnics on the ESXi Host.
Decided to contact Cisco, VMware, RHEL, etc.

We often miss the little details and always think its a bigger issue :)

So I decided to start from the little details.

First step was to download the .vmx files for the problem VM and the VM that was responding fine.

Using Notepad++ I did a compare on the VMX files line by line. It was given that there were a lot of differences like - Virtual H/W version, Number of drives, CPU, memory, UUID, etc. but one stood right out at me - CPU Latency Sensitivity

The CPU Latency sensitivity was set to "low" on the problem VM.

Hmmm, why would anyone change the CPU sensitivity settings? And that too to "low" ? If need be it would be set to "high" but not set as low. Obviously whoever changed this (by accident or deliberately) did not know what they were doing.

CPU latency sensitivity was first introduced in vSphere 5.5 with a few caveats.

Requires to reserve 100% allocated memory
vCPUs are give exclusive access to PCPUs.
Network frames will not be coalesced when enabled.

Wednesday, August 22, 2018

VMware VirtualCenter Operational Dashboard

There is a not so popular feature in vCenter that gives you a lot of details and stats. Its called the VMware VirtualCenter Operational Dashboard

Browse to the following URL and replace the vCenter name. This needs authentication.

https://vCENTER_SERVER_FQDN/vod/index.html

On the Home page, you can get detailed stats about -

vCenter Uptime
Virtual Machine & Host Operations (invocations/min)
Client Communication
Agent Communication

There are 6 detail pages available.

Example - If you click on "Host Status" you will get detailed info about Hostname, IPs, MOID, Last heartbeat time etc.

Thursday, March 29, 2018

How to find the disk size for an Avamar BMR restore

We had a scenario where a BMR (Bare Metal Restore) system state restore was required.

A skeleton VM was created with drives matching the original server and booted from the Avamar BMR ISO. The restore was kicked off and after 90% the restore failed. Why?

Because the disk sizes did not match. The VM had several disks, some were stripped volumes some as spanned volumes. When the restore skeleton VM was created, the Admins just matched the volumes to the old VM. Avamar has no visiblity of how the disks are laid out within the OS.

Example of how the disks were laid out in VMware and inside the OS -

So the question is how do you find the disk size if the VM is down and you need a BMR restore?

Avamar has a very nice Avtar command line utility. I could not find an official document from EMC but there is bits and pieces of information on the web.

avtar.exe --help will give you a wide range of options.

Steps to find the drive size for a BMR restore -

On your Windows workstation, install the Avamar client software and do not register it with the Avamar console.
The default installation path will be - C:\Program Files\avs\
Run the following command

C:\Program Files\avs\bin>avtar.exe -x --server=IP_of_Avamar_Server --id=Username --ap=
Password --path=/clients/servername_FQDN

--labelnum=label_number_of_the_backup_to_be_restored --internal --target=.\tmp\ .system_info

--target=.\tmp will create a tmp directory under avs\bin and the output of the command will be under this directory.

There will be numerous XML files under \tmp.
The useful files are CriticalVolumesMapping.xml and partitiontables.xml

CriticalVolumesMapping.xml will give you the details of how the disks are laid out within the OS.

Example -
-<VolumeMappings Version="2.0">

We can clearly see that -

Disks 3,4,5,6 make up Logical volume E:

Disks 8,9 make up Logical volume I:

partitiontables.xml will provide you with the disk sizes

Example -

-<PhysicalDisk NumPartitions="4" DiskSize_bytes="274872407040" PartioningScheme="MBR" MBRSignature="1720029347" DiskType="Fixed" DiskNumber="8" DiskSize_Gbytes="255" SectorSize_bytes="512">

-<PartitionList>

<Partition Size_bytes="274876826112" Start_bytes="32256" partStyle="MBR" SerialNumber_dec="0" SerialNumber_hex="0" PartitionNumber="0" Size_Gbytes="255" Type="Alternate Linux swap" Bootable="false"/>

<Partition Size_bytes="0" Start_bytes="0" partStyle="MBR" SerialNumber_dec="0" SerialNumber_hex="0" PartitionNumber="1" Size_Gbytes="0" Type="Empty" Bootable="false"/>

<Partition Size_bytes="0" Start_bytes="0" partStyle="MBR" SerialNumber_dec="0" SerialNumber_hex="0" PartitionNumber="2" Size_Gbytes="0" Type="Empty" Bootable="false"/>

<Partition Size_bytes="0" Start_bytes="0" partStyle="MBR" SerialNumber_dec="0" SerialNumber_hex="0" PartitionNumber="3" Size_Gbytes="0" Type="Empty" Bootable="false"/>

</PartitionList>

</PhysicalDisk>

-<PhysicalDisk NumPartitions="4" DiskSize_bytes="274872407040" PartioningScheme="MBR" MBRSignature="1720029346" DiskType="Fixed" DiskNumber="9" DiskSize_Gbytes="255" SectorSize_bytes="512">

-<PartitionList>

<Partition Size_bytes="0" Start_bytes="0" partStyle="MBR" SerialNumber_dec="0" SerialNumber_hex="0" PartitionNumber="1" Size_Gbytes="0" Type="Empty" Bootable="false"/>

<Partition Size_bytes="0" Start_bytes="0" partStyle="MBR" SerialNumber_dec="0" SerialNumber_hex="0" PartitionNumber="2" Size_Gbytes="0" Type="Empty" Bootable="false"/>

<Partition Size_bytes="0" Start_bytes="0" partStyle="MBR" SerialNumber_dec="0" SerialNumber_hex="0" PartitionNumber="3" Size_Gbytes="0" Type="Empty" Bootable="false"/>

In the above scenario, Disks 8 and 9 together make up Volume I: and from the partitiontables.xml file you know the total size of I: will be 510GB. While creating the skeleton VM you will create a 510 GB drive.

Once you create the exact size and number of drives required, Avamar will perform a restore without any hiccups.

Problem solved !

Sunday, November 26, 2017

Sharing NFS-backed Volumes Between Containers

vSphere Integrated Containers supports two types of volumes, each of which has different characteristics.

VMFS virtual disks (VMDKs), mounted as formatted disks directly on container VMs. These volumes are supported on multiple vSphere datastore types, including NFS, iSCSI and VMware vSAN. They are thin, lazy zeroed disks.
NFS shared volumes. These volumes are distinct from a block-level VMDK on an NFS datastore. They are Linux guest-level mounts of an NFS file-system share.

VMDKs are locked while a container VM is running and other containers cannot share them.

NFS volumes on the other hand are useful for scenarios where two containers need read-write access to the same volume.

To use container volumes, one must first declare or create a volume store at the time of VCH creation

You must use the vic-machine create --volume-store option to create a volume store at the time of VCH creation.

You can add a volume store to an existing VCH by using the vic-machine configure --volume-store option. If you are adding volume stores to a VCH that already has one or more volume stores, you must specify each existing volume store in a separate instance of --volume-store.

Note: If you do not specify a volume store, no volume store is created by default and container developers cannot create or run containers that use volumes.

In my example, I have assigned a whole vSphere datastore as a volume store and would like to add a new NFS volume store to the VCH. The syntax is as follows -

$ vic-machine-operating_system configure

--target vcenter_server_username:password@vcenter_server_address

--thumbprint certificate_thumbprint --id vch_id

--volume-store datastore_name/datastore_path:default

--volume-store nfs://datastore_name/path_to_share_point:nfs_volume_store_label

nfs://datastore_name/path_to_share_point:nfs_volume_store_label is adding a NFS datastore in vSphere as the volume store. Which means when a volume is created on it, it will be a VMDK file which cannot be shared between containers.

Add a NFS mountpoint to be able to share between containers. You need to specify the URL, UID, GID and access protocol.

Note:You cannot specify the root folder of an NFS server as a volume store.

The syntax is as follows -

--volume-store nfs://datastore_address/path_to_share_point?uid=1234&gid=5678&proto=tcp:nfs_volume_store_label

If you do not specify the UID and GID the default is 1000. Read more and UID and GID here.

Before adding NFS Volume Store

After adding NFS Volume Store

Two things to note in my example-

1) I am running a VM based RHEL backed NFS and the UID and GID did not work for me.

2) The workaround was to manually change permissions on the test2 folder on the NFS Share point by using chmod 777 test2

Now that the NFS volume store is added, lets create a volume -

Next, deploy two containers with My_NFS_Volume mounted on both.

To check if the volume was mounted docker inspect containername and you will see the details under Mounts. You will see the name as well as the read/write mode.

Now lets create a .txt file from one container and check from the other if it can be seen.

This concludes we can share NFS-backed volumes between containers.

Friday, November 17, 2017

Configure Static IP on PhotonOS

To obtain the name of your Ethernet link run the following command: networkctl

If this is the first time you are using Photon OS, you will only see the first 2 links. The others got created because I ran some docker swarms and created customer network bridges.

The network configuration file is located at -

/etc/systemd/network/

You might see the file 10-dhcp-eth0.network. I renamed this file to static.

You can do this by running the following command -

root@photon [ ~ ]# mv /etc/systemd/network/10-dhcp-eth0.network /etc/systemd/network/10-static-eth0.network

Use vi editor to edit the file and add your static IP, Gateway, DNS, Domain and NTP.

This is how the file would look like.

root@photon [ ~ ]# cat /etc/systemd/network/10-static-eth0.network

[Match]

Name=eth0 <<<<<<< “Make sure to change this to your adapter. ipconfig to check adapter name”

[Network]

Address=10.xx.xx.xx/24

Gateway=10.xx.xx.1

DNS=10.xx.xx.xx 10.xx.xx.xx

Domains=na.xx.com

NTP=time.nist.gov

Apply the changes by running -

systemctl restart systemd-networkd

Try to ping out form the OS.

Note: You will not able able to ping this VM as by default the iptables firewall blocks everything except SSH. In my next blog I will explain how to allow ping on iptables.