Showing posts with label vSphere 6.5. Show all posts
Showing posts with label vSphere 6.5. Show all posts

Sunday, October 14, 2018

The one with lost access to volumes which turned out to be SAN congestion and Zero buffer-to-buffer credits (Cisco Slow-Drain) nightmare

Does it sound familiar that all major issues always happen on a Friday?

Past Friday, we ran into an issue with random ESXi Hosts where we would see Host disconnected in vCenter and several VMs on it as orphaned or disconnected. In addition some VMs would be unresponsive, VM console would not work and RDP sessions would either freeze or get disconnected though the VM was still pingable.

The Host would self heal after a while and everything would return to normal with few VMs hung that would require a reboot to fix them.

First stop, vCenter Events -

There were several "lost access to volume due to connectivity issues" messages.




Next stop, VMKernel Logs - 

This is what was found in the VMkernel logs -

018-10-12T14:11:19.584Z cpu2:390228)WARNING: LinScsi: SCSILinuxAbortCommands:1909: Failed, Driver fnic, for vmhba2
2018-10-12T14:11:19.584Z cpu3:390230)<7>fnic : 2 :: Returning from abort cmd type 2 FAILED

2018-10-12T14:11:19.584Z cpu3:390230)WARNING: LinScsi: SCSILinuxAbortCommands:1909: Failed, Driver fnic, for vmhba2

Seems like a pretty straightforward error. The problem seems to be with the fnic driver for vmhba2.
But we are already running the latest supported drivers from Cisco and there is no know issues reported for this driver. There was a know issue in the previous version and here is the KB from Cisco - https://quickview.cloudapps.cisco.com/quickview/bug/CSCux90320

You can check the fnic drivers as follows -


Next stop was the Cisco Fabric Interconnects - Not a single error reported there.

Back to VMkernel Logs to see if more can be found -

There were numerous SCSI sense code errors. Here is how they looked -

2018-10-12T21:14:26.966Z cpu0:66222)ScsiDeviceIO: 2968: Cmd(0x439d44f24800) 0x89, CmdSN 0x245a3 from world 67160 to dev "naa.514f0c5595a00104" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0.
2018-10-12T21:23:00.565Z cpu24:66691)ScsiDeviceIO: 2954: Cmd(0x439d457db340) 0x85, CmdSN 0x125c from world 67651 to dev "naa.514f0c5595a00002" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2018-10-12T21:23:00.568Z cpu27:105197)ScsiDeviceIO: 2954: Cmd(0x439d457db340) 0x85, CmdSN 0x125d from world 67651 to dev "naa.514f0c5595a00003" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

You can decipher these codes and their meanings here - http://www.t10.org/lists/1spc-lst.htm

Highly recommend using this Blog which makes it a piece of cake to decipher these codes -

https://www.virten.net/vmware/esxi-scsi-sense-code-decoder/

Here is the deciphered version -

failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0 Cmd(0x439d44f24800) 0x89

Type
Code
Name
Description
Host Status
[0x0]
OK
This status is returned when there is no error on the host side. This is when you will see if there is a status for a Device or Plugin. It is also when you will see Valid sense data instead of Possible sense Data.
Device Status
[0x2]
CHECK_CONDITION
This status is returned when a command fails for a specific reason. When a CHECK CONDITION is received, the ESX storage stack will send out a SCSI command 0x3 (REQUEST SENSE) in order to get the SCSI sense data (Sense Key, Additional Sense Code, ASC Qualifier, and other bits). The sense data is listed after Valid sense data in the order of Sense Key, Additional Sense Code, and ASC Qualifier.
Plugin Status
[0x0]
GOOD
No error. (ESXi 5.x / 6.x only)
Sense Key
[0xE]
MISCOMPARE
Additional Sense Data
1D/00
MISCOMPARE DURING VERIFY OPERATION
OP Code
0x89
COMPARE AND WRITE

failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0 Cmd(0x439d457db340) 0x85


Type
Code
Name
Description
Host Status
[0x0]
OK
This status is returned when there is no error on the host side. This is when you will see if there is a status for a Device or Plugin. It is also when you will see Valid sense data instead of Possible sense Data.
Device Status
[0x2]
CHECK_CONDITION
This status is returned when a command fails for a specific reason. When a CHECK CONDITION is received, the ESX storage stack will send out a SCSI command 0x3 (REQUEST SENSE) in order to get the SCSI sense data (Sense Key, Additional Sense Code, ASC Qualifier, and other bits). The sense data is listed after Valid sense data in the order of Sense Key, Additional Sense Code, and ASC Qualifier.
Plugin Status
[0x0]
GOOD
No error. (ESXi 5.x / 6.x only)
Sense Key
[0x5]
ILLEGAL REQUEST
Additional Sense Data
20/00
INVALID COMMAND OPERATION CODE
OP Code
0x85
ATA PASS-THROUGH(16)

This sounds very similar to the vSphere 6.0 ATS miscompare issue. Read more about it here.

We are running "vSphere ESXi 6.5 U2 8294253" so this is not valid in our case.

So far we had checked the vSphere logs, VMkernel logs, Fabric Interconnects. Next stop would be one level higher - the Nexus 5Ks which are being used as our Fiber channel uplinks. In parallel the Storage team was involved to see if any errors are seen on the XtremIO Array.

Storage Team did not see any errors on the XtremIO Array.

Cisco was involved to look at the Nexus configuration. They inspected the FC ports on the Nexus 5K. Since there are 8 paths we need to inspect each one of them on Nexus A and Nexus B.

Side A looked clean but something on Side B stood right in the face. Upon inspecting one of the FC interface we noticed that the remaining B2B credits were ZERO. Here is how it looks -



Most of us had no idea what these B2B credits were. Here is a quick summary of what it means and does -

Buffer credits, also called buffer-to-buffer credits are used as a flow control method by Fibre Channel technology and represent the number of frames a port can store.
Each time a port transmits a frame that port’s BB Credit is decremented by one; for each R RDY received, that port’s BB Credit is incremented by one. If the BB Credit is zero the corresponding node cannot transmit until an R_RDY is received back.


This is also referred to as the slow drain issue

So in our case the FC could receive but could not transmit anything because there were no credits left. When ever an ESXi Host tried to send Data and used this specific path, storage could not receive that data and the data was lost in space. As we know FC is a lossless data delivery and hence the data loss would send the ESXi Kernel in a panic state. 

Next Cisco showed us how to find the loss events - 

Just run this command on the Nexus and you would see all the events that occurred.

Here is how it looks on a switch with no loss - 

nx5k-A# show process creditmon credit-loss-events

        Credit Loss Events: NO

Here is how it looks on the switch with a problem FC port - 


Everything on the physical layer of this port looked OK, so Cisco suspected that the XtremIO Storage is having an issue return/sending back an R_RDY B2B credit, causing the switch to stop processing traffic. 

Temporary solution was to just disable this port so no traffic is sent to it and thus avoiding further damage to the VMs as well as any escalations over the weekend ;-) 

EMC was involved again and they came back confirming that the Rx Power column is showing a status of Fault on the suspected XtremIO port. This indicated that there maybe an improperly seated cable/SFP, the cable might be bad or that the SFP on the switch is bad. 

We ended up replacing both the SFP and the cable.

This ended our 15 hour long call and nobody was called over the weekend !

Wednesday, August 22, 2018

VMware VirtualCenter Operational Dashboard

There is a not so popular feature in vCenter that gives you a lot of details and stats. Its called the VMware VirtualCenter Operational Dashboard

Browse to the following URL and replace the vCenter name. This needs authentication. 

  https://vCENTER_SERVER_FQDN/vod/index.html

On the Home page, you can get detailed stats about - 
  • vCenter Uptime
  • Virtual Machine & Host Operations (invocations/min)
  • Client Communication 
  • Agent Communication
There are 6 detail pages available.

Example - If you click on "Host Status" you will get detailed info about Hostname, IPs, MOID, Last heartbeat time etc.



Monday, June 25, 2018

The case of a forgotten Site Recovery Manager (SRM) DB Admin password

While getting ready to upgrade our vCenter from 6.0 to 6.5 and SRM from 6.1.1 to 6.5 we faced an issue were the SRM Embedded DB Admin Password was not documented.

Follow the steps to reset your forgotten password - 

1) You will need to edit the pg_hba.config file under -

C:\ProgramData\VMware\VMware vCenter Site Recovery Manager Embedded Database\data\pg_hba.conf

We had our install on the E:\ drive but the data folder was nowhere to be found. 

So I started searching for the pg_hba.config file and found it under - 

C:\ProgramData\VMware\VMware vCenter Site Recovery Manager Embedded Database\data

2) Make a backup of pg_hba.config

3) Stop service -
       Display Name -  VMware vCenter Site Recovery Manager Embedded Database
       Service Name - vmware-dr-vpostgres

4) Edit pg_hba.config in Wordpad.

5) Locate the following in the file - 

             # TYPE DATABASE USER ADDRESS METHOD
             # IPv4 local connections:
                 host all all 127.0.0.1/32 md5
             # IPv6 local connections:
                 host all all ::1/128 md5

6) Replace md5 with trust so the changes look like the following - 

             # TYPE DATABASE USER ADDRESS METHOD
             # IPv4 local connections:
                 host all all 127.0.0.1/32 trust
             # IPv6 local connections:
                 host all all ::1/128 trust

7) Save the file and start VMware vCenter Site Recovery Manager Embedded Database service.

8) Open command prompt as Administrator and navigate to - 
     E:\ProgramData\VMware\VMware vCenter Site Recovery Manager Embedded Database\bin
  Note: You might have installed in a different drive. 

9) Connect to the postgres database using - 
        psql -U postgres -p 5678 
     This will bring you to a prompt -  postgres=#
     Note: 5678 is the default port. If you chose a different port during installation, replace it                
    accordingly. If unsure open ODBC and check under System DSN.

10) Run the following to change the password - 
       ALTER USER "enter srm db user here" PASSWORD 'new_password';
     Note: srm db user can be found under ODBC - System DSN. new_password should be in single quotes.

11) If the command is successful, you will see the following output - ALTER ROLE


12) Your password has now been reset. Now you can take a backup of the Database.
       Note: You cannot take a backup without resetting the password because it will prompt you for a password 😊

13) To take a backup run the following - 
       pg_dump.exe -Fc --host 127.0.0.1 --port 5678 --username=dbaadmin srm_db > e:\
       destination_location
 

14) Revert the changes made to the pg_hba.config file (Replace trust with md5 or just replace the file you previously backed up.

15) Restart the VMware vCenter Site Recovery Manager Embedded Database service.

16) Go to Add/Remove Programs, select Vmware vCenter Site Recovery Manager and click Change.


17) Be patient it takes a while to load. On the next screen select Modify and click next.

18) Enter PSC address and the username/password.


19) Accept the Certificate

20) Enter information to register the SRM extension

21) Next choose if you want to create a new certificate or to use existing. Mine was still valid so used the existing one.

22) And finally you enter the new password you reset in Step 10. 

23) Once the setup is complete. Make sure the  VMware vCenter Site Recovery Manager Server service started. If not, manually start it. 

Reference: A quick google search brought up the following and we also had a case open with VMware who suggested to follow the same exact blog   http://www.virtuallypeculiar.com/2018/01/resetting-site-recovery-managers.html

Thursday, March 15, 2018

VMware Stack Upgrade

I’m working on a plan to upgrade our existing VMware Stack and wanted to write a detailed post about all components involved and the order of upgrade.

Current State –

·        Primary and Recovery site with XtremIO Arrays on both ends with physical RecoverPoint appliances for array based replication.
·        vCenter Servers on both sites (with embedded Platform Services Controller) running on Windows Server 2012 R2 with an external SQL Database.
·         ESXi 6.0.0, 3825889
·         Site Recovery Manager (SRM) 6.1.1.13825
·         Storage Replication Adapter (SRA) 2.2.0.3
·         XtremIO (XIOS) 4.0.15-24, XMS 4.2.1
·         RecoverPoint (RPA) 4.4.1
·         Avamar 7.3

Future State –
·         vCenter Server Appliance (VCSA) 6.5 U1f
·         ESXi 6.5, 7388607
·         SRM 6.5.1, 6014840
·         SRA 2.2.0.3
·         XIOS 6.0.1, XMS 6.0.1
·         RecoverPoint (RPA) 5.1
·         Avamar 7.5

 There are numerous inter-dependencies to get to the future state.

·         Avamar 7.3 is not compatible with vCenter 6.5
·         RPA 4.4.1 is not compatible with ESXi 6.5
·         Upgrading SRM from 6.0.x to 6.5 is not supported. You have to upgrade to 6.1.x before you upgrade to 6.5 (in my case that’s not required since already running on 6.1.1.x)
·         Any vCenter upgrade will break SRM until both sides are on the same level. I have Array replication active in case of a disaster while upgrading SRM.

Prerequisites –

·         Backup of vCenter Database on primary and recovery site.
·         Backup of SRM vPostgres Database on primary and recovery sit. (Explained in detail below)
·         Primary and Recovery Site Platform Services Controller and vCenter server instances must be running.

Order of Upgrade –

·         Upgrade Avamar to 7.5
·         RPA 4.4.1 is not compatible with ESXi 6.5 hence upgrade that to RPA 5.1
·         Upgrade vCenter Server to VCSA 6.5 GA at primary site.
·         Upgrade SRM to 6.5 at primary site. Note: SRM cannot be upgraded from 6.1.1 to 6.5.1
·         Upgrade SRA at primary site – Not required since running on latest
·         vCenter Server to VCSA 6.5 GA at recovery site.
·         Upgrade SRM to 6.5 at recovery site.
·         Upgrade SRA at recovery site – Not required since running on latest.
·         Upgrade vCenter from 6.5 GA to 6.5U1g at primary site.
·         Upgrade SRM from 6.5 to 6.5.1 at primary site.
·         Upgrade vCenter Server to VCSA 6.5U1g at recovery site.
·         Upgrade SRM from 6.5 to 6.5.1 at recovery site.
·         Verify connection between SRM. Verify Protection groups and recovery plans are valid.
·         Upgrade ESXi to 6.5, 7388607 at recovery site.
·         Upgrade ESXi to 6.5, 7388607 at primary site.
·         Upgrade virtual hardware and then VMtools on Virtual Machines – Can be scheduled during the next available outage window.
·         Upgrade XIOS and XMS to 6.0.1

Backup & Restore (if required) the SRM Embedded vPostgres Database -

1)      Log into the system on which you installed Site Recovery Manager Server.
2)      Stop the Site Recovery Manager service.
3)      Navigate to the folder that contains the vPostgres commands.
4)      If you installed Site Recovery Manager Server in the default location, you find the vPostgres commands in C:\Program Files\VMware\VMware vCenter Site Recovery Manager Embedded Database\bin.
5)      Create a backup of the embedded vPostgres database by using the pg_dump command.
pg_dump -Fc --host 127.0.0.1 --port port_number --username=db_username srm_db > srm_backup_name. To create a backup you need the admin password. We did not have the Admin password documented. Here is a link on how to reset the Admin password - http://virtuallycurious.blogspot.com/2018/06/the-case-of-forgotten-site-recovery.html
You set the port number, username, and password for the embedded vPostgres database when you installed Site Recovery Manager. The default port number is 5678. The database name is srm_db and cannot be changed.
6)      Start the Site Recovery Manager service.
7)      Restore (if things go south) by using the pg_restore command
pg_restore -Fc --host 127.0.0.1 --port port_number --username=db_username --dbname=srm_db srm_backup_name

References –

·         VMware Product Interoperability Matrices -http://partnerweb.vmware.com/comp_guide2/sim/interop_matrix.php
·         Update sequence for vSphere 6.5 and its compatible VMware products (2147289) - https://kb.vmware.com/s/article/2147289
·         Backup and Restore the embedded vPostgres Database - https://docs.vmware.com/en/Site-Recovery-Manager/6.5/srm-install-config-6-5.pdf
·        EMC Recoverpoint SRA compatibility Matrix - https://www.vmware.com/resources/compatibility/detail.php?deviceCategory=sra&productid=39129
·        Compatibility Matrix for SRM 6.5 - https://www.vmware.com/support/srm/srm-compat-matrix-6-5.html