Wednesday, December 12, 2018

VIC appliance disk space is filled by Registry blobs / Cleanup VIC Registry

We have been using vSphere Integrated Containers for almost a year now and whenever the developers have trouble pushing or pulling images from the private registry, its usually because the disk is full.

So we just add more space to the appliance and reboot it to expand the filesystem.

Here is how it is done - 

Login to the VIC appliance and run df -h 

You will see the filesystem /storage/data is almost full 


Go to the vSphere Console and expand the drive to the desired value. 

Note : You can hot add drive space. 

The VIC appliance has 4 virtual disks attached to it. Refer to this table to find out which disk to expand.


This is a quick and dirty solution but in our case we kept adding space and went well over 1 TB. It was time to cleanup and put a permanent fix to this problem. 

Harbor is the open source registry that runs in the VIC appliance. Behind the scenes it is nothing but an open source docker distribution with added functionalities such as security, identity and management.

As per the VIC documentation, you can simply enable garbage collection and after you reboot the appliance it should take effect. Unfortunately this did not work for us.



Here is how I got around this issue and manually ran the Garbage collection  -

If you have not noticed yet, within the VIC appliance all the components run as individual containers (pretty cool)

SSH to the VIC appliance and run the simple command 

                       docker ps -a 


You will notice that all services are running as individual containers. 

We are going to work on the highlighted Harbor registry container.

A garbage collector is the one that actually deletes the leftover blobs from different image tags. 

If you are familiar with the docker exec command, we will use this simple command on the registry container to see what blobs can be marked for deletion.

Update:- If you have upgraded to VIC 1.5 then you need to su to user harbor. 

Older VIC versions - docker exec registry bin/registry garbage-collect --dry-run  /etc/registry/config.yml

VIC 1.5 and above - docker exec registry su -c "registry garbage-collect --dry-run  /etc/registry/config.yml" harbor

This will not actually delete the blobs but will output all that can be marked for deletion.


To mark these blobs for deletion, run the above command without the --dry-run flag - 

Older VIC versions - docker exec registry bin/registry garbage-collect /etc/registry/config.yml

VIC 1.5 and above - docker exec registry su -c "registry garbage-collect /etc/registry/config.yml" harbor


Eligible blobs will be marked for deletion. 

Next just restart the harbor service and the blogs will be actually deleted from the disk. 

systemctl restart harbor

If that still does not cleanup space, reboot the VIC appliance, SSH back and check with a df -h

Happy days !

Sunday, October 14, 2018

The one with lost access to volumes which turned out to be SAN congestion and Zero buffer-to-buffer credits (Cisco Slow-Drain) nightmare

Does it sound familiar that all major issues always happen on a Friday?

Past Friday, we ran into an issue with random ESXi Hosts where we would see Host disconnected in vCenter and several VMs on it as orphaned or disconnected. In addition some VMs would be unresponsive, VM console would not work and RDP sessions would either freeze or get disconnected though the VM was still pingable.

The Host would self heal after a while and everything would return to normal with few VMs hung that would require a reboot to fix them.

First stop, vCenter Events -

There were several "lost access to volume due to connectivity issues" messages.




Next stop, VMKernel Logs - 

This is what was found in the VMkernel logs -

018-10-12T14:11:19.584Z cpu2:390228)WARNING: LinScsi: SCSILinuxAbortCommands:1909: Failed, Driver fnic, for vmhba2
2018-10-12T14:11:19.584Z cpu3:390230)<7>fnic : 2 :: Returning from abort cmd type 2 FAILED

2018-10-12T14:11:19.584Z cpu3:390230)WARNING: LinScsi: SCSILinuxAbortCommands:1909: Failed, Driver fnic, for vmhba2

Seems like a pretty straightforward error. The problem seems to be with the fnic driver for vmhba2.
But we are already running the latest supported drivers from Cisco and there is no know issues reported for this driver. There was a know issue in the previous version and here is the KB from Cisco - https://quickview.cloudapps.cisco.com/quickview/bug/CSCux90320

You can check the fnic drivers as follows -


Next stop was the Cisco Fabric Interconnects - Not a single error reported there.

Back to VMkernel Logs to see if more can be found -

There were numerous SCSI sense code errors. Here is how they looked -

2018-10-12T21:14:26.966Z cpu0:66222)ScsiDeviceIO: 2968: Cmd(0x439d44f24800) 0x89, CmdSN 0x245a3 from world 67160 to dev "naa.514f0c5595a00104" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0.
2018-10-12T21:23:00.565Z cpu24:66691)ScsiDeviceIO: 2954: Cmd(0x439d457db340) 0x85, CmdSN 0x125c from world 67651 to dev "naa.514f0c5595a00002" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.
2018-10-12T21:23:00.568Z cpu27:105197)ScsiDeviceIO: 2954: Cmd(0x439d457db340) 0x85, CmdSN 0x125d from world 67651 to dev "naa.514f0c5595a00003" failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0.

You can decipher these codes and their meanings here - http://www.t10.org/lists/1spc-lst.htm

Highly recommend using this Blog which makes it a piece of cake to decipher these codes -

https://www.virten.net/vmware/esxi-scsi-sense-code-decoder/

Here is the deciphered version -

failed H:0x0 D:0x2 P:0x0 Valid sense data: 0xe 0x1d 0x0 Cmd(0x439d44f24800) 0x89

Type
Code
Name
Description
Host Status
[0x0]
OK
This status is returned when there is no error on the host side. This is when you will see if there is a status for a Device or Plugin. It is also when you will see Valid sense data instead of Possible sense Data.
Device Status
[0x2]
CHECK_CONDITION
This status is returned when a command fails for a specific reason. When a CHECK CONDITION is received, the ESX storage stack will send out a SCSI command 0x3 (REQUEST SENSE) in order to get the SCSI sense data (Sense Key, Additional Sense Code, ASC Qualifier, and other bits). The sense data is listed after Valid sense data in the order of Sense Key, Additional Sense Code, and ASC Qualifier.
Plugin Status
[0x0]
GOOD
No error. (ESXi 5.x / 6.x only)
Sense Key
[0xE]
MISCOMPARE
Additional Sense Data
1D/00
MISCOMPARE DURING VERIFY OPERATION
OP Code
0x89
COMPARE AND WRITE

failed H:0x0 D:0x2 P:0x0 Valid sense data: 0x5 0x20 0x0 Cmd(0x439d457db340) 0x85


Type
Code
Name
Description
Host Status
[0x0]
OK
This status is returned when there is no error on the host side. This is when you will see if there is a status for a Device or Plugin. It is also when you will see Valid sense data instead of Possible sense Data.
Device Status
[0x2]
CHECK_CONDITION
This status is returned when a command fails for a specific reason. When a CHECK CONDITION is received, the ESX storage stack will send out a SCSI command 0x3 (REQUEST SENSE) in order to get the SCSI sense data (Sense Key, Additional Sense Code, ASC Qualifier, and other bits). The sense data is listed after Valid sense data in the order of Sense Key, Additional Sense Code, and ASC Qualifier.
Plugin Status
[0x0]
GOOD
No error. (ESXi 5.x / 6.x only)
Sense Key
[0x5]
ILLEGAL REQUEST
Additional Sense Data
20/00
INVALID COMMAND OPERATION CODE
OP Code
0x85
ATA PASS-THROUGH(16)

This sounds very similar to the vSphere 6.0 ATS miscompare issue. Read more about it here.

We are running "vSphere ESXi 6.5 U2 8294253" so this is not valid in our case.

So far we had checked the vSphere logs, VMkernel logs, Fabric Interconnects. Next stop would be one level higher - the Nexus 5Ks which are being used as our Fiber channel uplinks. In parallel the Storage team was involved to see if any errors are seen on the XtremIO Array.

Storage Team did not see any errors on the XtremIO Array.

Cisco was involved to look at the Nexus configuration. They inspected the FC ports on the Nexus 5K. Since there are 8 paths we need to inspect each one of them on Nexus A and Nexus B.

Side A looked clean but something on Side B stood right in the face. Upon inspecting one of the FC interface we noticed that the remaining B2B credits were ZERO. Here is how it looks -



Most of us had no idea what these B2B credits were. Here is a quick summary of what it means and does -

Buffer credits, also called buffer-to-buffer credits are used as a flow control method by Fibre Channel technology and represent the number of frames a port can store.
Each time a port transmits a frame that port’s BB Credit is decremented by one; for each R RDY received, that port’s BB Credit is incremented by one. If the BB Credit is zero the corresponding node cannot transmit until an R_RDY is received back.


This is also referred to as the slow drain issue

So in our case the FC could receive but could not transmit anything because there were no credits left. When ever an ESXi Host tried to send Data and used this specific path, storage could not receive that data and the data was lost in space. As we know FC is a lossless data delivery and hence the data loss would send the ESXi Kernel in a panic state. 

Next Cisco showed us how to find the loss events - 

Just run this command on the Nexus and you would see all the events that occurred.

Here is how it looks on a switch with no loss - 

nx5k-A# show process creditmon credit-loss-events

        Credit Loss Events: NO

Here is how it looks on the switch with a problem FC port - 


Everything on the physical layer of this port looked OK, so Cisco suspected that the XtremIO Storage is having an issue return/sending back an R_RDY B2B credit, causing the switch to stop processing traffic. 

Temporary solution was to just disable this port so no traffic is sent to it and thus avoiding further damage to the VMs as well as any escalations over the weekend ;-) 

EMC was involved again and they came back confirming that the Rx Power column is showing a status of Fault on the suspected XtremIO port. This indicated that there maybe an improperly seated cable/SFP, the cable might be bad or that the SFP on the switch is bad. 

We ended up replacing both the SFP and the cable.

This ended our 15 hour long call and nobody was called over the weekend !

Saturday, September 8, 2018

Provide Secure Remote access to on-premises applications using Azure Active Directory Application Proxy

I had the pleasure of attending Azure Active Directory overview class last week.

Learnt some really cool stuff and one of the features that stood out was the Application Proxy. 

Allowing access to internal (on-premise) applications has always involved a lot of moving parts - VPN, DMZ, Firewall Rules, Port numbers, etc. To add to it is the worry of is the application really secure? By allowing access from the internet are we creating a backdoor to the organization? 

Thats where Azure AD Application Proxy comes in picture. A modern way of letting your employees access internal applications. In short remote access as a service.

So I thought lets try this for our Lab vSphere Web client. Would't it be cool if you can access the whole Infrastructure without a VPN or without having a proxy sit in the DMZ and forwarding ports? 

Azure AD application Proxy supports the following applications - 
  • Web applications that use Integrated Windows Authentication for authentication
  • Web applications that use form-based or header-based access
  • Web APIs that you want to expose to rich applications on different devices
  • Applications hosted behind a Remote Desktop Gateway
  • Rich client apps that are integrated with the Active Directory Authentication Library (ADAL)
To find out how the Application Proxy works, please refer to the Microsoft documentation - https://docs.microsoft.com/en-us/azure/active-directory/manage-apps/application-proxy


Ports - You only need 80 and 443 open to outbound traffic.


Now that the prerequisites are taken care of, lets start publishing our application, in this case I am publishing the vSphere Web Client. 

Login to your Azure portal as administrator - portal.azure.com

Select Azure Active Directory > Enterprise applications > New application



Select On-premises application


Next, provide the following information and click Add.


Note: - Make sure you choose Translate URLs in - Application Body to YES if your external name is different than the internal name. 

I had this option set to NO. After publishing the application, I could get to the vSphere login screen but after entering the credentials, it would take me to the internal name (internal URL) and because there is no VPN involved, I could not resolve the internal name and hence would get a DNS error. 

Setting this option to YES would do this - After authentication, when the proxy server passes the application data to the user, Application Proxy scans the application for hardcoded links and replaces them with their respective, published external URLs.

The other way of doing this is to set a custom domain name to match your internal domain. To access your application using a custom domain you must configure a CNAME entry in your DNS provider which points your Internal URL to your external URL.

Example - Your internal URL is https://vcenter_server.contoso.com. Configure a CNAME entry in your DNS which points "vcenter_server.contoso.com" to "vcenter_server.msappproxy.net"

Next select Users and Groups and click Add User and grant the users access.


To add additional security we will enable Conditional Access with Multi-factor Authentication. 

To do this click on Conditional Access and create a new Policy


  • Enter a Name for your Policy 
  • Select the applicable Users
  • Select the newly published app by clicking on Cloud Apps 
  • Conditions - Select Browser and Mobile Apps and Desktop clients & Modern Auth clients.

  • Access Control - Grant - Select Grant Access and check Require MFA
  • Enable the Policy and Save it.
There are two ways to access this published app. 
  1. Go to myapps.microsoft.com and you will see the published app (only visible to the user that was granted access) OR
  2. Just browse to the external URL specified when you created the application.
If you try 2) then you will first be redirected to https://login.microsoftonline.com. Enter your credentials and you will be asked to approve the request on your phone through MFA.

If you have your MFA application setup on the phone, you will get an approval request on the phone, if you dont have MFA setup, it will walk you through the process of setting up MFA by scanning a QR code.

That's all Folks !!

Sunday, August 26, 2018

Curious Case of Network Latency

It all started with a colleague noticing ping response times higher than usual on a RHEL VM. Normal value is considered below 0.5 ms. We were seeing values upto 8-10ms.


We were comparing these values with another VM on the same subnet.

So I tried the basic troubleshooting -

  1. Reboot VM.
  2. Move VM to another Host.
  3. Move VM to same Host as the normal VM.
  4. Remove network adapter and add new adapter and reconfigure the IP.
  5. Use a new IP.
  6. Block port on the Port Group and unblock it.
  7. The network team was involved.
  8. Clear the ARP - this is an internal joke ;-) 
  9. Network team tried to ping from the Nexus 5K. Same response time to this specific VM.
  10. Traced MACs to make sure there are no duplicates, traced vNICs on UCS and vmnics on the ESXi Host.
  11. Decided to contact Cisco, VMware, RHEL, etc.
We often miss the little details and always think its a bigger issue :)

So I decided to start from the little details.

First step was to download the .vmx files for the problem VM and the VM that was responding fine.

Using Notepad++ I did a compare on the VMX files line by line. It was given that there were a lot of differences like - Virtual H/W version, Number of drives, CPU, memory, UUID, etc. but one stood right out at me - CPU Latency Sensitivity


The CPU Latency sensitivity was set to "low" on the problem VM.

Hmmm, why would anyone change the CPU sensitivity settings? And that too to "low" ? If need be it would be set to "high" but not set as low. Obviously whoever changed this (by accident or deliberately) did not know what they were doing.

CPU latency sensitivity was first introduced in vSphere 5.5 with a few caveats. 
  1. Requires to reserve 100% allocated memory
  2. vCPUs are give exclusive access to PCPUs.
  3. Network frames will not be coalesced when enabled.
Read more about this feature here

So back to our problem, how and where do we change this setting? 

There are two ways to do it - Good ol' PowerCli or the vSphere Web client.

Note: You can change the setting while the VM is powered ON but it will take effect on the next reboot.

PowerCli: 

Here is a one liner to find out if other VMs in your environment have these Advanced settings changed - 

Get-VM * | Get-AdvancedSetting -Name sched.cpu.latencysensitivity | ?{$_.Value -eq 'low' -or $_.Value -eq 'High' -or $_.Value -eq 'Medium'} | select Entity, Value | ft -AutoSize

And here is how to change it to desired value.

Get-VM vm_name | Get-AdvancedSetting -Name sched.cpu.latencysensitivity | Set-AdvancedSetting -Value Normal

Note: With PowerCli the settings are changed in the VMX file and will not be visible in the GUI until you reboot the VM. 

vSphere Web Client: 

This setting can be found under "Edit Settings > VM Options > Advanced 


Once these changes were in place and the VM rebooted, the ping response time was back to below 0.5 ms.

Happy days !!

Wednesday, August 22, 2018

VMware VirtualCenter Operational Dashboard

There is a not so popular feature in vCenter that gives you a lot of details and stats. Its called the VMware VirtualCenter Operational Dashboard

Browse to the following URL and replace the vCenter name. This needs authentication. 

  https://vCENTER_SERVER_FQDN/vod/index.html

On the Home page, you can get detailed stats about - 
  • vCenter Uptime
  • Virtual Machine & Host Operations (invocations/min)
  • Client Communication 
  • Agent Communication
There are 6 detail pages available.

Example - If you click on "Host Status" you will get detailed info about Hostname, IPs, MOID, Last heartbeat time etc.



Monday, June 25, 2018

The case of a forgotten Site Recovery Manager (SRM) DB Admin password

While getting ready to upgrade our vCenter from 6.0 to 6.5 and SRM from 6.1.1 to 6.5 we faced an issue were the SRM Embedded DB Admin Password was not documented.

Follow the steps to reset your forgotten password - 

1) You will need to edit the pg_hba.config file under -

C:\ProgramData\VMware\VMware vCenter Site Recovery Manager Embedded Database\data\pg_hba.conf

We had our install on the E:\ drive but the data folder was nowhere to be found. 

So I started searching for the pg_hba.config file and found it under - 

C:\ProgramData\VMware\VMware vCenter Site Recovery Manager Embedded Database\data

2) Make a backup of pg_hba.config

3) Stop service -
       Display Name -  VMware vCenter Site Recovery Manager Embedded Database
       Service Name - vmware-dr-vpostgres

4) Edit pg_hba.config in Wordpad.

5) Locate the following in the file - 

             # TYPE DATABASE USER ADDRESS METHOD
             # IPv4 local connections:
                 host all all 127.0.0.1/32 md5
             # IPv6 local connections:
                 host all all ::1/128 md5

6) Replace md5 with trust so the changes look like the following - 

             # TYPE DATABASE USER ADDRESS METHOD
             # IPv4 local connections:
                 host all all 127.0.0.1/32 trust
             # IPv6 local connections:
                 host all all ::1/128 trust

7) Save the file and start VMware vCenter Site Recovery Manager Embedded Database service.

8) Open command prompt as Administrator and navigate to - 
     E:\ProgramData\VMware\VMware vCenter Site Recovery Manager Embedded Database\bin
  Note: You might have installed in a different drive. 

9) Connect to the postgres database using - 
        psql -U postgres -p 5678 
     This will bring you to a prompt -  postgres=#
     Note: 5678 is the default port. If you chose a different port during installation, replace it                
    accordingly. If unsure open ODBC and check under System DSN.

10) Run the following to change the password - 
       ALTER USER "enter srm db user here" PASSWORD 'new_password';
     Note: srm db user can be found under ODBC - System DSN. new_password should be in single quotes.

11) If the command is successful, you will see the following output - ALTER ROLE


12) Your password has now been reset. Now you can take a backup of the Database.
       Note: You cannot take a backup without resetting the password because it will prompt you for a password 😊

13) To take a backup run the following - 
       pg_dump.exe -Fc --host 127.0.0.1 --port 5678 --username=dbaadmin srm_db > e:\
       destination_location
 

14) Revert the changes made to the pg_hba.config file (Replace trust with md5 or just replace the file you previously backed up.

15) Restart the VMware vCenter Site Recovery Manager Embedded Database service.

16) Go to Add/Remove Programs, select Vmware vCenter Site Recovery Manager and click Change.


17) Be patient it takes a while to load. On the next screen select Modify and click next.

18) Enter PSC address and the username/password.


19) Accept the Certificate

20) Enter information to register the SRM extension

21) Next choose if you want to create a new certificate or to use existing. Mine was still valid so used the existing one.

22) And finally you enter the new password you reset in Step 10. 

23) Once the setup is complete. Make sure the  VMware vCenter Site Recovery Manager Server service started. If not, manually start it. 

Reference: A quick google search brought up the following and we also had a case open with VMware who suggested to follow the same exact blog   http://www.virtuallypeculiar.com/2018/01/resetting-site-recovery-managers.html

Tuesday, May 8, 2018

docker: Error response from daemon: Server error from portlayer

docker: Error response from daemon: Server error from portlayer:

I got this error while running a new container in VIC.


I tried to move the VCH to a different ESXi Host, but no luck.

It is not advisable to reboot the VCH from vCenter, so had to skip that.

Tried to search the internet and came up with a similar error but a different cause - https://github.com/vmware/vic/issues/5782

This did not get me anywhere so I started looking from a higher level and this is what I noticed.


Turned out I had upgraded the VIC from 1.2.1 to 1.3 last weekend and complete forgot to upgrade this last VCH.

As you can see in the image above the VCH I was trying to run a container against was running on 1.2.1. Duh !

So I upgraded the VCH by running the following command -

./vic-machine-linux upgrade --id xxxxxx

And here is the result -


And ta-da I was able to run the new nginx container I was trying to deploy earlier.