Advertisements

How to HA Citrix PVS TFTP services via NetScaler 10.1 using RNAT

I was recently working on a project to migrate a pair of NetScalers from FW 9.3 to a new set of MPX appliances running 10.1: Build 122.17.nc

I was very pleased to know that Citrix deployed a native way to load balance TFTP traffic via NetScaler 10.1, primarely to HA PVS TFTP traffic.  You can read on how to do this on this post by Adam Gamble.

Be cautious! Citrix confirmed this is an issue on 10.1, however it has been fixed in version 10.1 build 123.x or later. Their suggestion is to  upgrade it to the latest 10.1 version(build 128.8) (Did not test with 10.5 yet).  This occurs due to the Packet Process Engine crash when TFTP traffic is triggered through the Netscaler, which will cause your NetScaler to reboot, and in some cases corrupt the NetScaler config file  (ns.conf)

Below is a way to get the darn TFTP process to work via UDP load balancing using an RNAT and utilizing USIP mode

Environment

  • MPX 8200 NetScaler 10.1: Build 122.17.nc

Solution:

In my case I created a new subnet for this.  Reason is because as you may already know, the source IP of traffic is usually your SNIP, however when using an RNAT you will see Source IP coming from the configured RNAT IP and not the SNIP.  So this is up to you.  Where I work now SourceIP is a big deal.

For this example lets use a random 172.16.88.0/24 Subnet

1. Create new NetScaler SNIP for the new Subnet 172.16.88.0/24 under NetScaler> System > Network > IPV4s

Then create your VIP that you will use for your DHCP Option 066 reservation.

SNIP: 172.16.88.99 

VIP: 172.16.88.35

TFTP SNIP

2. Let create a new VLAN 999 (your VLAN TAG) for the 172.16.88.0/24 subnet (Yes you can shrink the subnet size, just using this as an example :))  and in my case TAG the interface 1/1 to save some ports.  You can do this under NetScaler> System > Network > VLAN (Interface 1/1 is set up with full Trunk)

VLAN TFTP

3. To be safe I like to ensure a new DIRECT route is created for the new Subnet to utilize the new SNIP.  This is under  NetScaler> System > Network > Routes > Basic

In this case, it would be Network: 172.16.88.0 Netmask 255.255.255.0 Gateway (Your SNIP): 172.16.88.99

TFTP Route SNIP

 

4. Create RNAT under NetScaler> System > Network > Routes > RNAT

set ns rnat <Host/ Subnet IP> <Host/Subnet Mask> -natip <VIP IP>

In this case Network: 172.16.88.0 Netmask: 255.255.255.0 NatIP (Your VIP that will load balance that TFTP process) 172.16.88.35

TFTP RNAT

 

 

5. Ok the hard part is done.  Lets now create TFTP Servers under NetScaler > Traffic Management > Load Balancing > Servers

TFTP01 = 172.16.88.212

TFTP02 = 172.16.88.213

TFTP Servers

6. Create your Service Groups under NetScaler > Traffic Management > Load Balancing > Service Groups. Under Advanced ensure “User Proxy Port” is set to “No” and “USIP” mode is set.  If this is not set, your TFTP traffic will not function

TFTP Service Group

 

TFTP USIP PROXY NO

7. Create VIP (172.16.88.35) under NetScaler > Traffic Management > Load Balancing > Virtual Servers

Bind Service Group previously created

TFTP Service bind

 

 

 

8. Ok no more NetScaler work, since we are using “Use Source IP” we need to set the default gateway on the TFTP servers (TFTP01/02) to be set to the NetScaler SNIP 172.16.88.99

9. Lastly change your DHCP scope to include the Boot Server Host Name under option 066 to your VIP 172.16.88.35

Your are all set, now to update to 10.5 I guess 😛

TFTP Boot

 

 

 

 

 

Advertisements

PVS Gold image WMI issues fix

As part of a PVS image issue discovery project, I was able to determine that WMI was not working on several gold images which was causing several memory leaks, as well as event viewer complaining just about every 30 seconds.

Problems escalated whenXenApp hosts would completely run our of virtual memory which would end up affecting the overall user experience.

Environment:

  • Windows 2003 SP3
  • Citrix XenApp 5.0
  • PVS 6.1.16

Issue:

WMI not working on several Golden images.  My guess is this image was copied in a broken state and was replicated to many different images.

Fix:

Run the following in command line

  • Regsvr32 %SystemRoot%\System32\wbem\wmidcprv.dll
  • cd /d %windir%\system32\wbem
  • for %i in (*.dll) do RegSvr32 -s %i
  • for %i in (*.exe) do %i /RegServer

The Windows Management Instrumentation Tester window may appear, this is normal and we can go ahead to close it.

If it does not work, I also suggest you run the following commands to repair WMI namespace:

  • net stop winmgmt
  • wmic /NAMESPACE:\\root path “__namespace.name=’wmi'” delete
  • mofcomp %windir%\system32\wbem\wmi.mof
  • net start winmgmt

Restart the computer to check the result. If the issue persists, try the following steps:

  • winmgmt /verifyrepository
  • winmgmt /salvagerepository

PVS Gold image Admin shares fix

I was recently working on some PVS image issues where the Admin shares stopped working.  If that does not ring a bell, you access them like \\hostname\c$.

This was crucial for us for several reasons including utilizing AV scans as well as monitoring solutions such as ControlUP which is a great utility for monitoring your XenApp infrastructure.  You can read more about it by following this link.

Environment:

  • Windows 2008 R2/Windows 2003 SP3
  • Citrix XenApp 6.5/5.0 & Citrix XenDesktop 5.6
  • Windows 2003/2008/Win 7

Issue:

Windows Admin shares will stop working when the image was set to read only mode.

Fix:

  • Open the registry editor on the machine you cannot connect to by clicking on Start, Run
  • Type REGEDIT and press Enter, then go to the key below

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Lanmanserver\Parameters

  • In the right-hand portion of the Registry Editor look for a key called IRPStackSize
  • If the key exists, double-click on it and increase the decimal value to 15 and click OK.
  • Close the Registry Editor, reboot the computer and try to connect to the network share. If you are still unable to connect follow the above steps again and increase the decimal number for IRPStackSize by 5 and try again.

Continue to do this until the stack size is large enough to permit access. Personally, I had to increase the number to 25 before I could connect. The decimal range for the parameter is between 11 and 50.

  • If the IRPStackSize key DOES NOT exist in the right-hand column of the registry editor, then click on right-click in the blank area in the right-hand column and choose New
  • Then Click on D-Word under the Key column
  • In the Name of the new key, type the name IRPStackSize and press Enter. Type the name with the correct capitalization as shown above.
  • Now double-click on the IRPStackSize key and type 15 in the value data box and place a dot next to Decimal then click Ok.
  • Close the registry editor and reboot the computer. Try to access the network share again. If the same error appears follow the steps above to increase the value and reboot again. Continue this procedure until the problem is resolved. The decimal range for the parameter is between 11 and 50.

Citrix Provisioning Services DB Offline support

 

With the introduction of Citrix Provisioning Services 5.6 and 6.x, there is a feature that is sometimes overlooked which will allow your Provisioned hosts to remain up in the event that the SQL database is down.

Environment:

  • Windows 2008 R2
  • XenApp 6.5
  • XenDesktop 5.6
  • UPM 4.1
  • PVS 6.1

Issue:

  • The PVS Stream service is down due to unavailability of the PVS SQL database causing ALL streamed servers to stop responding and bringing all your XenApp/XenDesktop hosts down
  • This issue also occurs in a highly available environment (2 provisioning servers or more).

Now lets remember that one of the responsibilities of the PVS Stream service is to communicate between the provisioning servers and the SQL Database, when the SQL database is down, and offline support is not enabled, the Stream service cannot function properly.

When this service is down, all Streamed servers/desktops stops functioning.

Solution:

Enable the DB Offline Support feature in the Provisioning Services Farm

  • Using the Provisioning Services console, connect to your Provisioning Services farm. Right-click the farm and select Properties.

rm_properties.jpg

  • Select the Options tab, and select DB Offline.

  • Click OK and you are prompted to restart your Stream services on each server for the setting to take effect.

 

Provisioning Services antivirus exclusions

Citrix wrote an excellent article about the PVS Antivirus exclusions and thought I will share it with you.  In my experience with PVS, this is very crucial step as it will ensure you don’t interrupt the streaming process and/or slow things down.

If you are like me and decide to run the TFTP boot process on separate servers, you can create TFTP HA utilizing the NetScalers, if you decide to go this route which I recommend, you will need to exclude the TFTP directory on the separate TFTP hosts.  You can read about how to HA the PVS  TFTP boot process via the NetScalers from a previous post I wrote, which let me tell you it was no easy task.

To read the entire Citrix article about the Antivirus exclusions, click here.

A few recommended Server Side file exclusions.

C:\Windows\System32\drivers\CVhdBusP6.sys => (PVS 6.1)
C:\Windows\System32\drivers\CVhdBus2.sys => (PVS 5.6)
C:\Windows\System32\drivers\CFsDep2.sys => (PVS 5.6 and PVS 6.1)
C:\Program Files\Citrix\Provisioning Services\BNTFTP.EXE => (PVS 5.6 and PVS 6.1)
C:\ProgramData\Citrix\Provisioning Services\Tftpboot\ARDBP32.BIN => (PVS 5.6 and PVS 6.1)
D:\Store => ( i.e. local vdisk store)

A few recommended Server Side processes to be excluded.
C:\Program Files\Citrix\Provisioning Services\StreamService.exe => (All versions)
C:\Program Files\Citrix\Provisioning Services\StreamProcess.exe => (All versions)
C:\Program Files\Citrix\Provisioning Services\soapserver.exe => (All versions)

A few recommended Target Device exclusions.
C:\Windows\System32\drivers\bnistack.sys => (Only targets, Win2003/XP)
C:\Windows\System32\drivers\bnistack6.sys => (Only targets, 2008/Win7)
C:\Windows\System32\drivers\BNNF.sys => (Only targets)
C:\Windows\System32\drivers\BNNS.sys => (Only targets, Win2003/XP)
C:\Windows\System32\drivers\BNNS6.sys => (Doesn’t exist anymore with PVS6.1 Agent)
C:\Windows\System32\drivers\BNPort.sys => (Only targets)
C:\Windows\System32\drivers\CFsDep2.sys => (Only targets, Win2003/XP)
C:\Windows\System32\drivers\CVhdBusP52.sys => (Only targets, Win2003/XP)
C:\Program Files\Citrix\Provisioning Services\BNDevice.exe => (Only targets, 2008/Win7)
C:\Program Files\Citrix\Provisioning Services\TargetOSOptimizer.exe => (Only targets, 2008/Win7)

 

Determining the size of your Provisioning Services Write-Cache

I been prepping to get a new XenDesktop 5.6 FP1 / XenApp 6.5 environment to run with Provisioning Services 6.1.  In the last few weeks, I been collecting lots of information re: best practices, etc.   In the past I worked with PVS versions 5.0, 5.1, 5.2 and 5.6.  However I always ask my self and others “What is the correct size for my write-cache?” A good question and the answer, well “it depends.”  Below is some great information to use for determining the write-cache drive size.

What is Provisioning Services, a write-cache, and why does it matter?

Before we begin a little background on Citrix Provisioning Services (PVS) and how it works. Provisioning Services provides administrators the ability to virtualize a hard disk or workload and then stream it back out to multiple devices. The workloads, which can be server or desktop, are ripped from a physical or virtual disk into Microsoft’s virtual hard disk (VHD) format and treated as a golden master image called a vDisk. This master image is then streamed over the network from a Windows server running the stream service to multiple target devices that were PXE booted.

Device drivers (network and disk) installed on the target devices (physical or virtual) have the intelligence to route disk file requests over the network to the PVS streaming servers which in turn provide the requested files. The entire vDisk is not streamed across the network. Only the files requested by the operating system are streamed across the network. This means that a 30 GB Windows Server 2008 R2 workload that boots off a streamed vDisk may only see 200 MB of files transfer across the network.

When a vDisk is in private mode, the vDisk can be edited and all reads and writes to the vDisk and only one target device may be accessing the vDisk. When a vDisk is in standard mode, it is read-only and no changes can be made to it. Instead all disk write operations are redirected to what is referred to as a write-cache file. The intelligent device drivers are smart enough to redirect writes to the write-cache file and read newly written files from the write-cache file instead of the server when necessary. Also, whenever a target device is rebooted, the write-cache is deleted and recreated and the device boots in a pristine state.
What factors are important in determining the write-cache file size?
When using Citrix Provisioning Services with the vDisk in standard mode you have a write-cache drive location that holds all the writes for the operating system. If the write-cache file fills up unexpectedly, the operating system will behave the same as if the drive ran out of space without any warning, in other words it will blue screen.

The optimum size of write-cache drive does depend on several factors:

Frequency of server reboots. The write-cache file is reset upon each server boot so the size only needs to be large enough to handle the volume between reboots.

Amount of free space available on the c: drive. The space that will be used for new files written to the c: drive is considered the free space available. This is a key value when determining the write-cache drive size.

Amount of data being saved to the c: drive. Data that is written to the c: drive during operation will get stored automatically in the write-cache drive. New files will be stored in the write-cache file and decrease the amount of available space. Replacements for existing files will also be written to the write-cache file but will not marginally affect the amount of free space. For instance, a service pack install on a standard-mode disk will result in the write-cache file holding all the updated files, with very little change in available space.

Size and location of the pagefile. When a local NTFS-formatted drive is found, Provisioning Services moves the Windows pagefile off of the c: drive to the first available NTFS drive, which is also the location of the write-cache file. Therefore, in the default configuration, the write-cache drive will end up holding both the write-cache file and the pagefile. To learn more about correctly sizing your pagefile, see Nick Rintalan’s blog, “The Pagefile Done Right!”.

Location of the write-cache file. The location of the write-cache file is also a factor in determining its size. The write-cache file can be held on the target device’s local disk, the target device’s RAM, or on the streaming server.

  • Target device disk: If the write-cache file is held on the target device’s disk, it could be a local disk to client, local disk to the hypervisor, network storage to the hypervisor, or SAN storage to the hypervisor.
  • Target device RAM: If the write-cache file is held in the target device’s RAM the response time will be faster and in some cases the additional RAM is less expensive than SAN disk.
  • Streaming Server: If the write-cache file is on the server, no preset size is necessary. When using server-side write-cache file, the Provisioning Services streaming server must have enough disk space to hold the write-cache files for all target devices managed.

Determining the correct write-cache drive size is mostly a logical exercise once you understand the relationship of the write-cache file and the pagefile with the write-cache drive.

Guidelines for determining write-cache size
In the old days we would recommend running with server-side write cache for the duration of the pilot project and then find the largest write-cache file on the server before the target devices were rebooted. From there we would just double or triple the size and make that the default size for a write-cache file. That approach works most of the time, but the approach is not so efficient with disk space.

Below are the few guidelines I use when recommending a size for the client-side write-cache drive.

  1. Write-cache drive = write-cache file + pagefile (if pagefile is stored on the write-cache drive)
  2. Write-cache file size should be equal to the amount of free space left on the vDisk image. This will work in most situations, except those where servers receive large file updates immediately after booting. As a rule, your vDisk should not be getting updated while running in standard-mode.
  3. Always account for the pagefile location and size. If it is configured to reside on the c: or d: drive, include it in all size calculations.
  4. Set the pagefile to a predetermined size to make it easier to account for it. Letting Windows manage the pagefile size starts with 1x RAM but it could vary. Manually setting it to a known value will provide a static number to use for calculations.
  5. During the pilot, use server-side write caching to get an idea of the maximum size you might see a file reach between server reboots. Obviously, the server should have a full load and should be subject to the normal production reboot cycle for this to be of value.
  6.  If people die when servers blue screen, set the write-cache drive to the size of the vDisk plus the pagefile size.

In most situations, the recommended write-cache drive size will be free space available on vDisk image plus the pagefile size. For instance, if you have a 30GB Windows Server 2008 R2 vDisk with 16GB used (14GB free) and are running with an 8GB pagefile, I would recommend using a write-cache drive of 22GB calculated as 14GB free space + 8GB for the pagefile. If space doesn’t permit, you have a few options, not all of which may be available to you.

  1. If storage location for the write-cache drive supports thin-provisioning, configure thin-provisioned drives for the write-cache drive to save space.
  2. Use dynamic VHDs (instead of fixed VHDs) though this approach is generally only recommended for XenDesktop workloads. If you choose this approach, you will probably need to periodically reset the size of the dynamic VHD, which can be done with a PowerShell script.
  3. Reboot the servers more frequently which in turn will reduce the maximum size of the write-cache file.
  4. Move the pagefile to a different drive or run without a pagefile.
  5. Use the old school method mentioned earlier to select a write-cache file size that is equal to or larger than the largest write-cache file recorded during the pilot stage. Using this option though may still result in blue screen events.

Of course, if you require 100% uptime and you have the disk space available, the sure-fire write-cache drive size is to set it to the size of the vDisk plus the pagefile size when the pagefile will get placed on the write-cache drive. In other words, if the Windows Server 2008 R2 vDisk image is 30GB and you have an 8GB pagefile configured, setting the write-cache drive size to 38GB will protect against any unforeseen blue screens. However, not everyone has that kind of space available, especially when using the expensive SAN storage for the write-cache drives.

Scalability implications
Just a quick note that large-scale environments, the best practices recommendation is to place the write-cache drive on the client hard disk rather than on the server. Generally speaking, you get about 40-60% more target devices on a single Provisioning Server with client-side write-cache than you do with server-side write-cache drives.  In addition, failover works better as the client target device has its write-cache available no matter which server is streaming the vDisk.

The use of client-side write-cache provides the maximum scalability of the Provisioning Services streaming server because the server does not need to perform both reads and writes for all target devices; rather the server is only required to read the vDisk once, cache the contents, and then stream it out over the network. This saves both CPU and network bandwidth on the streaming server allowing it to manage more target devices.

Citrix Provisioning 6.1 and ESX 4.1 / 5.0 VMXNET3 NIC driver issue

Before you roll out Citrix Provisioning 6.1 make sure you install the latest Citrix PVS 6.1 Hotfixes. I had some really weird behavior when attemting to boot up VMs both on ESX 4.1 and 5.0

Issue:

I was not able to get target VMs working with ESX 4.1 / 5.0 utilizing the VMXNET3 NIC driver.  As most of you know, this is the recommended NIC to utilize in a VMware virtual infrastructure.  Unfortunately, VMs would not boot and had to rollback to the E1000 driver and run the 5.6 Target Agent (What?).  After successfully booting up the VM, I decided to do more digging on the VMXNET3 issue.

So after a few hours of pulling my hair this is what I found:

Install the latest PVS Hotfixes , in my case CPVS61E010 for PVS 6.1 (basically it is PVS 6.2).  However the following steps need to be followed:

  • Completely remove the existing install and re-install the core software
  • Upgrade the Database so make sure you have a good backup (the install process takes care of this for you)
  • Install the New target Agent.
  • This all works under 5.x, for some reason if you are running 4.1 like a lot of people, you will need to apply Hot fix CPVS56SP1E011 (What the heck?)  Once this was done I was able to utilize the VMXNET3 NIC driver on my Provisioned VM.