How-to install a Ubuntu 18.04 LTS guest/virtual machine (VM) running on Windows Server 2019 Hyper-V using Discrete Device Assignment (DDA) attaching a NVIDIA Tesla V100 32Gb on Dell Poweredge R740

Background

Commodity NVIDIA cards such as the GeForce 1080Ti are not intended by their manufacturer to be used in virtualized environments. Specifically there are features built into the drivers that make passing the PCI-E devices through from the host into the VM a challenge. This is possible to bypass if the host is running Linux/KVM (I will cover this in another post) however, I am not aware of a solution when the host is running Windows Server / Hyper-V as the hypervisor.

I will discuss why Hyper-V is far more appealing rather than running KVM or Virtualbox for this particular deployment in another post. Also for the purposes of this post, I am deliberately ignoring the performance impact of running GPU enabled deep learning inside a VM.

The NVIDIA cards that do support PCI-E passthrough (or Discrete Device Assignment – DDA in the Hyper-V world) are Quadro, Grid and Tesla (amongst others). I am using 14th generation Dell Poweredge R740 with riser config 4 and the GPU cooling trays and power kits as my bare metal. Following a previous experience of using the same kit but with commodity NVIDIA cards, for this deployment, we chose to use Dell pre-installed Tesla V100 32Gb cards. These are passively cooled devices and support DDA out of the box (as does the host).

R740 host runs BIOS version 2.3.10, is in UEFI boot mode, defaults loaded, performance mode on and SRV_IO global are enabled. Windows Server 2019 with Hyper-V is installed. N.B the bios version as earlier BIOS versions on 14G Dell had an issue that prevented the 2080Ti and newer cards from working correctly.

Aim

The aim is to pass the Tesla through from Server 2019 into a VM running either Centos or Ubuntu and get CUDA working.

Issues

I decided to start with Ubuntu 18.04.03 LTS. I am not a fan of Ubuntu in general (I think the code base is far too volatile and represents the most unstable and inconsistent mainstream LTS out there) however, whilst I think running Ubuntu LTS Server in production is an act of utter madness, I do recognise that Ubuntu Desktop fulfils a vital role in the desktop space, especially for a bit of data science. It is also on the Microsoft Hyper-V list for DDA support.

We firstly create a generation 1 VM in Hyper-V with 64Gb RAM and 8 CPU cores. NUMA spanning is off and these resources are much less than those of one NUMA core. Following the Microsoft DDA notes ( https://docs.microsoft.com/en-us/windows-server/virtualization/hyper-v/plan/plan-for-deploying-devices-using-discrete-device-assignment) and this excellent guide https://blog.workinghardinit.work/2016/04/11/discrete-device-assignment-in-windows-server-2016-hyper-v/. We run this code to identify the device and attach it to the device:

$MyDisplays = Get-PnpDevice | Where-Object {$_.Class -eq “Display”}
$MyDisplays | ft -AutoSize

$MyNVIDIA = Get-PnpDevice | Where-Object {$_.Class -eq “Display”} |
Where-Object {$_.Service -eq “nvlddmkm”}
$MyNVIDIA | ft -AutoSize

$DataOfGPUToDDismount = Get-PnpDeviceProperty DEVPKEY_Device_LocationPaths -InstanceId $MyNVIDIA[0].InstanceId
$DataOfGPUToDDismount | ft -AutoSize

$locationpath = ($DataOfGPUToDDismount).data[0]
$locationpath | ft -AutoSize

# Device needs to be disabled in device manager first
Dismount-VmHostAssignableDevice -locationpath $locationpath
Set-VM -LowMemoryMappedIoSpace 3Gb -VMName test
Set-VM -HighMemoryMappedIoSpace 33280Mb  -VMName test
Add-VMAssignableDevice -LocationPath $locationpath -VMName test

Note that the MIMO space has been set as per the guide. Running the SurveyDDA.ps1 confirms the Tesla is suitable for DDA and that the MIMO gap needs 48Mb. Nevertheless we have used these suggested settings. Ubuntu 18.04.03 LTS amd64 is installed from the ISO and on first boot this message is seen in dmesg:

hv_pci *UUID*: Need 0x802000000 of high MIMO space. Consider reconfiguring the VM.
hv_vmbus: probe failed for device *UUID* (-6)

This is running the ISO kernel (5.0.0-23-generic) and running an lspci shows no Telsa on the virtual PCI bus. It appears that we have the same issue as https://gridforums.nvidia.com/default/topic/10090/tesla-boards/tesla-v100-hyper-v-2019-device-cannot-find-enough-resources-code12-mmio-config-/?offset=2#16249 .

Interestingly if you wipe the vhdx at this point, change nothing else and install Windows 10, the device pops up on the bus, the driver installs and works. However, wiping the the vhdx and installing Centos 7 or 8 has the same issue as Ubuntu. So what happens if you increase the MIMO:

802000000 (hex) = 34393292800 (decimal) > 32Gb

Therefore if you increase the high MIMO accordingly and go back to Ubuntu using these parameters:

Set-VM -LowMemoryMappedIoSpace 3Gb -VMName test
Set-VM -HighMemoryMappedIoSpace 33GB -VMName test

Ubuntu promptly hangs during boot-up:

/dev/sda1: Clean, ****

However, if you install the latest azure kernel (5.0.0-1028-azure) or the very latest linux kernel (5.3.0-26-generic) before you increase the high MIMO, then it works and appears on the bus. After installing the NVIDIA Ubuntu Tesla repo and cuda all is working!

Does anyone know what has changed in these newer kernels? Hope this helps.

About: Rob