As you may have noticed, the cost of using GPUs on public clouds is high, even very* high.
So, if your application needs to run temporarily on GPU nodes, you'll need to make sure that these nodes are only charged during these executions!
We'll show you how to do this using a Kubernetes Azure AKS cluster and the nodepool concept.
What we will achieve here:
- Create an AKS cluster
- Get the prerequisites to use GPU nodes
- Create and additional nodepool with the GPU node
- Create a Kubernetes deployment that request GPU
Let's do it!
We will assume that you have an account with sufficient rights to create and manage resource group in your subscription.
Create the AKS cluster
We will login into the Azure subscription and create a single node AKS cluster in the Switzerland North location.
az login
$resourcegrouplocation="switzerlandnorth"
$resourcegroupname="aks-gpu-demo-01"
$aksclustername="aks-gpu-demo-01"
az group create --name $resourcegroupname --location $resourcegrouplocation
az aks create -g $resourcegroupname -n $aksclustername --node-count 1 --generate-ssh-keys
And voilà, your AKS cluster is being deployed in your subscription. When the cluster is up and running you can kubectl it.
az aks get-credentials -g $resourcegroupname -n $aksclustername
kubectl get nodes -o wide
Get the prerequisites to use GPU nodes
There is a more administrative part to do before you can continue
You have to apply for a quota extension for NC6s_v3 VMs. This extension request doesn't imply any additional cost, it's just administrative.
Once your extension request is validated you can proceed with the GPU feature registration.
az aks get-credentials --resource-group $resourcegroupname --name $aksclustername
az feature register --name GPUDedicatedVHDPreview --namespace Microsoft.ContainerService
az feature list -o table --query "[?contains(name, 'Microsoft.ContainerService/GPUDedicatedVHDPreview')].{Name:name,State:properties.state}"
az provider register --namespace Microsoft.ContainerService
az extension add --name aks-preview
az extension update --name aks-preview
Here we will create the new GPU nodepool with the standard_nc6s_v3 model. We will specify a maximum of 1 node and a minimum of 0. This will allow us to scale by default the number of GPU nodes to 0 and therefore not have 1 node running all the time.
$gpunodepoolname="aksgpudemo01"
$gpunodesize="standard_nc6s_v3"
$mingpunodecount="0"
$maxgpunodecount="1"
az aks nodepool add --resource-group $resourcegroupname --cluster-name $aksclustername --name $gpunodepoolname --node-count 0 --node-vm-size $gpunodesize --node-taints sku=gpu:NoSchedule --aks-custom-headers UseGPUDedicatedVHD=true --enable-cluster-autoscaler --min-count $mingpunodecount --max-count $maxgpunodecount
To use the GPU node, schedule a GPU-enabled workload with the appropriate resource request. Below we run a Tensorflow job against the MNIST dataset. Create a file named aks-gpu-demo-01.yaml and paste the following YAML manifest. The important part of the manifest below is the use of nvidia.com/gpu: 1 to request the schedule on a GPU worker node.
apiVersion: batch/v1
kind: Job
metadata:
labels:
app: aks-gpu-demo-01
name: aks-gpu-demo-01
spec:
template:
metadata:
labels:
app: saks-gpu-demo-01
spec:
containers:
- name: aks-gpu-demo-01
image: mcr.microsoft.com/azuredocs/samples-tf-mnist-demo:gpu
args: ["--max_steps", "500"]
imagePullPolicy: IfNotPresent
resources:
limits:
nvidia.com/gpu: 1
restartPolicy: OnFailure
tolerations:
- key: "sku"
operator: "Equal"
value: "gpu"
effect: "NoSchedule"
Deploy the job and watch the nodes
kubectl apply -f aks-gpu-demo-01.yaml
watch kubectl get nodes -o wide
You can now see the GPU node being used by the job and then being deleted once the job is finished.
With this method we optimize and control the costs generated by the use of GPU nodes ;-)
Feel free to comment this article if you have questions.