enable GPU support
Before installing nos, you must enable GPU support in your Kubernetes cluster.
using GPU operator
Nvidia Docs Nvidia Docs Nvidia Repo
installing only the NVIDIA GPU Operator, which automatically installs all the necessary components for you.
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator --version v22.9.0 \
--set driver.enabled=true \
--set migManager.enabled=false \
--set mig.strategy=mixed \
--set toolkit.enabled=true
microk8s
migManager:
enabled: false
mig:
strategy: mixed
toolkit:
enabled: true
microk8s disable gpu
kubectl delete crd clusterpolicies.nvidia.com || true
kubectl delete crd nvidiadrivers.nvidia.com || true
microk8s enable gpu --driver operator --version 22.9.0 --values values.yaml
manual
manually install the required components individually
install Nebuly's device plugin
Nos deivce plugin Docs Custom Values
TURNING OFF THE POSSIBLE NVIDIA DEVICE PLUGIN
kubectl label node arman-gpu nvidia.com/gpu.deploy.device-plugin=false --overwrite
Alternative solution to TURNING OFF THE POSSIBLE NVIDIA DEVICE PLUGIN
Apply this to the gpu-operator
helm chart to disable nvidia device plugin for nebuly target nodes:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nos.nebuly.com/gpu-partitioning
operator: NotIn
values:
- mps
migManager:
enabled: false
mig:
strategy: mixed
enable automatic MPS partitioning on a node
kubectl label nodes arman-gpu "nos.nebuly.com/gpu-partitioning=mps"
NOTE: ONLY run it if the plugin FAILED TO UP
kubectl label node arman-gpu nvidia.com/device-plugin.config-
install plugin
helm install oci://ghcr.io/nebuly-ai/helm-charts/nvidia-device-plugin \
--version 0.13.0 \
--generate-name \
-n nebuly-nvidia \
--create-namespace
install nos
Docs Configration Values Configration Values Description
install cert-manager
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.2/cert-manager.yaml
install nos using helm
helm install oci://ghcr.io/nebuly-ai/helm-charts/nos \
--version 0.1.2 \
--namespace nebuly-nos \
--generate-name \
--create-namespace
uninstall nos
helm uninstall nebuly-nos -n nebuly-nos
create test pods
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: mps-partitioning-example
spec:
hostIPC: true # (2)
securityContext:
runAsUser: 1000 # (3)
containers:
- name: sleepy
image: "busybox:latest"
command: ["sleep", "120"]
resources:
limits:
nvidia.com/gpu-10gb: 1 # (1)
EOF
- Fraction of GPU with 10 GB of memory
- hostIPC must be set to true (allow containers to access the IPC namespace of the host)
- Containers must run as the same user as the MPS Server (1000)
logging
kubectl logs -n nebuly-nos pod/nos-1707662485-gpu-partitioner-846f9dc94f-w7n7f -f
kubectl logs -n nebuly-nvidia -l app.kubernetes.io/name=nebuly-nvidia-device-plugin -f -c nvidia-device-plugin-ctr
kubectl logs -n nebuly-nvidia -l app.kubernetes.io/name=nebuly-nvidia-device-plugin -f -c nvidia-mps-server
spw
# or
watch -n 1 kubectl get all
troubleshooting
Here the possible ordinary erros are covered
error in 'microk8s enable gpu' after running different operators (from nos repo or nvidia website)
kubectl delete crd nodefeaturerules.nfd.k8s-sigs.io
helm repo update (not needed possibly!)
"Error: specified config arman-gpu-1707430327 does not exist"
You need to remove some label from your node (labels are corrupted):
kubectl label node arman-gpu nvidia.com/device-plugin.config-
Full Error Message:
msg="Label change detected: nvidia.com/device-plugin.config=arman-gpu-1707430327"
time="2024-02-09T16:25:33Z"
level=info msg="Error: specified config arman-gpu-1707430327 does not exist"
Error from server (BadRequest): container "nvidia-device-plugin-sidecar" in pod "nvidia-device-plugin-nebu
check the values have been applied
You can ensure the helm custom values have been set correctly by:
helm get values gpu-operator -n gpu-operator-resources
it should be like this:
driver:
enabled: "true"
mig:
strategy: mixed
migManager:
enabled: false
operator:
defaultRuntime: containerd
toolkit:
enabled: true
env:
- name: CONTAINERD_CONFIG
value: /var/snap/microk8s/current/args/containerd-template.toml
- name: CONTAINERD_SOCKET
value: /var/snap/microk8s/common/run/containerd.sock
- name: CONTAINERD_SET_AS_DEFAULT
value: "1"
Or check all of the values by:
helm get values gpu-operator -n gpu-operator-resources --all
Or get ALL INFO about the release:
helm get all gpu-operator -n gpu-operator-resources
Or only check the chart table values by:
helm show values nvidia/gpu-operator