NVIDIA/gpu-operator issues and pull requests

#545 - gpu-operator on RKE2 (RH8) - nvidia-cuda-validator component crashing continuosly.

Issue - State: open - Opened by frjaraur over 1 year ago - 5 comments

#544 - nvidia-fabricmanager.service can not start due to CUDA Version mismatch

Issue - State: open - Opened by bo-zeng-ml over 1 year ago

#543 - Update blossom-ci.yml

Pull Request - State: open - Opened by rorajani over 1 year ago - 2 comments
Labels: invalid

#542 - GPU-operator not applying driver version changes on EKS

Issue - State: closed - Opened by sjkoelle over 1 year ago - 8 comments

#541 - GPU operator 22.9.2 installation is failing

Issue - State: closed - Opened by likku123 over 1 year ago - 5 comments

#540 - Unable to deploy GPU Operator on MicroShift 4.13

Issue - State: closed - Opened by sjug over 1 year ago - 3 comments

#539 - GPU operator validator fails to create host device symlinks

Issue - State: open - Opened by adamancini over 1 year ago - 2 comments

#538 - GPU Operator Install with Terraform Not Working - Chart Not found

Issue - State: closed - Opened by nirajdesai2909 over 1 year ago - 3 comments

#537 - Relabelings for ServiceMonitor

Issue - State: closed - Opened by faurik over 1 year ago - 1 comment

#536 - Unable to cordon nodes

Issue - State: open - Opened by guyst16 over 1 year ago - 1 comment

#535 - dcgm-exporter missing metrics for A100 when mig enabled

Issue - State: closed - Opened by alloydm over 1 year ago

#534 - TEST CI

Pull Request - State: open - Opened by rorajani over 1 year ago - 1 comment
Labels: invalid

#533 - Node-by-node migration of k8s from docker to containerd with running toolkit leading to the inavailability of non-yet-upgraded nodes.

Issue - State: closed - Opened by punkerpunker over 1 year ago - 3 comments

#532 - Does this tool support windows nodes?

Issue - State: open - Opened by skiwheelr over 1 year ago - 1 comment

#531 - Toolkit containers in crash loop with "unresolvable CDI devices management.nvidia.com/gpu=all: unknown"

Issue - State: closed - Opened by benlsheets over 1 year ago - 6 comments

#530 - Not able to deploy Nvidia GPU Operator in Managed Kubernetes Service provided by OVH Cloud

Issue - State: closed - Opened by altruistcoder over 1 year ago - 2 comments

#529 - GPU Operator installation failure in AKS

Issue - State: closed - Opened by sidharthkumarpradhan over 1 year ago - 9 comments

#528 - nvidia-device-plugin-validator and nvidia-operator-validator in CrashLoopBackOff

Issue - State: closed - Opened by visla-xugeng over 1 year ago - 8 comments

#527 - Problem configuring vGPU access using Kubevirt

Issue - State: open - Opened by nadav213000 over 1 year ago - 14 comments

#526 - Could not resolve Linux kernel version on GKE 1.25.* + GPU Operator version: 23.3.1

Issue - State: closed - Opened by xcheng85 over 1 year ago - 9 comments

#525 - GPU-Operator does not install the specified driver version in AKS GPU Node

Issue - State: closed - Opened by xcheng85 over 1 year ago - 3 comments

#524 - Overriding The PrometheusRule Objects Alerts

Issue - State: open - Opened by guyst16 over 1 year ago - 1 comment
Labels: enhancement

#523 - nvidia gpu-operator in crashloopbackoff continuously on A100 nodes with 8 gpus.

Issue - State: open - Opened by hsuvarna over 1 year ago - 5 comments

#522 - Problem installing gpu-operator on rke2

Issue - State: closed - Opened by aavbsouza over 1 year ago - 5 comments

#521 - gpu-operator as OCI artifact

Issue - State: open - Opened by dioguerra over 1 year ago - 5 comments

#520 - gpu-operator console widget & gpu dashboard not reporting (correctly) after configuring mig

Issue - State: open - Opened by vittico over 1 year ago

#519 - Documentation clarification about containerd tweaks

Issue - State: open - Opened by aavbsouza over 1 year ago - 5 comments

#518 - Rename mig resources

Issue - State: closed - Opened by maaft over 1 year ago - 4 comments

#517 - rename

Issue - State: closed - Opened by maaft over 1 year ago

#516 - nvidia-settings and nvidia-xconfig not mounted to Pods

Issue - State: open - Opened by elgalu over 1 year ago - 2 comments

#515 - Add priorityClassName to nfd's pods

Pull Request - State: open - Opened by boniek83 over 1 year ago

#514 - Add priorityClassName to nfd's pods

Issue - State: open - Opened by boniek83 over 1 year ago - 1 comment

#513 - Openshift: NVIDIA GPU Operator: nvidia-container-toolkit-daemonset: InvalidImageName

Issue - State: closed - Opened by tormig-softronic over 1 year ago - 7 comments

#512 - feature discovery worker pod unable to connect to worker node

Issue - State: closed - Opened by gakshat14 over 1 year ago - 2 comments

#511 - Daemonset pods fail with: "nvidia-container-cli.real: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown"

Issue - State: closed - Opened by ianblitz over 1 year ago - 4 comments

#510 - I cannot install because plugin-validation and cuda-validation fail.

Issue - State: closed - Opened by koh-hr over 1 year ago - 8 comments

#509 - GPU operator component compatiblity matrix does not work for all combinations it specifies

Issue - State: open - Opened by lrotim over 1 year ago - 2 comments

#508 - Interaction between operator-validator and device-plugin causes error state.

Issue - State: open - Opened by neggert over 1 year ago - 3 comments

#507 - Feature request: More support for Ada/Hopper Generation gpus

Issue - State: open - Opened by hy-tomas-terala over 1 year ago - 2 comments

#506 - Does gpu-operaor's MIG work with AWS A10G?

Issue - State: closed - Opened by randxie over 1 year ago - 2 comments

#505 - DCGM Exporter breaks after upgrade to 22.9.12

Issue - State: closed - Opened by dcarrion87 over 1 year ago - 5 comments

#504 - Failing to install nvidia drivers on a new GPU node on a fresh LTS Ubuntu 22.04

Issue - State: open - Opened by denissabramovs over 1 year ago - 26 comments

#503 - Kubernetes containers using NVIDIA_VISIBLE_DEVICES lose device access after systemctl daemon-reload

Issue - State: open - Opened by dcarrion87 over 1 year ago - 7 comments

#502 - install nvidia/tao/tao-getting-started:4.0.0 (TAO Toolkit API) and get error message: Back-off restarting failed container

Issue - State: closed - Opened by ShangWeiKuo over 1 year ago - 1 comment

#501 - notebook nvidia-smi command show nothing

Issue - State: closed - Opened by tvtv511 over 1 year ago

#500 - K8 Job does not get marked as completed after the pod succeeds in AKS version 1.25.5

Issue - State: open - Opened by rajivml over 1 year ago

#499 - Panic in reconciler when daemonsets.annotations contain value with colons

Issue - State: open - Opened by sole0bserver over 1 year ago - 1 comment

#498 - Not able to obtain metrics for pods in GPU node using DCGM Exporter. nv-hostengine debug logs give Error: Could not load NSCQ.

Issue - State: open - Opened by suchisur over 1 year ago - 4 comments

#497 - Operator does not work with signed drivers and secure boot mode enabled

Issue - State: open - Opened by bogdan-mitrea-ds over 1 year ago - 9 comments

#496 - [BUG]: console-plugin-nvidia-gpu

Issue - State: closed - Opened by grvn over 1 year ago - 6 comments

#495 - Which driver image version is best suitable for gpu nvidia rtx A4000 and A2000 in redhat8.6

Issue - State: open - Opened by carlwang87 over 1 year ago - 1 comment

#494 - I'm using Kubernetes 1.19.9 and I have a GPU machine in the cluster. The kubectl top nodes / kubectl top pod gives only CPU and Memory usage as follows

Issue - State: open - Opened by vsadanala over 1 year ago - 2 comments

#493 - dcgmproftester pod from install docs using outdated cuda

Issue - State: open - Opened by benlsheets over 1 year ago - 3 comments

#492 - Cannot load all gpus on a worker node when installing gpu-operator with helm on rke2 cluster.

Issue - State: closed - Opened by zeddit over 1 year ago - 6 comments

#491 - Openshift GPU Operator v22.9 doesn't set nvidia.com/gpu.deploy.driver appropriately on node (IBM ROKS RHEL 8 4.10)

Issue - State: closed - Opened by relyt0925 over 1 year ago - 3 comments

#490 - Deprecated API used

Issue - State: closed - Opened by tormig-softronic over 1 year ago - 5 comments

#489 - Bump golang.org/x/net from 0.1.0 to 0.7.0

Pull Request - State: closed - Opened by dependabot[bot] over 1 year ago - 2 comments
Labels: dependencies

#488 - NVIDIA GPU Operator installation failed with Helm

Issue - State: closed - Opened by somethingwentwell almost 2 years ago - 2 comments

#487 - Some pods are stuck in init on one of our clusters

Issue - State: open - Opened by Alwinator almost 2 years ago - 8 comments

#486 - if devicePlugin.enabled is set to disable nvidia-operator-validator stay in CrashLoopBackOff

Issue - State: closed - Opened by nneram almost 2 years ago - 5 comments

#485 - NOTICE: Containers losing access to GPUs with error: "Failed to initialize NVML: Unknown Error"

Issue - State: open - Opened by cdesiniotis almost 2 years ago

#484 - gpu-operator fails to start due to deletion of nonexistent resources

Issue - State: closed - Opened by xknight almost 2 years ago - 8 comments

#483 - How do I install using Kustomize?

Issue - State: open - Opened by choyuansu almost 2 years ago - 1 comment

#482 - gpu-operator cannot discover the newly added GPU

Issue - State: open - Opened by zhouhao3 almost 2 years ago - 3 comments

#481 - GPU Operator reconciliation loop failed

Issue - State: closed - Opened by arpitsharma-hexad almost 2 years ago - 3 comments

#480 - device-plugin-validator fails if all gpu resources are allocated on a node

Issue - State: open - Opened by dcarrion87 almost 2 years ago - 11 comments

#479 - A problem that labels are not normally created when using custom-config

Issue - State: open - Opened by brinst07 almost 2 years ago - 1 comment

#478 - ClusterPolicy generated by Helm chart is not valid

Issue - State: closed - Opened by mkjpryor almost 2 years ago - 1 comment

#477 - gpu-operator injecting runtimeClass after transitioning containerd runtime node

Issue - State: open - Opened by dcarrion87 almost 2 years ago - 10 comments

#476 - Forced Driver Update with v22.9.1

Issue - State: open - Opened by BCJuan almost 2 years ago - 2 comments

#475 - [Feature Request] Make nvidia-operator-validator add a validation successful label or taint on the node

Issue - State: open - Opened by chiragjn almost 2 years ago - 4 comments
Labels: enhancement

#474 - Failed to remove GPU in nvidia-driver container

Issue - State: open - Opened by zhouhao3 almost 2 years ago - 5 comments

#473 - NVIDIA Container Toolkit fails to set default runtime on RKE2

Issue - State: closed - Opened by eabochasjauregui almost 2 years ago - 13 comments

#472 - PODNAME not populated in DCGM metrics

Issue - State: open - Opened by harjitdotsingh almost 2 years ago

#471 - HPA using gpu-operator

Issue - State: open - Opened by harjitdotsingh almost 2 years ago

#470 - OpenShift GPU operator only working on 1/2 physical nodes properly

Issue - State: closed - Opened by mgiessing almost 2 years ago - 2 comments

#469 - GPU's are showing as 0 while we have GPU's in cluster/NVIDIA GPU administration dashboard

Issue - State: closed - Opened by arpitsharma-hexad almost 2 years ago - 13 comments

#468 - Changing Node workload type on running node

Issue - State: open - Opened by nadav213000 almost 2 years ago - 1 comment

#467 - Not able to see DCGM Metrics in prometheus

Issue - State: closed - Opened by harjitdotsingh almost 2 years ago - 3 comments

#466 - Gpu operator does not work with cri-o user namespaces

Issue - State: open - Opened by robertdavidsmith almost 2 years ago - 1 comment

#465 - Time-slicing with multiple GPUs - asking for ability to block single GPU

Issue - State: open - Opened by Alexbay218 almost 2 years ago - 1 comment
Labels: enhancement

#464 - Will gpu-operator support Rocky linux in the furture?

Issue - State: open - Opened by carlwang87 almost 2 years ago

#463 - Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured

Issue - State: open - Opened by captainsk7 almost 2 years ago

#462 - [Feature Request] console-plugin-nvidia-gpu / GPU Operator Dashboard per project

Issue - State: closed - Opened by Alwinator almost 2 years ago - 2 comments

#461 - Cluster policy templating broken with default values

Issue - State: closed - Opened by danmx almost 2 years ago - 5 comments

#460 - Changing MIG strategy while Kubernetes cluster and gpu-operator running

Issue - State: closed - Opened by esparig almost 2 years ago - 2 comments

#458 - BREAKS ON 1.25: Does not work on k8s 1.25 due to node API deprecation

Issue - State: open - Opened by sfxworks almost 2 years ago - 16 comments

#457 - v22.9.0 - nvidia-driver-daemonset/nvidia-driver-ctr fails to start

Issue - State: closed - Opened by jeremy-london almost 2 years ago - 11 comments

#455 - Possible incompatibility with cpumanager, memorymanager, or topologymanager.

Issue - State: open - Opened by benlsheets almost 2 years ago - 3 comments

#454 - About the behavior of GPU-Operator when updating EUS

Issue - State: open - Opened by kousui-dev almost 2 years ago - 13 comments

#452 - Chcon command fails in nvidia-driver init - nvidia driver installation aborts

Issue - State: closed - Opened by snirkatriel almost 2 years ago - 5 comments

#451 - gpu-operator - deprecated API 1.25 call in audit log

Issue - State: closed - Opened by jpeimer almost 2 years ago - 3 comments

#443 - Getting Error: "stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown" while deploying gpu operator -22.9.0 on SLES 15 SP4

Issue - State: open - Opened by ATP-55 about 2 years ago - 9 comments

#441 - Error: failed to create FS watcher: too many open files

Issue - State: closed - Opened by EajksEajks about 2 years ago - 6 comments

#439 - DCGM exporter NodePort vs ClusterIP

Issue - State: closed - Opened by dcarrion87 about 2 years ago - 1 comment

#432 - Failed to get sandbox runtime: no runtime for nvidia is configured

Issue - State: open - Opened by denissabramovs about 2 years ago - 32 comments

#430 - Failed to initialize NVML: Unknown Error

Issue - State: open - Opened by hoangtnm about 2 years ago - 27 comments

#429 - gpu-operator-nfd-worker fails to read net interface attribute speed

Issue - State: closed - Opened by yotama-anv about 2 years ago - 13 comments

#428 - entitlement free system documentation out of date to work with RHCOS on OCP 4.11 (need to do manual tag)

Issue - State: open - Opened by relyt0925 about 2 years ago - 9 comments

#422 - console-plugin-nvidia-gpu / GPU Operator Dashboard not showing

Issue - State: closed - Opened by Alwinator about 2 years ago - 8 comments

GitHub / NVIDIA/gpu-operator issues and pull requests