Ecosyste.ms: Issues
An open API service for providing issue and pull request metadata for open source projects.
GitHub / NVIDIA/gpu-operator issues and pull requests
#545 - gpu-operator on RKE2 (RH8) - nvidia-cuda-validator component crashing continuosly.
Issue -
State: open - Opened by frjaraur over 1 year ago
- 5 comments
#544 - nvidia-fabricmanager.service can not start due to CUDA Version mismatch
Issue -
State: open - Opened by bo-zeng-ml over 1 year ago
#543 - Update blossom-ci.yml
Pull Request -
State: open - Opened by rorajani over 1 year ago
- 2 comments
Labels: invalid
#542 - GPU-operator not applying driver version changes on EKS
Issue -
State: closed - Opened by sjkoelle over 1 year ago
- 8 comments
#541 - GPU operator 22.9.2 installation is failing
Issue -
State: closed - Opened by likku123 over 1 year ago
- 5 comments
#540 - Unable to deploy GPU Operator on MicroShift 4.13
Issue -
State: closed - Opened by sjug over 1 year ago
- 3 comments
#539 - GPU operator validator fails to create host device symlinks
Issue -
State: open - Opened by adamancini over 1 year ago
- 2 comments
#538 - GPU Operator Install with Terraform Not Working - Chart Not found
Issue -
State: closed - Opened by nirajdesai2909 over 1 year ago
- 3 comments
#537 - Relabelings for ServiceMonitor
Issue -
State: closed - Opened by faurik over 1 year ago
- 1 comment
#536 - Unable to cordon nodes
Issue -
State: open - Opened by guyst16 over 1 year ago
- 1 comment
#535 - dcgm-exporter missing metrics for A100 when mig enabled
Issue -
State: closed - Opened by alloydm over 1 year ago
#534 - TEST CI
Pull Request -
State: open - Opened by rorajani over 1 year ago
- 1 comment
Labels: invalid
#533 - Node-by-node migration of k8s from docker to containerd with running toolkit leading to the inavailability of non-yet-upgraded nodes.
Issue -
State: closed - Opened by punkerpunker over 1 year ago
- 3 comments
#532 - Does this tool support windows nodes?
Issue -
State: open - Opened by skiwheelr over 1 year ago
- 1 comment
#531 - Toolkit containers in crash loop with "unresolvable CDI devices management.nvidia.com/gpu=all: unknown"
Issue -
State: closed - Opened by benlsheets over 1 year ago
- 6 comments
#530 - Not able to deploy Nvidia GPU Operator in Managed Kubernetes Service provided by OVH Cloud
Issue -
State: closed - Opened by altruistcoder over 1 year ago
- 2 comments
#529 - GPU Operator installation failure in AKS
Issue -
State: closed - Opened by sidharthkumarpradhan over 1 year ago
- 9 comments
#528 - nvidia-device-plugin-validator and nvidia-operator-validator in CrashLoopBackOff
Issue -
State: closed - Opened by visla-xugeng over 1 year ago
- 8 comments
#527 - Problem configuring vGPU access using Kubevirt
Issue -
State: open - Opened by nadav213000 over 1 year ago
- 14 comments
#526 - Could not resolve Linux kernel version on GKE 1.25.* + GPU Operator version: 23.3.1
Issue -
State: closed - Opened by xcheng85 over 1 year ago
- 9 comments
#525 - GPU-Operator does not install the specified driver version in AKS GPU Node
Issue -
State: closed - Opened by xcheng85 over 1 year ago
- 3 comments
#524 - Overriding The PrometheusRule Objects Alerts
Issue -
State: open - Opened by guyst16 over 1 year ago
- 1 comment
Labels: enhancement
#523 - nvidia gpu-operator in crashloopbackoff continuously on A100 nodes with 8 gpus.
Issue -
State: open - Opened by hsuvarna over 1 year ago
- 5 comments
#522 - Problem installing gpu-operator on rke2
Issue -
State: closed - Opened by aavbsouza over 1 year ago
- 5 comments
#521 - gpu-operator as OCI artifact
Issue -
State: open - Opened by dioguerra over 1 year ago
- 5 comments
#520 - gpu-operator console widget & gpu dashboard not reporting (correctly) after configuring mig
Issue -
State: open - Opened by vittico over 1 year ago
#519 - Documentation clarification about containerd tweaks
Issue -
State: open - Opened by aavbsouza over 1 year ago
- 5 comments
#518 - Rename mig resources
Issue -
State: closed - Opened by maaft over 1 year ago
- 4 comments
#517 - rename
Issue -
State: closed - Opened by maaft over 1 year ago
#516 - nvidia-settings and nvidia-xconfig not mounted to Pods
Issue -
State: open - Opened by elgalu over 1 year ago
- 2 comments
#515 - Add priorityClassName to nfd's pods
Pull Request -
State: open - Opened by boniek83 over 1 year ago
#514 - Add priorityClassName to nfd's pods
Issue -
State: open - Opened by boniek83 over 1 year ago
- 1 comment
#513 - Openshift: NVIDIA GPU Operator: nvidia-container-toolkit-daemonset: InvalidImageName
Issue -
State: closed - Opened by tormig-softronic over 1 year ago
- 7 comments
#512 - feature discovery worker pod unable to connect to worker node
Issue -
State: closed - Opened by gakshat14 over 1 year ago
- 2 comments
#511 - Daemonset pods fail with: "nvidia-container-cli.real: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown"
Issue -
State: closed - Opened by ianblitz over 1 year ago
- 4 comments
#510 - I cannot install because plugin-validation and cuda-validation fail.
Issue -
State: closed - Opened by koh-hr over 1 year ago
- 8 comments
#509 - GPU operator component compatiblity matrix does not work for all combinations it specifies
Issue -
State: open - Opened by lrotim over 1 year ago
- 2 comments
#508 - Interaction between operator-validator and device-plugin causes error state.
Issue -
State: open - Opened by neggert over 1 year ago
- 3 comments
#507 - Feature request: More support for Ada/Hopper Generation gpus
Issue -
State: open - Opened by hy-tomas-terala over 1 year ago
- 2 comments
#506 - Does gpu-operaor's MIG work with AWS A10G?
Issue -
State: closed - Opened by randxie over 1 year ago
- 2 comments
#505 - DCGM Exporter breaks after upgrade to 22.9.12
Issue -
State: closed - Opened by dcarrion87 over 1 year ago
- 5 comments
#504 - Failing to install nvidia drivers on a new GPU node on a fresh LTS Ubuntu 22.04
Issue -
State: open - Opened by denissabramovs over 1 year ago
- 26 comments
#503 - Kubernetes containers using NVIDIA_VISIBLE_DEVICES lose device access after systemctl daemon-reload
Issue -
State: open - Opened by dcarrion87 over 1 year ago
- 7 comments
#502 - install nvidia/tao/tao-getting-started:4.0.0 (TAO Toolkit API) and get error message: Back-off restarting failed container
Issue -
State: closed - Opened by ShangWeiKuo over 1 year ago
- 1 comment
#501 - notebook nvidia-smi command show nothing
Issue -
State: closed - Opened by tvtv511 over 1 year ago
#500 - K8 Job does not get marked as completed after the pod succeeds in AKS version 1.25.5
Issue -
State: open - Opened by rajivml over 1 year ago
#499 - Panic in reconciler when daemonsets.annotations contain value with colons
Issue -
State: open - Opened by sole0bserver over 1 year ago
- 1 comment
#498 - Not able to obtain metrics for pods in GPU node using DCGM Exporter. nv-hostengine debug logs give Error: Could not load NSCQ.
Issue -
State: open - Opened by suchisur over 1 year ago
- 4 comments
#497 - Operator does not work with signed drivers and secure boot mode enabled
Issue -
State: open - Opened by bogdan-mitrea-ds over 1 year ago
- 9 comments
#496 - [BUG]: console-plugin-nvidia-gpu
Issue -
State: closed - Opened by grvn over 1 year ago
- 6 comments
#495 - Which driver image version is best suitable for gpu nvidia rtx A4000 and A2000 in redhat8.6
Issue -
State: open - Opened by carlwang87 over 1 year ago
- 1 comment
#494 - I'm using Kubernetes 1.19.9 and I have a GPU machine in the cluster. The kubectl top nodes / kubectl top pod gives only CPU and Memory usage as follows
Issue -
State: open - Opened by vsadanala over 1 year ago
- 2 comments
#493 - dcgmproftester pod from install docs using outdated cuda
Issue -
State: open - Opened by benlsheets over 1 year ago
- 3 comments
#492 - Cannot load all gpus on a worker node when installing gpu-operator with helm on rke2 cluster.
Issue -
State: closed - Opened by zeddit over 1 year ago
- 6 comments
#491 - Openshift GPU Operator v22.9 doesn't set nvidia.com/gpu.deploy.driver appropriately on node (IBM ROKS RHEL 8 4.10)
Issue -
State: closed - Opened by relyt0925 over 1 year ago
- 3 comments
#490 - Deprecated API used
Issue -
State: closed - Opened by tormig-softronic over 1 year ago
- 5 comments
#489 - Bump golang.org/x/net from 0.1.0 to 0.7.0
Pull Request -
State: closed - Opened by dependabot[bot] over 1 year ago
- 2 comments
Labels: dependencies
#488 - NVIDIA GPU Operator installation failed with Helm
Issue -
State: closed - Opened by somethingwentwell almost 2 years ago
- 2 comments
#487 - Some pods are stuck in init on one of our clusters
Issue -
State: open - Opened by Alwinator almost 2 years ago
- 8 comments
#486 - if devicePlugin.enabled is set to disable nvidia-operator-validator stay in CrashLoopBackOff
Issue -
State: closed - Opened by nneram almost 2 years ago
- 5 comments
#485 - NOTICE: Containers losing access to GPUs with error: "Failed to initialize NVML: Unknown Error"
Issue -
State: open - Opened by cdesiniotis almost 2 years ago
#484 - gpu-operator fails to start due to deletion of nonexistent resources
Issue -
State: closed - Opened by xknight almost 2 years ago
- 8 comments
#483 - How do I install using Kustomize?
Issue -
State: open - Opened by choyuansu almost 2 years ago
- 1 comment
#482 - gpu-operator cannot discover the newly added GPU
Issue -
State: open - Opened by zhouhao3 almost 2 years ago
- 3 comments
#481 - GPU Operator reconciliation loop failed
Issue -
State: closed - Opened by arpitsharma-hexad almost 2 years ago
- 3 comments
#480 - device-plugin-validator fails if all gpu resources are allocated on a node
Issue -
State: open - Opened by dcarrion87 almost 2 years ago
- 11 comments
#479 - A problem that labels are not normally created when using custom-config
Issue -
State: open - Opened by brinst07 almost 2 years ago
- 1 comment
#478 - ClusterPolicy generated by Helm chart is not valid
Issue -
State: closed - Opened by mkjpryor almost 2 years ago
- 1 comment
#477 - gpu-operator injecting runtimeClass after transitioning containerd runtime node
Issue -
State: open - Opened by dcarrion87 almost 2 years ago
- 10 comments
#476 - Forced Driver Update with v22.9.1
Issue -
State: open - Opened by BCJuan almost 2 years ago
- 2 comments
#475 - [Feature Request] Make nvidia-operator-validator add a validation successful label or taint on the node
Issue -
State: open - Opened by chiragjn almost 2 years ago
- 4 comments
Labels: enhancement
#474 - Failed to remove GPU in nvidia-driver container
Issue -
State: open - Opened by zhouhao3 almost 2 years ago
- 5 comments
#473 - NVIDIA Container Toolkit fails to set default runtime on RKE2
Issue -
State: closed - Opened by eabochasjauregui almost 2 years ago
- 13 comments
#472 - PODNAME not populated in DCGM metrics
Issue -
State: open - Opened by harjitdotsingh almost 2 years ago
#471 - HPA using gpu-operator
Issue -
State: open - Opened by harjitdotsingh almost 2 years ago
#470 - OpenShift GPU operator only working on 1/2 physical nodes properly
Issue -
State: closed - Opened by mgiessing almost 2 years ago
- 2 comments
#469 - GPU's are showing as 0 while we have GPU's in cluster/NVIDIA GPU administration dashboard
Issue -
State: closed - Opened by arpitsharma-hexad almost 2 years ago
- 13 comments
#468 - Changing Node workload type on running node
Issue -
State: open - Opened by nadav213000 almost 2 years ago
- 1 comment
#467 - Not able to see DCGM Metrics in prometheus
Issue -
State: closed - Opened by harjitdotsingh almost 2 years ago
- 3 comments
#466 - Gpu operator does not work with cri-o user namespaces
Issue -
State: open - Opened by robertdavidsmith almost 2 years ago
- 1 comment
#465 - Time-slicing with multiple GPUs - asking for ability to block single GPU
Issue -
State: open - Opened by Alexbay218 almost 2 years ago
- 1 comment
Labels: enhancement
#464 - Will gpu-operator support Rocky linux in the furture?
Issue -
State: open - Opened by carlwang87 almost 2 years ago
#463 - Failed to create pod sandbox: rpc error: code = Unknown desc = failed to get sandbox runtime: no runtime for "nvidia" is configured
Issue -
State: open - Opened by captainsk7 almost 2 years ago
#462 - [Feature Request] console-plugin-nvidia-gpu / GPU Operator Dashboard per project
Issue -
State: closed - Opened by Alwinator almost 2 years ago
- 2 comments
#461 - Cluster policy templating broken with default values
Issue -
State: closed - Opened by danmx almost 2 years ago
- 5 comments
#460 - Changing MIG strategy while Kubernetes cluster and gpu-operator running
Issue -
State: closed - Opened by esparig almost 2 years ago
- 2 comments
#458 - BREAKS ON 1.25: Does not work on k8s 1.25 due to node API deprecation
Issue -
State: open - Opened by sfxworks almost 2 years ago
- 16 comments
#457 - v22.9.0 - nvidia-driver-daemonset/nvidia-driver-ctr fails to start
Issue -
State: closed - Opened by jeremy-london almost 2 years ago
- 11 comments
#455 - Possible incompatibility with cpumanager, memorymanager, or topologymanager.
Issue -
State: open - Opened by benlsheets almost 2 years ago
- 3 comments
#454 - About the behavior of GPU-Operator when updating EUS
Issue -
State: open - Opened by kousui-dev almost 2 years ago
- 13 comments
#452 - Chcon command fails in nvidia-driver init - nvidia driver installation aborts
Issue -
State: closed - Opened by snirkatriel almost 2 years ago
- 5 comments
#451 - gpu-operator - deprecated API 1.25 call in audit log
Issue -
State: closed - Opened by jpeimer almost 2 years ago
- 3 comments
#443 - Getting Error: "stderr: Auto-detected mode as 'legacy' nvidia-container-cli.real: initialization error: nvml error: insufficient permissions: unknown" while deploying gpu operator -22.9.0 on SLES 15 SP4
Issue -
State: open - Opened by ATP-55 about 2 years ago
- 9 comments
#441 - Error: failed to create FS watcher: too many open files
Issue -
State: closed - Opened by EajksEajks about 2 years ago
- 6 comments
#439 - DCGM exporter NodePort vs ClusterIP
Issue -
State: closed - Opened by dcarrion87 about 2 years ago
- 1 comment
#432 - Failed to get sandbox runtime: no runtime for nvidia is configured
Issue -
State: open - Opened by denissabramovs about 2 years ago
- 32 comments
#430 - Failed to initialize NVML: Unknown Error
Issue -
State: open - Opened by hoangtnm about 2 years ago
- 27 comments
#429 - gpu-operator-nfd-worker fails to read net interface attribute speed
Issue -
State: closed - Opened by yotama-anv about 2 years ago
- 13 comments
#428 - entitlement free system documentation out of date to work with RHCOS on OCP 4.11 (need to do manual tag)
Issue -
State: open - Opened by relyt0925 about 2 years ago
- 9 comments
#422 - console-plugin-nvidia-gpu / GPU Operator Dashboard not showing
Issue -
State: closed - Opened by Alwinator about 2 years ago
- 8 comments