NVIDIA/k8s-device-plugin issues and pull requests

#502 - How to trigger gpu failure, the gpu count of node's allocatable field will be dynamically decrease

Issue - State: open - Opened by yizhouv5 8 months ago - 4 comments

#443 - MPS with Kubernetes on NVIDIA GPU

Issue - State: open - Opened by selinnilesy 12 months ago - 27 comments
Labels: feature

#426 - GPU gets marked as unhealthy on systemctl daemon-reloads + kubelet restarts (on Kubernetes Upgrades)

Issue - State: open - Opened by sstrk about 1 year ago - 5 comments
Labels: needs-triage

#404 - Questions about GPU time-sharing on Kubernetes

Issue - State: open - Opened by jxl4650152 over 1 year ago

#403 - Add ability to restart container on device failures

Pull Request - State: closed - Opened by lxpbl over 1 year ago

#402 - Clarify FS watcher error with path

Pull Request - State: open - Opened by Dentrax over 1 year ago - 1 comment

#401 - Who is maintaining this repo??

Issue - State: closed - Opened by maaft over 1 year ago - 3 comments

#400 - Error: template: nvidia-device-plugin/templates/gfd.yml:22:19: executing "nvidia-device-plugin/templates/gfd.yml" at <.Subcharts.gfd>: nil pointer evaluating interface {}.gfd

Issue - State: open - Opened by hanzhc over 1 year ago - 7 comments

#399 - On the node with mig enabled, nvidia-device-plugin reports an error when it starts

Issue - State: closed - Opened by yeqiugt over 1 year ago - 2 comments

#398 - feat(plugin): Make resource name configurable

Pull Request - State: closed - Opened by YitzyD over 1 year ago

#397 - failed to construct NVML resource managers

Issue - State: closed - Opened by shortwavedave over 1 year ago - 1 comment

#396 - Bump github.com/opencontainers/runc from 1.1.4 to 1.1.5

Pull Request - State: open - Opened by dependabot[bot] over 1 year ago
Labels: dependencies

#395 - update readme to state that gitlab is the location where development …

Pull Request - State: open - Opened by kannon92 over 1 year ago

#394 - ubuntu 22.04: pods get killed when any pod resources differ between limits and requests

Issue - State: open - Opened by maaft over 1 year ago

#393 - gitlab location is where developers should open PRs on

Pull Request - State: closed - Opened by kannon92 over 1 year ago

#392 - K8 Job does not get marked as completed after the pod succeeds in AKS version 1.25.5

Issue - State: closed - Opened by narendrakumar-nj over 1 year ago - 2 comments
Labels: lifecycle/stale

#391 - Add prestart-hook for device plugin

Pull Request - State: closed - Opened by kannon92 over 1 year ago - 7 comments

#390 - NVIDIA device plugin isn't advertising the GPUs

Issue - State: closed - Opened by glopezdiest over 1 year ago - 12 comments

#389 - Is restarting the plugin the only way to update the node GPU profile after mig-enabled GPUs get repartitioned?

Issue - State: open - Opened by WindowsXp-Beta over 1 year ago - 1 comment

#388 - Plugin usage on a mixed node group EKS infrastructure

Issue - State: open - Opened by kaustubh-reinvent over 1 year ago

#387 - ETA on new release to address 0.13.0 Security Vulnerabilities?

Issue - State: open - Opened by jhawkins1 over 1 year ago

#386 - report index as RegisteredDevices when device-id-strategy set to index

Pull Request - State: closed - Opened by borgerli over 1 year ago - 4 comments

#385 - Bump golang.org/x/text from 0.3.3 to 0.3.8

Pull Request - State: closed - Opened by dependabot[bot] over 1 year ago - 2 comments
Labels: dependencies

#384 - Add documentation for CRI-O

Issue - State: open - Opened by AndreasMurk over 1 year ago

#383 - Feature: Deleting memory of prevous GPU runs before running the existing job

Issue - State: open - Opened by kannon92 over 1 year ago

#382 - 0/1 nodes are available: 1 Insufficient nvidia.com/gpu.

Issue - State: open - Opened by liufangpeng over 1 year ago - 2 comments

#381 - nvidia-cuda-mps-control: command not found

Issue - State: closed - Opened by LoRKaa over 1 year ago

#380 - fix expired link

Pull Request - State: closed - Opened by zhouhao3 over 1 year ago - 2 comments

#379 - About GPU cleanup features

Issue - State: open - Opened by zhouhao3 over 1 year ago

#378 - can not distinguish t4 and a100 ?

Issue - State: open - Opened by ggjjlldd over 1 year ago - 1 comment

#377 - Plug in does not detect Tegra device Jetson Nano

Issue - State: open - Opened by VladoPortos over 1 year ago - 9 comments

#376 - The plugin container registry is inaccessible via IPv6

Issue - State: open - Opened by osipov over 1 year ago

#375 - Update README.md

Pull Request - State: closed - Opened by HeGaoYuan over 1 year ago - 1 comment

#374 - Update README.md

Pull Request - State: closed - Opened by HeGaoYuan over 1 year ago

#373 - container failed to start after the VM node migrated to another host

Issue - State: open - Opened by borgerli over 1 year ago - 3 comments

#372 - nvidia-device-plugin getting CrashLoopBackOff while installing using helm

Issue - State: open - Opened by captainsk7 over 1 year ago - 2 comments

#370 - Update README.md

Pull Request - State: open - Opened by hholst80 almost 2 years ago

#369 - apt-key is deprecated

Issue - State: open - Opened by hholst80 almost 2 years ago

#368 - k8s-device-plugin restarts on k3s deployment (on top of containerd)

Issue - State: open - Opened by hholst80 almost 2 years ago - 15 comments
Labels: lifecycle/stale

#365 - #364 Check config symlink instead of file existence in config-manager

Pull Request - State: open - Opened by Telemaco019 almost 2 years ago

#364 - Time-slicing config update: "Error: error creating symlink: file exists"

Issue - State: open - Opened by Telemaco019 almost 2 years ago

#354 - timeslice config

Issue - State: open - Opened by segamishuichi almost 2 years ago

#353 - Enable resource renaming in time-slicing shared GPUs

Issue - State: closed - Opened by Telemaco019 almost 2 years ago - 4 comments

#352 - k3s nvidia-device-plugin-daemonset report error - Fixed

Issue - State: open - Opened by liujie1008cn almost 2 years ago - 2 comments

#351 - compatible gpu type with k8s gpu time slicing

Issue - State: open - Opened by alirezadaghigh99 almost 2 years ago

#350 - compatible gpu type with k8s gpu time slicing

Issue - State: open - Opened by alirezadaghigh99 almost 2 years ago

#349 - How do I know that the upgrade of the NVIDIA device plugin went well?

Issue - State: closed - Opened by Borchies almost 2 years ago - 4 comments

#348 - Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.

Issue - State: open - Opened by somethingwentwell almost 2 years ago - 4 comments

#347 - how to support nvlink between several k8s pods?

Issue - State: open - Opened by pokerc almost 2 years ago - 3 comments

#346 - "nvidia-smi": executable file not found in $PATH: unknown

Issue - State: open - Opened by devriesewouter89 almost 2 years ago - 3 comments

#345 - How does GPU Pod dynamically schedule clusters

Issue - State: open - Opened by Kry1702 almost 2 years ago - 1 comment

#344 - GPU is not available with a GPU EC2 instance in EKS cluster (1.23)

Issue - State: open - Opened by garyyang6 almost 2 years ago - 1 comment

#343 - Question about MIG config persistent

Issue - State: open - Opened by slow-zhang almost 2 years ago - 11 comments

#342 - Share GPU in same pod using volume-mounts strategy

Issue - State: open - Opened by dcarrion87 almost 2 years ago - 5 comments

#341 - nvdi-smi hogs CPU

Issue - State: open - Opened by duk0011 almost 2 years ago - 2 comments

#340 - MIG for A6000

Issue - State: closed - Opened by arijitthegame almost 2 years ago - 2 comments

#339 - Question about safely upgrading device plugin

Issue - State: open - Opened by henrysecond1 almost 2 years ago

#338 - Device driver panics randomly with unknown error

Issue - State: open - Opened by olemarkus almost 2 years ago - 2 comments

#337 - Update README.md

Pull Request - State: closed - Opened by tico88612 almost 2 years ago - 1 comment

#336 - How do I run my application on the GPU assigned by k8s

Issue - State: closed - Opened by c-android almost 2 years ago - 2 comments

#335 - Pods with GPU terminating very slowly

Issue - State: open - Opened by Zhurik about 2 years ago

#334 - Previous issue - #297, Is plugin ready for jetson nano devices

Issue - State: open - Opened by ravinayag about 2 years ago - 8 comments

#333 - 4pdvGPU: ERROR get_vdevice_index: Assertion `0' failed and Aborted (core dumped)

Issue - State: open - Opened by Chenxs1122 about 2 years ago

#332 - Getting GPU device minor number: Not Supported

Issue - State: open - Opened by zengzhengrong about 2 years ago - 13 comments

#331 - Function not found for nvml methods

Issue - State: open - Opened by kbkartik about 2 years ago - 3 comments

#330 - Startup race condition with dcgm-exporter

Issue - State: closed - Opened by skraga about 2 years ago - 3 comments

#329 - How can I use nvidia gpu in kubernetes pod?

Issue - State: open - Opened by misupopo about 2 years ago - 1 comment

#328 - Pods are not scheduled in all GPUs of a physical server.

Issue - State: closed - Opened by shan100github about 2 years ago - 25 comments

#327 - Make repo go install friendly

Issue - State: open - Opened by anthonyrisinger about 2 years ago

#326 - undefined symbol nvmlGpuInstanceGetComputeInstanceProfileInfoV in v12+

Issue - State: closed - Opened by anthonyrisinger about 2 years ago - 3 comments

#325 - helm 0.12.2 - nfd-worker logs permission denied on selinux and gfd

Issue - State: open - Opened by RichardSufliarsky about 2 years ago

#324 - Failure: nvidia-container-cli.real: container error: cgroup subsystem devices not found

Issue - State: open - Opened by mpu-creare about 2 years ago - 1 comment

#323 - CUDA memory error

Issue - State: closed - Opened by Borchies about 2 years ago - 8 comments

#322 - Failed to initialize NVML: Unknown Error for when changed runtime from docker to containerd

Issue - State: open - Opened by zvier about 2 years ago - 13 comments

#321 - Unable to get nvidia.com/gpu: "1" greater than 1 for Quadro P2000

Issue - State: closed - Opened by brianbrady about 2 years ago - 13 comments

#320 - Incorrect indentation for securityContext > capability in daemonset

Issue - State: closed - Opened by waynelwh about 2 years ago - 2 comments

#319 - "CUDA unknown error" when using pytorch, and recovered by restarting the nvidia plugin pod

Issue - State: open - Opened by chxk about 2 years ago

#318 - How to check the NVIDIA k8-device-plugin version?

Issue - State: closed - Opened by esparig about 2 years ago - 7 comments

#317 - NVIDIA A10 GPUs - are these drivers in the NVIDIA / k8s-device-plugin

Issue - State: open - Opened by jeffreydahan about 2 years ago - 9 comments

#315 - nvidia-device-plugin daemonset has 0 desired and no pod is launched

Issue - State: open - Opened by blackjack2015 over 2 years ago - 4 comments

#314 - device plugin default_runtime_name requirement and documentation

Issue - State: closed - Opened by rptaylor over 2 years ago - 2 comments

#311 - Fix containerd config in README.md

Pull Request - State: closed - Opened by gmrukwa over 2 years ago - 2 comments

#302 - How to use the device plugin with new k8s 1.24 version?

Issue - State: open - Opened by Zigko over 2 years ago - 21 comments

#300 - [add]: support for hostNetwork parameter in daemonset deployment

Pull Request - State: closed - Opened by vasudev-singhc-by over 2 years ago - 2 comments

#298 - fix: Fixed a build error on io.ReadAll

Pull Request - State: closed - Opened by aelgasser over 2 years ago - 6 comments

#297 - Cannot run nvidia-device-plugins on arm64 with cuda 10.2

Issue - State: closed - Opened by Fvoiretryzig over 2 years ago - 3 comments

#292 - how to schedule jobs to different type of gpus?

Issue - State: closed - Opened by silverlining21 over 2 years ago - 6 comments

#289 - pod fail to find gpu some time after created

Issue - State: closed - Opened by JuHyung-Son over 2 years ago - 14 comments

#284 - update the Dockerfile: NVIDIA_DRIVER_CAPABILITIES=utility,compute

Pull Request - State: closed - Opened by alex337 almost 3 years ago - 2 comments

#274 - Cannot find GPU information in Capacity when kubectl describe a K8s GPU node.

Issue - State: open - Opened by Zeyu-ZEYU almost 3 years ago - 3 comments

#267 - k8s-device-plugin seems to think gpu healthy when it is not usable due to Uncorrectable ECC Error

Issue - State: open - Opened by tingweiwu about 3 years ago - 8 comments
Labels: lifecycle/stale

#266 - spec.template.metadata.annotations[scheduler.alpha.kubernetes.io/critical-pod]: non-functional in v1.16+; use the "priorityClassName" field instead daemonset.apps/nvidia-device-plugin-daemonset created

Issue - State: closed - Opened by wajeehulhassanvii about 3 years ago - 4 comments

#258 - Multi card

Pull Request - State: closed - Opened by archlitchi about 3 years ago - 2 comments

#253 - Installation failed k8s-device-plugin(v0.9.0)

Issue - State: open - Opened by Kwonho over 3 years ago - 12 comments

#240 - Device-plugin does not bother to properly do a cleanup of the info about GPUs after MIG enable/disable or after reconfiguration

Issue - State: open - Opened by dchirikov over 3 years ago - 15 comments

#203 - With volume-mounts strategy, pod shouldn't fail when no permission to read NVIDIA_VISIBLE_DEVICES

Issue - State: open - Opened by zhsj almost 4 years ago - 3 comments

#199 - Setting "failOnInitError" unexpectedly "works" with a small 2 node cluster.

Issue - State: closed - Opened by supertetelman almost 4 years ago - 2 comments

#169 - Support sharing GPUs

Issue - State: open - Opened by ktarplee over 4 years ago - 37 comments

#151 - Update nvidia-device-plugin.yml

Pull Request - State: closed - Opened by frankenstien-831 almost 5 years ago - 3 comments

#143 - How to use specific NVIDIA GPU type(model) in pod yaml

Issue - State: closed - Opened by estherxyz almost 5 years ago - 8 comments

GitHub / NVIDIA/k8s-device-plugin issues and pull requests