Ecosyste.ms: Issues
An open API service for providing issue and pull request metadata for open source projects.
GitHub / leptonai/gpud issues and pull requests
#204 - fix(nvidia/infiniband): match mellanox to count PCI devices
Pull Request -
State: open - Opened by gyuho 4 days ago
#203 - can't get gpu info with wsl platform
Issue -
State: open - Opened by zhuima 5 days ago
- 1 comment
Labels: awaiting feedback
#202 - feat(nvidia): rorder nvidia-smi collect after NVML calls
Pull Request -
State: open - Opened by gyuho 5 days ago
#201 - feat(nvidia/query): helpful debugging lines for nvml device list call failures
Pull Request -
State: closed - Opened by gyuho 5 days ago
#200 - fix(nvidia/infiniband): use sysclass ib directory count as default port state checks, use Infiniband PCI bus count to decide whether Infiniband is enabled or not
Pull Request -
State: closed - Opened by gyuho 5 days ago
Labels: bug
#199 - fix(nvidia/infiniband): use "<" to evaluate ip port rates
Pull Request -
State: closed - Opened by gyuho 5 days ago
Labels: bug
#198 - fix(nvidia/infiniband): adjust default port rate based on GPU product
Pull Request -
State: closed - Opened by gyuho 5 days ago
#197 - fix(join): remove space in provider
Pull Request -
State: closed - Opened by cardyok 5 days ago
#196 - feat(nvidia/infiniband): make port states configurable
Pull Request -
State: closed - Opened by gyuho 5 days ago
#195 - fix(config/default): the flag "kubelet-ignore-connection-errors" is n…
Pull Request -
State: closed - Opened by popsiclexu 5 days ago
#194 - feat(session): add idle session timeout
Pull Request -
State: closed - Opened by cardyok 6 days ago
#193 - feat(components/os): use 20% of system descriptor limit for zombie process alerts
Pull Request -
State: closed - Opened by gyuho 6 days ago
#193 - feat(components/os): use 20% of system descriptor limit for zombie process alerts
Pull Request -
State: closed - Opened by gyuho 6 days ago
#192 - fix(log/tail): correctly collect xid/sxid events from log scanner
Pull Request -
State: closed - Opened by gyuho 7 days ago
Labels: bug
#191 - feat(component/kernel-module): initial commit (track /etc/modules)
Pull Request -
State: closed - Opened by gyuho 7 days ago
#190 - nit(nvidia/xid): add more Xid 119 test case, simpler detection logging
Pull Request -
State: closed - Opened by gyuho 7 days ago
#190 - nit(nvidia/xid): add more Xid 119 test case, simpler detection logging
Pull Request -
State: closed - Opened by gyuho 7 days ago
#189 - feat(nvidia): parse infiniband ibstat for error checking based on GPU card counts
Pull Request -
State: closed - Opened by gyuho 8 days ago
#189 - feat(nvidia): parse infiniband ibstat for error checking based on GPU card counts
Pull Request -
State: closed - Opened by gyuho 8 days ago
#188 - feat(nvidia, dmesg): use dmesg iso for millisecond level, merge peermem events by minute level
Pull Request -
State: closed - Opened by gyuho 11 days ago
#188 - feat(nvidia, dmesg): use dmesg iso for millisecond level, merge peermem events by minute level
Pull Request -
State: closed - Opened by gyuho 11 days ago
#187 - fix(nvidia/xid-sxid-state): persist xid/sxid in tail scan, better logging
Pull Request -
State: closed - Opened by gyuho 11 days ago
#187 - fix(nvidia/xid-sxid-state): persist xid/sxid in tail scan, better logging
Pull Request -
State: closed - Opened by gyuho 11 days ago
#186 - feat(internal/server): periodic status check logs in debug level
Pull Request -
State: closed - Opened by gyuho 11 days ago
#186 - feat(internal/server): periodic status check logs in debug level
Pull Request -
State: closed - Opened by gyuho 11 days ago
#185 - fix(internal/server): handle poller events no data error (don't error level log)
Pull Request -
State: closed - Opened by gyuho 11 days ago
#185 - fix(internal/server): handle poller events no data error (don't error level log)
Pull Request -
State: closed - Opened by gyuho 11 days ago
#184 - fix(accelerator/nvidia): add missing poller initialization
Pull Request -
State: closed - Opened by gyuho 11 days ago
Labels: critical-bug
#184 - fix(accelerator/nvidia): add missing poller initialization
Pull Request -
State: closed - Opened by gyuho 11 days ago
Labels: critical-bug
#183 - feat(query/log/tail): log stream with deduper
Pull Request -
State: closed - Opened by gyuho 12 days ago
#183 - feat(query/log/tail): log stream with deduper
Pull Request -
State: closed - Opened by gyuho 12 days ago
#182 - fix(components/dmesg): do not read raw dmesg file with unix time
Pull Request -
State: closed - Opened by gyuho 12 days ago
Labels: bug
#182 - fix(components/dmesg): do not read raw dmesg file with unix time
Pull Request -
State: closed - Opened by gyuho 12 days ago
Labels: bug
#181 - fix(nvidia/query): quote unusual process name for nvidia-smi parsing
Pull Request -
State: closed - Opened by gyuho 12 days ago
Labels: bug
#181 - fix(nvidia/query): quote unusual process name for nvidia-smi parsing
Pull Request -
State: closed - Opened by gyuho 12 days ago
Labels: bug
#157 - feat(nvidia/error-xid-sxid): new component based on persistent xid, sxid event history
Pull Request -
State: closed - Opened by gyuho 21 days ago
#157 - feat(nvidia/error-xid-sxid): new component based on persistent xid, sxid event history
Pull Request -
State: closed - Opened by gyuho 21 days ago
#100 - feat(nvidia/xid): add check user app and GPU action type, apply "Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning by DeepSeek AI"
Pull Request -
State: closed - Opened by gyuho about 2 months ago
#100 - feat(nvidia/xid): add check user app and GPU action type, apply "Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning by DeepSeek AI"
Pull Request -
State: closed - Opened by gyuho about 2 months ago
#99 - feat(nvidia/ibstat): check "Physical state" as fallback
Pull Request -
State: closed - Opened by gyuho about 2 months ago
#99 - feat(nvidia/ibstat): check "Physical state" as fallback
Pull Request -
State: closed - Opened by gyuho about 2 months ago
#98 - feat(session): support reboot method
Pull Request -
State: closed - Opened by cardyok about 2 months ago
#98 - feat(session): support reboot method
Pull Request -
State: closed - Opened by cardyok about 2 months ago
#97 - feat(build, release): support Amazon Linux 2 and 2023 (experimental)
Pull Request -
State: closed - Opened by gyuho about 2 months ago
#97 - feat(build, release): support Amazon Linux 2 and 2023 (experimental)
Pull Request -
State: closed - Opened by gyuho about 2 months ago
#96 - feat(pkg/reboot): initial commit
Pull Request -
State: closed - Opened by gyuho about 2 months ago
#96 - feat(pkg/reboot): initial commit
Pull Request -
State: closed - Opened by gyuho about 2 months ago
#95 - feat(components): add accelerator detect func, "gpud accelerator" subcommand
Pull Request -
State: closed - Opened by gyuho about 2 months ago
#95 - feat(components): add accelerator detect func, "gpud accelerator" subcommand
Pull Request -
State: closed - Opened by gyuho about 2 months ago
#94 - feat(server): allow custom uid with cli
Pull Request -
State: closed - Opened by cardyok about 2 months ago
#94 - feat(server): allow custom uid with cli
Pull Request -
State: closed - Opened by cardyok about 2 months ago
#93 - fix(components/fd): rename "fd_max_file_exists" to "fd_limit_supported", fix get limit on darwin
Pull Request -
State: closed - Opened by gyuho about 2 months ago
#93 - fix(components/fd): rename "fd_max_file_exists" to "fd_limit_supported", fix get limit on darwin
Pull Request -
State: closed - Opened by gyuho about 2 months ago
#92 - feat(gpud): add "file" component that returns healthy when all specified files exist
Pull Request -
State: closed - Opened by gyuho about 2 months ago
#92 - feat(gpud): add "file" component that returns healthy when all specified files exist
Pull Request -
State: closed - Opened by gyuho about 2 months ago
#91 - doc(sxid): add more example events for gpu-operator
Pull Request -
State: closed - Opened by gyuho about 2 months ago
- 1 comment
#91 - doc(sxid): add more example events for gpu-operator
Pull Request -
State: closed - Opened by gyuho about 2 months ago
- 1 comment
#90 - Installation on Amazon Linux2 version `GLIBC_2.28' not found
Issue -
State: closed - Opened by chatter92 about 2 months ago
- 7 comments
Labels: question, dependency-issue, awaiting feedback
#90 - Installation on Amazon Linux2 version `GLIBC_2.28' not found
Issue -
State: closed - Opened by chatter92 about 2 months ago
- 7 comments
Labels: question, dependency-issue, awaiting feedback
#89 - feat(nvidia/xid,sxid,remapped rows): add required actions field to /states, /events
Pull Request -
State: closed - Opened by gyuho about 2 months ago
- 1 comment
#89 - feat(nvidia/xid,sxid,remapped rows): add required actions field to /states, /events
Pull Request -
State: closed - Opened by gyuho about 2 months ago
- 1 comment
#88 - feat(nvidia/query): shorter timeouts for "nvidia-smi" calls
Pull Request -
State: closed - Opened by gyuho about 2 months ago
- 1 comment
#88 - feat(nvidia/query): shorter timeouts for "nvidia-smi" calls
Pull Request -
State: closed - Opened by gyuho about 2 months ago
- 1 comment
#87 - feat(nvidia/ecc): rename state name key to "ecc" (from ecc_errors)
Pull Request -
State: closed - Opened by gyuho about 2 months ago
- 2 comments
#87 - feat(nvidia/ecc): rename state name key to "ecc" (from ecc_errors)
Pull Request -
State: closed - Opened by gyuho about 2 months ago
- 2 comments
#86 - feat(nvidia): track "ECC mode" (enabled/disabled) using nvidia-smi and NVML
Pull Request -
State: closed - Opened by gyuho about 2 months ago
- 3 comments
#86 - feat(nvidia): track "ECC mode" (enabled/disabled) using nvidia-smi and NVML
Pull Request -
State: closed - Opened by gyuho about 2 months ago
- 3 comments
#85 - doc(nvidia/sxid): README to expain xid 79, sxid 20034 as an example
Pull Request -
State: closed - Opened by gyuho about 2 months ago
#85 - doc(nvidia/sxid): README to expain xid 79, sxid 20034 as an example
Pull Request -
State: closed - Opened by gyuho about 2 months ago
#84 - feat(nvidia): add non-fatal sxid "20012" code, rename Detail.ID to SXID
Pull Request -
State: closed - Opened by gyuho about 2 months ago
#84 - feat(nvidia): add non-fatal sxid "20012" code, rename Detail.ID to SXID
Pull Request -
State: closed - Opened by gyuho about 2 months ago
#83 - fix(nvidia): return empty output object if smi/nvml is nil
Pull Request -
State: closed - Opened by gyuho about 2 months ago
#83 - fix(nvidia): return empty output object if smi/nvml is nil
Pull Request -
State: closed - Opened by gyuho about 2 months ago
#82 - Update mothership endpoint
Pull Request -
State: closed - Opened by cardyok 2 months ago
#82 - Update mothership endpoint
Pull Request -
State: closed - Opened by cardyok 2 months ago
#81 - feat(nvidia/xid,sxid): rename Detail.ID to XID, add required actions for XID/SXID events
Pull Request -
State: closed - Opened by gyuho 2 months ago
#81 - feat(nvidia/xid,sxid): rename Detail.ID to XID, add required actions for XID/SXID events
Pull Request -
State: closed - Opened by gyuho 2 months ago
#80 - feat(nvidia): track row remapping, RMA/GPU reset status
Pull Request -
State: closed - Opened by gyuho 2 months ago
#80 - feat(nvidia): track row remapping, RMA/GPU reset status
Pull Request -
State: closed - Opened by gyuho 2 months ago
#79 - nits(nvidia/query/nvml): remove unused GPUID fields
Pull Request -
State: closed - Opened by gyuho 2 months ago
#79 - nits(nvidia/query/nvml): remove unused GPUID fields
Pull Request -
State: closed - Opened by gyuho 2 months ago
#78 - feat(internal/server): dynamically refresh containerd, docker, kubelet components
Pull Request -
State: closed - Opened by gyuho 2 months ago
- 1 comment
#78 - feat(internal/server): dynamically refresh containerd, docker, kubelet components
Pull Request -
State: closed - Opened by gyuho 2 months ago
- 1 comment
#77 - fix(nvidia/peermem): do not decide health based on ibcore peermem module
Pull Request -
State: closed - Opened by gyuho 2 months ago
#77 - fix(nvidia/peermem): do not decide health based on ibcore peermem module
Pull Request -
State: closed - Opened by gyuho 2 months ago
#76 - fix(power): fix power segfault
Pull Request -
State: closed - Opened by cardyok 2 months ago
#76 - fix(power): fix power segfault
Pull Request -
State: closed - Opened by cardyok 2 months ago
#75 - Question Regarding Remediation
Issue -
State: closed - Opened by ivelichkovich 2 months ago
- 1 comment
Labels: question
#75 - Question Regarding Remediation
Issue -
State: closed - Opened by ivelichkovich 2 months ago
- 1 comment
Labels: question
#74 - feat(nvidia/peermem): track dmesg events for invalid context errors
Pull Request -
State: closed - Opened by gyuho 2 months ago
#74 - feat(nvidia/peermem): track dmesg events for invalid context errors
Pull Request -
State: closed - Opened by gyuho 2 months ago
#73 - feat(pkg/process): change "New" function signature with op options, add more examples
Pull Request -
State: closed - Opened by gyuho 2 months ago
#73 - feat(pkg/process): change "New" function signature with op options, add more examples
Pull Request -
State: closed - Opened by gyuho 2 months ago
#72 - fix(pkg/process): panic on wait before process initialization
Pull Request -
State: closed - Opened by gyuho 2 months ago
#72 - fix(pkg/process): panic on wait before process initialization
Pull Request -
State: closed - Opened by gyuho 2 months ago
#71 - feat(nvidia/fabric-manager): alert on nvlink multicast failures
Pull Request -
State: closed - Opened by gyuho 2 months ago
#71 - feat(nvidia/fabric-manager): alert on nvlink multicast failures
Pull Request -
State: closed - Opened by gyuho 2 months ago
#70 - feat(dmesg): add oom-kill:constraint regex for cri-containerd events
Pull Request -
State: closed - Opened by gyuho 2 months ago
#70 - feat(dmesg): add oom-kill:constraint regex for cri-containerd events
Pull Request -
State: closed - Opened by gyuho 2 months ago
#69 - feat(nvidia/query): fabric manager debugging info from journalctl
Pull Request -
State: closed - Opened by gyuho 2 months ago