Ecosyste.ms: Issues
An open API service for providing issue and pull request metadata for open source projects.
GitHub / leptonai/gpud issues and pull requests
#261 - feat(nvidia/query): bump up nvidia-smi cmd timeout, better debugging info
Pull Request -
State: closed - Opened by gyuho 2 months ago
- 1 comment
#260 - fix(query/log/tail): fix time parser for initial lines, use correct time for fabric manager /events
Pull Request -
State: closed - Opened by gyuho 2 months ago
#259 - nit(containerd/pod): use id package for state name
Pull Request -
State: closed - Opened by gyuho 2 months ago
#258 - fix(containerd): use consistent state name
Pull Request -
State: closed - Opened by cardyok 2 months ago
#257 - feat(query): support getErrHandler func, log/ignore disk component error
Pull Request -
State: closed - Opened by gyuho 2 months ago
#256 - fix(disk): add retries for lsblk
Pull Request -
State: closed - Opened by gyuho 2 months ago
#255 - fix(nvidia): report installed when nvml return unknown error on device
Pull Request -
State: closed - Opened by cardyok 2 months ago
#254 - nit(disk): rename state key to disk_ext_partition
Pull Request -
State: closed - Opened by gyuho 2 months ago
#253 - fix(controller): only read stdout for run command
Pull Request -
State: closed - Opened by gyuho 2 months ago
#252 - revert(package_controller): revert read all changes
Pull Request -
State: closed - Opened by gyuho 2 months ago
- 1 comment
#251 - feat(lsblk): add more test case, clarify parse error
Pull Request -
State: closed - Opened by gyuho 2 months ago
- 1 comment
#250 - fix(os): run machine/boot id get calls only for linux, gpud run exit 1 on non-linux platform
Pull Request -
State: closed - Opened by gyuho 2 months ago
- 4 comments
#249 - feat(disk): use "findmnt --target" to find filesystem usage
Pull Request -
State: closed - Opened by gyuho 2 months ago
#248 - feat(process): read stderr in case of command failures, improve disk get error handling
Pull Request -
State: closed - Opened by gyuho 2 months ago
#247 - feat(poll): set default "get" operation timeout, higher timeout for latency checks
Pull Request -
State: closed - Opened by gyuho 2 months ago
#246 - feat(components): define event type enum, fix os component context setup, adjust hw slowdown event type, simplify PCI reason message
Pull Request -
State: closed - Opened by gyuho 2 months ago
#245 - feat(pci): move /states to /events for acs srv-valid checks
Pull Request -
State: closed - Opened by gyuho 2 months ago
#244 - chore(deps): bump golang.org/x/crypto from 0.25.0 to 0.31.0
Pull Request -
State: closed - Opened by dependabot[bot] 2 months ago
Labels: dependency-issue
#243 - fix(process/virt): handle systemd-detect-virt exit code 1, simplify process calls
Pull Request -
State: closed - Opened by gyuho 2 months ago
#242 - feat(components/dmesg): catch EDAC correctable errorrs in dmesg
Pull Request -
State: closed - Opened by gyuho 2 months ago
#242 - feat(components/dmesg): catch EDAC correctable errorrs in dmesg
Pull Request -
State: closed - Opened by gyuho 2 months ago
#241 - feat(go.mod): upgrade go sqlite3
Pull Request -
State: closed - Opened by gyuho 2 months ago
#241 - feat(go.mod): upgrade go sqlite3
Pull Request -
State: closed - Opened by gyuho 2 months ago
#240 - feat(components/os): use os machine id for uuid as fallback, support reboot events using boot id
Pull Request -
State: closed - Opened by gyuho 2 months ago
#240 - feat(components/os): use os machine id for uuid as fallback, support reboot events using boot id
Pull Request -
State: closed - Opened by gyuho 2 months ago
#239 - nit(containerd/pod): rename state keys
Pull Request -
State: closed - Opened by gyuho 2 months ago
#239 - nit(containerd/pod): rename state keys
Pull Request -
State: closed - Opened by gyuho 2 months ago
#238 - nit(dmesg): add more regex OOM matcher test cases with timestamps
Pull Request -
State: closed - Opened by gyuho 2 months ago
- 1 comment
#238 - nit(dmesg): add more regex OOM matcher test cases with timestamps
Pull Request -
State: closed - Opened by gyuho 2 months ago
- 1 comment
#237 - nit(diagnose): print matched dmesg line in scan command
Pull Request -
State: closed - Opened by gyuho 2 months ago
#237 - nit(diagnose): print matched dmesg line in scan command
Pull Request -
State: closed - Opened by gyuho 2 months ago
#236 - feat(components/pci): check PCI access control services for baremetal systems
Pull Request -
State: closed - Opened by gyuho 2 months ago
- 3 comments
#236 - feat(components/pci): check PCI access control services for baremetal systems
Pull Request -
State: closed - Opened by gyuho 2 months ago
- 3 comments
#235 - feat(components/os): detect virt environment, system manufacturer
Pull Request -
State: closed - Opened by gyuho 2 months ago
- 1 comment
#235 - feat(components/os): detect virt environment, system manufacturer
Pull Request -
State: closed - Opened by gyuho 2 months ago
- 1 comment
#234 - feat(components/dmesg): simplify /events fields
Pull Request -
State: closed - Opened by gyuho 2 months ago
#234 - feat(components/dmesg): simplify /events fields
Pull Request -
State: closed - Opened by gyuho 2 months ago
#233 - feat(components): add missing event type in /events
Pull Request -
State: closed - Opened by gyuho 3 months ago
#233 - feat(components): add missing event type in /events
Pull Request -
State: closed - Opened by gyuho 3 months ago
#232 - feat(components/disk): track total mounted ext partitions, block "disk" devices, "scan --diskcheck"
Pull Request -
State: closed - Opened by gyuho 3 months ago
- 3 comments
#232 - feat(components/disk): track total mounted ext partitions, block "disk" devices, "scan --diskcheck"
Pull Request -
State: closed - Opened by gyuho 3 months ago
- 3 comments
#231 - nit(gpud): fix flag description --expected-port-states-nvidia-infiniband
Pull Request -
State: closed - Opened by gyuho 3 months ago
- 1 comment
#231 - nit(gpud): fix flag description --expected-port-states-nvidia-infiniband
Pull Request -
State: closed - Opened by gyuho 3 months ago
- 1 comment
#230 - feat(nvidia/infiniband): better ib ports/rate checking based on port physical/state
Pull Request -
State: closed - Opened by gyuho 3 months ago
#229 - feat(nvidia/hw-slowdown): include GPU UUID in /events, persist smi for /events, dedup hw slowdown events by data source and nearest minutes, do not return hw slowdown clock events in /states, fix nvidia query get function context timeout
Pull Request -
State: closed - Opened by gyuho 3 months ago
#229 - feat(nvidia/hw-slowdown): include GPU UUID in /events, persist smi for /events, dedup hw slowdown events by data source and nearest minutes, do not return hw slowdown clock events in /states, fix nvidia query get function context timeout
Pull Request -
State: closed - Opened by gyuho 3 months ago
#228 - fix(components): separate timeout for poller get function calls
Pull Request -
State: closed - Opened by gyuho 3 months ago
Labels: bug
#228 - fix(components): separate timeout for poller get function calls
Pull Request -
State: closed - Opened by gyuho 3 months ago
Labels: bug
#227 - feat(nvidia): set components/events timestamp in UTC explicitly
Pull Request -
State: closed - Opened by gyuho 3 months ago
#227 - feat(nvidia): set components/events timestamp in UTC explicitly
Pull Request -
State: closed - Opened by gyuho 3 months ago
#226 - feat(server): send components in gossip
Pull Request -
State: closed - Opened by cardyok 3 months ago
#225 - fix(nvidia/hw-slowdown): rename from "clock" to only expose hardware slowdown issues, convert to events
Pull Request -
State: closed - Opened by gyuho 3 months ago
- 4 comments
Labels: bug
#225 - fix(nvidia/hw-slowdown): rename from "clock" to only expose hardware slowdown issues, convert to events
Pull Request -
State: closed - Opened by gyuho 3 months ago
- 4 comments
Labels: bug
#224 - feat(fd): monitor VFS file-max limit with allocated file handles on Linux
Pull Request -
State: closed - Opened by gyuho 3 months ago
#224 - feat(fd): monitor VFS file-max limit with allocated file handles on Linux
Pull Request -
State: closed - Opened by gyuho 3 months ago
#223 - feat(session): make context local to each session for flexibility
Pull Request -
State: closed - Opened by cardyok 3 months ago
#223 - feat(session): make context local to each session for flexibility
Pull Request -
State: closed - Opened by cardyok 3 months ago
#222 - fix(nvidia): derive product name using NVML results first
Pull Request -
State: closed - Opened by gyuho 3 months ago
Labels: bug
#222 - fix(nvidia): derive product name using NVML results first
Pull Request -
State: closed - Opened by gyuho 3 months ago
Labels: bug
#221 - fix(nvidia/query): only evaluate memory error management capabilities when product name found, add missing GPU ID in nvidia-smi parsing remapped rows
Pull Request -
State: closed - Opened by gyuho 3 months ago
- 1 comment
#221 - fix(nvidia/query): only evaluate memory error management capabilities when product name found, add missing GPU ID in nvidia-smi parsing remapped rows
Pull Request -
State: closed - Opened by gyuho 3 months ago
- 1 comment
#220 - fix(nvidia/clock): use nvml clock events, fall back to nvidia-smi parsing
Pull Request -
State: closed - Opened by gyuho 3 months ago
Labels: bug
#220 - fix(nvidia/clock): use nvml clock events, fall back to nvidia-smi parsing
Pull Request -
State: closed - Opened by gyuho 3 months ago
Labels: bug
#219 - fix(nvidia): remove error count "8" threshold for row remapping failures to qualify for RMA
Pull Request -
State: closed - Opened by gyuho 3 months ago
Labels: bug
#219 - fix(nvidia): remove error count "8" threshold for row remapping failures to qualify for RMA
Pull Request -
State: closed - Opened by gyuho 3 months ago
Labels: bug
#218 - fix(nvml/clock_events): enable clock events component when a single GPU device supports it
Pull Request -
State: closed - Opened by gyuho 3 months ago
#218 - fix(nvml/clock_events): enable clock events component when a single GPU device supports it
Pull Request -
State: closed - Opened by gyuho 3 months ago
#217 - feat(nvidia/xid-sxid): increase xid/sxid table retention period to 3-hour
Pull Request -
State: closed - Opened by gyuho 3 months ago
#217 - feat(nvidia/xid-sxid): increase xid/sxid table retention period to 3-hour
Pull Request -
State: closed - Opened by gyuho 3 months ago
#216 - fix(nvidia/remapped-rows): surface product name as reason regardless of its healthy-ness
Pull Request -
State: closed - Opened by gyuho 3 months ago
Labels: bug
#216 - fix(nvidia/remapped-rows): surface product name as reason regardless of its healthy-ness
Pull Request -
State: closed - Opened by gyuho 3 months ago
Labels: bug
#215 - fix(nvidia/nvml): correct boolean checks on whether clock events supported
Pull Request -
State: closed - Opened by gyuho 3 months ago
Labels: bug
#215 - fix(nvidia/nvml): correct boolean checks on whether clock events supported
Pull Request -
State: closed - Opened by gyuho 3 months ago
Labels: bug
#214 - fix(session): close reader channel on fast return
Pull Request -
State: closed - Opened by cardyok 3 months ago
#213 - feat(components/memory): track current jit alloc buffer size, vm alloc status
Pull Request -
State: closed - Opened by gyuho 3 months ago
- 1 comment
#213 - feat(components/memory): track current jit alloc buffer size, vm alloc status
Pull Request -
State: closed - Opened by gyuho 3 months ago
- 1 comment
#212 - fix(cmd/gpud): handle "run --expected-port-states-nvidia-infiniband" flag
Pull Request -
State: closed - Opened by gyuho 3 months ago
Labels: bug
#212 - fix(cmd/gpud): handle "run --expected-port-states-nvidia-infiniband" flag
Pull Request -
State: closed - Opened by gyuho 3 months ago
Labels: bug
#211 - fix(client): adding get states decode call, status command to check local gpud "/states", add sub-command aliases
Pull Request -
State: closed - Opened by gyuho 3 months ago
Labels: bug
#211 - fix(client): adding get states decode call, status command to check local gpud "/states", add sub-command aliases
Pull Request -
State: closed - Opened by gyuho 3 months ago
Labels: bug
#210 - feat(session): optimize default transport config
Pull Request -
State: closed - Opened by cardyok 3 months ago
#210 - feat(session): optimize default transport config
Pull Request -
State: closed - Opened by cardyok 3 months ago
#209 - nit(k8s/pod): quote string node name in case it's empty
Pull Request -
State: closed - Opened by gyuho 3 months ago
#209 - nit(k8s/pod): quote string node name in case it's empty
Pull Request -
State: closed - Opened by gyuho 3 months ago
#208 - feat(nvidia/temperature): port DCGM_FR_TEMP_VIOLATION logic for high temperature alerts
Pull Request -
State: closed - Opened by gyuho 3 months ago
#208 - feat(nvidia/temperature): port DCGM_FR_TEMP_VIOLATION logic for high temperature alerts
Pull Request -
State: closed - Opened by gyuho 3 months ago
#207 - fix(pkg/systemd): handle "n/a" in uptime with trailing characters
Pull Request -
State: closed - Opened by gyuho 3 months ago
#206 - fix(cmd/gpud): add --log-level flag to "scan", fix flag parsing for "run" commands, remove "scan --debug" flag
Pull Request -
State: closed - Opened by gyuho 3 months ago
Labels: bug
#206 - fix(cmd/gpud): add --log-level flag to "scan", fix flag parsing for "run" commands, remove "scan --debug" flag
Pull Request -
State: closed - Opened by gyuho 3 months ago
Labels: bug
#205 - fix(session): disable http keep alive
Pull Request -
State: closed - Opened by cardyok 3 months ago
- 2 comments
#204 - fix(nvidia/infiniband): match mellanox to count PCI devices
Pull Request -
State: closed - Opened by gyuho 3 months ago
#203 - can't get gpu info with wsl platform
Issue -
State: closed - Opened by zhuima 3 months ago
- 11 comments
Labels: feature, awaiting feedback
#202 - feat(nvidia): re-order nvidia-smi collect after NVML calls
Pull Request -
State: closed - Opened by gyuho 3 months ago
#201 - feat(nvidia/query): helpful debugging lines for nvml device list call failures
Pull Request -
State: closed - Opened by gyuho 3 months ago
#200 - fix(nvidia/infiniband): use sysclass ib directory count as default port state checks, use Infiniband PCI bus count to decide whether Infiniband is enabled or not
Pull Request -
State: closed - Opened by gyuho 3 months ago
Labels: bug
#199 - fix(nvidia/infiniband): use "<" to evaluate ip port rates
Pull Request -
State: closed - Opened by gyuho 3 months ago
Labels: bug
#198 - fix(nvidia/infiniband): adjust default port rate based on GPU product
Pull Request -
State: closed - Opened by gyuho 3 months ago
#197 - fix(join): remove space in provider
Pull Request -
State: closed - Opened by cardyok 3 months ago
#196 - feat(nvidia/infiniband): make port states configurable
Pull Request -
State: closed - Opened by gyuho 3 months ago
#195 - fix(config/default): the flag "kubelet-ignore-connection-errors" is n…
Pull Request -
State: closed - Opened by popsiclexu 3 months ago