Ecosyste.ms: Issues
An open API service for providing issue and pull request metadata for open source projects.
GitHub / CentaurusInfra/alnair issues and pull requests
#150 - Add a new ddp training script
Pull Request -
State: closed - Opened by YHDING23 almost 2 years ago
#147 - GDS traffic monitor
Issue -
State: open - Opened by pint1022 about 2 years ago
#146 - Major updates to the README files - restructured and separated them, …
Pull Request -
State: closed - Opened by np-ftrwei about 2 years ago
#145 - fix notebook errors
Pull Request -
State: closed - Opened by YHDING23 about 2 years ago
#144 - rewrite pod metadata and utils data collection and storing
Pull Request -
State: closed - Opened by Fizzbb about 2 years ago
#143 - [alnair device plugin] feature request -- support GPU selection
Issue -
State: open - Opened by Fizzbb about 2 years ago
#142 - Create nerf_ddp.py
Pull Request -
State: closed - Opened by nwangfw about 2 years ago
#141 - Add Neural Avatar as use-case
Pull Request -
State: closed - Opened by YHDING23 about 2 years ago
#140 - remove log.Fatalf from exiting programs
Pull Request -
State: closed - Opened by Fizzbb about 2 years ago
#139 - Alluxio data orchestration
Pull Request -
State: closed - Opened by np-ftrwei about 2 years ago
#138 - Update mnist-distributed.py
Pull Request -
State: closed - Opened by nwangfw about 2 years ago
#137 - Intercept hook
Pull Request -
State: closed - Opened by pint1022 about 2 years ago
#136 - GPUDirect to local SSD
Issue -
State: open - Opened by Fizzbb over 2 years ago
#135 - removed a space; changed memory size type long long
Pull Request -
State: closed - Opened by pint1022 over 2 years ago
#134 - Exporter dev
Pull Request -
State: closed - Opened by Fizzbb over 2 years ago
#133 - Cuda met
Pull Request -
State: closed - Opened by pint1022 over 2 years ago
#132 - intercept-lib test instruction doesn't work.
Issue -
State: open - Opened by awang088 over 2 years ago
- 1 comment
#131 - Add prometheus export to report process-level GPU utilization and memory used size
Issue -
State: open - Opened by Fizzbb over 2 years ago
#130 - vgpu-server get cgroup pid from docker top instead of copy file, and …
Pull Request -
State: closed - Opened by Fizzbb over 2 years ago
#129 - fix bug convert timestamp to float unexpected
Pull Request -
State: closed - Opened by Fizzbb over 2 years ago
#128 - scheduling needs
Issue -
State: open - Opened by Fizzbb over 2 years ago
#127 - A bad case for dlsym real func acquirement.
Issue -
State: closed - Opened by CalvinXKY over 2 years ago
- 3 comments
#126 - Add IsSharingGPU function
Pull Request -
State: closed - Opened by YHDING23 over 2 years ago
#125 - vGPU scheduler assume all the nodes have GPU information annotation. Cannot handle cpu node or the period before annotation got patched
Issue -
State: open - Opened by Fizzbb over 2 years ago
- 1 comment
#124 - remove potential .so directory in /opt/alnair to avoid Init:crashloop…
Pull Request -
State: closed - Opened by Fizzbb over 2 years ago
#123 - Containerize vGPU server leads cgroup.procs content invisible (leads to process util inquiry always 0, compute control failed)
Issue -
State: closed - Opened by Fizzbb over 2 years ago
- 4 comments
#122 - device-plugin installation error, Init:crashloopback
Issue -
State: open - Opened by Fizzbb over 2 years ago
- 1 comment
#121 - Add binpack and spread policy
Pull Request -
State: closed - Opened by YHDING23 over 2 years ago
#120 - change alnair socket path, so it does not need to mount /run causing …
Pull Request -
State: closed - Opened by Fizzbb over 2 years ago
#119 - vgpu-server container failed to start, "run/nvidia-persistenced/socket" no such device or address
Issue -
State: open - Opened by Fizzbb over 2 years ago
#118 - add max memory bandwidth utils to pod metrics
Pull Request -
State: closed - Opened by Fizzbb over 2 years ago
#116 - comment out remove annotations
Pull Request -
State: closed - Opened by Fizzbb over 2 years ago
#115 - profiler add mem-copy-utils from DCGM to reflect application's io requests
Issue -
State: closed - Opened by Fizzbb over 2 years ago
- 1 comment
#114 - intercept lib launched through LD_PRELOAD cannot intercept cuda driver API calls with pytorch version >=1.10
Issue -
State: open - Opened by Fizzbb over 2 years ago
- 1 comment
#113 - profiler remove all pod annotation under ai.centaurus.io domain after gpu process is done, which affects scheduler and device plugin
Issue -
State: closed - Opened by Fizzbb over 2 years ago
- 1 comment
#112 - use nsight system inside containers
Issue -
State: open - Opened by Fizzbb over 2 years ago
- 1 comment
#111 - update single file deployment, mount /run, require no nvidia-docker2 …
Pull Request -
State: closed - Opened by Fizzbb over 2 years ago
#110 - Add pre-start hook to all containers in container runtime to support GPU access
Issue -
State: open - Opened by Fizzbb over 2 years ago
#109 - same node pods communication through unix socket
Issue -
State: open - Opened by Fizzbb over 2 years ago
- 2 comments
#108 - create an exporter to export burst, overuse and window-size metrics to prometheus.
Issue -
State: open - Opened by pint1022 over 2 years ago
#107 - setup multiple nodes cluster for kubeshare performance testing
Issue -
State: open - Opened by pint1022 over 2 years ago
- 1 comment
#106 - setup tf-serving testing environment for kubeshare throughput testing
Issue -
State: open - Opened by pint1022 over 2 years ago
#105 - horovod mnist.py has higher utilization number. what does it do?
Issue -
State: open - Opened by pint1022 over 2 years ago
#104 - Add the vGPUScheduler to support Alnair Virtual GPUs
Pull Request -
State: closed - Opened by YHDING23 almost 3 years ago
#103 - Revert "Add the vGPUScheduler to support Alnair Virtual GPUs"
Pull Request -
State: closed - Opened by YHDING23 almost 3 years ago
#102 - Add the vGPUScheduler to support Alnair Virtual GPUs
Pull Request -
State: closed - Opened by YHDING23 almost 3 years ago
#101 - Revert "Add the vGPUScheduler to support alnair virtual gpu"
Pull Request -
State: closed - Opened by YHDING23 almost 3 years ago
#100 - Add the vGPUScheduler to support alnair virtual gpu
Pull Request -
State: closed - Opened by YHDING23 almost 3 years ago
#99 - modify getPreferredDeviceIDs function to make sure vGPU IDs are all f…
Pull Request -
State: closed - Opened by Fizzbb almost 3 years ago
- 1 comment
#98 - GPU sharing corner case: vGPUs spread to two or more physical GPUs
Issue -
State: open - Opened by Fizzbb almost 3 years ago
#97 - minor changes in annotaion to handle local test scenario, include nod…
Pull Request -
State: closed - Opened by Fizzbb almost 3 years ago
#96 - add pod and node annotation for gpu usage info
Pull Request -
State: closed - Opened by Fizzbb almost 3 years ago
- 1 comment
#95 - Alnair device plugin find the pod associated with the allocate request
Issue -
State: closed - Opened by Fizzbb almost 3 years ago
#94 - Profiler dev
Pull Request -
State: closed - Opened by zliu722 almost 3 years ago
#93 - Design and Implement a good GPU utilization metrics
Issue -
State: open - Opened by Fizzbb almost 3 years ago
- 1 comment
#92 - revise alnair devicepluginserver to connect the running pod/container info with the device
Issue -
State: open - Opened by YHDING23 almost 3 years ago
- 1 comment
#91 - Add GPU metrics to Pod metrics for Job metadata
Issue -
State: open - Opened by Fizzbb almost 3 years ago
#90 - add IndexField for running pod selection and fix the allocatable gpu bug
Pull Request -
State: closed - Opened by Fizzbb almost 3 years ago
#89 - Patch Pod Spec Annotations
Issue -
State: open - Opened by YHDING23 almost 3 years ago
#88 - Kubeshare prototyping and compute sharing deep dive
Issue -
State: open - Opened by Fizzbb almost 3 years ago
- 4 comments
#87 - add a transformer translation example, the valid dataset valid.de.raw…
Pull Request -
State: closed - Opened by Fizzbb almost 3 years ago
#86 - how to run the throughput testing on Kubeshare?
Issue -
State: open - Opened by pint1022 almost 3 years ago
- 2 comments
#85 - how to replace kubeshare executables with the local build?
Issue -
State: closed - Opened by pint1022 almost 3 years ago
- 2 comments
#84 - what does code-gen.sh do?
Issue -
State: closed - Opened by pint1022 almost 3 years ago
- 1 comment
#83 - fairseq multihead_attention, torch.cat cause RuntimeError: CUDA out of memory
Issue -
State: open - Opened by Fizzbb almost 3 years ago
#82 - delete print databse entries; change print to logging for exception handling
Pull Request -
State: closed - Opened by zliu722 almost 3 years ago
#80 - Kubeshare test cases:
Issue -
State: open - Opened by pint1022 almost 3 years ago
- 8 comments
#79 - Kubeshare is not targeted on a node with multiple gpus?
Issue -
State: closed - Opened by pint1022 almost 3 years ago
- 1 comment
#78 - Profiler dev
Pull Request -
State: closed - Opened by zliu722 almost 3 years ago
#77 - Kubeshare can NOT generate running Pods with Kubectl v1.23.1
Issue -
State: closed - Opened by pint1022 almost 3 years ago
- 1 comment
#76 - Serverless for AI
Issue -
State: open - Opened by Fizzbb almost 3 years ago
Labels: investigation
#75 - Multi-tenant platform security
Issue -
State: open - Opened by Fizzbb almost 3 years ago
Labels: investigation
#74 - Investigate distributed storage/cache for fast training dataset loading
Issue -
State: open - Opened by Fizzbb almost 3 years ago
Labels: investigation
#73 - add one deep learning training job
Pull Request -
State: closed - Opened by Fizzbb almost 3 years ago
#72 - Ziyu's progress update 1/19/2022
Issue -
State: closed - Opened by zliu722 almost 3 years ago
#71 - pytorch dataloader raise bus error: out of shared memory, in kubernetes container
Issue -
State: open - Opened by Fizzbb almost 3 years ago
- 1 comment
#70 - how to debug kubeshare and measure the performance?
Issue -
State: open - Opened by pint1022 almost 3 years ago
- 4 comments
#69 - A To-do List for the design of vGPU-Scheduler
Issue -
State: open - Opened by YHDING23 almost 3 years ago
#68 - how can the scheduler track used/free vGPU info?
Issue -
State: closed - Opened by YHDING23 almost 3 years ago
- 1 comment
#67 - Alnair vGPU compute resource
Issue -
State: open - Opened by hxhp almost 3 years ago
- 2 comments
#66 - rename profiler pod name and add a debug command to start contianer a…
Pull Request -
State: closed - Opened by Fizzbb almost 3 years ago
#65 - add get job metrics feature
Pull Request -
State: closed - Opened by zliu722 almost 3 years ago
- 2 comments
#64 - Device plugin, Kubelet (device plugin manager), API server, Scheduler, who knows which GPU is used/free
Issue -
State: open - Opened by Fizzbb almost 3 years ago
#63 - Scheduler extension point data sharing with CycleState
Issue -
State: open - Opened by Fizzbb almost 3 years ago
- 1 comment
#62 - Relevant platform trial
Issue -
State: open - Opened by Fizzbb almost 3 years ago
#61 - build docker to add python startup hook
Issue -
State: closed - Opened by pint1022 almost 3 years ago
- 1 comment
#60 - explore/implement ml-profiler UI in prometheus
Issue -
State: open - Opened by pint1022 almost 3 years ago
#59 - custom tensorflow fit or other tensorboard callback to get layer metrics
Issue -
State: open - Opened by pint1022 almost 3 years ago
#49 - error: unable to recognize "crd.yaml": no matches for kind "CustomResourceDefinition" in version "apiextensions.k8s.io/v1beta1"
Issue -
State: closed - Opened by pint1022 almost 3 years ago
- 1 comment
#48 - deploy MongoDB with pv and pvc
Pull Request -
State: closed - Opened by zliu722 almost 3 years ago
#47 - ML profiling metrics set (raw)
Issue -
State: open - Opened by pint1022 almost 3 years ago
- 2 comments
#44 - scheduler policy configuration
Issue -
State: closed - Opened by Fizzbb almost 3 years ago
- 2 comments
#43 - Investigate how to get device plugin (virtual device id) info from scheduler plugin
Issue -
State: open - Opened by Fizzbb almost 3 years ago
#41 - Design and Implement Alnair GPU device plugin
Issue -
State: open - Opened by Fizzbb almost 3 years ago
- 1 comment
#39 - Wrap MLPerf training scripts into Unified Job yaml format and create testing script
Issue -
State: open - Opened by Fizzbb almost 3 years ago
- 5 comments
#37 - GPU compute sharing/isolation methods survey/review
Issue -
State: open - Opened by Fizzbb almost 3 years ago
- 8 comments
#36 - check in pv,pvc, mongoDB deployment yaml file
Issue -
State: closed - Opened by Fizzbb almost 3 years ago
#35 - Fractional GPU scheduling design review & implementation
Issue -
State: open - Opened by Fizzbb almost 3 years ago
- 1 comment
#34 - Unified Training scheduler incorrectly use allocatable for free GPU resources
Issue -
State: open - Opened by Fizzbb almost 3 years ago
- 3 comments
#33 - Periodically check cluster job and update job info to database
Issue -
State: open - Opened by Fizzbb almost 3 years ago
- 1 comment