CentaurusInfra/alnair issues and pull requests

#150 - Add a new ddp training script

Pull Request - State: closed - Opened by YHDING23 about 2 years ago

#147 - GDS traffic monitor

Issue - State: open - Opened by pint1022 over 2 years ago

#146 - Major updates to the README files - restructured and separated them, …

Pull Request - State: closed - Opened by np-ftrwei over 2 years ago

#145 - fix notebook errors

Pull Request - State: closed - Opened by YHDING23 over 2 years ago

#144 - rewrite pod metadata and utils data collection and storing

Pull Request - State: closed - Opened by Fizzbb over 2 years ago

#143 - [alnair device plugin] feature request -- support GPU selection

Issue - State: open - Opened by Fizzbb over 2 years ago

#142 - Create nerf_ddp.py

Pull Request - State: closed - Opened by nwangfw over 2 years ago

#141 - Add Neural Avatar as use-case

Pull Request - State: closed - Opened by YHDING23 over 2 years ago

#140 - remove log.Fatalf from exiting programs

Pull Request - State: closed - Opened by Fizzbb over 2 years ago

#139 - Alluxio data orchestration

Pull Request - State: closed - Opened by np-ftrwei over 2 years ago

#138 - Update mnist-distributed.py

Pull Request - State: closed - Opened by nwangfw over 2 years ago

#137 - Intercept hook

Pull Request - State: closed - Opened by pint1022 over 2 years ago

#136 - GPUDirect to local SSD

Issue - State: open - Opened by Fizzbb over 2 years ago

#135 - removed a space; changed memory size type long long

Pull Request - State: closed - Opened by pint1022 over 2 years ago

#134 - Exporter dev

Pull Request - State: closed - Opened by Fizzbb over 2 years ago

#133 - Cuda met

Pull Request - State: closed - Opened by pint1022 over 2 years ago

#132 - intercept-lib test instruction doesn't work.

Issue - State: open - Opened by awang088 over 2 years ago - 1 comment

#131 - Add prometheus export to report process-level GPU utilization and memory used size

Issue - State: open - Opened by Fizzbb over 2 years ago

#130 - vgpu-server get cgroup pid from docker top instead of copy file, and …

Pull Request - State: closed - Opened by Fizzbb over 2 years ago

#129 - fix bug convert timestamp to float unexpected

Pull Request - State: closed - Opened by Fizzbb almost 3 years ago

#128 - scheduling needs

Issue - State: open - Opened by Fizzbb almost 3 years ago

#127 - A bad case for dlsym real func acquirement.

Issue - State: closed - Opened by CalvinXKY almost 3 years ago - 3 comments

#126 - Add IsSharingGPU function

Pull Request - State: closed - Opened by YHDING23 almost 3 years ago

#125 - vGPU scheduler assume all the nodes have GPU information annotation. Cannot handle cpu node or the period before annotation got patched

Issue - State: open - Opened by Fizzbb almost 3 years ago - 1 comment

#124 - remove potential .so directory in /opt/alnair to avoid Init:crashloop…

Pull Request - State: closed - Opened by Fizzbb almost 3 years ago

#123 - Containerize vGPU server leads cgroup.procs content invisible (leads to process util inquiry always 0, compute control failed)

Issue - State: closed - Opened by Fizzbb almost 3 years ago - 4 comments

#122 - device-plugin installation error, Init:crashloopback

Issue - State: open - Opened by Fizzbb almost 3 years ago - 1 comment

#121 - Add binpack and spread policy

Pull Request - State: closed - Opened by YHDING23 almost 3 years ago

#120 - change alnair socket path, so it does not need to mount /run causing …

Pull Request - State: closed - Opened by Fizzbb almost 3 years ago

#119 - vgpu-server container failed to start, "run/nvidia-persistenced/socket" no such device or address

Issue - State: open - Opened by Fizzbb almost 3 years ago

#118 - add max memory bandwidth utils to pod metrics

Pull Request - State: closed - Opened by Fizzbb almost 3 years ago

#116 - comment out remove annotations

Pull Request - State: closed - Opened by Fizzbb almost 3 years ago

#115 - profiler add mem-copy-utils from DCGM to reflect application's io requests

Issue - State: closed - Opened by Fizzbb almost 3 years ago - 1 comment

#114 - intercept lib launched through LD_PRELOAD cannot intercept cuda driver API calls with pytorch version >=1.10

Issue - State: open - Opened by Fizzbb almost 3 years ago - 1 comment

#113 - profiler remove all pod annotation under ai.centaurus.io domain after gpu process is done, which affects scheduler and device plugin

Issue - State: closed - Opened by Fizzbb almost 3 years ago - 1 comment

#112 - use nsight system inside containers

Issue - State: open - Opened by Fizzbb almost 3 years ago - 1 comment

#111 - update single file deployment, mount /run, require no nvidia-docker2 …

Pull Request - State: closed - Opened by Fizzbb almost 3 years ago

#110 - Add pre-start hook to all containers in container runtime to support GPU access

Issue - State: open - Opened by Fizzbb almost 3 years ago

#109 - same node pods communication through unix socket

Issue - State: open - Opened by Fizzbb almost 3 years ago - 2 comments

#108 - create an exporter to export burst, overuse and window-size metrics to prometheus.

Issue - State: open - Opened by pint1022 almost 3 years ago

#107 - setup multiple nodes cluster for kubeshare performance testing

Issue - State: open - Opened by pint1022 almost 3 years ago - 1 comment

#106 - setup tf-serving testing environment for kubeshare throughput testing

Issue - State: open - Opened by pint1022 almost 3 years ago

#105 - horovod mnist.py has higher utilization number. what does it do?

Issue - State: open - Opened by pint1022 almost 3 years ago

#104 - Add the vGPUScheduler to support Alnair Virtual GPUs

Pull Request - State: closed - Opened by YHDING23 almost 3 years ago

#103 - Revert "Add the vGPUScheduler to support Alnair Virtual GPUs"

Pull Request - State: closed - Opened by YHDING23 almost 3 years ago

#102 - Add the vGPUScheduler to support Alnair Virtual GPUs

Pull Request - State: closed - Opened by YHDING23 about 3 years ago

#101 - Revert "Add the vGPUScheduler to support alnair virtual gpu"

Pull Request - State: closed - Opened by YHDING23 about 3 years ago

#100 - Add the vGPUScheduler to support alnair virtual gpu

Pull Request - State: closed - Opened by YHDING23 about 3 years ago

#99 - modify getPreferredDeviceIDs function to make sure vGPU IDs are all f…

Pull Request - State: closed - Opened by Fizzbb about 3 years ago - 1 comment

#98 - GPU sharing corner case: vGPUs spread to two or more physical GPUs

Issue - State: open - Opened by Fizzbb about 3 years ago

#97 - minor changes in annotaion to handle local test scenario, include nod…

Pull Request - State: closed - Opened by Fizzbb about 3 years ago

#96 - add pod and node annotation for gpu usage info

Pull Request - State: closed - Opened by Fizzbb about 3 years ago - 1 comment

#95 - Alnair device plugin find the pod associated with the allocate request

Issue - State: closed - Opened by Fizzbb about 3 years ago

#94 - Profiler dev

Pull Request - State: closed - Opened by zliu722 about 3 years ago

#93 - Design and Implement a good GPU utilization metrics

Issue - State: open - Opened by Fizzbb about 3 years ago - 1 comment

#92 - revise alnair devicepluginserver to connect the running pod/container info with the device

Issue - State: open - Opened by YHDING23 about 3 years ago - 1 comment

#91 - Add GPU metrics to Pod metrics for Job metadata

Issue - State: open - Opened by Fizzbb about 3 years ago

#90 - add IndexField for running pod selection and fix the allocatable gpu bug

Pull Request - State: closed - Opened by Fizzbb about 3 years ago

#89 - Patch Pod Spec Annotations

Issue - State: open - Opened by YHDING23 about 3 years ago

#88 - Kubeshare prototyping and compute sharing deep dive

Issue - State: open - Opened by Fizzbb about 3 years ago - 4 comments

#87 - add a transformer translation example, the valid dataset valid.de.raw…

Pull Request - State: closed - Opened by Fizzbb about 3 years ago

#86 - how to run the throughput testing on Kubeshare?

Issue - State: open - Opened by pint1022 about 3 years ago - 2 comments

#85 - how to replace kubeshare executables with the local build?

Issue - State: closed - Opened by pint1022 about 3 years ago - 2 comments

#84 - what does code-gen.sh do?

Issue - State: closed - Opened by pint1022 about 3 years ago - 1 comment

#83 - fairseq multihead_attention, torch.cat cause RuntimeError: CUDA out of memory

Issue - State: open - Opened by Fizzbb about 3 years ago

#82 - delete print databse entries; change print to logging for exception handling

Pull Request - State: closed - Opened by zliu722 about 3 years ago

#80 - Kubeshare test cases:

Issue - State: open - Opened by pint1022 about 3 years ago - 8 comments

#79 - Kubeshare is not targeted on a node with multiple gpus?

Issue - State: closed - Opened by pint1022 about 3 years ago - 1 comment

#78 - Profiler dev

Pull Request - State: closed - Opened by zliu722 about 3 years ago

#77 - Kubeshare can NOT generate running Pods with Kubectl v1.23.1

Issue - State: closed - Opened by pint1022 about 3 years ago - 1 comment

#76 - Serverless for AI

Issue - State: open - Opened by Fizzbb about 3 years ago
Labels: investigation

#75 - Multi-tenant platform security

Issue - State: open - Opened by Fizzbb about 3 years ago
Labels: investigation

#74 - Investigate distributed storage/cache for fast training dataset loading

Issue - State: open - Opened by Fizzbb about 3 years ago
Labels: investigation

#73 - add one deep learning training job

Pull Request - State: closed - Opened by Fizzbb about 3 years ago

#72 - Ziyu's progress update 1/19/2022

Issue - State: closed - Opened by zliu722 about 3 years ago

#71 - pytorch dataloader raise bus error: out of shared memory, in kubernetes container

Issue - State: open - Opened by Fizzbb about 3 years ago - 1 comment

#70 - how to debug kubeshare and measure the performance?

Issue - State: open - Opened by pint1022 about 3 years ago - 4 comments

#69 - A To-do List for the design of vGPU-Scheduler

Issue - State: open - Opened by YHDING23 about 3 years ago

#68 - how can the scheduler track used/free vGPU info?

Issue - State: closed - Opened by YHDING23 about 3 years ago - 1 comment

#67 - Alnair vGPU compute resource

Issue - State: open - Opened by hxhp about 3 years ago - 2 comments

#66 - rename profiler pod name and add a debug command to start contianer a…

Pull Request - State: closed - Opened by Fizzbb about 3 years ago

#65 - add get job metrics feature

Pull Request - State: closed - Opened by zliu722 about 3 years ago - 2 comments

#64 - Device plugin, Kubelet (device plugin manager), API server, Scheduler, who knows which GPU is used/free

Issue - State: open - Opened by Fizzbb about 3 years ago

#63 - Scheduler extension point data sharing with CycleState

Issue - State: open - Opened by Fizzbb about 3 years ago - 1 comment

#62 - Relevant platform trial

Issue - State: open - Opened by Fizzbb about 3 years ago

#61 - build docker to add python startup hook

Issue - State: closed - Opened by pint1022 about 3 years ago - 1 comment

#60 - explore/implement ml-profiler UI in prometheus

Issue - State: open - Opened by pint1022 about 3 years ago

#59 - custom tensorflow fit or other tensorboard callback to get layer metrics

Issue - State: open - Opened by pint1022 about 3 years ago

#49 - error: unable to recognize "crd.yaml": no matches for kind "CustomResourceDefinition" in version "apiextensions.k8s.io/v1beta1"

Issue - State: closed - Opened by pint1022 about 3 years ago - 1 comment

#48 - deploy MongoDB with pv and pvc

Pull Request - State: closed - Opened by zliu722 about 3 years ago

#47 - ML profiling metrics set (raw)

Issue - State: open - Opened by pint1022 about 3 years ago - 2 comments

#44 - scheduler policy configuration

Issue - State: closed - Opened by Fizzbb about 3 years ago - 2 comments

#43 - Investigate how to get device plugin (virtual device id) info from scheduler plugin

Issue - State: open - Opened by Fizzbb about 3 years ago

#41 - Design and Implement Alnair GPU device plugin

Issue - State: open - Opened by Fizzbb about 3 years ago - 1 comment

#39 - Wrap MLPerf training scripts into Unified Job yaml format and create testing script

Issue - State: open - Opened by Fizzbb about 3 years ago - 5 comments

#37 - GPU compute sharing/isolation methods survey/review

Issue - State: open - Opened by Fizzbb about 3 years ago - 8 comments

#36 - check in pv,pvc, mongoDB deployment yaml file

Issue - State: closed - Opened by Fizzbb about 3 years ago

#35 - Fractional GPU scheduling design review & implementation

Issue - State: open - Opened by Fizzbb about 3 years ago - 1 comment

#34 - Unified Training scheduler incorrectly use allocatable for free GPU resources

Issue - State: open - Opened by Fizzbb about 3 years ago - 3 comments

#33 - Periodically check cluster job and update job info to database

Issue - State: open - Opened by Fizzbb about 3 years ago - 1 comment

GitHub / CentaurusInfra/alnair issues and pull requests