Ecosyste.ms: Issues

An open API service for providing issue and pull request metadata for open source projects.

GitHub / CentaurusInfra/alnair issues and pull requests

#150 - Add a new ddp training script

Pull Request - State: closed - Opened by YHDING23 almost 2 years ago

#147 - GDS traffic monitor

Issue - State: open - Opened by pint1022 about 2 years ago

#146 - Major updates to the README files - restructured and separated them, …

Pull Request - State: closed - Opened by np-ftrwei about 2 years ago

#145 - fix notebook errors

Pull Request - State: closed - Opened by YHDING23 about 2 years ago

#144 - rewrite pod metadata and utils data collection and storing

Pull Request - State: closed - Opened by Fizzbb about 2 years ago

#142 - Create nerf_ddp.py

Pull Request - State: closed - Opened by nwangfw about 2 years ago

#141 - Add Neural Avatar as use-case

Pull Request - State: closed - Opened by YHDING23 about 2 years ago

#140 - remove log.Fatalf from exiting programs

Pull Request - State: closed - Opened by Fizzbb about 2 years ago

#139 - Alluxio data orchestration

Pull Request - State: closed - Opened by np-ftrwei about 2 years ago

#138 - Update mnist-distributed.py

Pull Request - State: closed - Opened by nwangfw about 2 years ago

#137 - Intercept hook

Pull Request - State: closed - Opened by pint1022 about 2 years ago

#136 - GPUDirect to local SSD

Issue - State: open - Opened by Fizzbb over 2 years ago

#135 - removed a space; changed memory size type long long

Pull Request - State: closed - Opened by pint1022 over 2 years ago

#134 - Exporter dev

Pull Request - State: closed - Opened by Fizzbb over 2 years ago

#133 - Cuda met

Pull Request - State: closed - Opened by pint1022 over 2 years ago

#132 - intercept-lib test instruction doesn't work.

Issue - State: open - Opened by awang088 over 2 years ago - 1 comment

#130 - vgpu-server get cgroup pid from docker top instead of copy file, and …

Pull Request - State: closed - Opened by Fizzbb over 2 years ago

#129 - fix bug convert timestamp to float unexpected

Pull Request - State: closed - Opened by Fizzbb over 2 years ago

#128 - scheduling needs

Issue - State: open - Opened by Fizzbb over 2 years ago

#127 - A bad case for dlsym real func acquirement.

Issue - State: closed - Opened by CalvinXKY over 2 years ago - 3 comments

#126 - Add IsSharingGPU function

Pull Request - State: closed - Opened by YHDING23 over 2 years ago

#124 - remove potential .so directory in /opt/alnair to avoid Init:crashloop…

Pull Request - State: closed - Opened by Fizzbb over 2 years ago

#122 - device-plugin installation error, Init:crashloopback

Issue - State: open - Opened by Fizzbb over 2 years ago - 1 comment

#121 - Add binpack and spread policy

Pull Request - State: closed - Opened by YHDING23 over 2 years ago

#120 - change alnair socket path, so it does not need to mount /run causing …

Pull Request - State: closed - Opened by Fizzbb over 2 years ago

#118 - add max memory bandwidth utils to pod metrics

Pull Request - State: closed - Opened by Fizzbb over 2 years ago

#116 - comment out remove annotations

Pull Request - State: closed - Opened by Fizzbb over 2 years ago

#115 - profiler add mem-copy-utils from DCGM to reflect application's io requests

Issue - State: closed - Opened by Fizzbb over 2 years ago - 1 comment

#112 - use nsight system inside containers

Issue - State: open - Opened by Fizzbb over 2 years ago - 1 comment

#111 - update single file deployment, mount /run, require no nvidia-docker2 …

Pull Request - State: closed - Opened by Fizzbb over 2 years ago

#109 - same node pods communication through unix socket

Issue - State: open - Opened by Fizzbb over 2 years ago - 2 comments

#107 - setup multiple nodes cluster for kubeshare performance testing

Issue - State: open - Opened by pint1022 over 2 years ago - 1 comment

#104 - Add the vGPUScheduler to support Alnair Virtual GPUs

Pull Request - State: closed - Opened by YHDING23 almost 3 years ago

#103 - Revert "Add the vGPUScheduler to support Alnair Virtual GPUs"

Pull Request - State: closed - Opened by YHDING23 almost 3 years ago

#102 - Add the vGPUScheduler to support Alnair Virtual GPUs

Pull Request - State: closed - Opened by YHDING23 almost 3 years ago

#101 - Revert "Add the vGPUScheduler to support alnair virtual gpu"

Pull Request - State: closed - Opened by YHDING23 almost 3 years ago

#100 - Add the vGPUScheduler to support alnair virtual gpu

Pull Request - State: closed - Opened by YHDING23 almost 3 years ago

#99 - modify getPreferredDeviceIDs function to make sure vGPU IDs are all f…

Pull Request - State: closed - Opened by Fizzbb almost 3 years ago - 1 comment

#97 - minor changes in annotaion to handle local test scenario, include nod…

Pull Request - State: closed - Opened by Fizzbb almost 3 years ago

#96 - add pod and node annotation for gpu usage info

Pull Request - State: closed - Opened by Fizzbb almost 3 years ago - 1 comment

#94 - Profiler dev

Pull Request - State: closed - Opened by zliu722 almost 3 years ago

#93 - Design and Implement a good GPU utilization metrics

Issue - State: open - Opened by Fizzbb almost 3 years ago - 1 comment

#91 - Add GPU metrics to Pod metrics for Job metadata

Issue - State: open - Opened by Fizzbb almost 3 years ago

#90 - add IndexField for running pod selection and fix the allocatable gpu bug

Pull Request - State: closed - Opened by Fizzbb almost 3 years ago

#89 - Patch Pod Spec Annotations

Issue - State: open - Opened by YHDING23 almost 3 years ago

#88 - Kubeshare prototyping and compute sharing deep dive

Issue - State: open - Opened by Fizzbb almost 3 years ago - 4 comments

#87 - add a transformer translation example, the valid dataset valid.de.raw…

Pull Request - State: closed - Opened by Fizzbb almost 3 years ago

#86 - how to run the throughput testing on Kubeshare?

Issue - State: open - Opened by pint1022 almost 3 years ago - 2 comments

#85 - how to replace kubeshare executables with the local build?

Issue - State: closed - Opened by pint1022 almost 3 years ago - 2 comments

#84 - what does code-gen.sh do?

Issue - State: closed - Opened by pint1022 almost 3 years ago - 1 comment

#82 - delete print databse entries; change print to logging for exception handling

Pull Request - State: closed - Opened by zliu722 almost 3 years ago

#80 - Kubeshare test cases:

Issue - State: open - Opened by pint1022 almost 3 years ago - 8 comments

#79 - Kubeshare is not targeted on a node with multiple gpus?

Issue - State: closed - Opened by pint1022 almost 3 years ago - 1 comment

#78 - Profiler dev

Pull Request - State: closed - Opened by zliu722 almost 3 years ago

#77 - Kubeshare can NOT generate running Pods with Kubectl v1.23.1

Issue - State: closed - Opened by pint1022 almost 3 years ago - 1 comment

#76 - Serverless for AI

Issue - State: open - Opened by Fizzbb almost 3 years ago
Labels: investigation

#75 - Multi-tenant platform security

Issue - State: open - Opened by Fizzbb almost 3 years ago
Labels: investigation

#74 - Investigate distributed storage/cache for fast training dataset loading

Issue - State: open - Opened by Fizzbb almost 3 years ago
Labels: investigation

#73 - add one deep learning training job

Pull Request - State: closed - Opened by Fizzbb almost 3 years ago

#72 - Ziyu's progress update 1/19/2022

Issue - State: closed - Opened by zliu722 almost 3 years ago

#71 - pytorch dataloader raise bus error: out of shared memory, in kubernetes container

Issue - State: open - Opened by Fizzbb almost 3 years ago - 1 comment

#70 - how to debug kubeshare and measure the performance?

Issue - State: open - Opened by pint1022 almost 3 years ago - 4 comments

#69 - A To-do List for the design of vGPU-Scheduler

Issue - State: open - Opened by YHDING23 almost 3 years ago

#68 - how can the scheduler track used/free vGPU info?

Issue - State: closed - Opened by YHDING23 almost 3 years ago - 1 comment

#67 - Alnair vGPU compute resource

Issue - State: open - Opened by hxhp almost 3 years ago - 2 comments

#66 - rename profiler pod name and add a debug command to start contianer a…

Pull Request - State: closed - Opened by Fizzbb almost 3 years ago

#65 - add get job metrics feature

Pull Request - State: closed - Opened by zliu722 almost 3 years ago - 2 comments

#63 - Scheduler extension point data sharing with CycleState

Issue - State: open - Opened by Fizzbb almost 3 years ago - 1 comment

#62 - Relevant platform trial

Issue - State: open - Opened by Fizzbb almost 3 years ago

#61 - build docker to add python startup hook

Issue - State: closed - Opened by pint1022 almost 3 years ago - 1 comment

#60 - explore/implement ml-profiler UI in prometheus

Issue - State: open - Opened by pint1022 almost 3 years ago

#48 - deploy MongoDB with pv and pvc

Pull Request - State: closed - Opened by zliu722 almost 3 years ago

#47 - ML profiling metrics set (raw)

Issue - State: open - Opened by pint1022 almost 3 years ago - 2 comments

#44 - scheduler policy configuration

Issue - State: closed - Opened by Fizzbb almost 3 years ago - 2 comments

#41 - Design and Implement Alnair GPU device plugin

Issue - State: open - Opened by Fizzbb almost 3 years ago - 1 comment

#39 - Wrap MLPerf training scripts into Unified Job yaml format and create testing script

Issue - State: open - Opened by Fizzbb almost 3 years ago - 5 comments

#37 - GPU compute sharing/isolation methods survey/review

Issue - State: open - Opened by Fizzbb almost 3 years ago - 8 comments

#36 - check in pv,pvc, mongoDB deployment yaml file

Issue - State: closed - Opened by Fizzbb almost 3 years ago

#35 - Fractional GPU scheduling design review & implementation

Issue - State: open - Opened by Fizzbb almost 3 years ago - 1 comment

#34 - Unified Training scheduler incorrectly use allocatable for free GPU resources

Issue - State: open - Opened by Fizzbb almost 3 years ago - 3 comments

#33 - Periodically check cluster job and update job info to database

Issue - State: open - Opened by Fizzbb almost 3 years ago - 1 comment