Ecosyste.ms: Issues

An open API service for providing issue and pull request metadata for open source projects.

GitHub / sujeethjinesh/trainingthroughfailure issues and pull requests

#85 - Add Final Paper

Pull Request - State: closed - Opened by SujeethJinesh 8 months ago

#84 - Add final prometheus metrics

Pull Request - State: closed - Opened by SujeethJinesh 8 months ago

#83 - Add num processed gradients csv

Pull Request - State: closed - Opened by SujeethJinesh 8 months ago

#82 - Add System Diagrams

Pull Request - State: closed - Opened by SujeethJinesh 8 months ago

#81 - add new metrics graph

Pull Request - State: closed - Opened by wvshelu 8 months ago

#80 - Sujinesh/collect metrics 2 kill

Pull Request - State: closed - Opened by SujeethJinesh 8 months ago

#79 - Sujinesh/collect metrics 2 kill

Pull Request - State: closed - Opened by SujeethJinesh 8 months ago

#78 - Upload Metrics of Various Server Crashes

Pull Request - State: closed - Opened by SujeethJinesh 8 months ago

#77 - Fix run experiment script and update README

Pull Request - State: closed - Opened by zgan1 8 months ago

#76 - Undo random transform to test performance

Pull Request - State: closed - Opened by zgan1 8 months ago

#75 - Larger Fashion Mnist model

Pull Request - State: closed - Opened by zgan1 8 months ago

#74 - Sujinesh/collect 2 kill metrics

Pull Request - State: closed - Opened by SujeethJinesh 8 months ago

#73 - add kill times argument to script

Pull Request - State: closed - Opened by zgan1 8 months ago

#72 - run all experiments in script

Pull Request - State: closed - Opened by zgan1 8 months ago

#71 - Add Kill Times for Experiments

Pull Request - State: closed - Opened by SujeethJinesh 8 months ago

#70 - Add a run experiment bash script

Pull Request - State: closed - Opened by zgan1 8 months ago

#69 - Add L2 Regularization and random source image transform & resize

Pull Request - State: closed - Opened by oomyduoy 8 months ago

#68 - Update setup environment and update print formatting

Pull Request - State: closed - Opened by zgan1 8 months ago

#67 - Fix chain replication

Pull Request - State: closed - Opened by SujeethJinesh 8 months ago

#66 - Add a thread for chain replication server recovery

Pull Request - State: closed - Opened by zgan1 8 months ago

#65 - Add cpu utilization metrics

Pull Request - State: closed - Opened by SujeethJinesh 9 months ago

#64 - Sync/Async chain replication metrics no kill

Pull Request - State: closed - Opened by SujeethJinesh 9 months ago

#63 - Chain replication frequency

Pull Request - State: closed - Opened by zgan1 9 months ago

#62 - Add sync async checkpoint metrics

Pull Request - State: closed - Opened by SujeethJinesh 9 months ago

#61 - fix apply gradients

Pull Request - State: closed - Opened by zgan1 9 months ago

#60 - Experiment merge conflict

Pull Request - State: closed - Opened by zgan1 9 months ago

#59 - Async control no kill metrics

Pull Request - State: closed - Opened by SujeethJinesh 9 months ago

#58 - Sync Control No Kill Metrics

Pull Request - State: closed - Opened by SujeethJinesh 9 months ago

#57 - Aadd async relaxed consistency with kill

Pull Request - State: closed - Opened by SujeethJinesh 9 months ago

#56 - Async Relaxed Consistency No Kill metrics

Pull Request - State: closed - Opened by SujeethJinesh 9 months ago

#55 - Sujinesh/run experiments

Pull Request - State: closed - Opened by SujeethJinesh 9 months ago

#54 - Sujinesh/update exp4 lr

Pull Request - State: closed - Opened by SujeethJinesh 9 months ago

#53 - Add async relaxed consistency dashboard

Pull Request - State: closed - Opened by SujeethJinesh 9 months ago

#52 - Use SGD instead of Adam

Pull Request - State: closed - Opened by SujeethJinesh 9 months ago

#51 - Pass Metric Exporter to sync/async

Pull Request - State: closed - Opened by SujeethJinesh 9 months ago

#50 - Add Cumulative Gradients and ZK Read/Write Metrics

Pull Request - State: closed - Opened by SujeethJinesh 9 months ago

#49 - Refactor to enable model passing by name

Pull Request - State: open - Opened by zgan1 9 months ago

#45 - Fix runtime_env

Pull Request - State: closed - Opened by wvshelu 9 months ago

#44 - Update README.md with GCP set up instructions

Pull Request - State: closed - Opened by wvshelu 9 months ago

#43 - Add kazoo as dependency to work with GCP worker nodes

Pull Request - State: closed - Opened by wvshelu 9 months ago

#42 - Mostly fix exp 4 memory leak

Pull Request - State: closed - Opened by SujeethJinesh 9 months ago

#40 - Update dashboard.json

Pull Request - State: closed - Opened by wvshelu 9 months ago

#39 - Optimize chain replication weight passing

Pull Request - State: closed - Opened by zgan1 9 months ago

#38 - Fix CIFAR10 download issue

Pull Request - State: closed - Opened by SujeethJinesh 9 months ago

#36 - Add Grafana dashboard README instructions

Pull Request - State: closed - Opened by SujeethJinesh 9 months ago

#35 - Add cifar10 training workload

Pull Request - State: closed - Opened by zgan1 9 months ago

#34 - Bring up new node when chain node fails

Pull Request - State: closed - Opened by zgan1 9 months ago

#32 - Add Loss Metric for exps 1, 2, 4

Pull Request - State: closed - Opened by SujeethJinesh 9 months ago

#31 - [WIP] MPS to get macbook training working

Pull Request - State: closed - Opened by SujeethJinesh 9 months ago

#30 - Add loss metric and instrument it to chain replication

Pull Request - State: closed - Opened by zgan1 9 months ago

#29 - README Update

Pull Request - State: closed - Opened by SujeethJinesh 9 months ago

#28 - Correct loss function for training

Pull Request - State: closed - Opened by zgan1 9 months ago - 1 comment

#27 - Instrument Metrics for sync, async, and exp 4

Pull Request - State: closed - Opened by SujeethJinesh 9 months ago

#26 - Add fashion mnist model

Pull Request - State: closed - Opened by zgan1 9 months ago

#25 - Update Sync Experiment & Use Proper Metrics Versions

Pull Request - State: closed - Opened by SujeethJinesh 9 months ago

#24 - Refactor chain node experiment

Pull Request - State: closed - Opened by zgan1 9 months ago

#23 - Resolve zookeeper conflict between exp3 and exp4

Pull Request - State: closed - Opened by zgan1 9 months ago

#22 - Add further instructions for custom metrics

Pull Request - State: closed - Opened by zgan1 9 months ago

#21 - Add instrumentation for custom metrics

Pull Request - State: closed - Opened by zgan1 9 months ago

#20 - Refactor Experiment 4 and add more flags

Pull Request - State: closed - Opened by SujeethJinesh 9 months ago

#19 - Refactor relaxed consistency experiment

Pull Request - State: closed - Opened by SujeethJinesh 9 months ago

#18 - Drill into experiment 4

Pull Request - State: closed - Opened by SujeethJinesh 9 months ago

#17 - Remove env

Pull Request - State: closed - Opened by zgan1 9 months ago

#16 - Keep object references around using a ref store

Pull Request - State: closed - Opened by zgan1 9 months ago

#15 - Add Fault Tolerance To Experiment 4

Pull Request - State: closed - Opened by SujeethJinesh 9 months ago

#13 - Experiment 4 without Fault Tolerance

Pull Request - State: closed - Opened by SujeethJinesh 9 months ago

#12 - Remove Zookeeper installation directory from the project

Pull Request - State: closed - Opened by zgan1 9 months ago

#11 - Chain node exp

Pull Request - State: closed - Opened by zgan1 9 months ago

#10 - Add chain node experiment zookeeper README

Pull Request - State: closed - Opened by zgan1 9 months ago

#9 - Chain node exp

Pull Request - State: closed - Opened by zgan1 9 months ago

#8 - update steps to set up metrics monitoring

Pull Request - State: closed - Opened by oomyduoy 9 months ago

#7 - Working checkpointing via global actor.

Pull Request - State: closed - Opened by wvshelu 10 months ago

#6 - Add Ray Object Store Logic with Zookeeper for Experiment 3

Pull Request - State: closed - Opened by SujeethJinesh 10 months ago

#5 - disrupt training by exiting the actor

Pull Request - State: closed - Opened by zgan1 10 months ago

#4 - Implement zookeeper chain node that handles single node failure

Pull Request - State: closed - Opened by zgan1 10 months ago

#3 - Refactor experiments param server

Pull Request - State: closed - Opened by SujeethJinesh 10 months ago

#2 - Implement Synch and Asynch Parameter Servers

Pull Request - State: closed - Opened by SujeethJinesh 10 months ago

#1 - Add venv

Pull Request - State: closed - Opened by SujeethJinesh 10 months ago