Ecosyste.ms: Issues
An open API service for providing issue and pull request metadata for open source projects.
GitHub / sujeethjinesh/trainingthroughfailure issues and pull requests
#85 - Add Final Paper
Pull Request -
State: closed - Opened by SujeethJinesh 8 months ago
#84 - Add final prometheus metrics
Pull Request -
State: closed - Opened by SujeethJinesh 8 months ago
#83 - Add num processed gradients csv
Pull Request -
State: closed - Opened by SujeethJinesh 8 months ago
#82 - Add System Diagrams
Pull Request -
State: closed - Opened by SujeethJinesh 8 months ago
#81 - add new metrics graph
Pull Request -
State: closed - Opened by wvshelu 8 months ago
#80 - Sujinesh/collect metrics 2 kill
Pull Request -
State: closed - Opened by SujeethJinesh 8 months ago
#79 - Sujinesh/collect metrics 2 kill
Pull Request -
State: closed - Opened by SujeethJinesh 8 months ago
#78 - Upload Metrics of Various Server Crashes
Pull Request -
State: closed - Opened by SujeethJinesh 8 months ago
#77 - Fix run experiment script and update README
Pull Request -
State: closed - Opened by zgan1 8 months ago
#76 - Undo random transform to test performance
Pull Request -
State: closed - Opened by zgan1 8 months ago
#75 - Larger Fashion Mnist model
Pull Request -
State: closed - Opened by zgan1 8 months ago
#74 - Sujinesh/collect 2 kill metrics
Pull Request -
State: closed - Opened by SujeethJinesh 8 months ago
#73 - add kill times argument to script
Pull Request -
State: closed - Opened by zgan1 8 months ago
#72 - run all experiments in script
Pull Request -
State: closed - Opened by zgan1 8 months ago
#71 - Add Kill Times for Experiments
Pull Request -
State: closed - Opened by SujeethJinesh 8 months ago
#70 - Add a run experiment bash script
Pull Request -
State: closed - Opened by zgan1 8 months ago
#69 - Add L2 Regularization and random source image transform & resize
Pull Request -
State: closed - Opened by oomyduoy 8 months ago
#68 - Update setup environment and update print formatting
Pull Request -
State: closed - Opened by zgan1 8 months ago
#67 - Fix chain replication
Pull Request -
State: closed - Opened by SujeethJinesh 8 months ago
#66 - Add a thread for chain replication server recovery
Pull Request -
State: closed - Opened by zgan1 8 months ago
#65 - Add cpu utilization metrics
Pull Request -
State: closed - Opened by SujeethJinesh 9 months ago
#64 - Sync/Async chain replication metrics no kill
Pull Request -
State: closed - Opened by SujeethJinesh 9 months ago
#63 - Chain replication frequency
Pull Request -
State: closed - Opened by zgan1 9 months ago
#62 - Add sync async checkpoint metrics
Pull Request -
State: closed - Opened by SujeethJinesh 9 months ago
#61 - fix apply gradients
Pull Request -
State: closed - Opened by zgan1 9 months ago
#60 - Experiment merge conflict
Pull Request -
State: closed - Opened by zgan1 9 months ago
#59 - Async control no kill metrics
Pull Request -
State: closed - Opened by SujeethJinesh 9 months ago
#58 - Sync Control No Kill Metrics
Pull Request -
State: closed - Opened by SujeethJinesh 9 months ago
#57 - Aadd async relaxed consistency with kill
Pull Request -
State: closed - Opened by SujeethJinesh 9 months ago
#56 - Async Relaxed Consistency No Kill metrics
Pull Request -
State: closed - Opened by SujeethJinesh 9 months ago
#55 - Sujinesh/run experiments
Pull Request -
State: closed - Opened by SujeethJinesh 9 months ago
#54 - Sujinesh/update exp4 lr
Pull Request -
State: closed - Opened by SujeethJinesh 9 months ago
#53 - Add async relaxed consistency dashboard
Pull Request -
State: closed - Opened by SujeethJinesh 9 months ago
#52 - Use SGD instead of Adam
Pull Request -
State: closed - Opened by SujeethJinesh 9 months ago
#51 - Pass Metric Exporter to sync/async
Pull Request -
State: closed - Opened by SujeethJinesh 9 months ago
#50 - Add Cumulative Gradients and ZK Read/Write Metrics
Pull Request -
State: closed - Opened by SujeethJinesh 9 months ago
#49 - Refactor to enable model passing by name
Pull Request -
State: open - Opened by zgan1 9 months ago
#48 - use iter based evaluation instead of async thread in sync/async exper…
Pull Request -
State: closed - Opened by zgan1 9 months ago
#47 - use iter based evaluation instead of async thread in sync/async exper…
Pull Request -
State: open - Opened by zgan1 9 months ago
#46 - Add optional MPS support and cumulative gradient support for exp 4
Pull Request -
State: open - Opened by SujeethJinesh 9 months ago
#45 - Fix runtime_env
Pull Request -
State: closed - Opened by wvshelu 9 months ago
#44 - Update README.md with GCP set up instructions
Pull Request -
State: closed - Opened by wvshelu 9 months ago
#43 - Add kazoo as dependency to work with GCP worker nodes
Pull Request -
State: closed - Opened by wvshelu 9 months ago
#42 - Mostly fix exp 4 memory leak
Pull Request -
State: closed - Opened by SujeethJinesh 9 months ago
#41 - Update readme to use a single setup script for experiment environments
Pull Request -
State: closed - Opened by zgan1 9 months ago
#40 - Update dashboard.json
Pull Request -
State: closed - Opened by wvshelu 9 months ago
#39 - Optimize chain replication weight passing
Pull Request -
State: closed - Opened by zgan1 9 months ago
#38 - Fix CIFAR10 download issue
Pull Request -
State: closed - Opened by SujeethJinesh 9 months ago
#37 - [WIP] Add separate thread evaluator for Async Relaxed Consistency
Pull Request -
State: open - Opened by SujeethJinesh 9 months ago
#36 - Add Grafana dashboard README instructions
Pull Request -
State: closed - Opened by SujeethJinesh 9 months ago
#35 - Add cifar10 training workload
Pull Request -
State: closed - Opened by zgan1 9 months ago
#34 - Bring up new node when chain node fails
Pull Request -
State: closed - Opened by zgan1 9 months ago
#33 - Add Evaluator Thread that executes every 2 seconds (sync + async)
Pull Request -
State: closed - Opened by SujeethJinesh 9 months ago
#32 - Add Loss Metric for exps 1, 2, 4
Pull Request -
State: closed - Opened by SujeethJinesh 9 months ago
#31 - [WIP] MPS to get macbook training working
Pull Request -
State: closed - Opened by SujeethJinesh 9 months ago
#30 - Add loss metric and instrument it to chain replication
Pull Request -
State: closed - Opened by zgan1 9 months ago
#29 - README Update
Pull Request -
State: closed - Opened by SujeethJinesh 9 months ago
#28 - Correct loss function for training
Pull Request -
State: closed - Opened by zgan1 9 months ago
- 1 comment
#27 - Instrument Metrics for sync, async, and exp 4
Pull Request -
State: closed - Opened by SujeethJinesh 9 months ago
#26 - Add fashion mnist model
Pull Request -
State: closed - Opened by zgan1 9 months ago
#25 - Update Sync Experiment & Use Proper Metrics Versions
Pull Request -
State: closed - Opened by SujeethJinesh 9 months ago
#24 - Refactor chain node experiment
Pull Request -
State: closed - Opened by zgan1 9 months ago
#23 - Resolve zookeeper conflict between exp3 and exp4
Pull Request -
State: closed - Opened by zgan1 9 months ago
#22 - Add further instructions for custom metrics
Pull Request -
State: closed - Opened by zgan1 9 months ago
#21 - Add instrumentation for custom metrics
Pull Request -
State: closed - Opened by zgan1 9 months ago
#20 - Refactor Experiment 4 and add more flags
Pull Request -
State: closed - Opened by SujeethJinesh 9 months ago
#19 - Refactor relaxed consistency experiment
Pull Request -
State: closed - Opened by SujeethJinesh 9 months ago
#18 - Drill into experiment 4
Pull Request -
State: closed - Opened by SujeethJinesh 9 months ago
#17 - Remove env
Pull Request -
State: closed - Opened by zgan1 9 months ago
#16 - Keep object references around using a ref store
Pull Request -
State: closed - Opened by zgan1 9 months ago
#15 - Add Fault Tolerance To Experiment 4
Pull Request -
State: closed - Opened by SujeethJinesh 9 months ago
#14 - Set up `ft-distributed-ml-training` and add instructions to run Ray on GCP
Pull Request -
State: closed - Opened by wvshelu 9 months ago
#13 - Experiment 4 without Fault Tolerance
Pull Request -
State: closed - Opened by SujeethJinesh 9 months ago
#12 - Remove Zookeeper installation directory from the project
Pull Request -
State: closed - Opened by zgan1 9 months ago
#11 - Chain node exp
Pull Request -
State: closed - Opened by zgan1 9 months ago
#10 - Add chain node experiment zookeeper README
Pull Request -
State: closed - Opened by zgan1 9 months ago
#9 - Chain node exp
Pull Request -
State: closed - Opened by zgan1 9 months ago
#8 - update steps to set up metrics monitoring
Pull Request -
State: closed - Opened by oomyduoy 9 months ago
#7 - Working checkpointing via global actor.
Pull Request -
State: closed - Opened by wvshelu 10 months ago
#6 - Add Ray Object Store Logic with Zookeeper for Experiment 3
Pull Request -
State: closed - Opened by SujeethJinesh 10 months ago
#5 - disrupt training by exiting the actor
Pull Request -
State: closed - Opened by zgan1 10 months ago
#4 - Implement zookeeper chain node that handles single node failure
Pull Request -
State: closed - Opened by zgan1 10 months ago
#3 - Refactor experiments param server
Pull Request -
State: closed - Opened by SujeethJinesh 10 months ago
#2 - Implement Synch and Asynch Parameter Servers
Pull Request -
State: closed - Opened by SujeethJinesh 10 months ago
#1 - Add venv
Pull Request -
State: closed - Opened by SujeethJinesh 10 months ago