Ecosyste.ms: Issues
An open API service for providing issue and pull request metadata for open source projects.
GitHub / intelligent-machine-learning/dlrover issues and pull requests
#1089 - Use torch.save to save the training args.
Pull Request -
State: closed - Opened by workingloong 6 months ago
- 1 comment
#1088 - Diagnosis monitor implementation
Pull Request -
State: closed - Opened by samplise 6 months ago
- 1 comment
#1087 - Job master does not wait for workers if there are exited workers.
Pull Request -
State: closed - Opened by workingloong 6 months ago
#1086 - The job master hangs when there is only one worker and the worker is preempted.
Issue -
State: closed - Opened by workingloong 6 months ago
#1085 - Megatron-LM checkpointer loads the checkpoint from the memory.
Pull Request -
State: closed - Opened by workingloong 6 months ago
#1084 - Fix the performance table of flash checkpoint in Meagtron-LM.
Pull Request -
State: closed - Opened by workingloong 6 months ago
#1083 - Add the commit link of Megatron-LM in the benchmark of flash checkpoint.
Pull Request -
State: closed - Opened by workingloong 6 months ago
#1082 - Simplify the logs to show node ranks in the rendezvous.
Pull Request -
State: closed - Opened by workingloong 6 months ago
#1081 - Set the rdzv nodes as an OrderDict.
Pull Request -
State: closed - Opened by workingloong 6 months ago
#1080 - ElasticAgent does not sort ranks.
Pull Request -
State: closed - Opened by workingloong 6 months ago
#1079 - hf trainer with flash checkpoint hang when save_to_memory
Issue -
State: closed - Opened by Lzhang-hub 6 months ago
- 12 comments
#1078 - Save the FSDP checkpoint with full state dict.
Pull Request -
State: closed - Opened by workingloong 6 months ago
#1077 - Speed up the unit test cases.
Pull Request -
State: closed - Opened by workingloong 6 months ago
#1076 - Sort the ranks of nodes by the switches.
Pull Request -
State: closed - Opened by workingloong 6 months ago
#1075 - Use Gang Scheduling in ElasticJob of DLRover.
Issue -
State: open - Opened by workingloong 6 months ago
#1074 - Fault Diagnosis Design
Pull Request -
State: closed - Opened by samplise 6 months ago
- 2 comments
#1073 - The node check finishes if the record files are ready.
Pull Request -
State: closed - Opened by workingloong 6 months ago
#1072 - Implement the FlashCkptTrainer to async save checkpoint of hf trainer.
Pull Request -
State: closed - Opened by workingloong 6 months ago
#1071 - set checkpoint config no_reentrant default as False
Pull Request -
State: closed - Opened by skydoorkai 6 months ago
#1070 - Skip TestLLama2Util on cpu
Pull Request -
State: closed - Opened by adamantboy 6 months ago
#1069 - Don't release the lock of shared memory before restarting workers.
Pull Request -
State: closed - Opened by workingloong 6 months ago
- 1 comment
#1068 - The job stops restarting workers and exits if the traceback is a code bug.
Issue -
State: open - Opened by workingloong 6 months ago
- 2 comments
Labels: enhancement, question
#1067 - add npu examples
Pull Request -
State: closed - Opened by cailun01 6 months ago
#1066 - Skip saving checkpoint at the breakpoint if some nodes fail.
Pull Request -
State: closed - Opened by workingloong 6 months ago
- 1 comment
#1065 - load_checkpoint failed when using Megatron flash checkpoint because tracker_file is not saved by dlrover
Issue -
State: closed - Opened by deepcoldfish 6 months ago
- 2 comments
#1064 - Support passing load_strategy to AtorchTrainer.
Pull Request -
State: closed - Opened by rockychan724 6 months ago
- 1 comment
#1063 - Fix the figure of ElasticJob.
Pull Request -
State: closed - Opened by workingloong 6 months ago
#1062 - Fix the image path of elasticjob architecture.
Pull Request -
State: closed - Opened by workingloong 6 months ago
- 1 comment
#1061 - Optimize doc for dev.
Pull Request -
State: closed - Opened by BalaBalaYi 6 months ago
- 1 comment
Labels: documentation
#1060 - Fix the module path in the tutorial of node detection.
Pull Request -
State: closed - Opened by workingloong 6 months ago
- 1 comment
#1059 - Remove the lock to check all workers are completed.
Pull Request -
State: closed - Opened by workingloong 6 months ago
- 1 comment
#1058 - Try the exception to list nodes.
Pull Request -
State: closed - Opened by workingloong 6 months ago
- 1 comment
#1057 - Kill the process to execute nvidia_gpu.
Pull Request -
State: closed - Opened by workingloong 6 months ago
- 1 comment
#1056 - Set timeout seconds in list_namespaced_pod
Pull Request -
State: closed - Opened by workingloong 6 months ago
- 1 comment
#1055 - Fix the image of the job to test fault tolerance.
Pull Request -
State: closed - Opened by workingloong 6 months ago
- 1 comment
#1054 - 这里提到的弹性训练是否一定是PS架构的,由于PS架构带宽上的限制,现在大模型的训练中使用PS架构的场景应该不多了吧?
Issue -
State: closed - Opened by liubowen8 6 months ago
- 7 comments
#1054 - 这里提到的弹性训练是否一定是PS架构的,由于PS架构带宽上的限制,现在大模型的训练中使用PS架构的场景应该不多了吧?
Issue -
State: open - Opened by liubowen8 6 months ago
- 7 comments
#1053 - Fix the iteration step in epochs.
Pull Request -
State: closed - Opened by workingloong 6 months ago
- 1 comment
#1053 - Fix the iteration step in epochs.
Pull Request -
State: closed - Opened by workingloong 6 months ago
- 1 comment
#1052 - Can you share the training cases on Huawei acceleration card?
Issue -
State: closed - Opened by felix0080 6 months ago
- 1 comment
#1052 - Can you share the training cases on Huawei acceleration card?
Issue -
State: open - Opened by felix0080 6 months ago
- 1 comment
#1051 - Fix the bug to save and load DDP checkpoints.
Pull Request -
State: closed - Opened by workingloong 6 months ago
- 1 comment
#1051 - Fix the bug to save and load DDP checkpoints.
Pull Request -
State: closed - Opened by workingloong 6 months ago
- 1 comment
#1050 - Fix the typo errors.
Pull Request -
State: closed - Opened by workingloong 6 months ago
- 1 comment
#1050 - Fix the typo errors.
Pull Request -
State: closed - Opened by workingloong 6 months ago
- 1 comment
#1049 - 案例介绍中图3解释
Issue -
State: closed - Opened by MercuryLc 6 months ago
#1049 - 案例介绍中图3解释
Issue -
State: closed - Opened by MercuryLc 6 months ago
#1048 - Update 20240313
Pull Request -
State: closed - Opened by adamantboy 6 months ago
- 1 comment
#1047 - Do not remove exited node if remove_exited_node is not enabled.
Pull Request -
State: closed - Opened by workingloong 6 months ago
- 1 comment
#1046 - Set the shm as the memory in the job of nanogpt.
Pull Request -
State: closed - Opened by workingloong 6 months ago
- 1 comment
#1045 - how to use Flash Checkpoint for huggingface trainer job
Issue -
State: closed - Opened by Lzhang-hub 6 months ago
#1044 - add util for loss spike save and decode.
Pull Request -
State: open - Opened by haikuotiankong1212 7 months ago
- 2 comments
#1043 - The node reports a socket.gaierror to the master.
Pull Request -
State: closed - Opened by workingloong 7 months ago
- 1 comment
#1042 - Flash Checkpoint supports saving the checkpoint of moe optimizers.
Pull Request -
State: closed - Opened by workingloong 7 months ago
- 1 comment
#1041 - Early stop if the available node is less than min nodes.
Pull Request -
State: closed - Opened by workingloong 7 months ago
- 1 comment
#1040 - Fix the module path of accelerator.
Pull Request -
State: closed - Opened by workingloong 7 months ago
- 1 comment
#1039 - Do not relaunch node i f the relaunchable=False
Pull Request -
State: closed - Opened by workingloong 7 months ago
- 1 comment
#1038 - Add WeChat QR of AI Infra into the Readme.
Pull Request -
State: closed - Opened by workingloong 7 months ago
- 1 comment
#1037 - llama2 failed
Issue -
State: open - Opened by wwj-2017-1117 7 months ago
- 1 comment
#1036 - Set the relaunchable to False after relaunching a node.
Pull Request -
State: closed - Opened by workingloong 7 months ago
- 1 comment
#1035 - Doc: Design to implement checkpoint in memory of nodes.
Pull Request -
State: closed - Opened by workingloong 7 months ago
- 1 comment
#1034 - Fix the checkpoint name to load for DDP.
Pull Request -
State: closed - Opened by workingloong 7 months ago
- 1 comment
#1033 - Update atorch precommit mypy version to v0.981
Pull Request -
State: closed - Opened by adamantboy 7 months ago
- 1 comment
#1032 - Read the job node under a lock.
Pull Request -
State: closed - Opened by workingloong 7 months ago
- 1 comment
#1031 - Update atorch precommit mypy version to v0.981
Pull Request -
State: closed - Opened by adamantboy 7 months ago
- 2 comments
#1030 - save Megatron-LM MOE model error with flash checkpoint
Issue -
State: closed - Opened by TING2938 7 months ago
- 2 comments
#1029 - need example of llama with flash checkpoint
Issue -
State: closed - Opened by yiyuanyu17 7 months ago
- 1 comment
#1028 - Fix the invalid link in README.
Pull Request -
State: closed - Opened by workingloong 7 months ago
- 1 comment
#1027 - ATorch pre-commit fails due to mypy
Issue -
State: closed - Opened by skydoorkai 7 months ago
#1026 - A tutorial to extend the fault tolerance in DLRover.
Pull Request -
State: closed - Opened by workingloong 7 months ago
- 1 comment
#1025 - Mark the node as unschedulable if the node fails.
Pull Request -
State: closed - Opened by workingloong 7 months ago
- 1 comment
#1024 - Fix atorch pre-commit
Pull Request -
State: closed - Opened by workingloong 7 months ago
- 1 comment
#1023 - Implement the scripts to check node for different AI chips.
Pull Request -
State: closed - Opened by workingloong 7 months ago
- 1 comment
#1022 - Remove the existing nodes and simplify the log of scaleplan.
Pull Request -
State: closed - Opened by workingloong 7 months ago
- 1 comment
#1021 - Do not reset the meta of checkpoint to empty before restarting workers.
Pull Request -
State: closed - Opened by workingloong 7 months ago
- 1 comment
#1020 - Polish the blog to introduce the fault-tolerance of DLRover.
Pull Request -
State: closed - Opened by workingloong 7 months ago
- 1 comment
#1019 - Fix the order of file names in the test of DeepSpeed CKPT.
Pull Request -
State: closed - Opened by workingloong 7 months ago
- 1 comment
#1018 - Set the link of the script to check node.
Pull Request -
State: closed - Opened by workingloong 7 months ago
- 1 comment
#1017 - [RFC] Welcome to give requirements to use DLRover on nodes with AI chips.
Issue -
State: open - Opened by workingloong 7 months ago
#1016 - Always relaunch the failed pod for allreduce jobs.
Pull Request -
State: closed - Opened by workingloong 7 months ago
- 1 comment
#1015 - Set an ENV to start the monitor in the agent.
Pull Request -
State: closed - Opened by workingloong 7 months ago
- 1 comment
#1014 - Add figures into megatron flash ckpt blogs.
Pull Request -
State: closed - Opened by workingloong 7 months ago
- 1 comment
#1014 - Add figures into megatron flash ckpt blogs.
Pull Request -
State: closed - Opened by workingloong 7 months ago
- 1 comment
#1013 - The job master diagnoses whether the worker is dead by heartbeat.
Pull Request -
State: closed - Opened by workingloong 7 months ago
- 1 comment
#1012 - Update the Flash Checkpoint APIs for Megatron in blogs.
Pull Request -
State: closed - Opened by workingloong 7 months ago
- 1 comment
#1011 - Mark the error as node error if the pod fails due to hardware errors.
Pull Request -
State: closed - Opened by workingloong 7 months ago
- 1 comment
#1010 - The Job master monitors the worker status by heartbeat.
Issue -
State: closed - Opened by workingloong 7 months ago
#1009 - Blogs to introduce saving/loading Megatron-LM checkpoint.
Pull Request -
State: closed - Opened by workingloong 7 months ago
- 1 comment
#1008 - Retry to check the master service.
Pull Request -
State: closed - Opened by workingloong 7 months ago
- 1 comment
#1007 - Relaunch th node if the node breaks down when checking the network.
Pull Request -
State: closed - Opened by workingloong 7 months ago
- 1 comment
#1006 - The ElasticJob can recover the master pod if it is evicted.
Pull Request -
State: closed - Opened by workingloong 8 months ago
- 1 comment
#1005 - We are going to build a LLM training agent help searching training strategy and babysitting model training to maxmimize MFU and effective training time.
Issue -
State: open - Opened by hxdtest 8 months ago
- 1 comment
#1004 - Use master Pod IP if the master service is not available.
Pull Request -
State: closed - Opened by workingloong 8 months ago
- 1 comment
#1003 - Fail to connect the master Pod.
Issue -
State: closed - Opened by workingloong 8 months ago
#1002 - Cast the factor to cut CPU and memory to a float.
Pull Request -
State: closed - Opened by workingloong 8 months ago
- 1 comment
#1001 - fix the missing arguments max-nodes.
Pull Request -
State: closed - Opened by workingloong 8 months ago
- 1 comment
#1000 - Fix failure node report
Pull Request -
State: closed - Opened by samplise 8 months ago
- 1 comment
#999 - The job master only prints the exit reason if the pod fails.
Pull Request -
State: closed - Opened by workingloong 8 months ago
- 1 comment
#998 - fix saver factory for creation after training process restarts.
Pull Request -
State: closed - Opened by liyzcj 8 months ago
- 1 comment
#996 - Set the owner reference of Pod as the elastic job.
Pull Request -
State: closed - Opened by workingloong 8 months ago
- 1 comment