intelligent-machine-learning/dlrover issues and pull requests

#1089 - Use torch.save to save the training args.

Pull Request - State: closed - Opened by workingloong 6 months ago - 1 comment

#1088 - Diagnosis monitor implementation

Pull Request - State: closed - Opened by samplise 6 months ago - 1 comment

#1087 - Job master does not wait for workers if there are exited workers.

Pull Request - State: closed - Opened by workingloong 6 months ago

#1086 - The job master hangs when there is only one worker and the worker is preempted.

Issue - State: closed - Opened by workingloong 6 months ago

#1085 - Megatron-LM checkpointer loads the checkpoint from the memory.

Pull Request - State: closed - Opened by workingloong 6 months ago

#1084 - Fix the performance table of flash checkpoint in Meagtron-LM.

Pull Request - State: closed - Opened by workingloong 6 months ago

#1083 - Add the commit link of Megatron-LM in the benchmark of flash checkpoint.

Pull Request - State: closed - Opened by workingloong 6 months ago

#1082 - Simplify the logs to show node ranks in the rendezvous.

Pull Request - State: closed - Opened by workingloong 6 months ago

#1081 - Set the rdzv nodes as an OrderDict.

Pull Request - State: closed - Opened by workingloong 6 months ago

#1080 - ElasticAgent does not sort ranks.

Pull Request - State: closed - Opened by workingloong 6 months ago

#1079 - hf trainer with flash checkpoint hang when save_to_memory

Issue - State: closed - Opened by Lzhang-hub 6 months ago - 12 comments

#1078 - Save the FSDP checkpoint with full state dict.

Pull Request - State: closed - Opened by workingloong 6 months ago

#1077 - Speed up the unit test cases.

Pull Request - State: closed - Opened by workingloong 6 months ago

#1076 - Sort the ranks of nodes by the switches.

Pull Request - State: closed - Opened by workingloong 6 months ago

#1075 - Use Gang Scheduling in ElasticJob of DLRover.

Issue - State: open - Opened by workingloong 6 months ago

#1074 - Fault Diagnosis Design

Pull Request - State: closed - Opened by samplise 6 months ago - 2 comments

#1073 - The node check finishes if the record files are ready.

Pull Request - State: closed - Opened by workingloong 6 months ago

#1072 - Implement the FlashCkptTrainer to async save checkpoint of hf trainer.

Pull Request - State: closed - Opened by workingloong 6 months ago

#1071 - set checkpoint config no_reentrant default as False

Pull Request - State: closed - Opened by skydoorkai 6 months ago

#1070 - Skip TestLLama2Util on cpu

Pull Request - State: closed - Opened by adamantboy 6 months ago

#1069 - Don't release the lock of shared memory before restarting workers.

Pull Request - State: closed - Opened by workingloong 6 months ago - 1 comment

#1068 - The job stops restarting workers and exits if the traceback is a code bug.

Issue - State: open - Opened by workingloong 6 months ago - 2 comments
Labels: enhancement, question

#1067 - add npu examples

Pull Request - State: closed - Opened by cailun01 6 months ago

#1066 - Skip saving checkpoint at the breakpoint if some nodes fail.

Pull Request - State: closed - Opened by workingloong 6 months ago - 1 comment

#1065 - load_checkpoint failed when using Megatron flash checkpoint because tracker_file is not saved by dlrover

Issue - State: closed - Opened by deepcoldfish 6 months ago - 2 comments

#1064 - Support passing load_strategy to AtorchTrainer.

Pull Request - State: closed - Opened by rockychan724 6 months ago - 1 comment

#1063 - Fix the figure of ElasticJob.

Pull Request - State: closed - Opened by workingloong 6 months ago

#1062 - Fix the image path of elasticjob architecture.

Pull Request - State: closed - Opened by workingloong 6 months ago - 1 comment

#1061 - Optimize doc for dev.

Pull Request - State: closed - Opened by BalaBalaYi 6 months ago - 1 comment
Labels: documentation

#1060 - Fix the module path in the tutorial of node detection.

Pull Request - State: closed - Opened by workingloong 6 months ago - 1 comment

#1059 - Remove the lock to check all workers are completed.

Pull Request - State: closed - Opened by workingloong 6 months ago - 1 comment

#1058 - Try the exception to list nodes.

Pull Request - State: closed - Opened by workingloong 6 months ago - 1 comment

#1057 - Kill the process to execute nvidia_gpu.

Pull Request - State: closed - Opened by workingloong 6 months ago - 1 comment

#1056 - Set timeout seconds in list_namespaced_pod

Pull Request - State: closed - Opened by workingloong 6 months ago - 1 comment

#1055 - Fix the image of the job to test fault tolerance.

Pull Request - State: closed - Opened by workingloong 6 months ago - 1 comment

#1054 - 这里提到的弹性训练是否一定是PS架构的，由于PS架构带宽上的限制，现在大模型的训练中使用PS架构的场景应该不多了吧？

Issue - State: closed - Opened by liubowen8 6 months ago - 7 comments

#1054 - 这里提到的弹性训练是否一定是PS架构的，由于PS架构带宽上的限制，现在大模型的训练中使用PS架构的场景应该不多了吧？

Issue - State: open - Opened by liubowen8 6 months ago - 7 comments

#1053 - Fix the iteration step in epochs.

Pull Request - State: closed - Opened by workingloong 6 months ago - 1 comment

#1053 - Fix the iteration step in epochs.

Pull Request - State: closed - Opened by workingloong 6 months ago - 1 comment

#1052 - Can you share the training cases on Huawei acceleration card?

Issue - State: closed - Opened by felix0080 6 months ago - 1 comment

#1052 - Can you share the training cases on Huawei acceleration card?

Issue - State: open - Opened by felix0080 6 months ago - 1 comment

#1051 - Fix the bug to save and load DDP checkpoints.

Pull Request - State: closed - Opened by workingloong 6 months ago - 1 comment

#1051 - Fix the bug to save and load DDP checkpoints.

Pull Request - State: closed - Opened by workingloong 6 months ago - 1 comment

#1050 - Fix the typo errors.

Pull Request - State: closed - Opened by workingloong 6 months ago - 1 comment

#1050 - Fix the typo errors.

Pull Request - State: closed - Opened by workingloong 6 months ago - 1 comment

#1049 - 案例介绍中图3解释

Issue - State: closed - Opened by MercuryLc 6 months ago

#1049 - 案例介绍中图3解释

Issue - State: closed - Opened by MercuryLc 6 months ago

#1048 - Update 20240313

Pull Request - State: closed - Opened by adamantboy 6 months ago - 1 comment

#1047 - Do not remove exited node if remove_exited_node is not enabled.

Pull Request - State: closed - Opened by workingloong 6 months ago - 1 comment

#1046 - Set the shm as the memory in the job of nanogpt.

Pull Request - State: closed - Opened by workingloong 6 months ago - 1 comment

#1045 - how to use Flash Checkpoint for huggingface trainer job

Issue - State: closed - Opened by Lzhang-hub 6 months ago

#1044 - add util for loss spike save and decode.

Pull Request - State: open - Opened by haikuotiankong1212 7 months ago - 2 comments

#1043 - The node reports a socket.gaierror to the master.

Pull Request - State: closed - Opened by workingloong 7 months ago - 1 comment

#1042 - Flash Checkpoint supports saving the checkpoint of moe optimizers.

Pull Request - State: closed - Opened by workingloong 7 months ago - 1 comment

#1041 - Early stop if the available node is less than min nodes.

Pull Request - State: closed - Opened by workingloong 7 months ago - 1 comment

#1040 - Fix the module path of accelerator.

Pull Request - State: closed - Opened by workingloong 7 months ago - 1 comment

#1039 - Do not relaunch node i f the relaunchable=False

Pull Request - State: closed - Opened by workingloong 7 months ago - 1 comment

#1038 - Add WeChat QR of AI Infra into the Readme.

Pull Request - State: closed - Opened by workingloong 7 months ago - 1 comment

#1037 - llama2 failed

Issue - State: open - Opened by wwj-2017-1117 7 months ago - 1 comment

#1036 - Set the relaunchable to False after relaunching a node.

Pull Request - State: closed - Opened by workingloong 7 months ago - 1 comment

#1035 - Doc: Design to implement checkpoint in memory of nodes.

Pull Request - State: closed - Opened by workingloong 7 months ago - 1 comment

#1034 - Fix the checkpoint name to load for DDP.

Pull Request - State: closed - Opened by workingloong 7 months ago - 1 comment

#1033 - Update atorch precommit mypy version to v0.981

Pull Request - State: closed - Opened by adamantboy 7 months ago - 1 comment

#1032 - Read the job node under a lock.

Pull Request - State: closed - Opened by workingloong 7 months ago - 1 comment

#1031 - Update atorch precommit mypy version to v0.981

Pull Request - State: closed - Opened by adamantboy 7 months ago - 2 comments

#1030 - save Megatron-LM MOE model error with flash checkpoint

Issue - State: closed - Opened by TING2938 7 months ago - 2 comments

#1029 - need example of llama with flash checkpoint

Issue - State: closed - Opened by yiyuanyu17 7 months ago - 1 comment

#1028 - Fix the invalid link in README.

Pull Request - State: closed - Opened by workingloong 7 months ago - 1 comment

#1027 - ATorch pre-commit fails due to mypy

Issue - State: closed - Opened by skydoorkai 7 months ago

#1026 - A tutorial to extend the fault tolerance in DLRover.

Pull Request - State: closed - Opened by workingloong 7 months ago - 1 comment

#1025 - Mark the node as unschedulable if the node fails.

Pull Request - State: closed - Opened by workingloong 7 months ago - 1 comment

#1024 - Fix atorch pre-commit

Pull Request - State: closed - Opened by workingloong 7 months ago - 1 comment

#1023 - Implement the scripts to check node for different AI chips.

Pull Request - State: closed - Opened by workingloong 7 months ago - 1 comment

#1022 - Remove the existing nodes and simplify the log of scaleplan.

Pull Request - State: closed - Opened by workingloong 7 months ago - 1 comment

#1021 - Do not reset the meta of checkpoint to empty before restarting workers.

Pull Request - State: closed - Opened by workingloong 7 months ago - 1 comment

#1020 - Polish the blog to introduce the fault-tolerance of DLRover.

Pull Request - State: closed - Opened by workingloong 7 months ago - 1 comment

#1019 - Fix the order of file names in the test of DeepSpeed CKPT.

Pull Request - State: closed - Opened by workingloong 7 months ago - 1 comment

#1018 - Set the link of the script to check node.

Pull Request - State: closed - Opened by workingloong 7 months ago - 1 comment

#1017 - [RFC] Welcome to give requirements to use DLRover on nodes with AI chips.

Issue - State: open - Opened by workingloong 7 months ago

#1016 - Always relaunch the failed pod for allreduce jobs.

Pull Request - State: closed - Opened by workingloong 7 months ago - 1 comment

#1015 - Set an ENV to start the monitor in the agent.

Pull Request - State: closed - Opened by workingloong 7 months ago - 1 comment

#1014 - Add figures into megatron flash ckpt blogs.

Pull Request - State: closed - Opened by workingloong 7 months ago - 1 comment

#1014 - Add figures into megatron flash ckpt blogs.

Pull Request - State: closed - Opened by workingloong 7 months ago - 1 comment

#1013 - The job master diagnoses whether the worker is dead by heartbeat.

Pull Request - State: closed - Opened by workingloong 7 months ago - 1 comment

#1012 - Update the Flash Checkpoint APIs for Megatron in blogs.

Pull Request - State: closed - Opened by workingloong 7 months ago - 1 comment

#1011 - Mark the error as node error if the pod fails due to hardware errors.

Pull Request - State: closed - Opened by workingloong 7 months ago - 1 comment

#1010 - The Job master monitors the worker status by heartbeat.

Issue - State: closed - Opened by workingloong 7 months ago

#1009 - Blogs to introduce saving/loading Megatron-LM checkpoint.

Pull Request - State: closed - Opened by workingloong 7 months ago - 1 comment

#1008 - Retry to check the master service.

Pull Request - State: closed - Opened by workingloong 7 months ago - 1 comment

#1007 - Relaunch th node if the node breaks down when checking the network.

Pull Request - State: closed - Opened by workingloong 7 months ago - 1 comment

#1006 - The ElasticJob can recover the master pod if it is evicted.

Pull Request - State: closed - Opened by workingloong 8 months ago - 1 comment

#1005 - We are going to build a LLM training agent help searching training strategy and babysitting model training to maxmimize MFU and effective training time.

Issue - State: open - Opened by hxdtest 8 months ago - 1 comment

#1004 - Use master Pod IP if the master service is not available.

Pull Request - State: closed - Opened by workingloong 8 months ago - 1 comment

#1003 - Fail to connect the master Pod.

Issue - State: closed - Opened by workingloong 8 months ago

#1002 - Cast the factor to cut CPU and memory to a float.

Pull Request - State: closed - Opened by workingloong 8 months ago - 1 comment

#1001 - fix the missing arguments max-nodes.

Pull Request - State: closed - Opened by workingloong 8 months ago - 1 comment

#1000 - Fix failure node report

Pull Request - State: closed - Opened by samplise 8 months ago - 1 comment

#999 - The job master only prints the exit reason if the pod fails.

Pull Request - State: closed - Opened by workingloong 8 months ago - 1 comment

#998 - fix saver factory for creation after training process restarts.

Pull Request - State: closed - Opened by liyzcj 8 months ago - 1 comment

#996 - Set the owner reference of Pod as the elastic job.

Pull Request - State: closed - Opened by workingloong 8 months ago - 1 comment

GitHub / intelligent-machine-learning/dlrover issues and pull requests