Ecosyste.ms: Issues
An open API service for providing issue and pull request metadata for open source projects.
GitHub / intelligent-machine-learning/dlrover issues and pull requests
#1283 - kill all the children processes in ascend npu
Pull Request -
State: closed - Opened by majieyue 6 days ago
#1282 - Compatible torch 2.4.x (0926)
Pull Request -
State: closed - Opened by BalaBalaYi 7 days ago
- 1 comment
#1281 - Upgrade version 0.3.8
Pull Request -
State: closed - Opened by BalaBalaYi 9 days ago
- 1 comment
#1280 - upgrade version 0.3.8
Pull Request -
State: closed - Opened by BalaBalaYi 9 days ago
- 1 comment
#1279 - Compatible with torch2.4
Pull Request -
State: closed - Opened by BalaBalaYi 9 days ago
- 1 comment
#1278 - Enlarge heartbeat timeout
Pull Request -
State: closed - Opened by BalaBalaYi 12 days ago
- 1 comment
#1277 - Worker get elastic run config from master
Pull Request -
State: open - Opened by samplise 12 days ago
- 3 comments
#1276 - Optimize logging
Pull Request -
State: closed - Opened by BalaBalaYi 16 days ago
- 1 comment
Labels: enhancement
#1275 - DLRover - Flyte integration
Issue -
State: open - Opened by davidmirror-ops 16 days ago
- 2 comments
Labels: collaboration
#1274 - add RELEASES.md MAINTAINERS.md CONTRIBUTING.md CODE_OF_CONDUCT.md
Pull Request -
State: closed - Opened by majieyue 17 days ago
- 1 comment
#1273 - Fix bug.
Pull Request -
State: closed - Opened by BalaBalaYi 18 days ago
Labels: bug
#1272 - Filter error code log
Pull Request -
State: closed - Opened by samplise 23 days ago
- 1 comment
#1271 - Skip empty error codes when check node failures
Pull Request -
State: closed - Opened by samplise 24 days ago
- 1 comment
#1270 - Add ut for log collecting.
Pull Request -
State: closed - Opened by BalaBalaYi 24 days ago
- 1 comment
#1269 - Filter non training logs
Pull Request -
State: closed - Opened by samplise 24 days ago
- 1 comment
#1268 - Make diagnosis agent singleton
Pull Request -
State: closed - Opened by samplise 24 days ago
- 1 comment
#1267 - Fix diagnosis configure bugs
Pull Request -
State: closed - Opened by samplise 24 days ago
- 1 comment
#1266 - missing elastic_training_pb2
Issue -
State: open - Opened by NiushanDong 24 days ago
- 1 comment
#1265 - optimize check abnormal nodes
Pull Request -
State: closed - Opened by BalaBalaYi 24 days ago
- 1 comment
Labels: enhancement
#1264 - optimize heartbeat collect
Pull Request -
State: closed - Opened by BalaBalaYi 25 days ago
- 1 comment
Labels: enhancement
#1263 - Flash checkpoint does not support safetensors
Issue -
State: open - Opened by Alex-Ruan 26 days ago
#1262 - optimize diagnose logging
Pull Request -
State: closed - Opened by BalaBalaYi 26 days ago
- 1 comment
#1261 - fix socket error
Pull Request -
State: closed - Opened by BalaBalaYi 26 days ago
- 1 comment
Labels: bug
#1260 - Erros in dlrover, after pip installed the dlrover package
Issue -
State: open - Opened by Desperadoze 27 days ago
- 2 comments
#1259 - Fix duplicate pod relaunching for some cases(with internal k8s).
Pull Request -
State: closed - Opened by BalaBalaYi about 1 month ago
- 1 comment
Labels: bug
#1258 - Optimize pending timeout using
Pull Request -
State: closed - Opened by BalaBalaYi about 1 month ago
- 1 comment
Labels: enhancement
#1257 - Optimize and fix events expose
Pull Request -
State: closed - Opened by samplise about 1 month ago
- 1 comment
#1256 - deepspeed zero3 also save ckpt only in rank 0?
Issue -
State: closed - Opened by Alex-Ruan about 1 month ago
- 1 comment
#1255 - Skip pending timeout when timeout=0.
Pull Request -
State: closed - Opened by BalaBalaYi about 1 month ago
- 1 comment
Labels: enhancement
#1254 - Fix serveral issue when using fsdp checkpointer.
Pull Request -
State: closed - Opened by BalaBalaYi about 1 month ago
- 1 comment
Labels: bug, enhancement
#1253 - Optimize network-check.
Pull Request -
State: closed - Opened by BalaBalaYi about 1 month ago
- 1 comment
Labels: enhancement
#1252 - Optimize training ending
Pull Request -
State: closed - Opened by BalaBalaYi about 1 month ago
- 1 comment
Labels: enhancement
#1251 - Fix path creation in fsdp dcp saver.
Pull Request -
State: closed - Opened by BalaBalaYi about 1 month ago
- 1 comment
Labels: bug
#1250 - Sse rdzv timeout as insufficient timeout
Pull Request -
State: closed - Opened by BalaBalaYi about 1 month ago
- 1 comment
Labels: enhancement
#1249 - Fix rdzv updating in concurrency
Pull Request -
State: closed - Opened by BalaBalaYi about 1 month ago
- 1 comment
Labels: bug
#1248 - Can you create a dlrover arm64 image for Ascend NPU?
Issue -
State: open - Opened by xmarker about 1 month ago
- 1 comment
#1247 - Add alive pod stats variable.
Pull Request -
State: closed - Opened by BalaBalaYi about 2 months ago
- 1 comment
Labels: enhancement
#1246 - Optimize logging and revert using random to create socket
Pull Request -
State: closed - Opened by BalaBalaYi about 2 months ago
- 1 comment
#1245 - Revert "【WIP】Temp solution for socket conflict."
Pull Request -
State: closed - Opened by BalaBalaYi about 2 months ago
- 1 comment
#1244 - Question: How DLRover integrate with Llama Factory?
Issue -
State: open - Opened by hetingyou about 2 months ago
- 1 comment
#1243 - What is the relationship with DLRover and Megatron? Can I integrate DLRover with Megatron with fault-tolerance and monitoring capabilities. How DLRover can recover from GPU offline problems with TP and PP needing to be reorganized?
Issue -
State: open - Opened by dotsonliu about 2 months ago
- 1 comment
#1242 - Add validation for 'critical_worker_index'
Pull Request -
State: closed - Opened by BalaBalaYi about 2 months ago
- 1 comment
Labels: enhancement
#1241 - Update dignding
Pull Request -
State: closed - Opened by BalaBalaYi about 2 months ago
- 1 comment
#1240 - update ding group
Pull Request -
State: closed - Opened by BalaBalaYi about 2 months ago
- 1 comment
#1239 - Fix type.
Pull Request -
State: closed - Opened by BalaBalaYi about 2 months ago
- 1 comment
Labels: bug
#1238 - Skip 'should early stop' for non all reduce job.
Pull Request -
State: closed - Opened by BalaBalaYi about 2 months ago
- 1 comment
#1237 - Remove error code 128 from 'hardware-error'
Pull Request -
State: closed - Opened by BalaBalaYi about 2 months ago
- 1 comment
Labels: enhancement
#1236 - Fix master client setup in ckpt saver.
Pull Request -
State: closed - Opened by BalaBalaYi about 2 months ago
- 1 comment
Labels: bug
#1235 - Optimize ckeckpointing.
Pull Request -
State: closed - Opened by BalaBalaYi about 2 months ago
Labels: enhancement
#1234 - Refactor diagnose agent
Pull Request -
State: closed - Opened by samplise about 2 months ago
- 1 comment
#1233 - while using megatron distributed flash-checkpoint to recovery, error ocurs when load_checkpoint
Issue -
State: open - Opened by deepcoldfish about 2 months ago
#1232 - add logging for writing error
Pull Request -
State: closed - Opened by BalaBalaYi about 2 months ago
- 1 comment
#1231 - Update dlrover event action
Pull Request -
State: closed - Opened by samplise about 2 months ago
- 1 comment
#1230 - Fix fsdp dcp saver
Pull Request -
State: closed - Opened by BalaBalaYi about 2 months ago
Labels: bug
#1229 - Fix flash ckpt
Pull Request -
State: closed - Opened by BalaBalaYi about 2 months ago
- 1 comment
Labels: bug, enhancement, flash checkpoint
#1228 - Optimize diagnosis structure
Pull Request -
State: closed - Opened by BalaBalaYi 2 months ago
- 1 comment
Labels: dlrover-master
#1227 - 【WIP】Temp solution for socket conflict.
Pull Request -
State: closed - Opened by BalaBalaYi 2 months ago
- 1 comment
Labels: bug
#1226 - fix unittest error: AttributeError: ElasticLaunchConfig object has no attribute tee
Pull Request -
State: closed - Opened by majieyue 2 months ago
- 1 comment
#1225 - Why model_optim_rng.pt is saved in a seperate directory?
Issue -
State: open - Opened by zhaoyang-star 2 months ago
- 7 comments
#1224 - Optimize test for ascend NPU.
Pull Request -
State: closed - Opened by BalaBalaYi 2 months ago
- 1 comment
Labels: enhancement
#1223 - Why model_optim_rng.pt is not saved when enable dlrover?
Issue -
State: closed - Opened by zhaoyang-star 2 months ago
#1222 - easydl/elasticjob-controller:master image pull error
Issue -
State: open - Opened by xywangbuaa 2 months ago
- 1 comment
#1221 - transformers version?
Issue -
State: closed - Opened by Alex-Ruan 2 months ago
- 1 comment
#1220 - Optimize pending judgement: when all nodes pending
Pull Request -
State: closed - Opened by BalaBalaYi 2 months ago
- 1 comment
Labels: enhancement
#1219 - 【WIP】add pod diagnosis feature
Pull Request -
State: open - Opened by xiaochaoren 2 months ago
- 1 comment
#1218 - remove xpu-timer
Pull Request -
State: closed - Opened by BalaBalaYi 2 months ago
- 3 comments
#1217 - Fix scaler async execution.
Pull Request -
State: closed - Opened by BalaBalaYi 2 months ago
- 1 comment
Labels: bug
#1216 - Optimize logging in rdzf manager.
Pull Request -
State: closed - Opened by BalaBalaYi 2 months ago
- 1 comment
Labels: enhancement
#1215 - scale down allreduct pytorch job won't complete and report error
Issue -
State: open - Opened by cocodee 2 months ago
- 1 comment
#1214 - Copy tensor to the shared memory without grad.
Pull Request -
State: closed - Opened by workingloong 2 months ago
- 1 comment
#1213 - Add signal timeout for 'stop_workers'
Pull Request -
State: closed - Opened by BalaBalaYi 2 months ago
- 2 comments
Labels: enhancement
#1212 - Support action timeout processing.
Pull Request -
State: closed - Opened by BalaBalaYi 2 months ago
- 1 comment
Labels: enhancement
#1211 - Fix user-agent issue.
Pull Request -
State: closed - Opened by BalaBalaYi 2 months ago
- 1 comment
Labels: bug
#1210 - Resolve pending and insufficient nodes issue.
Pull Request -
State: closed - Opened by BalaBalaYi 2 months ago
- 1 comment
Labels: enhancement, dlrover-master
#1209 - Keep conflict processing when env set
Pull Request -
State: closed - Opened by BalaBalaYi 2 months ago
- 1 comment
Labels: enhancement
#1208 - When performing multi-node, multi-GPU training with Megatron-LM, if the 'rank' is only input in the startup script and not set in the environment variables, an exception may occur (stroagetype is disk)
Issue -
State: open - Opened by lkq51 2 months ago
- 2 comments
#1207 - Optimize node status from pod phase.
Pull Request -
State: closed - Opened by BalaBalaYi 2 months ago
Labels: enhancement, dlrover-master
#1206 - Support optional debug log level.
Pull Request -
State: closed - Opened by BalaBalaYi 3 months ago
- 1 comment
Labels: enhancement, dlrover-master
#1205 - Enhancement of hccl port config resolution.
Pull Request -
State: closed - Opened by BalaBalaYi 3 months ago
- 1 comment
Labels: enhancement
#1204 - Increase heartbeat timeout
Pull Request -
State: closed - Opened by BalaBalaYi 3 months ago
- 1 comment
Labels: enhancement
#1203 - Expose important event
Pull Request -
State: closed - Opened by samplise 3 months ago
- 1 comment
#1202 - Add default owners.
Pull Request -
State: closed - Opened by BalaBalaYi 3 months ago
- 1 comment
#1201 - add more network check log
Pull Request -
State: closed - Opened by alpha-baby 3 months ago
- 2 comments
#1200 - Fix heartbeat when there is node relaunched.
Pull Request -
State: closed - Opened by BalaBalaYi 3 months ago
- 1 comment
Labels: bug
#1199 - [Error] When using deepspeed to start a megatron training task, only rank 0 of the flash checkpoint saves the model
Issue -
State: open - Opened by liangxuZhang 3 months ago
- 1 comment
#1198 - Add try except for getting dead node.
Pull Request -
State: closed - Opened by BalaBalaYi 3 months ago
Labels: enhancement
#1197 - fix time type issue
Pull Request -
State: closed - Opened by BalaBalaYi 3 months ago
- 1 comment
Labels: bug
#1196 - Add failure reporting for async ckpt saver.
Pull Request -
State: closed - Opened by BalaBalaYi 3 months ago
- 2 comments
Labels: enhancement
#1195 - What's the difference between MegatronCheckpointEngine and MegatronDistCheckpointEngine?
Issue -
State: closed - Opened by liangxuZhang 3 months ago
#1194 - optimize heartbeat logging
Pull Request -
State: closed - Opened by BalaBalaYi 3 months ago
- 1 comment
Labels: enhancement
#1193 - optimize elastic-run logging
Pull Request -
State: closed - Opened by BalaBalaYi 3 months ago
- 1 comment
Labels: enhancement
#1192 - Fix log file bugs
Pull Request -
State: closed - Opened by samplise 3 months ago
Labels: bug
#1191 - Optimize hccl port detection
Pull Request -
State: closed - Opened by samplise 3 months ago
- 1 comment
#1190 - Optimize failure node detection
Pull Request -
State: closed - Opened by samplise 3 months ago
- 1 comment
#1189 - Fix heart beat for concurency.
Pull Request -
State: closed - Opened by BalaBalaYi 3 months ago
- 2 comments
Labels: bug
#1188 - Add std version output for agent.
Pull Request -
State: closed - Opened by BalaBalaYi 3 months ago
- 1 comment
Labels: enhancement
#1187 - Why checkpoint can't be copied to shared memory Asynchronously to shared memory when using Flash Checkpoint?
Issue -
State: closed - Opened by Reflect0 3 months ago
- 1 comment
#1186 - fix exception when plan is none
Pull Request -
State: closed - Opened by BalaBalaYi 3 months ago
- 1 comment
Labels: bug
#1185 - Skip restart training process on failure nodes
Pull Request -
State: closed - Opened by samplise 3 months ago
- 1 comment
#1184 - Unify job manager's stop status field
Pull Request -
State: closed - Opened by BalaBalaYi 3 months ago
- 1 comment
Labels: enhancement