Ecosyste.ms: Issues

An open API service for providing issue and pull request metadata for open source projects.

GitHub / intelligent-machine-learning/dlrover issues and pull requests

#1283 - kill all the children processes in ascend npu

Pull Request - State: closed - Opened by majieyue 6 days ago

#1282 - Compatible torch 2.4.x (0926)

Pull Request - State: closed - Opened by BalaBalaYi 7 days ago - 1 comment

#1281 - Upgrade version 0.3.8

Pull Request - State: closed - Opened by BalaBalaYi 9 days ago - 1 comment

#1280 - upgrade version 0.3.8

Pull Request - State: closed - Opened by BalaBalaYi 9 days ago - 1 comment

#1279 - Compatible with torch2.4

Pull Request - State: closed - Opened by BalaBalaYi 9 days ago - 1 comment

#1278 - Enlarge heartbeat timeout

Pull Request - State: closed - Opened by BalaBalaYi 12 days ago - 1 comment

#1277 - Worker get elastic run config from master

Pull Request - State: open - Opened by samplise 12 days ago - 3 comments

#1276 - Optimize logging

Pull Request - State: closed - Opened by BalaBalaYi 16 days ago - 1 comment
Labels: enhancement

#1275 - DLRover - Flyte integration

Issue - State: open - Opened by davidmirror-ops 16 days ago - 2 comments
Labels: collaboration

#1274 - add RELEASES.md MAINTAINERS.md CONTRIBUTING.md CODE_OF_CONDUCT.md

Pull Request - State: closed - Opened by majieyue 17 days ago - 1 comment

#1273 - Fix bug.

Pull Request - State: closed - Opened by BalaBalaYi 18 days ago
Labels: bug

#1272 - Filter error code log

Pull Request - State: closed - Opened by samplise 23 days ago - 1 comment

#1271 - Skip empty error codes when check node failures

Pull Request - State: closed - Opened by samplise 23 days ago - 1 comment

#1270 - Add ut for log collecting.

Pull Request - State: closed - Opened by BalaBalaYi 23 days ago - 1 comment

#1269 - Filter non training logs

Pull Request - State: closed - Opened by samplise 24 days ago - 1 comment

#1268 - Make diagnosis agent singleton

Pull Request - State: closed - Opened by samplise 24 days ago - 1 comment

#1267 - Fix diagnosis configure bugs

Pull Request - State: closed - Opened by samplise 24 days ago - 1 comment

#1266 - missing elastic_training_pb2

Issue - State: open - Opened by NiushanDong 24 days ago - 1 comment

#1265 - optimize check abnormal nodes

Pull Request - State: closed - Opened by BalaBalaYi 24 days ago - 1 comment
Labels: enhancement

#1264 - optimize heartbeat collect

Pull Request - State: closed - Opened by BalaBalaYi 25 days ago - 1 comment
Labels: enhancement

#1263 - Flash checkpoint does not support safetensors

Issue - State: open - Opened by Alex-Ruan 25 days ago

#1262 - optimize diagnose logging

Pull Request - State: closed - Opened by BalaBalaYi 26 days ago - 1 comment

#1261 - fix socket error

Pull Request - State: closed - Opened by BalaBalaYi 26 days ago - 1 comment
Labels: bug

#1260 - Erros in dlrover, after pip installed the dlrover package

Issue - State: open - Opened by Desperadoze 27 days ago - 2 comments

#1259 - Fix duplicate pod relaunching for some cases(with internal k8s).

Pull Request - State: closed - Opened by BalaBalaYi about 1 month ago - 1 comment
Labels: bug

#1258 - Optimize pending timeout using

Pull Request - State: closed - Opened by BalaBalaYi about 1 month ago - 1 comment
Labels: enhancement

#1257 - Optimize and fix events expose

Pull Request - State: closed - Opened by samplise about 1 month ago - 1 comment

#1256 - deepspeed zero3 also save ckpt only in rank 0?

Issue - State: closed - Opened by Alex-Ruan about 1 month ago - 1 comment

#1255 - Skip pending timeout when timeout=0.

Pull Request - State: closed - Opened by BalaBalaYi about 1 month ago - 1 comment
Labels: enhancement

#1254 - Fix serveral issue when using fsdp checkpointer.

Pull Request - State: closed - Opened by BalaBalaYi about 1 month ago - 1 comment
Labels: bug, enhancement

#1253 - Optimize network-check.

Pull Request - State: closed - Opened by BalaBalaYi about 1 month ago - 1 comment
Labels: enhancement

#1252 - Optimize training ending

Pull Request - State: closed - Opened by BalaBalaYi about 1 month ago - 1 comment
Labels: enhancement

#1251 - Fix path creation in fsdp dcp saver.

Pull Request - State: closed - Opened by BalaBalaYi about 1 month ago - 1 comment
Labels: bug

#1250 - Sse rdzv timeout as insufficient timeout

Pull Request - State: closed - Opened by BalaBalaYi about 1 month ago - 1 comment
Labels: enhancement

#1249 - Fix rdzv updating in concurrency

Pull Request - State: closed - Opened by BalaBalaYi about 1 month ago - 1 comment
Labels: bug

#1248 - Can you create a dlrover arm64 image for Ascend NPU?

Issue - State: open - Opened by xmarker about 1 month ago - 1 comment

#1247 - Add alive pod stats variable.

Pull Request - State: closed - Opened by BalaBalaYi about 2 months ago - 1 comment
Labels: enhancement

#1246 - Optimize logging and revert using random to create socket

Pull Request - State: closed - Opened by BalaBalaYi about 2 months ago - 1 comment

#1245 - Revert "【WIP】Temp solution for socket conflict."

Pull Request - State: closed - Opened by BalaBalaYi about 2 months ago - 1 comment

#1244 - Question: How DLRover integrate with Llama Factory?

Issue - State: open - Opened by hetingyou about 2 months ago - 1 comment

#1242 - Add validation for 'critical_worker_index'

Pull Request - State: closed - Opened by BalaBalaYi about 2 months ago - 1 comment
Labels: enhancement

#1241 - Update dignding

Pull Request - State: closed - Opened by BalaBalaYi about 2 months ago - 1 comment

#1240 - update ding group

Pull Request - State: closed - Opened by BalaBalaYi about 2 months ago - 1 comment

#1239 - Fix type.

Pull Request - State: closed - Opened by BalaBalaYi about 2 months ago - 1 comment
Labels: bug

#1238 - Skip 'should early stop' for non all reduce job.

Pull Request - State: closed - Opened by BalaBalaYi about 2 months ago - 1 comment

#1237 - Remove error code 128 from 'hardware-error'

Pull Request - State: closed - Opened by BalaBalaYi about 2 months ago - 1 comment
Labels: enhancement

#1236 - Fix master client setup in ckpt saver.

Pull Request - State: closed - Opened by BalaBalaYi about 2 months ago - 1 comment
Labels: bug

#1235 - Optimize ckeckpointing.

Pull Request - State: closed - Opened by BalaBalaYi about 2 months ago
Labels: enhancement

#1234 - Refactor diagnose agent

Pull Request - State: closed - Opened by samplise about 2 months ago - 1 comment

#1232 - add logging for writing error

Pull Request - State: closed - Opened by BalaBalaYi about 2 months ago - 1 comment

#1231 - Update dlrover event action

Pull Request - State: closed - Opened by samplise about 2 months ago - 1 comment

#1230 - Fix fsdp dcp saver

Pull Request - State: closed - Opened by BalaBalaYi about 2 months ago
Labels: bug

#1229 - Fix flash ckpt

Pull Request - State: closed - Opened by BalaBalaYi about 2 months ago - 1 comment
Labels: bug, enhancement, flash checkpoint

#1228 - Optimize diagnosis structure

Pull Request - State: closed - Opened by BalaBalaYi 2 months ago - 1 comment
Labels: dlrover-master

#1227 - 【WIP】Temp solution for socket conflict.

Pull Request - State: closed - Opened by BalaBalaYi 2 months ago - 1 comment
Labels: bug

#1225 - Why model_optim_rng.pt is saved in a seperate directory?

Issue - State: open - Opened by zhaoyang-star 2 months ago - 7 comments

#1224 - Optimize test for ascend NPU.

Pull Request - State: closed - Opened by BalaBalaYi 2 months ago - 1 comment
Labels: enhancement

#1222 - easydl/elasticjob-controller:master image pull error

Issue - State: open - Opened by xywangbuaa 2 months ago - 1 comment

#1221 - transformers version?

Issue - State: closed - Opened by Alex-Ruan 2 months ago - 1 comment

#1220 - Optimize pending judgement: when all nodes pending

Pull Request - State: closed - Opened by BalaBalaYi 2 months ago - 1 comment
Labels: enhancement

#1219 - 【WIP】add pod diagnosis feature

Pull Request - State: open - Opened by xiaochaoren 2 months ago - 1 comment

#1218 - remove xpu-timer

Pull Request - State: closed - Opened by BalaBalaYi 2 months ago - 3 comments

#1217 - Fix scaler async execution.

Pull Request - State: closed - Opened by BalaBalaYi 2 months ago - 1 comment
Labels: bug

#1216 - Optimize logging in rdzf manager.

Pull Request - State: closed - Opened by BalaBalaYi 2 months ago - 1 comment
Labels: enhancement

#1215 - scale down allreduct pytorch job won't complete and report error

Issue - State: open - Opened by cocodee 2 months ago - 1 comment

#1214 - Copy tensor to the shared memory without grad.

Pull Request - State: closed - Opened by workingloong 2 months ago - 1 comment

#1213 - Add signal timeout for 'stop_workers'

Pull Request - State: closed - Opened by BalaBalaYi 2 months ago - 2 comments
Labels: enhancement

#1212 - Support action timeout processing.

Pull Request - State: closed - Opened by BalaBalaYi 2 months ago - 1 comment
Labels: enhancement

#1211 - Fix user-agent issue.

Pull Request - State: closed - Opened by BalaBalaYi 2 months ago - 1 comment
Labels: bug

#1210 - Resolve pending and insufficient nodes issue.

Pull Request - State: closed - Opened by BalaBalaYi 2 months ago - 1 comment
Labels: enhancement, dlrover-master

#1209 - Keep conflict processing when env set

Pull Request - State: closed - Opened by BalaBalaYi 2 months ago - 1 comment
Labels: enhancement

#1207 - Optimize node status from pod phase.

Pull Request - State: closed - Opened by BalaBalaYi 2 months ago
Labels: enhancement, dlrover-master

#1206 - Support optional debug log level.

Pull Request - State: closed - Opened by BalaBalaYi 3 months ago - 1 comment
Labels: enhancement, dlrover-master

#1205 - Enhancement of hccl port config resolution.

Pull Request - State: closed - Opened by BalaBalaYi 3 months ago - 1 comment
Labels: enhancement

#1204 - Increase heartbeat timeout

Pull Request - State: closed - Opened by BalaBalaYi 3 months ago - 1 comment
Labels: enhancement

#1203 - Expose important event

Pull Request - State: closed - Opened by samplise 3 months ago - 1 comment

#1202 - Add default owners.

Pull Request - State: closed - Opened by BalaBalaYi 3 months ago - 1 comment

#1201 - add more network check log

Pull Request - State: closed - Opened by alpha-baby 3 months ago - 2 comments

#1200 - Fix heartbeat when there is node relaunched.

Pull Request - State: closed - Opened by BalaBalaYi 3 months ago - 1 comment
Labels: bug

#1198 - Add try except for getting dead node.

Pull Request - State: closed - Opened by BalaBalaYi 3 months ago
Labels: enhancement

#1197 - fix time type issue

Pull Request - State: closed - Opened by BalaBalaYi 3 months ago - 1 comment
Labels: bug

#1196 - Add failure reporting for async ckpt saver.

Pull Request - State: closed - Opened by BalaBalaYi 3 months ago - 2 comments
Labels: enhancement

#1194 - optimize heartbeat logging

Pull Request - State: closed - Opened by BalaBalaYi 3 months ago - 1 comment
Labels: enhancement

#1193 - optimize elastic-run logging

Pull Request - State: closed - Opened by BalaBalaYi 3 months ago - 1 comment
Labels: enhancement

#1192 - Fix log file bugs

Pull Request - State: closed - Opened by samplise 3 months ago
Labels: bug

#1191 - Optimize hccl port detection

Pull Request - State: closed - Opened by samplise 3 months ago - 1 comment

#1190 - Optimize failure node detection

Pull Request - State: closed - Opened by samplise 3 months ago - 1 comment

#1189 - Fix heart beat for concurency.

Pull Request - State: closed - Opened by BalaBalaYi 3 months ago - 2 comments
Labels: bug

#1188 - Add std version output for agent.

Pull Request - State: closed - Opened by BalaBalaYi 3 months ago - 1 comment
Labels: enhancement

#1186 - fix exception when plan is none

Pull Request - State: closed - Opened by BalaBalaYi 3 months ago - 1 comment
Labels: bug

#1185 - Skip restart training process on failure nodes

Pull Request - State: closed - Opened by samplise 3 months ago - 1 comment

#1184 - Unify job manager's stop status field

Pull Request - State: closed - Opened by BalaBalaYi 3 months ago - 1 comment
Labels: enhancement