Ecosyste.ms: Issues

An open API service for providing issue and pull request metadata for open source projects.

GitHub / intelligent-machine-learning/dlrover issues and pull requests

#1183 - Sync internal modification.

Pull Request - State: closed - Opened by BalaBalaYi 3 months ago - 1 comment

#1182 - Multi issue fixed.

Pull Request - State: closed - Opened by BalaBalaYi 3 months ago - 1 comment
Labels: enhancement

#1181 - Improve training port conflict avoid

Pull Request - State: closed - Opened by samplise 3 months ago - 1 comment

#1179 - add logging

Pull Request - State: closed - Opened by BalaBalaYi 3 months ago - 1 comment

#1178 - Remove the debug code to print variables.

Pull Request - State: closed - Opened by workingloong 3 months ago - 1 comment

#1177 - fix ShardCkptReplicaManager when replica = 0

Pull Request - State: closed - Opened by BalaBalaYi 3 months ago - 1 comment
Labels: bug

#1176 - update atorch to 062024

Pull Request - State: open - Opened by skydoorkai 3 months ago - 2 comments

#1174 - optimize job manager structure

Pull Request - State: closed - Opened by BalaBalaYi 3 months ago - 1 comment

#1173 - Pod scaler enhancement: support concurrent creation

Pull Request - State: closed - Opened by BalaBalaYi 3 months ago - 4 comments
Labels: enhancement

#1172 - How to use the elasticity and fault tolerance in a Volcano job.

Issue - State: open - Opened by workingloong 3 months ago - 1 comment

#1171 - Hccl port conflict detection

Pull Request - State: closed - Opened by samplise 3 months ago - 1 comment

#1170 - Update paper messages.

Pull Request - State: closed - Opened by youxingling 3 months ago - 1 comment

#1169 - optimize auto scaler

Pull Request - State: closed - Opened by BalaBalaYi 3 months ago - 1 comment
Labels: enhancement

#1169 - optimize auto scaler

Pull Request - State: closed - Opened by BalaBalaYi 3 months ago - 1 comment
Labels: enhancement

#1168 - Add sockct close v2

Pull Request - State: open - Opened by yangrudan 3 months ago - 3 comments

#1168 - Add sockct close v2

Pull Request - State: open - Opened by yangrudan 3 months ago - 3 comments

#1167 - The agent joins the rendzvous with the node id.

Pull Request - State: closed - Opened by workingloong 3 months ago - 1 comment

#1167 - The agent joins the rendzvous with the node id.

Pull Request - State: closed - Opened by workingloong 3 months ago - 1 comment

#1166 - Modify the controller's resource allocation settings.

Pull Request - State: open - Opened by Filtee 3 months ago - 1 comment

#1166 - Modify the controller's resource allocation settings.

Pull Request - State: closed - Opened by Filtee 3 months ago - 1 comment

#1165 - Fix request spelling error and add socket resource release

Pull Request - State: closed - Opened by yangrudan 4 months ago - 2 comments

#1165 - Fix request spelling error and add socket resource release

Pull Request - State: closed - Opened by yangrudan 4 months ago - 2 comments

#1164 - manifests: generate clientset/listers/informers into pkg/client package

Pull Request - State: closed - Opened by kidlj 4 months ago - 2 comments

#1164 - manifests: generate clientset/listers/informers into pkg/client package

Pull Request - State: closed - Opened by kidlj 4 months ago - 2 comments

#1163 - optimize svc operation

Pull Request - State: closed - Opened by BalaBalaYi 4 months ago - 1 comment
Labels: bug, enhancement

#1162 - torch version compatible

Pull Request - State: closed - Opened by Lzhang-hub 4 months ago - 1 comment

#1161 - add wait in service check

Pull Request - State: closed - Opened by BalaBalaYi 4 months ago - 1 comment
Labels: bug

#1160 - add log for master svc check

Pull Request - State: closed - Opened by BalaBalaYi 4 months ago - 1 comment
Labels: enhancement

#1159 - xpu timer python package

Issue - State: open - Opened by zxyyzx 4 months ago - 3 comments

#1158 - Cached sharding

Pull Request - State: closed - Opened by zhangbw17 4 months ago

#1157 - optimize rdzv logic during node relaunching

Pull Request - State: closed - Opened by BalaBalaYi 4 months ago - 1 comment
Labels: enhancement

#1156 - add specified ua for k8s client

Pull Request - State: closed - Opened by BalaBalaYi 4 months ago - 1 comment

#1155 - Update async-checkpoint.md

Pull Request - State: closed - Opened by cainiaogoroad 4 months ago - 1 comment

#1154 - Remove codes to get master pod.

Pull Request - State: closed - Opened by workingloong 4 months ago - 1 comment

#1153 - Restore checkpoint replica from the shared memory of other nodes.

Pull Request - State: closed - Opened by workingloong 4 months ago - 1 comment

#1152 - Polish the log of timeout to wait nodes.

Pull Request - State: closed - Opened by workingloong 4 months ago - 1 comment

#1151 - fix the rdzv name to check node.

Pull Request - State: closed - Opened by workingloong 4 months ago - 1 comment

#1150 - Fix the rdzv name to check node.

Pull Request - State: closed - Opened by workingloong 4 months ago

#1149 - Remove the codes to remove the checkpoint directory.

Pull Request - State: closed - Opened by workingloong 4 months ago - 1 comment

#1148 - Set the nccl env to execute gpu test task.

Pull Request - State: closed - Opened by workingloong 4 months ago - 1 comment

#1145 - The restarted node can acquire the full checkpoint in the peer node.

Pull Request - State: closed - Opened by workingloong 4 months ago - 1 comment

#1144 - Error encountered when using falsh checkpoint

Issue - State: open - Opened by chencjcj 4 months ago - 1 comment

#1143 - Enlarge master pod retrieving interval and waiting time.

Pull Request - State: closed - Opened by BalaBalaYi 4 months ago - 1 comment
Labels: enhancement

#1142 - Fix the tutorial to run the example of auto scaling the TF job.

Pull Request - State: closed - Opened by workingloong 4 months ago - 1 comment

#1141 - Fix the example of deeprec to train the deepfm model with CRITEO dataset.

Pull Request - State: closed - Opened by workingloong 4 months ago - 1 comment

#1140 - The nodes backup the checkpoint data in the shared meory.

Pull Request - State: closed - Opened by workingloong 4 months ago - 1 comment

#1139 - collect error content

Pull Request - State: closed - Opened by BalaBalaYi 5 months ago - 1 comment
Labels: enhancement

#1138 - straggler-detection

Issue - State: closed - Opened by alex337 5 months ago - 5 comments

#1137 - Design to backup checkpoint shards between nodes.

Pull Request - State: closed - Opened by workingloong 5 months ago - 1 comment

#1136 - example failed: examples/tensorflow/criteo_deeprec/manual_job.yaml

Issue - State: open - Opened by jason-i-vv 5 months ago - 1 comment

#1135 - Incomplete save of ckpt files

Issue - State: open - Opened by husky23333 5 months ago - 3 comments

#1134 - List Go version 1.18 in the docs.

Pull Request - State: closed - Opened by workingloong 5 months ago - 1 comment

#1133 - Refactor the test cases of SimpleStrategyGenerator

Pull Request - State: closed - Opened by workingloong 5 months ago - 1 comment

#1131 - optimize master retrieving

Pull Request - State: closed - Opened by BalaBalaYi 5 months ago - 1 comment
Labels: enhancement

#1130 - Support torch>=2.3.0

Pull Request - State: closed - Opened by workingloong 5 months ago - 1 comment

#1129 - Error llama2 demo with pytorch 2.3.0

Issue - State: closed - Opened by SwordFaith 5 months ago

#1128 - megatron-lm new api

Pull Request - State: closed - Opened by Lzhang-hub 5 months ago - 1 comment

#1127 - Tune up the sleep interval to query the new rdzv.

Pull Request - State: closed - Opened by workingloong 5 months ago - 1 comment

#1126 - optimize MODIFIED event with deletion info

Pull Request - State: closed - Opened by BalaBalaYi 5 months ago - 1 comment
Labels: enhancement

#1125 - fix pod event converting invocation

Pull Request - State: closed - Opened by BalaBalaYi 5 months ago - 1 comment
Labels: bug

#1124 - add log for pod event watcher

Pull Request - State: closed - Opened by BalaBalaYi 5 months ago - 1 comment
Labels: enhancement

#1123 - possible typo in the example of [tf_elasticjob_on_k8s]

Issue - State: closed - Opened by lichadehehehe 5 months ago - 1 comment

#1122 - Ignore pod deleted event if pod failover is already done by k8s

Pull Request - State: closed - Opened by BalaBalaYi 5 months ago - 1 comment
Labels: enhancement, dlrover-master

#1120 - Remove unused logs of checking gpu.

Pull Request - State: closed - Opened by workingloong 5 months ago - 1 comment

#1119 - Add Bayesian Optimization for DLRover Brain

Pull Request - State: closed - Opened by yzlnew 5 months ago - 2 comments

#1118 - Test allreduce performance after checking node health.

Pull Request - State: closed - Opened by workingloong 5 months ago - 1 comment

#1117 - Polish the log of executing matmul ops.

Pull Request - State: closed - Opened by workingloong 5 months ago - 1 comment

#1116 - Remove empty.proto in the elastic_training.proto

Pull Request - State: closed - Opened by workingloong 5 months ago - 1 comment

#1115 - make deploy IMG=easydl/elasticjob-controller:master

Issue - State: open - Opened by yangzhipeng1108 5 months ago - 1 comment

#1114 - Set join timeout value as timeout in rdzv params.

Issue - State: closed - Opened by workingloong 5 months ago - 1 comment

#1113 - OSError: [Errno 98] Address already in use

Issue - State: closed - Opened by chencjcj 5 months ago - 3 comments

#1112 - WIP: Diagnose training hang

Pull Request - State: open - Opened by samplise 5 months ago

#1111 - Refactor /examples/pytorch/NanoGPT/*_train.py

Pull Request - State: closed - Opened by Filtee 5 months ago - 1 comment

#1110 - dlrover里的go是干啥的,和python包有啥联系

Issue - State: closed - Opened by HongLouyemeng 5 months ago - 8 comments

#1108 - Refactor examples/pytorch/NanoGPT/

Pull Request - State: closed - Opened by Filtee 5 months ago - 3 comments

#1107 - Refactored example/pytorch/NanoGPT.

Pull Request - State: closed - Opened by Filtee 5 months ago

#1106 - Fix the bug to check whether a dict contains another dict.

Pull Request - State: closed - Opened by workingloong 5 months ago - 1 comment

#1105 - fix logging

Pull Request - State: closed - Opened by BalaBalaYi 5 months ago - 1 comment
Labels: bug

#1104 - Fix unittest cases.

Pull Request - State: closed - Opened by workingloong 5 months ago

#1103 - Refactor the job collector to collect the count of failover.

Pull Request - State: closed - Opened by workingloong 5 months ago - 1 comment

#1102 - Remove the exited node from the waiting nodes of rdzv.

Pull Request - State: closed - Opened by workingloong 5 months ago - 1 comment

#1101 - optimize master pod retrieving

Pull Request - State: closed - Opened by BalaBalaYi 5 months ago - 1 comment
Labels: enhancement, dlrover-master

#1100 - Set the deletion strategy and timeout in the checkpointers.

Pull Request - State: closed - Opened by workingloong 5 months ago - 2 comments
Labels: flash checkpoint

#1099 - Save/load the non-params-related variables of dist optimizer.

Pull Request - State: closed - Opened by workingloong 5 months ago - 1 comment

#1098 - atorch-v0.1.9

Pull Request - State: closed - Opened by adamantboy 5 months ago - 2 comments

#1097 - 故障自动恢复后,load(flash ckpt)后loss异常震荡

Issue - State: closed - Opened by suenphey 5 months ago - 1 comment

#1096 - Diagnosis manager implementation in the master

Pull Request - State: closed - Opened by samplise 5 months ago - 1 comment

#1095 - Add the news about Flash Ckpt supports HuggingFace Trainer.

Pull Request - State: closed - Opened by workingloong 5 months ago - 1 comment

#1094 - Set pull request target to run codecov with secret.

Pull Request - State: closed - Opened by workingloong 6 months ago

#1092 - The agent waits for async saving checkpoint before exiting.

Pull Request - State: closed - Opened by workingloong 6 months ago - 1 comment

#1091 - Set the version to 0.3.6rc0.

Pull Request - State: closed - Opened by workingloong 6 months ago - 1 comment

#1090 - delete useless dockerfile

Pull Request - State: closed - Opened by cailun01 6 months ago - 1 comment