Ecosyste.ms: Issues
An open API service for providing issue and pull request metadata for open source projects.
GitHub / intelligent-machine-learning/dlrover issues and pull requests
#1183 - Sync internal modification.
Pull Request -
State: closed - Opened by BalaBalaYi 3 months ago
- 1 comment
#1182 - Multi issue fixed.
Pull Request -
State: closed - Opened by BalaBalaYi 3 months ago
- 1 comment
Labels: enhancement
#1181 - Improve training port conflict avoid
Pull Request -
State: closed - Opened by samplise 3 months ago
- 1 comment
#1180 - Error encountered while using flash attention in TensorFlow
Issue -
State: open - Opened by monatis 3 months ago
#1179 - add logging
Pull Request -
State: closed - Opened by BalaBalaYi 3 months ago
- 1 comment
#1178 - Remove the debug code to print variables.
Pull Request -
State: closed - Opened by workingloong 3 months ago
- 1 comment
#1177 - fix ShardCkptReplicaManager when replica = 0
Pull Request -
State: closed - Opened by BalaBalaYi 3 months ago
- 1 comment
Labels: bug
#1176 - update atorch to 062024
Pull Request -
State: open - Opened by skydoorkai 3 months ago
- 2 comments
#1175 - Worker pod stuck in Pending state causing TimeoutError and incorrect handling by master
Issue -
State: open - Opened by TheAriaYang 3 months ago
- 1 comment
#1174 - optimize job manager structure
Pull Request -
State: closed - Opened by BalaBalaYi 3 months ago
- 1 comment
#1173 - Pod scaler enhancement: support concurrent creation
Pull Request -
State: closed - Opened by BalaBalaYi 3 months ago
- 4 comments
Labels: enhancement
#1172 - How to use the elasticity and fault tolerance in a Volcano job.
Issue -
State: open - Opened by workingloong 3 months ago
- 1 comment
#1171 - Hccl port conflict detection
Pull Request -
State: closed - Opened by samplise 3 months ago
- 1 comment
#1170 - Update paper messages.
Pull Request -
State: closed - Opened by youxingling 3 months ago
- 1 comment
#1169 - optimize auto scaler
Pull Request -
State: closed - Opened by BalaBalaYi 3 months ago
- 1 comment
Labels: enhancement
#1169 - optimize auto scaler
Pull Request -
State: closed - Opened by BalaBalaYi 3 months ago
- 1 comment
Labels: enhancement
#1168 - Add sockct close v2
Pull Request -
State: open - Opened by yangrudan 3 months ago
- 3 comments
#1168 - Add sockct close v2
Pull Request -
State: open - Opened by yangrudan 3 months ago
- 3 comments
#1167 - The agent joins the rendzvous with the node id.
Pull Request -
State: closed - Opened by workingloong 3 months ago
- 1 comment
#1167 - The agent joins the rendzvous with the node id.
Pull Request -
State: closed - Opened by workingloong 3 months ago
- 1 comment
#1166 - Modify the controller's resource allocation settings.
Pull Request -
State: open - Opened by Filtee 3 months ago
- 1 comment
#1166 - Modify the controller's resource allocation settings.
Pull Request -
State: closed - Opened by Filtee 3 months ago
- 1 comment
#1165 - Fix request spelling error and add socket resource release
Pull Request -
State: closed - Opened by yangrudan 4 months ago
- 2 comments
#1165 - Fix request spelling error and add socket resource release
Pull Request -
State: closed - Opened by yangrudan 4 months ago
- 2 comments
#1164 - manifests: generate clientset/listers/informers into pkg/client package
Pull Request -
State: closed - Opened by kidlj 4 months ago
- 2 comments
#1164 - manifests: generate clientset/listers/informers into pkg/client package
Pull Request -
State: closed - Opened by kidlj 4 months ago
- 2 comments
#1163 - optimize svc operation
Pull Request -
State: closed - Opened by BalaBalaYi 4 months ago
- 1 comment
Labels: bug, enhancement
#1162 - torch version compatible
Pull Request -
State: closed - Opened by Lzhang-hub 4 months ago
- 1 comment
#1161 - add wait in service check
Pull Request -
State: closed - Opened by BalaBalaYi 4 months ago
- 1 comment
Labels: bug
#1160 - add log for master svc check
Pull Request -
State: closed - Opened by BalaBalaYi 4 months ago
- 1 comment
Labels: enhancement
#1159 - xpu timer python package
Issue -
State: open - Opened by zxyyzx 4 months ago
- 3 comments
#1158 - Cached sharding
Pull Request -
State: closed - Opened by zhangbw17 4 months ago
#1157 - optimize rdzv logic during node relaunching
Pull Request -
State: closed - Opened by BalaBalaYi 4 months ago
- 1 comment
Labels: enhancement
#1156 - add specified ua for k8s client
Pull Request -
State: closed - Opened by BalaBalaYi 4 months ago
- 1 comment
#1155 - Update async-checkpoint.md
Pull Request -
State: closed - Opened by cainiaogoroad 4 months ago
- 1 comment
#1154 - Remove codes to get master pod.
Pull Request -
State: closed - Opened by workingloong 4 months ago
- 1 comment
#1153 - Restore checkpoint replica from the shared memory of other nodes.
Pull Request -
State: closed - Opened by workingloong 4 months ago
- 1 comment
#1152 - Polish the log of timeout to wait nodes.
Pull Request -
State: closed - Opened by workingloong 4 months ago
- 1 comment
#1151 - fix the rdzv name to check node.
Pull Request -
State: closed - Opened by workingloong 4 months ago
- 1 comment
#1150 - Fix the rdzv name to check node.
Pull Request -
State: closed - Opened by workingloong 4 months ago
#1149 - Remove the codes to remove the checkpoint directory.
Pull Request -
State: closed - Opened by workingloong 4 months ago
- 1 comment
#1148 - Set the nccl env to execute gpu test task.
Pull Request -
State: closed - Opened by workingloong 4 months ago
- 1 comment
#1147 - Megatron-LM core_r0.6.0 TP=4 save ckpt raise RuntimeError: Fail to set metadata!
Issue -
State: closed - Opened by SwordFaith 4 months ago
- 1 comment
#1146 - megatron-lm flash-ckpt can not save ckpt to disk when use pipeline parallel
Issue -
State: open - Opened by Lzhang-hub 4 months ago
- 6 comments
#1145 - The restarted node can acquire the full checkpoint in the peer node.
Pull Request -
State: closed - Opened by workingloong 4 months ago
- 1 comment
#1144 - Error encountered when using falsh checkpoint
Issue -
State: open - Opened by chencjcj 4 months ago
- 1 comment
#1143 - Enlarge master pod retrieving interval and waiting time.
Pull Request -
State: closed - Opened by BalaBalaYi 4 months ago
- 1 comment
Labels: enhancement
#1142 - Fix the tutorial to run the example of auto scaling the TF job.
Pull Request -
State: closed - Opened by workingloong 4 months ago
- 1 comment
#1141 - Fix the example of deeprec to train the deepfm model with CRITEO dataset.
Pull Request -
State: closed - Opened by workingloong 4 months ago
- 1 comment
#1140 - The nodes backup the checkpoint data in the shared meory.
Pull Request -
State: closed - Opened by workingloong 4 months ago
- 1 comment
#1139 - collect error content
Pull Request -
State: closed - Opened by BalaBalaYi 5 months ago
- 1 comment
Labels: enhancement
#1138 - straggler-detection
Issue -
State: closed - Opened by alex337 5 months ago
- 5 comments
#1137 - Design to backup checkpoint shards between nodes.
Pull Request -
State: closed - Opened by workingloong 5 months ago
- 1 comment
#1136 - example failed: examples/tensorflow/criteo_deeprec/manual_job.yaml
Issue -
State: open - Opened by jason-i-vv 5 months ago
- 1 comment
#1135 - Incomplete save of ckpt files
Issue -
State: open - Opened by husky23333 5 months ago
- 3 comments
#1134 - List Go version 1.18 in the docs.
Pull Request -
State: closed - Opened by workingloong 5 months ago
- 1 comment
#1133 - Refactor the test cases of SimpleStrategyGenerator
Pull Request -
State: closed - Opened by workingloong 5 months ago
- 1 comment
#1132 - [observability] OTEL Trace/Event for training rendezvous, gpu check, flash checkpoint, etc.
Issue -
State: open - Opened by liyzcj 5 months ago
Labels: enhancement
#1131 - optimize master retrieving
Pull Request -
State: closed - Opened by BalaBalaYi 5 months ago
- 1 comment
Labels: enhancement
#1130 - Support torch>=2.3.0
Pull Request -
State: closed - Opened by workingloong 5 months ago
- 1 comment
#1129 - Error llama2 demo with pytorch 2.3.0
Issue -
State: closed - Opened by SwordFaith 5 months ago
#1128 - megatron-lm new api
Pull Request -
State: closed - Opened by Lzhang-hub 5 months ago
- 1 comment
#1127 - Tune up the sleep interval to query the new rdzv.
Pull Request -
State: closed - Opened by workingloong 5 months ago
- 1 comment
#1126 - optimize MODIFIED event with deletion info
Pull Request -
State: closed - Opened by BalaBalaYi 5 months ago
- 1 comment
Labels: enhancement
#1125 - fix pod event converting invocation
Pull Request -
State: closed - Opened by BalaBalaYi 5 months ago
- 1 comment
Labels: bug
#1124 - add log for pod event watcher
Pull Request -
State: closed - Opened by BalaBalaYi 5 months ago
- 1 comment
Labels: enhancement
#1123 - possible typo in the example of [tf_elasticjob_on_k8s]
Issue -
State: closed - Opened by lichadehehehe 5 months ago
- 1 comment
#1122 - Ignore pod deleted event if pod failover is already done by k8s
Pull Request -
State: closed - Opened by BalaBalaYi 5 months ago
- 1 comment
Labels: enhancement, dlrover-master
#1121 - dlrover/blob/master/docs/tutorial/tf_elasticjob_on_k8s 【tf_elasticjob_on_k8s example failed to start】 【tf_elasticjob_on_k8s 示例启动失败】
Issue -
State: closed - Opened by lichadehehehe 5 months ago
- 5 comments
#1120 - Remove unused logs of checking gpu.
Pull Request -
State: closed - Opened by workingloong 5 months ago
- 1 comment
#1119 - Add Bayesian Optimization for DLRover Brain
Pull Request -
State: closed - Opened by yzlnew 5 months ago
- 2 comments
#1118 - Test allreduce performance after checking node health.
Pull Request -
State: closed - Opened by workingloong 5 months ago
- 1 comment
#1117 - Polish the log of executing matmul ops.
Pull Request -
State: closed - Opened by workingloong 5 months ago
- 1 comment
#1116 - Remove empty.proto in the elastic_training.proto
Pull Request -
State: closed - Opened by workingloong 5 months ago
- 1 comment
#1115 - make deploy IMG=easydl/elasticjob-controller:master
Issue -
State: open - Opened by yangzhipeng1108 5 months ago
- 1 comment
#1114 - Set join timeout value as timeout in rdzv params.
Issue -
State: closed - Opened by workingloong 5 months ago
- 1 comment
#1113 - OSError: [Errno 98] Address already in use
Issue -
State: closed - Opened by chencjcj 5 months ago
- 3 comments
#1112 - WIP: Diagnose training hang
Pull Request -
State: open - Opened by samplise 5 months ago
#1111 - Refactor /examples/pytorch/NanoGPT/*_train.py
Pull Request -
State: closed - Opened by Filtee 5 months ago
- 1 comment
#1110 - dlrover里的go是干啥的,和python包有啥联系
Issue -
State: closed - Opened by HongLouyemeng 5 months ago
- 8 comments
#1109 - kubectl -n dlrover apply -f examples/pytorch/nanogpt/elastic_job.yaml error
Issue -
State: closed - Opened by yangzhipeng1108 5 months ago
- 1 comment
#1108 - Refactor examples/pytorch/NanoGPT/
Pull Request -
State: closed - Opened by Filtee 5 months ago
- 3 comments
#1107 - Refactored example/pytorch/NanoGPT.
Pull Request -
State: closed - Opened by Filtee 5 months ago
#1106 - Fix the bug to check whether a dict contains another dict.
Pull Request -
State: closed - Opened by workingloong 5 months ago
- 1 comment
#1105 - fix logging
Pull Request -
State: closed - Opened by BalaBalaYi 5 months ago
- 1 comment
Labels: bug
#1104 - Fix unittest cases.
Pull Request -
State: closed - Opened by workingloong 5 months ago
#1103 - Refactor the job collector to collect the count of failover.
Pull Request -
State: closed - Opened by workingloong 5 months ago
- 1 comment
#1102 - Remove the exited node from the waiting nodes of rdzv.
Pull Request -
State: closed - Opened by workingloong 5 months ago
- 1 comment
#1101 - optimize master pod retrieving
Pull Request -
State: closed - Opened by BalaBalaYi 5 months ago
- 1 comment
Labels: enhancement, dlrover-master
#1100 - Set the deletion strategy and timeout in the checkpointers.
Pull Request -
State: closed - Opened by workingloong 5 months ago
- 2 comments
Labels: flash checkpoint
#1099 - Save/load the non-params-related variables of dist optimizer.
Pull Request -
State: closed - Opened by workingloong 5 months ago
- 1 comment
#1098 - atorch-v0.1.9
Pull Request -
State: closed - Opened by adamantboy 5 months ago
- 2 comments
#1097 - 故障自动恢复后,load(flash ckpt)后loss异常震荡
Issue -
State: closed - Opened by suenphey 5 months ago
- 1 comment
#1096 - Diagnosis manager implementation in the master
Pull Request -
State: closed - Opened by samplise 5 months ago
- 1 comment
#1095 - Add the news about Flash Ckpt supports HuggingFace Trainer.
Pull Request -
State: closed - Opened by workingloong 5 months ago
- 1 comment
#1094 - Set pull request target to run codecov with secret.
Pull Request -
State: closed - Opened by workingloong 6 months ago
#1093 - Fatal Python error: Segmentation fault when kill the training process.
Issue -
State: closed - Opened by workingloong 6 months ago
#1092 - The agent waits for async saving checkpoint before exiting.
Pull Request -
State: closed - Opened by workingloong 6 months ago
- 1 comment
#1091 - Set the version to 0.3.6rc0.
Pull Request -
State: closed - Opened by workingloong 6 months ago
- 1 comment
#1090 - delete useless dockerfile
Pull Request -
State: closed - Opened by cailun01 6 months ago
- 1 comment