Ecosyste.ms: Issues
An open API service for providing issue and pull request metadata for open source projects.
GitHub / pytorch/elastic issues and pull requests
#170 - [examples/imagenet/main.py] Why doesn't elastic code contain gpu sync to compute performance, e.g. all_reduce
Issue -
State: open - Opened by JoohyungLee0106 over 2 years ago
#169 - RuntimeError: Expected all tensors to be on the same device, but found at least two devices
Issue -
State: closed - Opened by JoohyungLee0106 over 2 years ago
- 4 comments
#168 - Please add more torch elastic training examples
Issue -
State: open - Opened by wanziyu over 2 years ago
#167 - (torchelastic) update README to point elastic CRD users to TorchX
Pull Request -
State: closed - Opened by kiukchung almost 3 years ago
- 1 comment
Labels: fb-exported, cla signed
#166 - BASE_IMG upgrade for Dockerfile after PyTorch1.10
Pull Request -
State: closed - Opened by ghost almost 3 years ago
- 3 comments
Labels: cla signed
#165 - rendezvous: _matches_machine_hostname doesn't resolve hostnames fully
Issue -
State: open - Opened by d4l3k almost 3 years ago
- 2 comments
#164 - Kubernetes: ttlSecondsAfterFinished not working in ElasticJob spec
Issue -
State: open - Opened by jovan-absci about 3 years ago
#163 - (torchx/specs) Remove RunConfig in favor of using Dict[str, CfgVal] directly
Pull Request -
State: closed - Opened by kiukchung about 3 years ago
- 6 comments
Labels: fb-exported, cla signed
#162 - [feature request] Add CPU example
Issue -
State: open - Opened by gaocegege over 3 years ago
- 2 comments
#161 - Remove unconfigured submodule
Pull Request -
State: open - Opened by jonathan-conder-sm over 3 years ago
- 3 comments
Labels: cla signed
#160 - Is petctl also deprecated?
Issue -
State: open - Opened by vadimkantorov over 3 years ago
#159 - [feature request] petctl to support pulling script directory from github repo by commit or tag
Issue -
State: open - Opened by vadimkantorov over 3 years ago
#158 - submodule path docs/src/pytorch-sphinx-theme not in .gitmodules
Issue -
State: open - Opened by jonathan-conder-sm over 3 years ago
#157 - Kubernetes CustomResourceDefinition Moving out of Beta
Issue -
State: closed - Opened by 5had3z over 3 years ago
- 4 comments
#156 - [Blocked] feat(dockerfile): Use Torch 1.9 instead of nightly
Pull Request -
State: closed - Opened by gaocegege over 3 years ago
- 4 comments
Labels: cla signed
#155 - Various improvements to `torch.distributed.launch` and `torch.distributed.run` (#60925)
Pull Request -
State: closed - Opened by aivanou over 3 years ago
- 2 comments
Labels: fb-exported, cla signed
#154 - update docs to point to pytorch 1.9 and torchx for torchelastic and tsm (respectively)
Pull Request -
State: closed - Opened by kiukchung over 3 years ago
- 2 comments
Labels: fb-exported, Merged, cla signed
#153 - Update index.md
Pull Request -
State: closed - Opened by brianjo over 3 years ago
- 2 comments
Labels: cla signed
#152 - EtcdStore: AttributeError: can't set attribute
Issue -
State: open - Opened by vv-p over 3 years ago
- 1 comment
#151 - Cannot reuse --rdzv_id between different elastic launch ?
Issue -
State: open - Opened by PKUFlyingPig over 3 years ago
#150 - Imagenet example fails during accuracy calculation (v0.2.2 on 1.8.1)
Issue -
State: closed - Opened by assapin over 3 years ago
- 1 comment
#149 - add support for jetter to Role (base_image) for mast launches
Pull Request -
State: closed - Opened by kiukchung over 3 years ago
- 2 comments
Labels: fb-exported, Merged, cla signed
#148 - Move torchelastic docs *.rst (#56811)
Pull Request -
State: closed - Opened by kiukchung almost 4 years ago
- 2 comments
Labels: fb-exported, Merged, cla signed
#147 - Out of Data documentation
Issue -
State: closed - Opened by Godricly almost 4 years ago
- 4 comments
#146 - Improve the implementation of `RendezvousParameters` and add its unit tests. (#54807)
Pull Request -
State: closed - Opened by cbalioglu almost 4 years ago
- 2 comments
Labels: fb-exported, Merged, cla signed
#145 - ModuleNotFoundError: No module named 'torch.distributed.elastic'
Issue -
State: closed - Opened by GwangsooHong almost 4 years ago
- 4 comments
#144 - Fix python required version > 3.6 bug
Pull Request -
State: open - Opened by jenhaoyang almost 4 years ago
- 8 comments
Labels: cla signed
#143 - Move torchelastic/events to torch/distributed/events
Pull Request -
State: closed - Opened by aivanou almost 4 years ago
- 2 comments
Labels: fb-exported, Merged, cla signed
#142 - Support PyTorch 1.8, TorchVision 0.9.0 and TorchAduio 0.8.0
Issue -
State: closed - Opened by DavidSpek almost 4 years ago
- 7 comments
#141 - Move torchelastic/rendezvous to torch/distributed/rendezvous
Pull Request -
State: closed - Opened by kiukchung almost 4 years ago
- 2 comments
Labels: fb-exported, Merged, cla signed
#140 - Torch Elastic - How to make sure all nodes are in the same AZ?
Issue -
State: closed - Opened by thecooltechguy about 4 years ago
- 2 comments
#139 - [*.py] Rename "Arguments:" to "Args:"
Pull Request -
State: closed - Opened by SamuelMarks about 4 years ago
- 3 comments
Labels: Merged, cla signed
#138 - Remove NCCL Blocking Wait from Imagenet Example
Pull Request -
State: closed - Opened by osalpekar about 4 years ago
- 2 comments
Labels: fb-exported, Merged, cla signed
#137 - Minor fix in Issue Reporting Template
Pull Request -
State: closed - Opened by osalpekar about 4 years ago
- 3 comments
Labels: fb-exported, Merged, cla signed
#136 - Enable NCCL_ASYNC_ERROR_HANDLING in Torchelastic
Issue -
State: closed - Opened by osalpekar about 4 years ago
- 1 comment
#135 - Pytorch Lightning with TorchElastic - One worker doesn't start
Issue -
State: closed - Opened by tchaton about 4 years ago
- 3 comments
#134 - Elastic agent doesn't detect worker failures in NCCL
Issue -
State: closed - Opened by ruipeterpan about 4 years ago
- 4 comments
#133 - Enable NCCL_ASYNC_ERROR_HANDLING in torchelastic
Pull Request -
State: closed - Opened by osalpekar about 4 years ago
- 4 comments
Labels: fb-exported, Merged, cla signed
#132 - Fix circle CI breakage by depending on torch-1.8.0dev (nightly)
Pull Request -
State: closed - Opened by kiukchung about 4 years ago
- 2 comments
Labels: fb-exported, Merged, cla signed
#131 - unbind scheduler from session and make session apis take a scheduler backend
Pull Request -
State: closed - Opened by kiukchung over 4 years ago
- 4 comments
Labels: fb-exported, Merged
#130 - How to programmatically determine if a training job has finished using `kubectl`?
Issue -
State: closed - Opened by darthsuogles over 4 years ago
- 2 comments
#129 - make torchelastic.distributed.launch args settable from env var with name PET_ARG
Pull Request -
State: closed - Opened by kiukchung over 4 years ago
- 2 comments
Labels: fb-exported, Merged
#128 - Add env support for the training script argument
Issue -
State: closed - Opened by kuikuikuizzZ over 4 years ago
- 4 comments
#127 - Add tsm docs to the docs page, added --dry-run flag to doc_push.sh, fix a few docstring typos
Pull Request -
State: closed - Opened by kiukchung over 4 years ago
- 2 comments
Labels: fb-exported, Merged
#126 - add ui url to the return value of session.status(app_id)
Pull Request -
State: closed - Opened by kiukchung over 4 years ago
- 2 comments
Labels: fb-exported, Merged
#125 - add replica_id macro
Pull Request -
State: closed - Opened by kiukchung over 4 years ago
- 2 comments
Labels: fb-exported, Merged
#124 - accept role as a command line argument
Pull Request -
State: closed - Opened by yifuwang over 4 years ago
- 2 comments
Labels: fb-exported, Merged
#123 - implement ElasticRole, role args macro substitution
Pull Request -
State: closed - Opened by kiukchung over 4 years ago
- 2 comments
Labels: fb-exported, Merged
#122 - move pytorch/elastic/test/** into pytorch/elastic/torchelastic/**/test/**
Pull Request -
State: closed - Opened by kiukchung over 4 years ago
- 4 comments
Labels: fb-exported, Merged
#121 - test pyenv
Pull Request -
State: closed - Opened by kiukchung over 4 years ago
#120 - Bump version to 0.2.1rc0, upgrade python to 3.8.4
Pull Request -
State: closed - Opened by kiukchung over 4 years ago
- 6 comments
Labels: fb-exported
#119 - got rid of url based rdzv init, added fb rdzv registry with supported rdzv backends
Pull Request -
State: closed - Opened by kiukchung over 4 years ago
- 8 comments
Labels: fb-exported, Merged
#118 - initial api skeleton
Pull Request -
State: closed - Opened by kiukchung over 4 years ago
- 3 comments
Labels: fb-exported, Merged
#117 - [request] Do we have plan to merge Kubernetes part to kubeflow/pytorch-operator?
Issue -
State: open - Opened by gaocegege over 4 years ago
- 22 comments
#116 - Improve docs on writing training scripts compatible with scale down
Issue -
State: closed - Opened by kiukchung over 4 years ago
- 1 comment
#115 - ClassyVision distributed training hang after scaledown training Nodes
Issue -
State: closed - Opened by yqwang-ms over 4 years ago
- 6 comments
#114 - Wrapping elastic-job kubernetes in initcontainer
Issue -
State: closed - Opened by mckunkel over 4 years ago
- 4 comments
#113 - ValueError: host not found: Name or service not known with torchelastic
Issue -
State: closed - Opened by mckunkel over 4 years ago
- 2 comments
#112 - Trouble connecting PersistentVolume/Claim to ElasticJob
Issue -
State: closed - Opened by SeanNaren over 4 years ago
- 3 comments
#111 - Added command to create GCP GKE cluster
Pull Request -
State: closed - Opened by SeanNaren over 4 years ago
- 1 comment
Labels: Merged
#110 - make faulthandler enabling best effort
Pull Request -
State: closed - Opened by kiukchung over 4 years ago
- 2 comments
Labels: fb-exported, Merged
#109 - More consistent format
Pull Request -
State: closed - Opened by vinamrabenara over 4 years ago
- 3 comments
#108 - enable fault handler to dump python traceback on error signals
Pull Request -
State: closed - Opened by kiukchung over 4 years ago
- 2 comments
Labels: fb-exported, Merged
#106 - Move torchelastic.distributed.app to torch.distributed.fb.app
Pull Request -
State: closed - Opened by kiukchung over 4 years ago
- 2 comments
Labels: fb-exported, Merged
#103 - expose run_id as TORCHELASTIC_RUN_ID env var for workers
Pull Request -
State: closed - Opened by kiukchung over 4 years ago
- 2 comments
Labels: fb-exported, Merged
#102 - Pytorch Elastic with NCCL backend
Issue -
State: closed - Opened by vibhatha over 4 years ago
- 5 comments
#100 - fix circleci test breakage by not using UNSET_RPC_TIMEOUT constant
Pull Request -
State: closed - Opened by kiukchung over 4 years ago
- 2 comments
Labels: fb-exported, Merged
#99 - A pytorch-native ElasticDDP module proposal
Issue -
State: closed - Opened by umialpha almost 5 years ago
- 1 comment
#98 - User loss of work if the cluster change occurs in the middle of the epoch
Issue -
State: open - Opened by aivanou almost 5 years ago
- 1 comment
#93 - imagenet example - add logic to broadcast most recent checkpoint from max_rank
Pull Request -
State: closed - Opened by kiukchung almost 5 years ago
- 2 comments
Labels: fb-exported, Merged
#92 - update docs
Pull Request -
State: closed - Opened by kiukchung almost 5 years ago
- 1 comment
Labels: fb-exported
#89 - modify yaml to work with torchelastic-0.2.0rc, make clear eks vs vanilla k8 in README, modified pod.go to append args to launcher
Pull Request -
State: closed - Opened by kiukchung almost 5 years ago
- 3 comments
Labels: fb-exported, Merged
#87 - fix broken doc string in rdzv module and revamp the layout of the rendezvous.html page
Pull Request -
State: closed - Opened by kiukchung almost 5 years ago
- 2 comments
Labels: fb-exported, Merged
#86 - remove usage of deprecated use_env flag in new launch tests introduced in D20658603
Pull Request -
State: closed - Opened by kiukchung almost 5 years ago
- 2 comments
Labels: fb-exported, Merged
#85 - fix errors in dockerfile, remove use_env flag in launcher as it seems to be a legacy feature in torch.distributed.launch as well
Pull Request -
State: closed - Opened by kiukchung almost 5 years ago
- 3 comments
Labels: fb-exported, Merged
#84 - remove remaining v0.1.0 apis
Pull Request -
State: closed - Opened by kiukchung almost 5 years ago
- 4 comments
Labels: fb-exported, Merged
#83 - revamp imagenet example for v0.2.0, remove classy example (built into docker now), modify Dockerfiles for lib and example
Pull Request -
State: closed - Opened by kiukchung almost 5 years ago
- 5 comments
Labels: fb-exported, Merged
#82 - remove distributed utils and rainbow logging
Pull Request -
State: closed - Opened by kiukchung almost 5 years ago
- 4 comments
Labels: fb-exported, Merged
#81 - torchelastic: enable pyre
Pull Request -
State: closed - Opened by d4l3k almost 5 years ago
#80 - Delete classy elastic trainer and dependency to pet checkpoint api
Pull Request -
State: closed - Opened by kiukchung almost 5 years ago
- 2 comments
Labels: fb-exported, Merged
#79 - version bump + documentation update
Pull Request -
State: closed - Opened by kiukchung almost 5 years ago
- 9 comments
Labels: fb-exported, Merged
#75 - For Kubernetes provide sample using Nvidia GPU operator
Issue -
State: closed - Opened by chauhang almost 5 years ago
#73 - disable tests that are incompatible with tsan when in tsan build mode
Pull Request -
State: closed - Opened by kiukchung almost 5 years ago
- 2 comments
Labels: fb-exported, Merged
#72 - Made elastic_launch an operator
Pull Request -
State: closed - Opened by kiukchung almost 5 years ago
- 3 comments
Labels: fb-exported, Merged
#71 - add group rank to distinfo and env var, made elastic launcher return values from worker fn
Pull Request -
State: closed - Opened by kiukchung almost 5 years ago
- 2 comments
Labels: fb-exported, Merged
#69 - Initialize TorchElasticService when starting the agent
Pull Request -
State: closed - Opened by kiukchung almost 5 years ago
- 2 comments
Labels: fb-exported
#67 - Added profiling metrics to agent. Improved metrics api to make it easy to add metrics to the agent
Pull Request -
State: closed - Opened by kiukchung almost 5 years ago
- 3 comments
Labels: fb-exported, Merged
#65 - implemented torchelastic.distributed.launch for oss
Pull Request -
State: closed - Opened by kiukchung almost 5 years ago
- 2 comments
Labels: fb-exported, Merged
#62 - small formatting changes (that I missed in my previous diff)
Pull Request -
State: closed - Opened by kiukchung almost 5 years ago
- 3 comments
Labels: fb-exported, Merged
#51 - Disable timer_example tests for spawn and forkserver under tsan (in addition to asan)
Pull Request -
State: closed - Opened by kiukchung almost 5 years ago
- 3 comments
Labels: fb-exported
#50 - Use __slots__ over dataclass since it is not supported in python 3.6, add __init__.py to test/timer
Pull Request -
State: closed - Opened by kiukchung almost 5 years ago
- 3 comments
Labels: fb-exported, Merged
#46 - disable certain timer tests under tsan
Pull Request -
State: closed - Opened by kiukchung almost 5 years ago
- 1 comment
Labels: fb-exported
#43 - Implement countdown timer with a localhost implementation of timer client and server
Pull Request -
State: closed - Opened by kiukchung about 5 years ago
- 4 comments
Labels: fb-exported, Merged
#41 - add formatter_python.sh for external contributor, and setup circleci to run linter on builds
Pull Request -
State: closed - Opened by kiukchung about 5 years ago
- 5 comments
Labels: fb-exported, Merged
#35 - Temporarily depend on torch 1.5.0-nightly (required to use EtcdStore).
Pull Request -
State: closed - Opened by kiukchung about 5 years ago
- 5 comments
Labels: fb-exported, Merged
#34 - Use EtcdStore rather than TCPStore when using etcd_rdzv
Pull Request -
State: closed - Opened by kiukchung about 5 years ago
- 3 comments
Labels: fb-exported, Merged
#33 - Move get_rank() to torchelastic.distributed.utils
Pull Request -
State: closed - Opened by kiukchung about 5 years ago
- 2 comments
Labels: fb-exported, Merged
#25 - training hang when remove/add instances
Issue -
State: closed - Opened by dev777-create about 5 years ago
- 3 comments
#20 - version bump to 0.1.0rc2
Pull Request -
State: closed - Opened by kiukchung about 5 years ago
- 4 comments
Labels: fb-exported, Merged
#12 - [rdzv] Implement `etcd_rendezvous#load_extra_data()` timeout.
Issue -
State: closed - Opened by kiukchung about 5 years ago