Ecosyste.ms: Issues

An open API service for providing issue and pull request metadata for open source projects.

GitHub / pytorch/elastic issues and pull requests

#168 - Please add more torch elastic training examples

Issue - State: open - Opened by wanziyu over 2 years ago

#167 - (torchelastic) update README to point elastic CRD users to TorchX

Pull Request - State: closed - Opened by kiukchung almost 3 years ago - 1 comment
Labels: fb-exported, cla signed

#166 - BASE_IMG upgrade for Dockerfile after PyTorch1.10

Pull Request - State: closed - Opened by ghost almost 3 years ago - 3 comments
Labels: cla signed

#165 - rendezvous: _matches_machine_hostname doesn't resolve hostnames fully

Issue - State: open - Opened by d4l3k almost 3 years ago - 2 comments

#163 - (torchx/specs) Remove RunConfig in favor of using Dict[str, CfgVal] directly

Pull Request - State: closed - Opened by kiukchung about 3 years ago - 6 comments
Labels: fb-exported, cla signed

#162 - [feature request] Add CPU example

Issue - State: open - Opened by gaocegege over 3 years ago - 2 comments

#161 - Remove unconfigured submodule

Pull Request - State: open - Opened by jonathan-conder-sm over 3 years ago - 3 comments
Labels: cla signed

#160 - Is petctl also deprecated?

Issue - State: open - Opened by vadimkantorov over 3 years ago

#157 - Kubernetes CustomResourceDefinition Moving out of Beta

Issue - State: closed - Opened by 5had3z over 3 years ago - 4 comments

#156 - [Blocked] feat(dockerfile): Use Torch 1.9 instead of nightly

Pull Request - State: closed - Opened by gaocegege over 3 years ago - 4 comments
Labels: cla signed

#155 - Various improvements to `torch.distributed.launch` and `torch.distributed.run` (#60925)

Pull Request - State: closed - Opened by aivanou over 3 years ago - 2 comments
Labels: fb-exported, cla signed

#154 - update docs to point to pytorch 1.9 and torchx for torchelastic and tsm (respectively)

Pull Request - State: closed - Opened by kiukchung over 3 years ago - 2 comments
Labels: fb-exported, Merged, cla signed

#153 - Update index.md

Pull Request - State: closed - Opened by brianjo over 3 years ago - 2 comments
Labels: cla signed

#152 - EtcdStore: AttributeError: can't set attribute

Issue - State: open - Opened by vv-p over 3 years ago - 1 comment

#150 - Imagenet example fails during accuracy calculation (v0.2.2 on 1.8.1)

Issue - State: closed - Opened by assapin over 3 years ago - 1 comment

#149 - add support for jetter to Role (base_image) for mast launches

Pull Request - State: closed - Opened by kiukchung over 3 years ago - 2 comments
Labels: fb-exported, Merged, cla signed

#148 - Move torchelastic docs *.rst (#56811)

Pull Request - State: closed - Opened by kiukchung almost 4 years ago - 2 comments
Labels: fb-exported, Merged, cla signed

#147 - Out of Data documentation

Issue - State: closed - Opened by Godricly almost 4 years ago - 4 comments

#146 - Improve the implementation of `RendezvousParameters` and add its unit tests. (#54807)

Pull Request - State: closed - Opened by cbalioglu almost 4 years ago - 2 comments
Labels: fb-exported, Merged, cla signed

#145 - ModuleNotFoundError: No module named 'torch.distributed.elastic'

Issue - State: closed - Opened by GwangsooHong almost 4 years ago - 4 comments

#144 - Fix python required version > 3.6 bug

Pull Request - State: open - Opened by jenhaoyang almost 4 years ago - 8 comments
Labels: cla signed

#143 - Move torchelastic/events to torch/distributed/events

Pull Request - State: closed - Opened by aivanou almost 4 years ago - 2 comments
Labels: fb-exported, Merged, cla signed

#142 - Support PyTorch 1.8, TorchVision 0.9.0 and TorchAduio 0.8.0

Issue - State: closed - Opened by DavidSpek almost 4 years ago - 7 comments

#141 - Move torchelastic/rendezvous to torch/distributed/rendezvous

Pull Request - State: closed - Opened by kiukchung almost 4 years ago - 2 comments
Labels: fb-exported, Merged, cla signed

#140 - Torch Elastic - How to make sure all nodes are in the same AZ?

Issue - State: closed - Opened by thecooltechguy about 4 years ago - 2 comments

#139 - [*.py] Rename "Arguments:" to "Args:"

Pull Request - State: closed - Opened by SamuelMarks about 4 years ago - 3 comments
Labels: Merged, cla signed

#138 - Remove NCCL Blocking Wait from Imagenet Example

Pull Request - State: closed - Opened by osalpekar about 4 years ago - 2 comments
Labels: fb-exported, Merged, cla signed

#137 - Minor fix in Issue Reporting Template

Pull Request - State: closed - Opened by osalpekar about 4 years ago - 3 comments
Labels: fb-exported, Merged, cla signed

#136 - Enable NCCL_ASYNC_ERROR_HANDLING in Torchelastic

Issue - State: closed - Opened by osalpekar about 4 years ago - 1 comment

#135 - Pytorch Lightning with TorchElastic - One worker doesn't start

Issue - State: closed - Opened by tchaton about 4 years ago - 3 comments

#134 - Elastic agent doesn't detect worker failures in NCCL

Issue - State: closed - Opened by ruipeterpan about 4 years ago - 4 comments

#133 - Enable NCCL_ASYNC_ERROR_HANDLING in torchelastic

Pull Request - State: closed - Opened by osalpekar about 4 years ago - 4 comments
Labels: fb-exported, Merged, cla signed

#132 - Fix circle CI breakage by depending on torch-1.8.0dev (nightly)

Pull Request - State: closed - Opened by kiukchung about 4 years ago - 2 comments
Labels: fb-exported, Merged, cla signed

#131 - unbind scheduler from session and make session apis take a scheduler backend

Pull Request - State: closed - Opened by kiukchung over 4 years ago - 4 comments
Labels: fb-exported, Merged

#129 - make torchelastic.distributed.launch args settable from env var with name PET_ARG

Pull Request - State: closed - Opened by kiukchung over 4 years ago - 2 comments
Labels: fb-exported, Merged

#128 - Add env support for the training script argument

Issue - State: closed - Opened by kuikuikuizzZ over 4 years ago - 4 comments

#127 - Add tsm docs to the docs page, added --dry-run flag to doc_push.sh, fix a few docstring typos

Pull Request - State: closed - Opened by kiukchung over 4 years ago - 2 comments
Labels: fb-exported, Merged

#126 - add ui url to the return value of session.status(app_id)

Pull Request - State: closed - Opened by kiukchung over 4 years ago - 2 comments
Labels: fb-exported, Merged

#125 - add replica_id macro

Pull Request - State: closed - Opened by kiukchung over 4 years ago - 2 comments
Labels: fb-exported, Merged

#124 - accept role as a command line argument

Pull Request - State: closed - Opened by yifuwang over 4 years ago - 2 comments
Labels: fb-exported, Merged

#123 - implement ElasticRole, role args macro substitution

Pull Request - State: closed - Opened by kiukchung over 4 years ago - 2 comments
Labels: fb-exported, Merged

#122 - move pytorch/elastic/test/** into pytorch/elastic/torchelastic/**/test/**

Pull Request - State: closed - Opened by kiukchung over 4 years ago - 4 comments
Labels: fb-exported, Merged

#121 - test pyenv

Pull Request - State: closed - Opened by kiukchung over 4 years ago

#120 - Bump version to 0.2.1rc0, upgrade python to 3.8.4

Pull Request - State: closed - Opened by kiukchung over 4 years ago - 6 comments
Labels: fb-exported

#119 - got rid of url based rdzv init, added fb rdzv registry with supported rdzv backends

Pull Request - State: closed - Opened by kiukchung over 4 years ago - 8 comments
Labels: fb-exported, Merged

#118 - initial api skeleton

Pull Request - State: closed - Opened by kiukchung over 4 years ago - 3 comments
Labels: fb-exported, Merged

#116 - Improve docs on writing training scripts compatible with scale down

Issue - State: closed - Opened by kiukchung over 4 years ago - 1 comment

#115 - ClassyVision distributed training hang after scaledown training Nodes

Issue - State: closed - Opened by yqwang-ms over 4 years ago - 6 comments

#114 - Wrapping elastic-job kubernetes in initcontainer

Issue - State: closed - Opened by mckunkel over 4 years ago - 4 comments

#113 - ValueError: host not found: Name or service not known with torchelastic

Issue - State: closed - Opened by mckunkel over 4 years ago - 2 comments

#112 - Trouble connecting PersistentVolume/Claim to ElasticJob

Issue - State: closed - Opened by SeanNaren over 4 years ago - 3 comments

#111 - Added command to create GCP GKE cluster

Pull Request - State: closed - Opened by SeanNaren over 4 years ago - 1 comment
Labels: Merged

#110 - make faulthandler enabling best effort

Pull Request - State: closed - Opened by kiukchung over 4 years ago - 2 comments
Labels: fb-exported, Merged

#109 - More consistent format

Pull Request - State: closed - Opened by vinamrabenara over 4 years ago - 3 comments

#108 - enable fault handler to dump python traceback on error signals

Pull Request - State: closed - Opened by kiukchung over 4 years ago - 2 comments
Labels: fb-exported, Merged

#106 - Move torchelastic.distributed.app to torch.distributed.fb.app

Pull Request - State: closed - Opened by kiukchung over 4 years ago - 2 comments
Labels: fb-exported, Merged

#103 - expose run_id as TORCHELASTIC_RUN_ID env var for workers

Pull Request - State: closed - Opened by kiukchung over 4 years ago - 2 comments
Labels: fb-exported, Merged

#102 - Pytorch Elastic with NCCL backend

Issue - State: closed - Opened by vibhatha over 4 years ago - 5 comments

#100 - fix circleci test breakage by not using UNSET_RPC_TIMEOUT constant

Pull Request - State: closed - Opened by kiukchung over 4 years ago - 2 comments
Labels: fb-exported, Merged

#99 - A pytorch-native ElasticDDP module proposal

Issue - State: closed - Opened by umialpha almost 5 years ago - 1 comment

#98 - User loss of work if the cluster change occurs in the middle of the epoch

Issue - State: open - Opened by aivanou almost 5 years ago - 1 comment

#93 - imagenet example - add logic to broadcast most recent checkpoint from max_rank

Pull Request - State: closed - Opened by kiukchung almost 5 years ago - 2 comments
Labels: fb-exported, Merged

#92 - update docs

Pull Request - State: closed - Opened by kiukchung almost 5 years ago - 1 comment
Labels: fb-exported

#89 - modify yaml to work with torchelastic-0.2.0rc, make clear eks vs vanilla k8 in README, modified pod.go to append args to launcher

Pull Request - State: closed - Opened by kiukchung almost 5 years ago - 3 comments
Labels: fb-exported, Merged

#87 - fix broken doc string in rdzv module and revamp the layout of the rendezvous.html page

Pull Request - State: closed - Opened by kiukchung almost 5 years ago - 2 comments
Labels: fb-exported, Merged

#86 - remove usage of deprecated use_env flag in new launch tests introduced in D20658603

Pull Request - State: closed - Opened by kiukchung almost 5 years ago - 2 comments
Labels: fb-exported, Merged

#85 - fix errors in dockerfile, remove use_env flag in launcher as it seems to be a legacy feature in torch.distributed.launch as well

Pull Request - State: closed - Opened by kiukchung almost 5 years ago - 3 comments
Labels: fb-exported, Merged

#84 - remove remaining v0.1.0 apis

Pull Request - State: closed - Opened by kiukchung almost 5 years ago - 4 comments
Labels: fb-exported, Merged

#83 - revamp imagenet example for v0.2.0, remove classy example (built into docker now), modify Dockerfiles for lib and example

Pull Request - State: closed - Opened by kiukchung almost 5 years ago - 5 comments
Labels: fb-exported, Merged

#82 - remove distributed utils and rainbow logging

Pull Request - State: closed - Opened by kiukchung almost 5 years ago - 4 comments
Labels: fb-exported, Merged

#81 - torchelastic: enable pyre

Pull Request - State: closed - Opened by d4l3k almost 5 years ago

#80 - Delete classy elastic trainer and dependency to pet checkpoint api

Pull Request - State: closed - Opened by kiukchung almost 5 years ago - 2 comments
Labels: fb-exported, Merged

#79 - version bump + documentation update

Pull Request - State: closed - Opened by kiukchung almost 5 years ago - 9 comments
Labels: fb-exported, Merged

#75 - For Kubernetes provide sample using Nvidia GPU operator

Issue - State: closed - Opened by chauhang almost 5 years ago

#73 - disable tests that are incompatible with tsan when in tsan build mode

Pull Request - State: closed - Opened by kiukchung almost 5 years ago - 2 comments
Labels: fb-exported, Merged

#72 - Made elastic_launch an operator

Pull Request - State: closed - Opened by kiukchung almost 5 years ago - 3 comments
Labels: fb-exported, Merged

#71 - add group rank to distinfo and env var, made elastic launcher return values from worker fn

Pull Request - State: closed - Opened by kiukchung almost 5 years ago - 2 comments
Labels: fb-exported, Merged

#69 - Initialize TorchElasticService when starting the agent

Pull Request - State: closed - Opened by kiukchung almost 5 years ago - 2 comments
Labels: fb-exported

#67 - Added profiling metrics to agent. Improved metrics api to make it easy to add metrics to the agent

Pull Request - State: closed - Opened by kiukchung almost 5 years ago - 3 comments
Labels: fb-exported, Merged

#65 - implemented torchelastic.distributed.launch for oss

Pull Request - State: closed - Opened by kiukchung almost 5 years ago - 2 comments
Labels: fb-exported, Merged

#62 - small formatting changes (that I missed in my previous diff)

Pull Request - State: closed - Opened by kiukchung almost 5 years ago - 3 comments
Labels: fb-exported, Merged

#51 - Disable timer_example tests for spawn and forkserver under tsan (in addition to asan)

Pull Request - State: closed - Opened by kiukchung almost 5 years ago - 3 comments
Labels: fb-exported

#50 - Use __slots__ over dataclass since it is not supported in python 3.6, add __init__.py to test/timer

Pull Request - State: closed - Opened by kiukchung almost 5 years ago - 3 comments
Labels: fb-exported, Merged

#46 - disable certain timer tests under tsan

Pull Request - State: closed - Opened by kiukchung almost 5 years ago - 1 comment
Labels: fb-exported

#43 - Implement countdown timer with a localhost implementation of timer client and server

Pull Request - State: closed - Opened by kiukchung about 5 years ago - 4 comments
Labels: fb-exported, Merged

#41 - add formatter_python.sh for external contributor, and setup circleci to run linter on builds

Pull Request - State: closed - Opened by kiukchung about 5 years ago - 5 comments
Labels: fb-exported, Merged

#35 - Temporarily depend on torch 1.5.0-nightly (required to use EtcdStore).

Pull Request - State: closed - Opened by kiukchung about 5 years ago - 5 comments
Labels: fb-exported, Merged

#34 - Use EtcdStore rather than TCPStore when using etcd_rdzv

Pull Request - State: closed - Opened by kiukchung about 5 years ago - 3 comments
Labels: fb-exported, Merged

#33 - Move get_rank() to torchelastic.distributed.utils

Pull Request - State: closed - Opened by kiukchung about 5 years ago - 2 comments
Labels: fb-exported, Merged

#25 - training hang when remove/add instances

Issue - State: closed - Opened by dev777-create about 5 years ago - 3 comments

#20 - version bump to 0.1.0rc2

Pull Request - State: closed - Opened by kiukchung about 5 years ago - 4 comments
Labels: fb-exported, Merged

#12 - [rdzv] Implement `etcd_rendezvous#load_extra_data()` timeout.

Issue - State: closed - Opened by kiukchung about 5 years ago