Ecosyste.ms: Issues
An open API service for providing issue and pull request metadata for open source projects.
GitHub / pytorch/torchft issues and pull requests
#97 - process_group: support all PG APIs
Issue -
State: open - Opened by d4l3k 3 days ago
Labels: enhancement
#96 - Refactor local_sgd integration tests
Pull Request -
State: closed - Opened by H-Huang 3 days ago
Labels: CLA Signed
#95 - Participants APIs should check if quorum is started
Pull Request -
State: closed - Opened by fegin 3 days ago
Labels: CLA Signed
#94 - manager: expose participating_rank
Pull Request -
State: closed - Opened by d4l3k 4 days ago
Labels: CLA Signed
#93 - examples,docs: adjust ddp example timeout and docs
Pull Request -
State: closed - Opened by d4l3k 7 days ago
Labels: CLA Signed
#92 - Add DiLoCo
Pull Request -
State: closed - Opened by H-Huang 8 days ago
Labels: CLA Signed
#91 - ProcessGroupBabyNCCL: support multiple streams and use event on start
Pull Request -
State: closed - Opened by d4l3k 8 days ago
Labels: CLA Signed
#90 - CheckpointServer: start in disallowed state + tests
Pull Request -
State: closed - Opened by d4l3k 8 days ago
Labels: CLA Signed
#89 - ProcessGroupBaby: support full suite of PG tests
Pull Request -
State: closed - Opened by d4l3k 9 days ago
Labels: CLA Signed
#88 - process_group: fix docs with torch==2.6.0
Pull Request -
State: closed - Opened by d4l3k 9 days ago
Labels: CLA Signed
#87 - Change how TorchFT manages user_state_dict
Pull Request -
State: closed - Opened by fegin 9 days ago
Labels: CLA Signed
#86 - Fix ManagedDeviceMesh composability issues
Pull Request -
State: closed - Opened by fegin 9 days ago
Labels: CLA Signed
#85 - Improve OptimizerWrapper composability
Pull Request -
State: closed - Opened by fegin 9 days ago
Labels: CLA Signed
#84 - coordination: expose new low level torchft coordination API
Pull Request -
State: open - Opened by d4l3k 10 days ago
Labels: CLA Signed
#83 - process_group/ManagedProcessGroup: ensure quorum and PG is configured
Pull Request -
State: closed - Opened by d4l3k 10 days ago
Labels: CLA Signed
#82 - [WIP][RFC] Required changes for integration with TorchTitan
Pull Request -
State: open - Opened by fegin 11 days ago
Labels: CLA Signed
#81 - checkpointing: use CheckpointTransport abstraction
Pull Request -
State: closed - Opened by d4l3k 14 days ago
Labels: CLA Signed
#80 - rust: add open telemetry tracing
Pull Request -
State: open - Opened by d4l3k 15 days ago
Labels: CLA Signed
#79 - lighthouse/quorum: make it clear that quorum logs are for next quorum
Pull Request -
State: closed - Opened by d4l3k 15 days ago
Labels: CLA Signed
#78 - lib: fix Already borrowed
Pull Request -
State: closed - Opened by d4l3k 16 days ago
Labels: CLA Signed
#77 - [WIP] FSDP example
Pull Request -
State: open - Opened by mreso 16 days ago
Labels: CLA Signed
#76 - Add DiLoCo
Pull Request -
State: closed - Opened by H-Huang 17 days ago
- 3 comments
Labels: CLA Signed
#75 - use torchx for manual many replica (20+) tests
Pull Request -
State: closed - Opened by d4l3k 22 days ago
Labels: CLA Signed
#74 - Fix typo and use sampler in train_ddp.py
Pull Request -
State: closed - Opened by mreso 23 days ago
Labels: CLA Signed
#73 - overhaul timeouts for Lighthouse, Manager, checkpoint server
Pull Request -
State: closed - Opened by d4l3k 23 days ago
Labels: CLA Signed
#72 - Dont return quorum if requester isnt involved
Pull Request -
State: closed - Opened by Jackmin801 24 days ago
Labels: CLA Signed
#71 - lighthouse/quorum: avoid split brain and add shrink_only support
Pull Request -
State: closed - Opened by d4l3k 25 days ago
- 2 comments
Labels: CLA Signed
#70 - lighthouse, manager: remove room support
Pull Request -
State: closed - Opened by d4l3k 25 days ago
- 1 comment
Labels: CLA Signed
#69 - feat: fix security warnings in torchft
Pull Request -
State: closed - Opened by c-p-i-o 25 days ago
- 1 comment
Labels: CLA Signed
#68 - process_group: wait for futher_thread join before creating new one
Pull Request -
State: closed - Opened by dwancn 26 days ago
- 4 comments
Labels: CLA Signed
#67 - [manager] fix address when binding to 0
Pull Request -
State: closed - Opened by d4l3k 28 days ago
Labels: CLA Signed
#66 - Use bucketized model averaging for LocalSGD
Issue -
State: open - Opened by d4l3k 28 days ago
- 1 comment
Labels: enhancement, good first issue
#65 - Update documentation link
Pull Request -
State: closed - Opened by d4l3k 29 days ago
Labels: CLA Signed
#64 - [lighthouse] detect unhealthy participants via heartbeats
Pull Request -
State: closed - Opened by d4l3k 29 days ago
- 2 comments
Labels: CLA Signed
#63 - Use heartbeat to invalidate stale entries
Issue -
State: closed - Opened by Jackmin801 29 days ago
- 1 comment
#62 - Test manager join
Pull Request -
State: open - Opened by Jackmin801 about 1 month ago
Labels: CLA Signed
#61 - feat: expose lighthouse join timeout
Pull Request -
State: closed - Opened by Jackmin801 about 1 month ago
Labels: CLA Signed
#60 - process_group: add PG init timeouts + automatically assign manager port
Pull Request -
State: closed - Opened by d4l3k about 1 month ago
Labels: CLA Signed
#59 - manager: add per request timeouts
Pull Request -
State: closed - Opened by d4l3k about 1 month ago
- 1 comment
Labels: CLA Signed
#58 - Dataloader question upon restart
Issue -
State: open - Opened by cjolivier01 about 1 month ago
- 6 comments
Labels: enhancement, question, data
#57 - Fix typos and add a missing break
Pull Request -
State: closed - Opened by yeahdongcn about 1 month ago
Labels: CLA Signed
#56 - Introduce ManagedDeviceMesh to integrate DeviceMesh with TorchFT
Pull Request -
State: closed - Opened by fegin about 2 months ago
Labels: CLA Signed
#55 - manager_integ_tests: added LocalSGD integration test
Pull Request -
State: closed - Opened by d4l3k about 2 months ago
Labels: CLA Signed
#54 - Use streaming transfers
Pull Request -
State: open - Opened by Krishn1412 about 2 months ago
- 4 comments
Labels: CLA Signed
#53 - chore: compile -> compile_protos
Pull Request -
State: closed - Opened by Jackmin801 about 2 months ago
- 1 comment
Labels: CLA Signed
#52 - ManagerClient.quorum should return a namedtuple, dataclass or object
Issue -
State: closed - Opened by Jackmin801 about 2 months ago
- 3 comments
#51 - add pypi badge to README
Pull Request -
State: closed - Opened by d4l3k about 2 months ago
Labels: CLA Signed
#50 - .github: add nightly wheel builds
Pull Request -
State: closed - Opened by d4l3k about 2 months ago
Labels: CLA Signed
#49 - [Chore] Couple of small chores
Pull Request -
State: closed - Opened by Jackmin801 about 2 months ago
Labels: CLA Signed
#48 - lighthouse, manager: support multiple quorum rooms
Pull Request -
State: closed - Opened by d4l3k about 2 months ago
Labels: CLA Signed
#47 - local_sgd: initial version of fault tolerant LocalSGD
Pull Request -
State: closed - Opened by d4l3k about 2 months ago
Labels: CLA Signed
#46 - manager: support multiple calls to start_quorum and conditional healing
Pull Request -
State: closed - Opened by d4l3k about 2 months ago
Labels: CLA Signed
#45 - Add _test_pg helper
Pull Request -
State: closed - Opened by H-Huang about 2 months ago
Labels: CLA Signed
#44 - manager: rename step to start_step + small shutdown fix
Pull Request -
State: closed - Opened by d4l3k about 2 months ago
- 1 comment
Labels: CLA Signed
#43 - FSDP + torchtitan support
Issue -
State: open - Opened by d4l3k about 2 months ago
- 1 comment
#42 - Update README with protobuf related installation instructions
Pull Request -
State: closed - Opened by H-Huang about 2 months ago
Labels: CLA Signed
#41 - support DDP bucket rebuilding
Issue -
State: open - Opened by d4l3k about 2 months ago
#40 - manager_integ_tests: added multi rank recovery
Pull Request -
State: closed - Opened by d4l3k about 2 months ago
Labels: CLA Signed
#39 - LocalSGD / DiLoCo support
Issue -
State: open - Opened by d4l3k about 2 months ago
- 4 comments
Labels: enhancement
#38 - manager: add CPU timeouts on allreduce/future calls
Pull Request -
State: closed - Opened by d4l3k about 2 months ago
Labels: CLA Signed
#37 - [dataloader] dataloading improvement tracking issue
Issue -
State: open - Opened by d4l3k about 2 months ago
- 2 comments
Labels: enhancement, data
#36 - [CheckpointServer] use streaming transfers
Issue -
State: open - Opened by d4l3k about 2 months ago
- 5 comments
Labels: enhancement, good first issue
#35 - [lighthouse] use heartbeat info to quickly drop down replicas
Issue -
State: closed - Opened by d4l3k about 2 months ago
- 1 comment
Labels: enhancement, lighthouse
#34 - Update README.md
Pull Request -
State: closed - Opened by d4l3k about 2 months ago
Labels: CLA Signed
#33 - Update README.md logo for dark mode
Pull Request -
State: closed - Opened by d4l3k about 2 months ago
Labels: CLA Signed
#32 - [tests] add generic size/type test for all ProcessGroups
Issue -
State: closed - Opened by d4l3k about 2 months ago
- 2 comments
Labels: enhancement, good first issue
#31 - Update README.md
Pull Request -
State: closed - Opened by d4l3k about 2 months ago
Labels: CLA Signed
#30 - Address already in use
Issue -
State: closed - Opened by goddice about 2 months ago
- 4 comments
#29 - pyre strict
Pull Request -
State: closed - Opened by d4l3k 2 months ago
Labels: CLA Signed
#28 - manager_integ_tests: added recovery test
Pull Request -
State: closed - Opened by d4l3k 2 months ago
Labels: CLA Signed
#27 - manager_integ_tests: added Python integration test with lighthouse
Pull Request -
State: closed - Opened by d4l3k 2 months ago
Labels: CLA Signed
#26 - manager: added E2E tests and support getting ligthhouse and manager a…
Pull Request -
State: closed - Opened by d4l3k 2 months ago
Labels: CLA Signed
#25 - manager: added E2E tests and support getting lighthouse and manager addresses
Pull Request -
State: closed - Opened by d4l3k 2 months ago
Labels: CLA Signed
#24 - manager: added FIXED_WITH_SPARES mode
Pull Request -
State: closed - Opened by d4l3k 2 months ago
Labels: CLA Signed
#23 - lintrunner: enable pyre
Pull Request -
State: closed - Opened by d4l3k 2 months ago
Labels: CLA Signed
#22 - lintrunner: added black,isort,rustfmt
Pull Request -
State: closed - Opened by d4l3k 2 months ago
Labels: CLA Signed
#21 - process_group: wrapper updates and ErrorSwallowingProcessGroup
Pull Request -
State: closed - Opened by d4l3k 2 months ago
Labels: CLA Signed
#20 - docs: add legal info + fix jinja2 security warning
Pull Request -
State: closed - Opened by d4l3k 2 months ago
Labels: CLA Signed
#19 - manager: expand API to include errors, participant information and numeric test
Pull Request -
State: closed - Opened by d4l3k 2 months ago
Labels: CLA Signed
#18 - docs: add sphinx documentation and add missing documentation
Pull Request -
State: closed - Opened by d4l3k 2 months ago
- 1 comment
Labels: CLA Signed
#17 - [WIP] A test case to show how to use DeviceMesh API to create the customized PG
Pull Request -
State: closed - Opened by fegin 2 months ago
Labels: CLA Signed
#16 - Specifiy the devices when registering the backend to avoid warnings
Pull Request -
State: closed - Opened by fegin 3 months ago
Labels: CLA Signed
#15 - Update README.md to include Rust installation
Pull Request -
State: closed - Opened by fegin 3 months ago
Labels: CLA Signed
#14 - process_group: register via public API
Pull Request -
State: closed - Opened by d4l3k 3 months ago
- 1 comment
Labels: CLA Signed
#13 - process_group: added registration to support DeviceMesh and functional_collectives
Pull Request -
State: closed - Opened by d4l3k 3 months ago
Labels: CLA Signed
#12 - [checkpointing] support ipv6
Pull Request -
State: closed - Opened by d4l3k 3 months ago
Labels: CLA Signed
#11 - train, manager, dashboard: show world size on dashboard, manual replica_id, convergence tweaks
Pull Request -
State: closed - Opened by d4l3k 3 months ago
- 1 comment
Labels: CLA Signed
#10 - dashboard: show quorum status, age, old replicas
Pull Request -
State: closed - Opened by d4l3k 3 months ago
Labels: CLA Signed
#9 - ci: use stable rust and gate on number of gpus
Pull Request -
State: closed - Opened by d4l3k 3 months ago
Labels: CLA Signed
#8 - lighthosue, manager: dashboard kill and heartbeat old ui
Pull Request -
State: closed - Opened by d4l3k 3 months ago
Labels: CLA Signed
#7 - lighthouse: add dashboard
Pull Request -
State: closed - Opened by d4l3k 3 months ago
- 1 comment
Labels: CLA Signed
#6 - lighthouse: add heartbeats
Pull Request -
State: closed - Opened by d4l3k 3 months ago
Labels: CLA Signed
#5 - train_ddp, process_group: fixes so CUDA works e2e
Pull Request -
State: closed - Opened by d4l3k 3 months ago
Labels: CLA Signed
#4 - ci: switch to amazon runners
Pull Request -
State: closed - Opened by d4l3k 3 months ago
Labels: CLA Signed
#3 - Testing versions
Pull Request -
State: closed - Opened by ZainRizvi 3 months ago
- 1 comment
Labels: CLA Signed
#2 - Sorry, didn't mean to merge directly to main
Pull Request -
State: closed - Opened by ZainRizvi 3 months ago
Labels: CLA Signed
#1 - ci: lint + unittest
Pull Request -
State: closed - Opened by d4l3k 3 months ago
Labels: CLA Signed