Ecosyste.ms: Issues

An open API service for providing issue and pull request metadata for open source projects.

GitHub / pytorch/torchft issues and pull requests

#97 - process_group: support all PG APIs

Issue - State: open - Opened by d4l3k 3 days ago
Labels: enhancement

#96 - Refactor local_sgd integration tests

Pull Request - State: closed - Opened by H-Huang 3 days ago
Labels: CLA Signed

#95 - Participants APIs should check if quorum is started

Pull Request - State: closed - Opened by fegin 3 days ago
Labels: CLA Signed

#94 - manager: expose participating_rank

Pull Request - State: closed - Opened by d4l3k 4 days ago
Labels: CLA Signed

#93 - examples,docs: adjust ddp example timeout and docs

Pull Request - State: closed - Opened by d4l3k 7 days ago
Labels: CLA Signed

#92 - Add DiLoCo

Pull Request - State: closed - Opened by H-Huang 8 days ago
Labels: CLA Signed

#91 - ProcessGroupBabyNCCL: support multiple streams and use event on start

Pull Request - State: closed - Opened by d4l3k 8 days ago
Labels: CLA Signed

#90 - CheckpointServer: start in disallowed state + tests

Pull Request - State: closed - Opened by d4l3k 8 days ago
Labels: CLA Signed

#89 - ProcessGroupBaby: support full suite of PG tests

Pull Request - State: closed - Opened by d4l3k 9 days ago
Labels: CLA Signed

#88 - process_group: fix docs with torch==2.6.0

Pull Request - State: closed - Opened by d4l3k 9 days ago
Labels: CLA Signed

#87 - Change how TorchFT manages user_state_dict

Pull Request - State: closed - Opened by fegin 9 days ago
Labels: CLA Signed

#86 - Fix ManagedDeviceMesh composability issues

Pull Request - State: closed - Opened by fegin 9 days ago
Labels: CLA Signed

#85 - Improve OptimizerWrapper composability

Pull Request - State: closed - Opened by fegin 9 days ago
Labels: CLA Signed

#84 - coordination: expose new low level torchft coordination API

Pull Request - State: open - Opened by d4l3k 10 days ago
Labels: CLA Signed

#83 - process_group/ManagedProcessGroup: ensure quorum and PG is configured

Pull Request - State: closed - Opened by d4l3k 10 days ago
Labels: CLA Signed

#82 - [WIP][RFC] Required changes for integration with TorchTitan

Pull Request - State: open - Opened by fegin 11 days ago
Labels: CLA Signed

#81 - checkpointing: use CheckpointTransport abstraction

Pull Request - State: closed - Opened by d4l3k 14 days ago
Labels: CLA Signed

#80 - rust: add open telemetry tracing

Pull Request - State: open - Opened by d4l3k 15 days ago
Labels: CLA Signed

#79 - lighthouse/quorum: make it clear that quorum logs are for next quorum

Pull Request - State: closed - Opened by d4l3k 15 days ago
Labels: CLA Signed

#78 - lib: fix Already borrowed

Pull Request - State: closed - Opened by d4l3k 16 days ago
Labels: CLA Signed

#77 - [WIP] FSDP example

Pull Request - State: open - Opened by mreso 16 days ago
Labels: CLA Signed

#76 - Add DiLoCo

Pull Request - State: closed - Opened by H-Huang 17 days ago - 3 comments
Labels: CLA Signed

#75 - use torchx for manual many replica (20+) tests

Pull Request - State: closed - Opened by d4l3k 22 days ago
Labels: CLA Signed

#74 - Fix typo and use sampler in train_ddp.py

Pull Request - State: closed - Opened by mreso 23 days ago
Labels: CLA Signed

#73 - overhaul timeouts for Lighthouse, Manager, checkpoint server

Pull Request - State: closed - Opened by d4l3k 23 days ago
Labels: CLA Signed

#72 - Dont return quorum if requester isnt involved

Pull Request - State: closed - Opened by Jackmin801 24 days ago
Labels: CLA Signed

#71 - lighthouse/quorum: avoid split brain and add shrink_only support

Pull Request - State: closed - Opened by d4l3k 25 days ago - 2 comments
Labels: CLA Signed

#70 - lighthouse, manager: remove room support

Pull Request - State: closed - Opened by d4l3k 25 days ago - 1 comment
Labels: CLA Signed

#69 - feat: fix security warnings in torchft

Pull Request - State: closed - Opened by c-p-i-o 25 days ago - 1 comment
Labels: CLA Signed

#68 - process_group: wait for futher_thread join before creating new one

Pull Request - State: closed - Opened by dwancn 26 days ago - 4 comments
Labels: CLA Signed

#67 - [manager] fix address when binding to 0

Pull Request - State: closed - Opened by d4l3k 28 days ago
Labels: CLA Signed

#66 - Use bucketized model averaging for LocalSGD

Issue - State: open - Opened by d4l3k 28 days ago - 1 comment
Labels: enhancement, good first issue

#65 - Update documentation link

Pull Request - State: closed - Opened by d4l3k 29 days ago
Labels: CLA Signed

#64 - [lighthouse] detect unhealthy participants via heartbeats

Pull Request - State: closed - Opened by d4l3k 29 days ago - 2 comments
Labels: CLA Signed

#63 - Use heartbeat to invalidate stale entries

Issue - State: closed - Opened by Jackmin801 29 days ago - 1 comment

#62 - Test manager join

Pull Request - State: open - Opened by Jackmin801 about 1 month ago
Labels: CLA Signed

#61 - feat: expose lighthouse join timeout

Pull Request - State: closed - Opened by Jackmin801 about 1 month ago
Labels: CLA Signed

#60 - process_group: add PG init timeouts + automatically assign manager port

Pull Request - State: closed - Opened by d4l3k about 1 month ago
Labels: CLA Signed

#59 - manager: add per request timeouts

Pull Request - State: closed - Opened by d4l3k about 1 month ago - 1 comment
Labels: CLA Signed

#58 - Dataloader question upon restart

Issue - State: open - Opened by cjolivier01 about 1 month ago - 6 comments
Labels: enhancement, question, data

#57 - Fix typos and add a missing break

Pull Request - State: closed - Opened by yeahdongcn about 1 month ago
Labels: CLA Signed

#56 - Introduce ManagedDeviceMesh to integrate DeviceMesh with TorchFT

Pull Request - State: closed - Opened by fegin about 2 months ago
Labels: CLA Signed

#55 - manager_integ_tests: added LocalSGD integration test

Pull Request - State: closed - Opened by d4l3k about 2 months ago
Labels: CLA Signed

#54 - Use streaming transfers

Pull Request - State: open - Opened by Krishn1412 about 2 months ago - 4 comments
Labels: CLA Signed

#53 - chore: compile -> compile_protos

Pull Request - State: closed - Opened by Jackmin801 about 2 months ago - 1 comment
Labels: CLA Signed

#52 - ManagerClient.quorum should return a namedtuple, dataclass or object

Issue - State: closed - Opened by Jackmin801 about 2 months ago - 3 comments

#51 - add pypi badge to README

Pull Request - State: closed - Opened by d4l3k about 2 months ago
Labels: CLA Signed

#50 - .github: add nightly wheel builds

Pull Request - State: closed - Opened by d4l3k about 2 months ago
Labels: CLA Signed

#49 - [Chore] Couple of small chores

Pull Request - State: closed - Opened by Jackmin801 about 2 months ago
Labels: CLA Signed

#48 - lighthouse, manager: support multiple quorum rooms

Pull Request - State: closed - Opened by d4l3k about 2 months ago
Labels: CLA Signed

#47 - local_sgd: initial version of fault tolerant LocalSGD

Pull Request - State: closed - Opened by d4l3k about 2 months ago
Labels: CLA Signed

#46 - manager: support multiple calls to start_quorum and conditional healing

Pull Request - State: closed - Opened by d4l3k about 2 months ago
Labels: CLA Signed

#45 - Add _test_pg helper

Pull Request - State: closed - Opened by H-Huang about 2 months ago
Labels: CLA Signed

#44 - manager: rename step to start_step + small shutdown fix

Pull Request - State: closed - Opened by d4l3k about 2 months ago - 1 comment
Labels: CLA Signed

#43 - FSDP + torchtitan support

Issue - State: open - Opened by d4l3k about 2 months ago - 1 comment

#42 - Update README with protobuf related installation instructions

Pull Request - State: closed - Opened by H-Huang about 2 months ago
Labels: CLA Signed

#41 - support DDP bucket rebuilding

Issue - State: open - Opened by d4l3k about 2 months ago

#40 - manager_integ_tests: added multi rank recovery

Pull Request - State: closed - Opened by d4l3k about 2 months ago
Labels: CLA Signed

#39 - LocalSGD / DiLoCo support

Issue - State: open - Opened by d4l3k about 2 months ago - 4 comments
Labels: enhancement

#38 - manager: add CPU timeouts on allreduce/future calls

Pull Request - State: closed - Opened by d4l3k about 2 months ago
Labels: CLA Signed

#37 - [dataloader] dataloading improvement tracking issue

Issue - State: open - Opened by d4l3k about 2 months ago - 2 comments
Labels: enhancement, data

#36 - [CheckpointServer] use streaming transfers

Issue - State: open - Opened by d4l3k about 2 months ago - 5 comments
Labels: enhancement, good first issue

#35 - [lighthouse] use heartbeat info to quickly drop down replicas

Issue - State: closed - Opened by d4l3k about 2 months ago - 1 comment
Labels: enhancement, lighthouse

#34 - Update README.md

Pull Request - State: closed - Opened by d4l3k about 2 months ago
Labels: CLA Signed

#33 - Update README.md logo for dark mode

Pull Request - State: closed - Opened by d4l3k about 2 months ago
Labels: CLA Signed

#32 - [tests] add generic size/type test for all ProcessGroups

Issue - State: closed - Opened by d4l3k about 2 months ago - 2 comments
Labels: enhancement, good first issue

#31 - Update README.md

Pull Request - State: closed - Opened by d4l3k about 2 months ago
Labels: CLA Signed

#30 - Address already in use

Issue - State: closed - Opened by goddice about 2 months ago - 4 comments

#29 - pyre strict

Pull Request - State: closed - Opened by d4l3k 2 months ago
Labels: CLA Signed

#28 - manager_integ_tests: added recovery test

Pull Request - State: closed - Opened by d4l3k 2 months ago
Labels: CLA Signed

#27 - manager_integ_tests: added Python integration test with lighthouse

Pull Request - State: closed - Opened by d4l3k 2 months ago
Labels: CLA Signed

#26 - manager: added E2E tests and support getting ligthhouse and manager a…

Pull Request - State: closed - Opened by d4l3k 2 months ago
Labels: CLA Signed

#25 - manager: added E2E tests and support getting lighthouse and manager addresses

Pull Request - State: closed - Opened by d4l3k 2 months ago
Labels: CLA Signed

#24 - manager: added FIXED_WITH_SPARES mode

Pull Request - State: closed - Opened by d4l3k 2 months ago
Labels: CLA Signed

#23 - lintrunner: enable pyre

Pull Request - State: closed - Opened by d4l3k 2 months ago
Labels: CLA Signed

#22 - lintrunner: added black,isort,rustfmt

Pull Request - State: closed - Opened by d4l3k 2 months ago
Labels: CLA Signed

#21 - process_group: wrapper updates and ErrorSwallowingProcessGroup

Pull Request - State: closed - Opened by d4l3k 2 months ago
Labels: CLA Signed

#20 - docs: add legal info + fix jinja2 security warning

Pull Request - State: closed - Opened by d4l3k 2 months ago
Labels: CLA Signed

#19 - manager: expand API to include errors, participant information and numeric test

Pull Request - State: closed - Opened by d4l3k 2 months ago
Labels: CLA Signed

#18 - docs: add sphinx documentation and add missing documentation

Pull Request - State: closed - Opened by d4l3k 2 months ago - 1 comment
Labels: CLA Signed

#17 - [WIP] A test case to show how to use DeviceMesh API to create the customized PG

Pull Request - State: closed - Opened by fegin 2 months ago
Labels: CLA Signed

#16 - Specifiy the devices when registering the backend to avoid warnings

Pull Request - State: closed - Opened by fegin 3 months ago
Labels: CLA Signed

#15 - Update README.md to include Rust installation

Pull Request - State: closed - Opened by fegin 3 months ago
Labels: CLA Signed

#14 - process_group: register via public API

Pull Request - State: closed - Opened by d4l3k 3 months ago - 1 comment
Labels: CLA Signed

#13 - process_group: added registration to support DeviceMesh and functional_collectives

Pull Request - State: closed - Opened by d4l3k 3 months ago
Labels: CLA Signed

#12 - [checkpointing] support ipv6

Pull Request - State: closed - Opened by d4l3k 3 months ago
Labels: CLA Signed

#11 - train, manager, dashboard: show world size on dashboard, manual replica_id, convergence tweaks

Pull Request - State: closed - Opened by d4l3k 3 months ago - 1 comment
Labels: CLA Signed

#10 - dashboard: show quorum status, age, old replicas

Pull Request - State: closed - Opened by d4l3k 3 months ago
Labels: CLA Signed

#9 - ci: use stable rust and gate on number of gpus

Pull Request - State: closed - Opened by d4l3k 3 months ago
Labels: CLA Signed

#8 - lighthosue, manager: dashboard kill and heartbeat old ui

Pull Request - State: closed - Opened by d4l3k 3 months ago
Labels: CLA Signed

#7 - lighthouse: add dashboard

Pull Request - State: closed - Opened by d4l3k 3 months ago - 1 comment
Labels: CLA Signed

#6 - lighthouse: add heartbeats

Pull Request - State: closed - Opened by d4l3k 3 months ago
Labels: CLA Signed

#5 - train_ddp, process_group: fixes so CUDA works e2e

Pull Request - State: closed - Opened by d4l3k 3 months ago
Labels: CLA Signed

#4 - ci: switch to amazon runners

Pull Request - State: closed - Opened by d4l3k 3 months ago
Labels: CLA Signed

#3 - Testing versions

Pull Request - State: closed - Opened by ZainRizvi 3 months ago - 1 comment
Labels: CLA Signed

#2 - Sorry, didn't mean to merge directly to main

Pull Request - State: closed - Opened by ZainRizvi 3 months ago
Labels: CLA Signed

#1 - ci: lint + unittest

Pull Request - State: closed - Opened by d4l3k 3 months ago
Labels: CLA Signed