pytorch/torchtune issues and pull requests

#2380 - Model weights conversion failed

Issue - State: open - Opened by xeasonx 7 days ago

#2379 - parallelize_module() parallelism module fqn wildcard doesn't work for llama3.2

Issue - State: open - Opened by acisseJZhong 7 days ago

#2378 - Add tests and implementation for disabling dropout layers in models

Pull Request - State: open - Opened by Ankur-singh 7 days ago - 1 comment
Labels: CLA Signed

#2377 - Fix Qwen config

Pull Request - State: closed - Opened by acisseJZhong 7 days ago - 1 comment
Labels: CLA Signed

#2376 - fix: Moved dev deps from optional-dependencies to dependency-groups

Pull Request - State: open - Opened by bogdansalyp 7 days ago - 1 comment
Labels: CLA Signed

#2375 - pyproject.toml wrong dev deps organization

Issue - State: open - Opened by bogdansalyp 7 days ago - 1 comment

#2374 - update in docs mentions of "ft-" before ckpt name, since we removed it

Issue - State: open - Opened by felipemello1 7 days ago
Labels: community help wanted

#2373 - expose max-autotune in configs for better perf

Issue - State: open - Opened by felipemello1 7 days ago
Labels: community help wanted

#2372 - Disable `reshard_after_forward` for last transformer layer FSDP param group

Issue - State: open - Opened by SalmanMohammadi 8 days ago
Labels: community help wanted

#2371 - Adding new role "Tool" to support Llama 3.3 models

Issue - State: open - Opened by init27 8 days ago - 4 comments

#2370 - [WIP]: Get rid of optim_bwd checks via wrapper.

Pull Request - State: open - Opened by krammnic 9 days ago - 4 comments
Labels: CLA Signed

#2369 - Misleading import error message for torchao

Issue - State: open - Opened by bogdansalyp 9 days ago

#2368 - fix: torch and torchvision import check

Pull Request - State: open - Opened by bogdansalyp 9 days ago - 2 comments
Labels: CLA Signed

#2367 - feat: Added cfg.cudnn_deterministic_mode flag

Pull Request - State: open - Opened by bogdansalyp 9 days ago - 5 comments
Labels: CLA Signed

#2366 - Refactor load_image to return torch.Tensor instead of PIL.Image

Pull Request - State: open - Opened by Ankur-singh 10 days ago - 2 comments
Labels: CLA Signed

#2365 - Implements MLFlowLogger

Pull Request - State: open - Opened by nathan-az 10 days ago - 11 comments
Labels: CLA Signed

#2364 - Add torchdata Parallel Packer for faster startup

Pull Request - State: open - Opened by andrewkho 10 days ago - 1 comment
Labels: CLA Signed

#2363 - readme updates for full DPO distributed recipe

Pull Request - State: closed - Opened by ebsmothers 10 days ago - 1 comment
Labels: CLA Signed

#2362 - [Fix Test] Fix failed generation test by pining pytorch nightlies

Pull Request - State: closed - Opened by acisseJZhong 10 days ago - 1 comment
Labels: CLA Signed

#2361 - How to change the datasets in JSON format?

Issue - State: closed - Opened by kailashg26 10 days ago - 3 comments

#2360 - Resume from checkpoint broken with distributed optimizer-in-backward

Issue - State: open - Opened by ebsmothers 10 days ago

#2359 - Resume from checkpoint with distributed optimizer-in-backward repro

Pull Request - State: open - Opened by ebsmothers 10 days ago - 1 comment
Labels: CLA Signed

#2358 - Add mistral small

Pull Request - State: open - Opened by AndrewMead10 11 days ago - 1 comment
Labels: CLA Signed

#2357 - Add max-autotune try/except if flex attn breaks

Pull Request - State: closed - Opened by felipemello1 11 days ago - 3 comments
Labels: CLA Signed

#2356 - Generic classifier builder

Pull Request - State: open - Opened by SalmanMohammadi 11 days ago - 1 comment
Labels: CLA Signed

#2355 - [WIP] Support Continual Pretraining Multi Dataset using Streaming

Pull Request - State: open - Opened by mostafaelhoushi 11 days ago - 1 comment
Labels: CLA Signed

#2354 - Remove "ft-" prefix from checkpoint shards.

Pull Request - State: closed - Opened by EugenHotaj 11 days ago - 2 comments
Labels: CLA Signed

#2353 - Add a `disable_dropout` utility fn

Issue - State: open - Opened by SalmanMohammadi 12 days ago - 4 comments
Labels: good first issue, community help wanted, better engineering

#2352 - Incorrect Default Config File Paths for Llama 3.1 8B and Qwen 2.5 7B Models

Issue - State: open - Opened by MaxHastings 12 days ago - 1 comment

#2351 - Fix saving adapter weights after disabling DSD

Pull Request - State: closed - Opened by acisseJZhong 12 days ago - 2 comments
Labels: CLA Signed

#2350 - HF tokenizers: initial base tokenizer support

Pull Request - State: open - Opened by ebsmothers 12 days ago - 2 comments
Labels: CLA Signed

#2349 - Rework recipes section of README and simplify models ref

Pull Request - State: open - Opened by joecummings 12 days ago - 2 comments
Labels: CLA Signed

#2348 - Update README for multinode

Pull Request - State: closed - Opened by joecummings 12 days ago - 1 comment
Labels: CLA Signed

#2347 - Add multi node training to README

Pull Request - State: closed - Opened by joecummings 12 days ago - 1 comment
Labels: CLA Signed

#2346 - [Bug Fix]Disable DSD for saving ckpt

Pull Request - State: closed - Opened by acisseJZhong 12 days ago - 2 comments
Labels: CLA Signed

#2345 - "ft-" prefix for finetuned checkpoints

Issue - State: closed - Opened by EugenHotaj 12 days ago - 3 comments

#2344 - Discussion: Update dataloader to skip rows that dont require training

Issue - State: open - Opened by felipemello1 13 days ago - 4 comments
Labels: discussion, best practice, triage review

#2343 - Traj dpo

Pull Request - State: open - Opened by Vattikondadheeraj 13 days ago - 3 comments

#2342 - Update to proper EOS ids for Qwen2 and Qwen2.5

Pull Request - State: closed - Opened by joecummings 13 days ago - 3 comments
Labels: CLA Signed

#2341 - CEWithChunkedOutputLoss does not check division by zero

Issue - State: open - Opened by pocca2048 14 days ago - 6 comments
Labels: discussion, triaged

#2340 - Feature request: GRPO support

Issue - State: open - Opened by tikikun 14 days ago - 5 comments

#2339 - DistributedSampler has the same seed randomization

Issue - State: closed - Opened by bogdansalyp 14 days ago - 3 comments

#2338 - Seed: null isn't random

Issue - State: open - Opened by bogdansalyp 14 days ago - 1 comment
Labels: bug, triaged

#2337 - Qwen Tokenizer Excludes Last Assistant EOT Token

Issue - State: closed - Opened by roeetal 14 days ago - 2 comments
Labels: bug, triaged

#2336 - Update PT pin for modules/_export

Pull Request - State: closed - Opened by Jack-Khuu 14 days ago - 5 comments
Labels: CLA Signed, fb-exported

#2335 - Seed is not applied for DPO recipes

Issue - State: open - Opened by bogdansalyp 14 days ago - 3 comments
Labels: bug, triaged

#2334 - Apply gradient accumulation fix to DPO/PPO recipes

Issue - State: open - Opened by bogdansalyp 14 days ago - 1 comment

#2333 - Distributed DPO loss normalization by amount of tokens

Issue - State: open - Opened by bogdansalyp 14 days ago - 2 comments

#2332 - Loss shouldn't be averaged within one grad_acc step

Issue - State: closed - Opened by bogdansalyp 14 days ago - 1 comment

#2331 - added `tie_word_embeddings` to llama3_2 models

Pull Request - State: closed - Opened by jingzhaoou 15 days ago - 4 comments
Labels: CLA Signed

#2330 - TP + FSDP distributed training (full finetuning)

Pull Request - State: closed - Opened by acisseJZhong 15 days ago - 2 comments
Labels: CLA Signed

#2329 - Wandb charts show time (minutes), but I want seconds.

Issue - State: closed - Opened by kailashg26 15 days ago - 1 comment

#2328 - Add distributed inference for llama3.2 vision

Pull Request - State: open - Opened by acisseJZhong 16 days ago - 2 comments
Labels: CLA Signed

#2327 - try fix a bug for symbolic check

Pull Request - State: open - Opened by ywq880611 17 days ago - 3 comments
Labels: CLA Signed

#2326 - [Very WiP] R1-Style distributed GRPO

Pull Request - State: open - Opened by RedTachyon 17 days ago - 20 comments
Labels: CLA Signed

#2325 - FIRE Relative Positional Encodings

Issue - State: open - Opened by kaddu341 17 days ago - 3 comments

#2324 - Grpo & verifiable rewards dataset

Pull Request - State: closed - Opened by ianbarber 17 days ago - 3 comments
Labels: CLA Signed

#2323 - Reading TorchProfiler after run

Issue - State: open - Opened by fabiogeraci 18 days ago - 6 comments

#2322 - Use checkout@v4 / upload@v4 for docs build

Pull Request - State: closed - Opened by joecummings 18 days ago - 1 comment
Labels: CLA Signed

#2321 - Refactor validate missing for LoRA + deprecate param utility

Pull Request - State: open - Opened by RdoubleA 18 days ago - 2 comments
Labels: CLA Signed

#2320 - Classifiers (reward models) in torchtune

Issue - State: open - Opened by EugenHotaj 18 days ago - 4 comments

#2319 - How to run torchtune on AMD Instinct MI300X

Issue - State: open - Opened by kailashg26 19 days ago - 2 comments

#2318 - [WIP] 2D parallelism for training

Pull Request - State: closed - Opened by joecummings 19 days ago - 2 comments
Labels: CLA Signed

#2317 - fix state dict hook for early fusion models

Pull Request - State: closed - Opened by acisseJZhong 19 days ago - 1 comment
Labels: CLA Signed

#2316 - Call `get_world_size_and_rank` ONCE

Issue - State: open - Opened by joecummings 19 days ago

#2315 - Rename and document `cleanup_before_training`

Issue - State: open - Opened by joecummings 19 days ago

#2314 - Disable DSD and fix bitsandbytes test

Pull Request - State: closed - Opened by RdoubleA 19 days ago - 2 comments
Labels: CLA Signed

#2313 - Revert DSD to fix breakages

Pull Request - State: closed - Opened by ebsmothers 19 days ago - 1 comment
Labels: CLA Signed

#2312 - Investigate the optimal scenario in which to use ``torch_set_num_thread()``

Issue - State: open - Opened by joecummings 19 days ago

#2311 - Text-to-Image Dataset and Flux Transform

Pull Request - State: open - Opened by calvinpelletier 20 days ago - 1 comment
Labels: CLA Signed

#2310 - Unable to reproduce QAT results from Blog

Issue - State: open - Opened by AbhinavDutta 20 days ago - 9 comments

#2309 - [ez] Add output_dir field to a couple configs

Pull Request - State: closed - Opened by ebsmothers 20 days ago - 1 comment
Labels: CLA Signed

#2308 - [EZ] Only log deprecation warning on rank zero

Pull Request - State: closed - Opened by RdoubleA 20 days ago - 1 comment
Labels: CLA Signed

#2307 - Differing component implementation logic across recipes

Issue - State: open - Opened by EugenHotaj 20 days ago - 4 comments
Labels: bug, best practice, better engineering, triaged

#2306 - Support for Janus-Pro series of model

Issue - State: closed - Opened by Ankur-singh 20 days ago - 2 comments

#2305 - Update LoRA DPO distributed recipe

Issue - State: closed - Opened by SalmanMohammadi 20 days ago

#2304 - Fix stop tokens in PPO

Pull Request - State: closed - Opened by RedTachyon 21 days ago - 8 comments
Labels: CLA Signed

#2303 - Move from PIL to torchvision.io.decode_image

Issue - State: open - Opened by ebsmothers 21 days ago - 8 comments
Labels: best practice, community help wanted

#2302 - Flux Model

Pull Request - State: open - Opened by calvinpelletier 21 days ago - 1 comment
Labels: CLA Signed

#2301 - Multinode support in torchtune

Pull Request - State: closed - Opened by joecummings 21 days ago - 5 comments
Labels: CLA Signed

#2300 - Missing `<|begin_of_text|>` Token in `Llama3Tokenizer`

Issue - State: open - Opened by seungjun-green 22 days ago - 3 comments

#2299 - Step based checkpointing

Issue - State: closed - Opened by xTRam1 24 days ago - 1 comment
Labels: triage review

#2298 - [WIP] 'tune cat' command for pretty printing configuration files

Pull Request - State: closed - Opened by Ankur-singh 24 days ago - 7 comments
Labels: CLA Signed

#2297 - Training never starts - stuck after Loss is intialized

Issue - State: closed - Opened by datamancerai 25 days ago - 12 comments
Labels: discussion, triaged

#2296 - Tokens per second calculation

Issue - State: open - Opened by EugenHotaj 25 days ago - 8 comments
Labels: best practice, triage review

#2295 - Tune download command not found

Issue - State: closed - Opened by shaunakjoshi12 25 days ago - 3 comments

#2294 - How to checkpoint every N steps?

Issue - State: closed - Opened by tginart 26 days ago - 1 comment

#2293 - Remove deprecated components for 0.6.0

Pull Request - State: closed - Opened by RdoubleA 26 days ago - 1 comment
Labels: CLA Signed

#2292 - Custom DPO losses support

Pull Request - State: open - Opened by krammnic 26 days ago - 8 comments
Labels: CLA Signed

#2291 - Proper prefix handling in EarlyFusion sd hooks

Pull Request - State: closed - Opened by ebsmothers 26 days ago - 3 comments
Labels: CLA Signed

#2290 - Removing `SimPOLoss`

Pull Request - State: closed - Opened by SalmanMohammadi 27 days ago - 1 comment
Labels: CLA Signed

#2288 - Roadmap for distributed recipes using NPU as a backend

Issue - State: open - Opened by Nicorgi 27 days ago

#2287 - deepseek r1 support?

Issue - State: open - Opened by johnnynunez 27 days ago - 10 comments
Labels: enhancement, triage review

#2286 - Documentation for evaluation on a custom dataset for a custom task

Issue - State: open - Opened by karrtikiyer 28 days ago - 16 comments
Labels: bug, documentation, discussion, triage review

#2285 - Saving multiple checkpoints per epoch

Issue - State: open - Opened by EugenHotaj 28 days ago - 2 comments
Labels: enhancement, triaged

#2284 - Add masking strategies to message transforms

Pull Request - State: open - Opened by supreethmanyam 28 days ago - 3 comments
Labels: CLA Signed

#2283 - Inconsistent initialization of RoPE embedding across component builders

Issue - State: open - Opened by Ankur-singh 28 days ago
Labels: best practice, better engineering

#2282 - Update model builders

Pull Request - State: closed - Opened by Ankur-singh 29 days ago - 11 comments
Labels: CLA Signed

#2281 - [RFC] Proposal for `tune cat` Command

Issue - State: closed - Opened by Ankur-singh 29 days ago - 2 comments
Labels: rfc, discussion

#2280 - Roadmap for other parallelisms

Issue - State: open - Opened by rahul-sarvam 30 days ago - 6 comments
Labels: discussion, triaged

GitHub / pytorch/torchtune issues and pull requests