Ecosyste.ms: Issues

An open API service for providing issue and pull request metadata for open source projects.

GitHub / Lightning-AI/pytorch-lightning issues and pull requests

#19842 - [App] Extend retry to 4xx except 400, 401, 403, 404

Pull Request - State: closed - Opened by lantiga 7 months ago - 1 comment
Labels: docs, app, pl

#19841 - Remove `numpy` dependencies in `src/lightning/pytorch`

Pull Request - State: closed - Opened by Peiffap 7 months ago - 2 comments
Labels: ci, community, pl

#19833 - Loading large models with fabric, FSDP and empty_init=True does not work

Issue - State: open - Opened by RuABraun 7 months ago - 1 comment
Labels: bug, needs triage

#19829 - How to incorporate vLLM in Lightning for LLM inference?

Issue - State: open - Opened by YuWang916 7 months ago - 3 comments
Labels: feature, needs triage

#19828 - TensorBoardLogger has the wrong epoch numbers much more than the fact

Issue - State: open - Opened by AlbireoBai 7 months ago - 2 comments
Labels: bug, needs triage, ver: 2.1.x

#19825 - [TPU] Fix test assertion error from artifacts

Pull Request - State: closed - Opened by awaelchli 7 months ago - 2 comments
Labels: ready, ci, tests, pl, run TPU

#19822 - Set `_choose_auto_accelerator` to `staticmethod`

Pull Request - State: closed - Opened by fedebotu 7 months ago - 2 comments
Labels: refactor, fabric, community, pl

#19820 - Add a warning when some of the modules are in eval mode before the training stage

Issue - State: closed - Opened by mszulc913 7 months ago - 3 comments
Labels: feature, discussion

#19819 - Fix resetting epoch loop restarting flag in LearningRateFinder

Pull Request - State: closed - Opened by clumsy 7 months ago - 4 comments
Labels: bug, tuner, community, pl

#19818 - Full validation after first microbatch when training after LearningRateFinder

Issue - State: closed - Opened by clumsy 7 months ago
Labels: bug, needs triage, ver: 2.2.x

#19817 - Multi-node Training with DDP stuck at "Initialize distributed..." on SLURM cluster

Issue - State: open - Opened by OswaldHe 7 months ago - 4 comments
Labels: bug, needs triage

#19814 - Fix moving keys to device in ResultCollection

Pull Request - State: closed - Opened by clumsy 7 months ago - 6 comments
Labels: bug, logging, community, pl

#19813 - Existing metric keys not moved to device after LearningRateFinder

Issue - State: open - Opened by clumsy 7 months ago
Labels: bug, tuner, logging, ver: 2.2.x

#19810 - Issue in Manual optimisation, during self.manual_backward call

Issue - State: closed - Opened by pranavrao-qure 7 months ago - 3 comments
Labels: question, precision: amp, ver: 2.0.x

#19808 - Fix `save_last` type annotation for ModelCheckpoint

Pull Request - State: closed - Opened by mariovas3 7 months ago - 2 comments
Labels: bug, callback: model checkpoint, community, pl

#19806 - Adding test for legacy checkpoint created with 2.2.5

Pull Request - State: closed - Opened by pl-ghost 7 months ago - 2 comments
Labels: checkpointing, tests, pl

#19805 - Update `LearningRateMonitor` docs and tests for `log_weight_decay`

Pull Request - State: closed - Opened by Peiffap 7 months ago - 2 comments
Labels: ready, community, pl

#19804 - Ignore parameters causing ValueError when dumping to YAML

Pull Request - State: closed - Opened by Callidior 7 months ago - 4 comments
Labels: bug, ready, community, pl

#19802 - FSDP Strategy checkpoint loading

Issue - State: closed - Opened by xin-w8023 7 months ago - 1 comment
Labels: duplicate, feature

#19799 - parsing issue with `save_last` parameter of `ModelCheckpoint`

Issue - State: closed - Opened by mariovas3 7 months ago
Labels: bug, callback: model checkpoint, ver: 2.2.x

#19794 - LOG issue

Issue - State: closed - Opened by jzhanghzau 7 months ago - 2 comments
Labels: question, progress bar: tqdm, ver: 2.1.x

#19792 - The packages such as libraries and models are not loading from files

Issue - State: closed - Opened by sinanLab 7 months ago - 1 comment
Labels: question

#19791 - Support wandb_logger.watch() when using LightningCLI

Issue - State: open - Opened by Boltzmachine 7 months ago - 2 comments
Labels: feature, needs triage

#19789 - Enable batch size finder for distributed strategies

Issue - State: open - Opened by clumsy 7 months ago - 2 comments
Labels: feature, tuner

#19783 - DDP strategy doesn't work for on_validation_epoch_end, always hang

Issue - State: closed - Opened by jzhanghzau 7 months ago - 4 comments
Labels: question, logging, ver: 2.1.x

#19773 - Support GAN based model training with deepspeed which need to setup fabric twice

Issue - State: open - Opened by npuichigo 7 months ago - 2 comments
Labels: feature, needs triage

#19772 - Sanitize object params before they get logged from argument-free classes

Issue - State: closed - Opened by V0XNIHILI 7 months ago
Labels: feature

#19771 - Sanitize argument-free object params before logging

Pull Request - State: closed - Opened by V0XNIHILI 7 months ago - 2 comments
Labels: logger, fabric, community, pl

#19768 - Script freezes when Trainer is instantiated

Issue - State: closed - Opened by PabloVD 7 months ago - 5 comments
Labels: question

#19766 - Does `DDPStrategy` support XLA?

Issue - State: closed - Opened by laserkelvin 7 months ago - 1 comment
Labels: question, strategy: ddp, ver: 2.1.x

#19764 - Resume from mid steps inside an epoch

Issue - State: open - Opened by xiaosuyu1997 7 months ago - 1 comment
Labels: feature, needs triage

#19761 - Apply the ignore of the `save_hyperparameters` function to args

Issue - State: open - Opened by doveppp 7 months ago - 2 comments
Labels: feature, help wanted, good first issue

#19754 - SaveConfigCallback.save_config is conflict with DDP

Issue - State: open - Opened by KeplerWang 7 months ago - 2 comments
Labels: bug, needs triage, ver: 2.1.x

#19753 - Unable to extend FSDPStrategy to HPU accelerator

Issue - State: closed - Opened by jyothisambolu 7 months ago - 8 comments
Labels: bug, needs triage

#19751 - Validation does not produce any output in PyTorch Lightning using my UNetTestModel

Issue - State: closed - Opened by lgy112112 7 months ago
Labels: bug, needs triage

#19743 - Log `TensorBoard` histograms

Issue - State: open - Opened by dominicgkerr 7 months ago - 3 comments
Labels: feature, needs triage

#19736 - `TensorBoardLogger` fails with remote FS (azure)

Issue - State: open - Opened by nkaenzig 8 months ago - 3 comments
Labels: bug, needs triage

#19733 - build(deps-dev): bump vite from 2.9.17 to 2.9.18 in /src/lightning/app/cli/react-ui-template/ui

Pull Request - State: closed - Opened by dependabot[bot] 8 months ago - 1 comment
Labels: app, dependencies, javascript

#19731 - Expected all tensors to be on the same device... using quantized model with deepspeed zero3

Issue - State: open - Opened by sanghyuk-choi 8 months ago - 5 comments
Labels: bug, needs triage

#19730 - ValueError: dictionary update sequence element #0 has length 1; 2 is required

Issue - State: closed - Opened by pau-altur 8 months ago - 3 comments
Labels: bug

#19714 - Mixing the order of `--config` and `fit` in LightningCLI can cause confusion

Issue - State: open - Opened by awaelchli 8 months ago - 7 comments
Labels: bug, feature, lightningcli, ver: 2.2.x

#19704 - Allowing FSDP strategy for hpu accelerator

Pull Request - State: closed - Opened by jyothisambolu 8 months ago - 1 comment
Labels: has conflicts, strategy: hpu (external), pl

#19703 - Enable dumping raw prof files in `AdvancedProfiler`

Pull Request - State: closed - Opened by clumsy 8 months ago - 8 comments
Labels: feature, profiler, community, pl

#19698 - Dump prof files from AdvancedProfiler

Issue - State: closed - Opened by clumsy 8 months ago - 1 comment
Labels: feature, help wanted, profiler

#19631 - openweb_trainer.py crashes after 6k iters

Issue - State: open - Opened by salykova 11 months ago - 4 comments
Labels: bug, callback: throughput

#19626 - FSDPStrategy error when automatic_optimization=False

Issue - State: open - Opened by carlosgjs 8 months ago - 4 comments
Labels: bug, needs triage

#19624 - IterableDataset with CORRECT length causes validation loop to be skipped

Issue - State: open - Opened by mattcleigh 8 months ago - 9 comments
Labels: question, data handling, ver: 2.2.x

#19620 - Log dict changed behavior: can't log train and validation metrics on the same plot

Issue - State: closed - Opened by mfoglio 8 months ago - 12 comments
Labels: question

#19617 - Resuming not correct when `max_steps` corresponds to the end of an epoch

Issue - State: open - Opened by awaelchli 8 months ago
Labels: bug, loops, ver: 2.2.x

#19612 - pytorch-ada

Issue - State: closed - Opened by moghadas76 8 months ago - 1 comment
Labels: question

#19609 - [WIP] Basic system check for troubleshooting multi-GPU issues

Pull Request - State: open - Opened by awaelchli 8 months ago
Labels: docs, ci, fabric, strategy: ddp, fun

#19604 - Deadlock when manually logging from on_train_epoch_end

Issue - State: open - Opened by idfah 8 months ago - 12 comments
Labels: bug, needs triage, ver: 2.1.x

#19598 - CUDA Error while training in DDP and using workers in dataloaders

Issue - State: open - Opened by JahooYoung 8 months ago - 11 comments
Labels: bug, fabric, torch.compile

#19596 - save_hyperparameter incorrectly infers parameters from superclass

Issue - State: open - Opened by klieret 8 months ago - 1 comment
Labels: bug

#19595 - Does `Trainer(devices=1)` use all CPUs?

Issue - State: closed - Opened by MaximilienLC 8 months ago - 7 comments
Labels: help wanted, good first issue, question, ver: 2.2.x

#19593 - Update reference to LitGPT example

Pull Request - State: closed - Opened by awaelchli 9 months ago - 1 comment
Labels: example, ready, docs, fabric

#19589 - [WAITING FOR PL CORE MAINTAINER OPINION] Bugfix/17958 multi optimizer step count behaviour

Pull Request - State: open - Opened by Anner-deJong 9 months ago - 3 comments
Labels: docs, pl

#19587 - ModelCheckpoint does not save any checkpoint

Issue - State: closed - Opened by pcwanan 9 months ago - 6 comments
Labels: bug, callback: model checkpoint, repro needed

#19583 - Added fix/workaround for validation issue after resumption

Pull Request - State: open - Opened by pimdh 9 months ago
Labels: bug, loops, community, pl

#19578 - Add ability for TQDMProgressBar to retain prior epoch training bars

Pull Request - State: closed - Opened by jojje 9 months ago - 1 comment
Labels: docs, progress bar: tqdm, community, pl

#19575 - EarlyStopping interfered by LearningRateFinder

Issue - State: open - Opened by zhf231298 9 months ago - 1 comment
Labels: bug, tuner, ver: 2.2.x

#19564 - Document `ddp_find_unused_parameters_true` in Fabric

Pull Request - State: closed - Opened by leng-yue 9 months ago - 2 comments
Labels: ready, docs, fabric, community

#19549 - Validation runs only for one iteration when restarting from checkpoint mid-epoch, wrongly reporting validation loss

Issue - State: open - Opened by pimdh 9 months ago - 3 comments
Labels: bug, help wanted, loops

#19546 - Support peft package for loading lora-adapted models

Pull Request - State: closed - Opened by moghadas76 9 months ago - 2 comments
Labels: app, pl, dependencies

#19544 - NCCL when trying to train on 2 nodes

Issue - State: open - Opened by waynemystir 9 months ago - 4 comments
Labels: bug, needs triage

#19540 - [WIP] Fail more nicely when an error occurs in `Fabric.rank_zero_first()`

Pull Request - State: closed - Opened by awaelchli 9 months ago
Labels: fabric

#19536 - WIP: Unscale the gradients before gradient clipping in manual optimization

Pull Request - State: open - Opened by awaelchli 9 months ago
Labels: bug, precision: amp, fun

#19522 - Switch to new package name lightning_data -> litdata

Pull Request - State: closed - Opened by tchaton 9 months ago - 1 comment
Labels: ready, data (external)

#19521 - Update tests for PyTorch 2.2.1

Pull Request - State: closed - Opened by awaelchli 9 months ago - 2 comments
Labels: ready, fabric, tests, pl

#19514 - Don't save pretrained submodules in checkpoint

Issue - State: closed - Opened by davidpicard 9 months ago - 3 comments
Labels: docs

#19510 - Add support for using the streaming dataloader in map or optimize for large scale inference

Pull Request - State: closed - Opened by tchaton 9 months ago - 1 comment
Labels: ready, data (external)

#19507 - Emulating multiple devices with a single GPU

Issue - State: closed - Opened by liecn 9 months ago - 4 comments
Labels: feature, question, ver: 2.0.x

#19504 - Flexible and easy to use HSDP setting

Pull Request - State: closed - Opened by Liyang90 9 months ago - 4 comments
Labels: feature, ready, fabric, strategy: fsdp, community, pl

#19502 - Allow flexible and easy to configure HSDP

Issue - State: closed - Opened by Liyang90 9 months ago - 8 comments
Labels: feature, discussion, strategy: fsdp

#19498 - Add Ascend NPU as a backend

Issue - State: closed - Opened by hipudding 9 months ago - 14 comments
Labels: feature, needs triage

#19494 - FSDP hybrid shard should checkpoint in a single node

Issue - State: open - Opened by carmocca 9 months ago - 3 comments
Labels: feature, checkpointing, strategy: fsdp

#19493 - Alternative mechanism to detect missing `Fabric.backward()` call

Pull Request - State: closed - Opened by awaelchli 9 months ago - 2 comments
Labels: bug, ready, fabric, fun

#19477 - ci: adding testing with M1 [1/2: without Fabric] [wip]

Pull Request - State: closed - Opened by Borda 9 months ago - 5 comments
Labels: ci, has conflicts

#19467 - "backward pass is invalid for module in evaluation mode" with deepspeed stage 3

Issue - State: open - Opened by olegsinavski 9 months ago - 6 comments
Labels: question

#19462 - FSDP checkpointing uses deprecated APIs with PyTorch 2.2

Issue - State: open - Opened by carmocca 9 months ago - 6 comments
Labels: bug, checkpointing, strategy: fsdp

#19460 - `batch_sampler.batch_size` is None with deepspeed and `DataLoader(batch_size=None)`

Issue - State: open - Opened by olegsinavski 9 months ago - 4 comments
Labels: bug, help wanted, strategy: deepspeed

#19450 - Minor correction release `2.2.0.post` [rebase & merge]

Pull Request - State: closed - Opened by Borda 9 months ago - 2 comments
Labels: ready, docs, ci, release, fabric, app, pl, dependencies, package, data (external)

#19443 - Enable support for Intel XPU devices (AKA Intel GPUs)

Pull Request - State: open - Opened by coreyjadams 9 months ago - 3 comments
Labels: fabric, pl, data (external)

#19435 - Tutorial on running Lightning App on Kubernetes

Issue - State: closed - Opened by svnv-svsv-jm 9 months ago
Labels: docs, app

#19427 - calling iter twice messes up dataloaders with queues

Issue - State: open - Opened by ben-da6 9 months ago - 4 comments
Labels: bug, data handling, loops, ver: 2.1.x

#19403 - Extra training step/global_step incrementation when resuming training from a checkpoint

Issue - State: open - Opened by gnikolenyi 10 months ago - 3 comments
Labels: bug, needs triage, ver: 2.1.x

#19354 - Support `DDP(static_graph=True)` and gradient accumulation

Issue - State: open - Opened by nousr 10 months ago - 3 comments
Labels: help wanted, strategy: ddp

#19331 - How to get gpu global rank in the dataloader model?

Issue - State: closed - Opened by FakeEnd 10 months ago - 2 comments
Labels: question

#19322 - stats logging in "on_train_epoch_end" ends up on wrong progress bar

Issue - State: closed - Opened by jojje 10 months ago - 5 comments
Labels: bug, logging, ver: 2.1.x

#19308 - [WIP]add npu support

Pull Request - State: closed - Opened by hipudding 10 months ago - 7 comments
Labels: pl

#19297 - feat(integrations): Improve checkpoint functionality of `WandbLogger`

Pull Request - State: open - Opened by ash0ts 10 months ago - 6 comments
Labels: feature, logger: wandb, pl, dependencies

#19271 - RuntimeError: mat1 and mat2 shapes cannot be multiplied

Issue - State: closed - Opened by anshkumar 10 months ago
Labels: bug, ver: 2.1.x, repro needed, precision: bnb

#19268 - Fix: automatically upload config file when using WandbLogger with PTL CLI

Pull Request - State: open - Opened by ayulockin 10 months ago - 4 comments
Labels: logger, has conflicts, lightningcli, community, pl

#19256 - NCCL timeout (or GPU OOMs) when using Wandb + configure_model with passing a factory + save_hyperparameters + large models

Issue - State: open - Opened by olegsinavski 10 months ago - 3 comments
Labels: bug, needs triage, ver: 2.1.x

#19235 - Support gradient clipping by norm with FSDP

Issue - State: open - Opened by awaelchli 11 months ago - 2 comments
Labels: feature, strategy: fsdp

#19177 - Reset trainer variable `should_stop` when `fit` is called

Pull Request - State: open - Opened by ryan597 11 months ago - 5 comments
Labels: community, pl

#19149 - Saving Only LORA Weights in PyTorch Lightning with HuggingFace PEFT

Issue - State: closed - Opened by NingJinzhong 11 months ago - 3 comments
Labels: question, checkpointing

#19138 - Wrong PositionalEncoding in the Transformer example

Issue - State: closed - Opened by Galaxy-Husky 11 months ago - 7 comments
Labels: bug, help wanted, good first issue, example, ver: 2.2.x