Ecosyste.ms: Issues
An open API service for providing issue and pull request metadata for open source projects.
GitHub / Lightning-AI/pytorch-lightning issues and pull requests
#19970 - Make checkpoint saving fully atomic
Issue -
State: closed - Opened by radomirgr 5 months ago
- 3 comments
Labels: feature, help wanted, checkpointing, ver: 2.2.x
#19969 - Fix minor typo in Trainer's documentation
Pull Request -
State: closed - Opened by SamuelLarkin 5 months ago
- 1 comment
Labels: docs, pl
#19968 - Update README.md
Pull Request -
State: closed - Opened by williamFalcon 5 months ago
- 1 comment
#19967 - docs: Bump HPU ref `1.6.0.rc0`
Pull Request -
State: closed - Opened by pl-ghost 5 months ago
- 2 comments
Labels: docs, pl
#19966 - min_epochs and EarlyStopping in conflict
Issue -
State: open - Opened by timlod 5 months ago
- 2 comments
Labels: bug, needs triage
#19965 - Loading saved config file fails because of InterpolationMode
Issue -
State: closed - Opened by iulialexandra 5 months ago
- 1 comment
Labels: bug, needs triage
#19964 - Documentation: writing custom samplers compatible with multi GPU training
Issue -
State: open - Opened by fteufel 5 months ago
Labels: help wanted, docs
#19963 - Nicer logging of list of dicts for hyper parameters
Pull Request -
State: open - Opened by vork 5 months ago
- 1 comment
Labels: fabric
#19962 - modified num_replicas=self.world_size
Pull Request -
State: open - Opened by arjunagarwal899 5 months ago
Labels: pl
#19961 - Returning num_replicas=world_size when using distributed sampler in ddp
Issue -
State: open - Opened by arjunagarwal899 5 months ago
- 3 comments
Labels: duplicate, feature, help wanted, distributed, strategy: ddp
#19959 - Removing numpy package from src/lightning
Pull Request -
State: closed - Opened by Bhavay-2001 5 months ago
- 3 comments
Labels: fabric
#19958 - Removing numpy package dependency in src/lightning package
Pull Request -
State: closed - Opened by Bhavay-2001 5 months ago
Labels: fabric
#19957 - Logging Hyperparameters for list of dicts
Issue -
State: open - Opened by vork 5 months ago
Labels: bug, needs triage, ver: 2.2.x
#19956 - ValueError: range() arg 3 must not be zero - Need to Identify the Root Cause
Issue -
State: closed - Opened by YuyaWake 5 months ago
- 1 comment
Labels: bug, needs triage, ver: 2.0.x
#19955 - Adam optimizer is slower after loading model from checkpoint
Issue -
State: closed - Opened by radomirgr 5 months ago
- 27 comments
Labels: bug, help wanted, optimization, performance
#19954 - Release 2.3.0
Pull Request -
State: closed - Opened by awaelchli 5 months ago
- 2 comments
Labels: ready, release, fabric, pl, package, data
#19953 - Make TensorBoardLogger default version creation ascii sortable
Issue -
State: open - Opened by jdenhof 5 months ago
Labels: feature, needs triage
#19952 - Class name displayed incorrectly
Issue -
State: open - Opened by HelixPiano 5 months ago
Labels: docs, needs triage
#19951 - docker image doesn't have `pytorch_lightning`
Issue -
State: closed - Opened by grisaitis 5 months ago
- 1 comment
#19950 - Autocast "cache_enabled=True" failing
Issue -
State: open - Opened by thomassajot 5 months ago
- 1 comment
Labels: bug, needs triage
#19949 - Use lr setter callback instead of `attr_name` in `LearningRateFinder` and `Tuner`
Issue -
State: open - Opened by arthurdjn 6 months ago
- 4 comments
Labels: feature, needs triage
#19948 - ci/docs: enable dispatch build without warning as errors
Pull Request -
State: closed - Opened by Borda 6 months ago
- 1 comment
Labels: ready, docs, ci
#19947 - Removing numpy requirement from all files in examples/pytorch/domain_templates
Pull Request -
State: closed - Opened by Bhavay-2001 6 months ago
- 6 comments
Labels: example, refactor, community, pl
#19946 - Fix strict loading from distributed checkpoints vs PyTorch nightly
Pull Request -
State: closed - Opened by awaelchli 6 months ago
- 2 comments
Labels: bug, checkpointing, fabric, fun
#19945 - Avoid casting with `numpy()` in `multiprocessing.py`
Issue -
State: closed - Opened by Peiffap 6 months ago
- 1 comment
Labels: help wanted, refactor
#19943 - `make test` fails with `subprocess-exited-with-error`: `AssertionError: Could not find cmake executable!`
Issue -
State: open - Opened by Peiffap 6 months ago
Labels: bug, needs triage, ver: 2.2.x
#19942 - Replace usage of `grep -P` with `perl` in `run_standalone_tests.sh`
Pull Request -
State: closed - Opened by Peiffap 6 months ago
Labels: ready, ci, community
#19941 - fix(docs): fix broken link to ensure the docs can be built
Pull Request -
State: closed - Opened by yurijmikhalevich 6 months ago
- 1 comment
Labels: ready, docs, app
#19940 - Custom batch selection for logging
Issue -
State: open - Opened by bhosalems 6 months ago
- 3 comments
Labels: feature, needs triage
#19939 - Callback for logging forward, backward and update time
Issue -
State: open - Opened by joshim5 6 months ago
Labels: feature, needs triage
#19938 - `grep: Invalid option -- P` when running `./tests/run_standalone_tests.sh` on macOS
Issue -
State: closed - Opened by Peiffap 6 months ago
- 1 comment
Labels: bug, help wanted, tests, ver: 2.2.x
#19937 - Fix typos in CONTRIBUTING.md
Pull Request -
State: closed - Opened by Peiffap 6 months ago
Labels: ci
#19936 - FileNotFoundError: [Errno 2] No such file or directory tfevents file
Issue -
State: closed - Opened by bhosalems 6 months ago
- 1 comment
Labels: bug, ver: 1.8.x
#19935 - Simplify loading full checkpoint in ModelParallelStrategy
Pull Request -
State: closed - Opened by awaelchli 6 months ago
Labels: fabric
#19933 - is `lightning` and `pytorch_lightning` the same?
Issue -
State: closed - Opened by stephanielees 6 months ago
- 4 comments
Labels: bug, needs triage
#19932 - Continuing training with `ckpt_path="last"` and MLFLowLogger fails in distributed setting
Issue -
State: open - Opened by selflein 6 months ago
Labels: bug, logger: mlflow
#19931 - Destroy process group in atexit handler
Pull Request -
State: closed - Opened by awaelchli 6 months ago
- 2 comments
Labels: ready, fabric, pl
#19929 - ModelCheckpoint does not work when using the monitor
Issue -
State: closed - Opened by QianhangFeng 6 months ago
- 1 comment
Labels: bug, callback: model checkpoint, repro needed, ver: 2.2.x
#19926 - Update FlopCounterMode usage in throughput.py
Pull Request -
State: closed - Opened by IvanYashchuk 6 months ago
- 2 comments
Labels: ready, fabric
#19925 - Update code owners file
Pull Request -
State: closed - Opened by awaelchli 6 months ago
- 1 comment
Labels: ci
#19923 - Lightning Fabric: generic method to get the full state dict
Issue -
State: open - Opened by Xynonners 6 months ago
Labels: feature, needs triage
#19922 - Update code owners file
Pull Request -
State: closed - Opened by awaelchli 6 months ago
- 1 comment
Labels: docs, ci, pl
#19921 - forward method missing required positional argument ‘masks’ in PyTorch Lightning
Issue -
State: closed - Opened by YuyaWake 6 months ago
- 6 comments
Labels: question
#19920 - The training process will stop unexpectedly
Issue -
State: open - Opened by 5huanghuai 6 months ago
- 1 comment
Labels: bug, needs triage, repro needed
#19919 - XLA FSDP strategy has undocumented requirement for using activation checkpointing
Issue -
State: open - Opened by ebreck 6 months ago
Labels: bug, needs triage
#19918 - Disable skipping training step in distributed training
Pull Request -
State: closed - Opened by awaelchli 6 months ago
- 2 comments
Labels: ready, breaking change, loops, pl
#19917 - Update docstring for `self.log` about keys in distributed training
Pull Request -
State: closed - Opened by awaelchli 6 months ago
- 2 comments
Labels: ready, docs, pl
#19916 - KeyboardInterrupt raises an exception which results in a zero exit code
Issue -
State: open - Opened by amarckal 6 months ago
Labels: bug, help wanted, environment: slurm, ver: 2.0.x, ver: 2.1.x, ver: 2.2.x
#19915 - Check if CometLogger experiment is alive
Pull Request -
State: closed - Opened by EtayLivne 6 months ago
- 1 comment
Labels: logger: comet, community, pl
#19913 - Add Studio badge to tensor parallel docs
Pull Request -
State: closed - Opened by awaelchli 6 months ago
- 1 comment
Labels: ready, docs, fabric, pl
#19912 - LR_FIND() does not work in DDP anymore, RuntimeError: No backend type associated with device type cpu
Issue -
State: open - Opened by asusdisciple 6 months ago
Labels: bug, needs triage
#19911 - Include VertexAI cluster environment for Fabric
Pull Request -
State: open - Opened by miguelalba96 6 months ago
Labels: docs, fabric, pl
#19910 - element 0 of tensors does not require grad and does not have a grad_fn in "test_step" and "validation_step"
Issue -
State: closed - Opened by SongJgit 6 months ago
- 4 comments
Labels: working as intended
#19909 - "save_last" could not save a complete checkpoint
Issue -
State: open - Opened by kxgong 6 months ago
- 1 comment
Labels: bug, needs triage, ver: 1.9.x
#19907 - I think it's deadly necessary to add docs or tutorials for handling the case when We return multiple loaders in test_dataloaders() method? I think it
Issue -
State: open - Opened by onbigion13 6 months ago
Labels: docs, needs triage
#19906 - Add functionality to save nn.Modules supplied as arguments when initialising LightningModule
Issue -
State: closed - Opened by tom-hehir 6 months ago
Labels: feature, needs triage
#19905 - AttributeError: type object 'Trainer' has no attribute 'add_argparse_args'
Issue -
State: closed - Opened by Park-yebin 6 months ago
- 1 comment
Labels: question, working as intended, ver: 2.0.x, ver: 2.1.x
#19904 - Remove unknown `[metadata]` table from `pyproject.toml`
Pull Request -
State: closed - Opened by ringohoffman 6 months ago
- 1 comment
Labels: ready, community, package
#19903 - CUDA unknown error
Issue -
State: closed - Opened by aniketmaurya 6 months ago
- 1 comment
Labels: bug, needs triage
#19902 - Error for unsupported precision types with ModelParallelStrategy
Pull Request -
State: closed - Opened by awaelchli 6 months ago
- 2 comments
Labels: feature, ready, fabric, pl
#19900 - Creating A Second Comet Logger Disables The First
Issue -
State: closed - Opened by EtayLivne 6 months ago
Labels: bug, needs triage, ver: 2.1.x
#19899 - (10/10) Support 2D Parallelism - Port Fabric docs to PL
Pull Request -
State: closed - Opened by awaelchli 6 months ago
- 1 comment
Labels: ready, docs, fabric, pl
#19898 - Fabric: Incorrect `num_replicas` (ddp/fsdp) when number of GPUs on each node is different
Issue -
State: open - Opened by shaibagon 6 months ago
- 2 comments
Labels: bug, needs triage
#19897 - docs: prune unused `linkcode`
Pull Request -
State: closed - Opened by Borda 6 months ago
- 1 comment
Labels: ready, docs, fabric, app
#19896 - docs: fix link to CLIP
Pull Request -
State: closed - Opened by Borda 6 months ago
- 1 comment
Labels: ready, docs, app
#19895 - Error when fast_dev_run=True or num_sanity_val_steps=0 and using torchmetrics MetricTracker
Issue -
State: open - Opened by MoustHolmes 6 months ago
Labels: bug, needs triage, ver: 2.2.x
#19894 - Remove the requirement for FSDPStrategy subclasses to only support GPU
Pull Request -
State: closed - Opened by awaelchli 6 months ago
- 2 comments
Labels: feature, ready, fabric
#19893 - Patch release 2.2.5
Pull Request -
State: closed - Opened by awaelchli 6 months ago
- 4 comments
Labels: bug, ready, docs, release, fabric, app (removed), pl, dependencies, package
#19892 - Dataloader on multi-gpu jobs only surpport to manipulate on local_rank=0, is there a way tom manipulate every device?
Issue -
State: open - Opened by renren7111 6 months ago
Labels: bug, needs triage, ver: 2.2.x
#19891 - MisconfigurationException: Do not set `gradient_accumulation_steps` in the DeepSpeed config
Issue -
State: open - Opened by mxkrn 6 months ago
Labels: bug, needs triage
#19890 - Is "Prepare a config file for the CLI" out of date?
Issue -
State: open - Opened by zengchang233 6 months ago
- 2 comments
Labels: bug, needs triage
#19889 - MLFlowLogger fails when logging hyperparameters as Trainer already does automatically
Issue -
State: open - Opened by CristoJV 6 months ago
Labels: bug, needs triage, ver: 2.1.x
#19888 - (9/n) Support 2D Parallelism - Remaining Checkpoint Logic
Pull Request -
State: closed - Opened by awaelchli 6 months ago
- 2 comments
Labels: feature, ready, checkpointing, fabric, pl
#19887 - (8/n) Support 2D Parallelism - 2D Parallel Fabric Docs
Pull Request -
State: closed - Opened by awaelchli 6 months ago
- 1 comment
Labels: ready, docs, fabric, pl
#19886 - Fix state dict loading in bitsandbytes plugin when checkpoint is already quantized
Pull Request -
State: closed - Opened by awaelchli 6 months ago
- 2 comments
Labels: bug, ready, fabric, pl, precision: bnb
#19884 - (7/n) Support 2D Parallelism - TP Fabric Docs
Pull Request -
State: closed - Opened by awaelchli 6 months ago
- 2 comments
Labels: ready, docs, fabric, pl
#19883 - Lightning stalls with 2 GPUs on 1 node with SLURM (and apptainer)
Issue -
State: open - Opened by sorenwacker 6 months ago
- 3 comments
Labels: bug, needs triage
#19882 - Enable loss-parallel in example
Pull Request -
State: closed - Opened by awaelchli 6 months ago
- 1 comment
Labels: example, ready, fabric, pl
#19881 - WIP: Integrate Collective into strategies
Pull Request -
State: open - Opened by awaelchli 6 months ago
Labels: refactor, fabric
#19880 - can't fit with ddp_notebook on a Vertex AI Workbench instance (CUDA initialized)
Issue -
State: open - Opened by jasonbrancazio 6 months ago
Labels: bug, needs triage
#19879 - (6/n) Support 2D Parallelism - Trainer example
Pull Request -
State: closed - Opened by awaelchli 6 months ago
- 1 comment
Labels: example, ready, fabric, pl
#19878 - (5/n) Support 2D Parallelism in Lightning Trainer
Pull Request -
State: closed - Opened by awaelchli 6 months ago
- 2 comments
Labels: feature, ready, fabric, pl
#19877 - Remove redundant code to set the device on the LightningModule
Pull Request -
State: closed - Opened by awaelchli 6 months ago
- 2 comments
Labels: ready, refactor, code quality, pl
#19874 - Using the MLflow logger produces Inconsistent metric plots
Issue -
State: open - Opened by gboeer 6 months ago
- 2 comments
Labels: bug, needs triage
#19873 - AttributeError: module 'pytorch_lightning.callbacks' has no attribute 'ProgressBarBase'. Did you mean: 'ProgressBar'?
Issue -
State: closed - Opened by carusyte 6 months ago
- 1 comment
Labels: question, progress bar: tqdm
#19872 - (4/n) Support 2D Parallelism - Loading optimizer states correctly
Pull Request -
State: closed - Opened by awaelchli 6 months ago
- 2 comments
Labels: feature, ready, checkpointing, fabric
#19870 - (3/n) Support 2D Parallelism - Efficient loading of full-state checkpoints
Pull Request -
State: closed - Opened by awaelchli 6 months ago
- 2 comments
Labels: ready, refactor, fabric, performance, pl
#19869 - Error loading a saved model to run inference (using ddp_notebook strategy)
Issue -
State: open - Opened by carlos-havier 6 months ago
- 1 comment
Labels: bug, needs triage, ver: 2.1.x
#19868 - Possible bug in recognizing `mps` accelerator even though PyTorch seems to register the `mps` device?
Issue -
State: open - Opened by adam2392 6 months ago
- 1 comment
Labels: bug, needs triage
#19867 - Added some more potentially robust ways to do learning rate tuning
Pull Request -
State: open - Opened by varchasgopalaswamy 6 months ago
Labels: pl
#19865 - Resume training, how to change learning scheduler?
Issue -
State: open - Opened by jzhanghzau 6 months ago
- 1 comment
Labels: bug, needs triage, ver: 2.2.x
#19862 - Add dog has an error: FileNotFoundError:
Issue -
State: closed - Opened by BaiYuanxi-dev 6 months ago
- 1 comment
Labels: question
#19858 - Dynamically link arguments in `LightningCLI`?
Issue -
State: closed - Opened by EthanMarx 6 months ago
- 2 comments
Labels: feature, lightningcli
#19857 - Update Lightning Cloud 0.5.69
Pull Request -
State: closed - Opened by tchaton 6 months ago
- 1 comment
Labels: ready, app, dependencies
#19856 - Reduce queue fetching
Pull Request -
State: closed - Opened by tchaton 6 months ago
- 1 comment
Labels: ready, app
#19854 - Is the Lightning App deprecated? (Lightning App doc is not found)
Issue -
State: closed - Opened by guyleaf 6 months ago
- 1 comment
Labels: bug, app
#19853 - RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [68]] is at version 3; expected version 2 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
Issue -
State: open - Opened by ASAmbitious 6 months ago
- 1 comment
Labels: bug, needs triage, ver: 2.2.x
#19852 - (2/n) Support 2D Parallelism - Distributed Checkpoints
Pull Request -
State: closed - Opened by awaelchli 6 months ago
- 3 comments
Labels: feature, ready, checkpointing, fabric, pl
#19847 - Fix typo on `estimated_stepping_batches` property
Pull Request -
State: closed - Opened by afspies 7 months ago
- 2 comments
Labels: docs, community, pl
#19843 - docs: Bump HPU ref `1.5.0`
Pull Request -
State: closed - Opened by pl-ghost 7 months ago
- 1 comment
Labels: docs, pl