NVIDIA/TransformerEngine issues and pull requests

#1460 - Fix DAct input ordering of gradient input and activation input

Pull Request - State: open - Opened by jberchtold-nvidia about 18 hours ago - 1 comment

#1459 - How to set NVTE_FWD/BWD_LAYERNORM_SM_MARGIN?

Issue - State: open - Opened by cailun01 1 day ago - 1 comment

#1458 - There is an issue with building cpp test.

Issue - State: closed - Opened by soohyung-jang 2 days ago - 1 comment

#1457 - Fix MXFP8 normalization

Pull Request - State: closed - Opened by ptrendx 2 days ago
Labels: 2.0.0

#1456 - Parallel Cross Entropy using online softmax

Pull Request - State: open - Opened by sanandaraj5597 2 days ago
Labels: enhancement

#1455 - [PyTorch] Remove MXFP8 scale-inv padding in MXFP8 all-gather

Pull Request - State: closed - Opened by timmoon10 2 days ago
Labels: bug, 2.0.0

#1454 - [JAX] THD ring attention

Pull Request - State: open - Opened by zlsh80826 2 days ago - 2 comments

#1453 - FusedAdam optimizer doesn't have `set_to_none` keyword argument

Issue - State: open - Opened by MaciejBalaNV 2 days ago

#1452 - Support vectorized local reduction for p2p-based ReduceScatter overlap

Pull Request - State: open - Opened by erhoo82 3 days ago

#1451 - Incorrect parameter dtype initialisation for flax transformer-engine modules

Issue - State: open - Opened by liamclarkza 3 days ago

#1450 - [Core] Debug unaligned MXFP8 dequantize tests

Pull Request - State: closed - Opened by timmoon10 5 days ago - 1 comment
Labels: bug, testing, 2.0.0

#1449 - [common] Generalized MXFP8 gated kernels w.r.t. input tensor dimensions

Pull Request - State: closed - Opened by Oleg-Goncharov 6 days ago
Labels: bug, testing, 2.0.0

#1448 - Generalization of the FP8 dgated activations kernel

Pull Request - State: closed - Opened by ptrendx 6 days ago

#1447 - Add NVTX ranges to categorize execution

Pull Request - State: open - Opened by minitu 6 days ago - 1 comment

#1446 - Add the virtual destructor to the Quantizer class

Pull Request - State: closed - Opened by ptrendx 6 days ago

#1445 - [PyTorch] Rename and clean up MXFP8 recipe class

Pull Request - State: open - Opened by timmoon10 7 days ago - 1 comment

#1444 - [PyTorch] Debug NeMo distributed optimizer

Pull Request - State: closed - Opened by timmoon10 7 days ago - 1 comment
Labels: bug, 2.0.0

#1443 - Support `store_param_remainders` feature from Apex in TE Fused Adam

Pull Request - State: open - Opened by timmoon10 7 days ago - 1 comment
Labels: enhancement

#1442 - Rename block scaling recipe

Pull Request - State: closed - Opened by ksivaman 7 days ago - 2 comments

#1441 - [Pytorch] Nvidia-DLFramework-Inspect support

Pull Request - State: open - Opened by pggPL 7 days ago

#1440 - [PyTorch] Respect existing quantizer usages in functional linear API

Pull Request - State: closed - Opened by timmoon10 8 days ago
Labels: bug, 2.0.0

#1439 - Update neox to completed

Pull Request - State: closed - Opened by Quentin-Anthony 8 days ago

#1438 - Update FE from 1.10-rc to 1.10

Pull Request - State: closed - Opened by cyanguwa 8 days ago - 1 comment

#1437 - [common] Generalized MXFP8 fused kernels w.r.t. input tensor dimensions

Pull Request - State: closed - Opened by Oleg-Goncharov 8 days ago - 2 comments
Labels: enhancement, 2.0.0

#1436 - If no Windows support is planned for 5090 as it wasn't for 4090, pass along to corporate to mention this in advertisement

Issue - State: open - Opened by NeedsMoar 9 days ago - 4 comments

#1435 - [PyTorch] Reduce tensor dimensions in MXFP8 tests

Pull Request - State: closed - Opened by timmoon10 9 days ago - 2 comments
Labels: bug, testing, 2.0.0

#1433 - Add test for Lightning Thunder integration

Pull Request - State: open - Opened by timmoon10 9 days ago
Labels: testing

#1432 - Add path to disable cudnn norm for mxfp8

Pull Request - State: closed - Opened by ksivaman 9 days ago

#1431 - Pad MXFP8 scale inverses at the time of creation

Pull Request - State: closed - Opened by ksivaman 9 days ago
Labels: 2.0.0

#1430 - Introduce NVSHMEM based communication API for pytorch

Pull Request - State: open - Opened by gdengk 9 days ago

#1429 - HF Accelerate FP8 use more gpu memory then FP16 in training LLM

Issue - State: open - Opened by Liufeiran123 9 days ago - 1 comment

#1428 - Introduce NVSHMEM based communication API for pytorch

Pull Request - State: closed - Opened by gdengk 10 days ago

#1427 - [PyTorch/C++] Comm+GEMM overlap compatibility with QuantizedTensor

Pull Request - State: closed - Opened by denera 10 days ago - 1 comment
Labels: 2.0.0

#1426 - [PyTorch] Fix linter warnings

Pull Request - State: closed - Opened by timmoon10 10 days ago
Labels: bug, 2.0.0

#1425 - Adding remove_caches API to Float8Tensor class

Pull Request - State: open - Opened by youngeunkwon0405 10 days ago - 2 comments

#1424 - Adding remove_caches API for Float8Tensor

Pull Request - State: closed - Opened by youngeunkwon0405 10 days ago

#1423 - transformer_engine.pytorch.distributed.checkpoint function only works with TE modules, instead of all Callables

Issue - State: open - Opened by MaciejBalaNV 10 days ago

#1422 - FP8 execution requires 2D input matrices with height divisible by 8 and width divisible by 16

Issue - State: open - Opened by Liufeiran123 12 days ago - 1 comment

#1421 - Deadline or schedule new update supporting blackwell and fp4?

Issue - State: open - Opened by johnnynunez 13 days ago

#1420 - Problem when install transformers_engine with nvcc11.8 and nvcc12.0

Issue - State: open - Opened by chwenjun225 14 days ago - 3 comments

#1419 - Questions about accuracy alignment between BF16 and FP8

Issue - State: open - Opened by zigzagcai 16 days ago - 2 comments
Labels: question

#1418 - Initial Support Blackwell Build

Pull Request - State: open - Opened by johnnynunez 16 days ago

#1417 - Initial support blackwell build

Pull Request - State: closed - Opened by johnnynunez 16 days ago

#1416 - Performance Discrepancy in FP8 vs. BF16 Training with NanoGPT

Issue - State: open - Opened by wzzll123 17 days ago - 2 comments

#1415 - [PyTorch] cuBLAS workspace size fix for TP overlap unit test

Pull Request - State: open - Opened by denera 20 days ago - 1 comment
Labels: bug

#1414 - bug in flash attn backward with context parallel

Issue - State: closed - Opened by sallyjunjun 21 days ago - 1 comment

#1413 - Fix Linear Weight Initialization in the PaddlePaddle Implementation

Pull Request - State: open - Opened by GuoxiaWang 21 days ago

#1412 - [PyTorch] `te.Linear` FP8 DGRAD+RS output bugfix

Pull Request - State: closed - Opened by denera 22 days ago - 2 comments
Labels: bug

#1411 - Plans for block-wise FP8 quantization during training?

Issue - State: open - Opened by beccohov 22 days ago - 3 comments

#1410 - Make it an option to compile activation functions with fast math

Pull Request - State: closed - Opened by guyueh1 23 days ago - 1 comment

#1409 - Questions on DotProductAttention API Usage in Flash Attention thd Mode

Issue - State: open - Opened by pipSu 23 days ago - 1 comment

#1408 - Support `store_param_remainders` feature from Apex in TE Fused Adam

Pull Request - State: closed - Opened by sanandaraj5597 24 days ago - 6 comments

#1407 - Fused attention error while running Nvidia Cosmos

Issue - State: open - Opened by deepbeepmeep 24 days ago - 3 comments

#1406 - [JAX] Support segment_ids/pos as FA inputs

Pull Request - State: closed - Opened by zlsh80826 24 days ago - 5 comments

#1405 - [JAX] Consolidate the distributed fused attention test code

Pull Request - State: closed - Opened by mgoldfarb-nvidia 25 days ago - 6 comments

#1404 - Not compile in wsl2 pytorch wheels

Issue - State: closed - Opened by johnnynunez 25 days ago - 1 comment

#1403 - [PyTorch] Avoid `parameters` function in op backward pass

Pull Request - State: closed - Opened by timmoon10 27 days ago - 2 comments
Labels: bug

#1402 - Fix "refractor" typo in the PR template

Pull Request - State: closed - Opened by kit1980 27 days ago

#1401 - Use log1p(x) instead of log(1+x)

Pull Request - State: closed - Opened by kit1980 27 days ago - 5 comments

#1400 - Import fails when working from a TE directory

Issue - State: open - Opened by ksivaman 27 days ago
Labels: good first issue

#1399 - Installation stuck at 97%

Issue - State: open - Opened by lorenzbaraldi 27 days ago - 1 comment

#1398 - why close ag overlap when is_grad_enabled is False

Issue - State: open - Opened by sallyjunjun 28 days ago - 1 comment

#1397 - [PyTorch] Fix AttentionParams comparison logic

Pull Request - State: closed - Opened by cyanguwa 28 days ago - 1 comment

#1396 - Take token count quantization of fused attention into consideration for CP results correction

Pull Request - State: closed - Opened by xrennvidia 29 days ago - 1 comment

#1395 - [PyTorch] Fix fusible ops checkpoint

Pull Request - State: closed - Opened by ksivaman 29 days ago
Labels: bug

#1394 - [JAX] Test_multiprocessing_encoder with process spawn in bash

Pull Request - State: closed - Opened by phu0ngng 29 days ago - 1 comment

#1393 - [JAX] Correct fused attention output after each step of ring attention

Pull Request - State: closed - Opened by mgoldfarb-nvidia 30 days ago - 3 comments

#1392 - support new flash_attn_interface

Issue - State: open - Opened by rgtjf about 1 month ago - 2 comments

#1391 - FP8 GEMM Kernels

Issue - State: open - Opened by xiaoxiao26 about 1 month ago

#1391 - FP8 GEMM Kernels

Issue - State: open - Opened by xiaoxiao26 about 1 month ago

#1390 - [JAX] Add THD + SWA unit tests

Pull Request - State: closed - Opened by zlsh80826 about 1 month ago - 1 comment

#1389 - Better cuBLAS handle management

Pull Request - State: open - Opened by ptrendx about 1 month ago - 6 comments

Pull Request - State: closed - Opened by ksivaman about 1 month ago

#1387 - clean CP implementation for flash attention and cuDNN 9.6

Pull Request - State: closed - Opened by xrennvidia about 1 month ago - 3 comments

#1386 - How about the grouplinear?

Issue - State: open - Opened by south-ocean about 1 month ago - 2 comments

#1385 - Update README.rst

Pull Request - State: open - Opened by sbhavani about 2 months ago

#1384 - _NoopCatFunc in transformer layer

Issue - State: open - Opened by robot-transformer about 2 months ago
Labels: bug

#1383 - thd qkv-format in transformer layer

Issue - State: open - Opened by robot-transformer about 2 months ago

#1382 - bug fix for using `return_layernorm_output=True`

Pull Request - State: closed - Opened by LiyuanLucasLiu about 2 months ago - 1 comment

#1381 - [PyTorch] Add caching for attention backend selection results

Pull Request - State: open - Opened by cyanguwa about 2 months ago

#1380 - Don't touch nor send messages to the root logger.

Pull Request - State: open - Opened by sagostinho-nvidia about 2 months ago

#1379 - AttributeError: module 'transformer_engine' has no attribute 'pytorch'

Issue - State: open - Opened by carrot0117 about 2 months ago - 2 comments

#1378 - [common/PyTorch] Add cuDNN SWA (left, 0) + padding + bottom right causal

Pull Request - State: closed - Opened by cyanguwa about 2 months ago - 5 comments
Labels: 1.14.0

#1377 - ViT Support

Issue - State: open - Opened by cnut1648 about 2 months ago - 1 comment

#1376 - TypeError: initialize_ub() got an unexpected keyword argument 'tp_size'

Issue - State: closed - Opened by wccccp about 2 months ago - 3 comments

#1375 - [JAX] Bug Fix: Softmax FFIs with correct Encapsulates

Pull Request - State: closed - Opened by phu0ngng about 2 months ago - 1 comment

#1374 - [PyTorch] Add weights_only=False for torch.load

Pull Request - State: closed - Opened by cyanguwa about 2 months ago - 1 comment
Labels: 1.14.0

#1373 - [MoE][PyTorch] Add mask-based MoE permutation

Pull Request - State: closed - Opened by hxbai about 2 months ago - 2 comments

#1372 - Should cublasLtHandle_t be Destroyed?

Issue - State: open - Opened by shenzhenghai about 2 months ago - 2 comments

#1371 - Add user to CI

Pull Request - State: closed - Opened by ksivaman about 2 months ago

#1370 - [common] Add max_t support for KV in THD

Pull Request - State: closed - Opened by cyanguwa about 2 months ago - 1 comment
Labels: 1.14.0

#1369 - [common/PyTorch] Add FusedAttention support for SWA (left, right)

Pull Request - State: open - Opened by cyanguwa about 2 months ago - 1 comment

#1368 - How to use thd format qkv with cp + packed_seq_params

Issue - State: open - Opened by Wraythh about 2 months ago - 4 comments

#1366 - [JAX] Bug fix for distributed normalization

Pull Request - State: closed - Opened by phu0ngng about 2 months ago - 1 comment
Labels: 1.14.0

#1365 - TypeError: UbufP2PCommOverlap(): incompatible function arguments.

Issue - State: closed - Opened by sallyjunjun about 2 months ago - 5 comments

#1364 - [JAX] Use default factory for not sharing mutable default values

Pull Request - State: closed - Opened by zlsh80826 about 2 months ago - 2 comments

#1364 - [JAX] Use default factory for not sharing mutable default values

Pull Request - State: closed - Opened by zlsh80826 about 2 months ago - 2 comments

#1363 - The comm/gemm overlap example failed with "ran out of input".

Issue - State: closed - Opened by wujingyue about 2 months ago - 2 comments

#1362 - Fix an invalid reference in the doc

Pull Request - State: open - Opened by wujingyue about 2 months ago

#1362 - Fix an invalid reference in the doc

Pull Request - State: closed - Opened by wujingyue about 2 months ago - 1 comment

GitHub / NVIDIA/TransformerEngine issues and pull requests