NVIDIA/TransformerEngine issues and pull requests

#1412 - [PyTorch] `te.Linear` FP8 DGRAD+RS output bugfix

Pull Request - State: open - Opened by denera 2 days ago - 1 comment
Labels: bug

#1411 - Plans for block-wise FP8 quantization during training?

Issue - State: open - Opened by beccohov 3 days ago - 1 comment

#1410 - Make it an option to compile activation functions with fast math

Pull Request - State: closed - Opened by guyueh1 3 days ago - 1 comment

#1409 - Questions on DotProductAttention API Usage in Flash Attention thd Mode

Issue - State: open - Opened by pipSu 4 days ago

#1408 - Support `store_param_remainders` feature from Apex in TE Fused Adam

Pull Request - State: open - Opened by sanandaraj5597 4 days ago

#1407 - Fused attention error while running Nvidia Cosmos

Issue - State: open - Opened by deepbeepmeep 4 days ago

#1406 - [JAX] Support segment_ids/pos as FA inputs

Pull Request - State: open - Opened by zlsh80826 5 days ago - 2 comments

#1405 - [JAX] Consolidate the distributed fused attention test code

Pull Request - State: open - Opened by mgoldfarb-nvidia 5 days ago - 5 comments

#1404 - Not compile in wsl2 pytorch wheels

Issue - State: closed - Opened by johnnynunez 6 days ago - 1 comment

#1403 - [PyTorch] Avoid `parameters` function in op backward pass

Pull Request - State: open - Opened by timmoon10 7 days ago - 1 comment
Labels: bug

#1402 - Fix "refractor" typo in the PR template

Pull Request - State: closed - Opened by kit1980 7 days ago

#1401 - Use log1p(x) instead of log(1+x)

Pull Request - State: open - Opened by kit1980 7 days ago - 4 comments

#1400 - Import fails when working from a TE directory

Issue - State: open - Opened by ksivaman 7 days ago
Labels: good first issue

#1399 - Installation stuck at 97%

Issue - State: open - Opened by lorenzbaraldi 8 days ago - 1 comment

#1398 - why close ag overlap when is_grad_enabled is False

Issue - State: open - Opened by sallyjunjun 8 days ago - 1 comment

#1397 - [PyTorch] Fix AttentionParams comparison logic

Pull Request - State: open - Opened by cyanguwa 9 days ago - 1 comment

#1396 - Take token count quantization of fused attention into consideration for CP results correction

Pull Request - State: closed - Opened by xrennvidia 9 days ago - 1 comment

#1395 - [PyTorch] Fix fusible ops checkpoint

Pull Request - State: closed - Opened by ksivaman 9 days ago
Labels: bug

#1394 - [JAX] Test_multiprocessing_encoder with process spawn in bash

Pull Request - State: closed - Opened by phu0ngng 10 days ago - 1 comment

#1393 - [JAX] Correct fused attention output after each step of ring attention

Pull Request - State: closed - Opened by mgoldfarb-nvidia 10 days ago - 3 comments

#1392 - support new flash_attn_interface

Issue - State: open - Opened by rgtjf 11 days ago - 2 comments

#1391 - FP8 GEMM Kernels

Issue - State: open - Opened by xiaoxiao26 11 days ago

#1391 - FP8 GEMM Kernels

Issue - State: open - Opened by xiaoxiao26 11 days ago

#1390 - [JAX] Add THD + SWA unit tests

Pull Request - State: closed - Opened by zlsh80826 12 days ago - 1 comment

#1389 - Better cuBLAS handle management

Pull Request - State: open - Opened by ptrendx 15 days ago - 6 comments

#1388 - Update copyright to include 2025

Pull Request - State: closed - Opened by ksivaman 16 days ago

#1387 - clean CP implementation for flash attention and cuDNN 9.6

Pull Request - State: closed - Opened by xrennvidia 19 days ago - 3 comments

#1386 - How about the grouplinear?

Issue - State: open - Opened by south-ocean 23 days ago - 2 comments

#1385 - Update README.rst

Pull Request - State: open - Opened by sbhavani 26 days ago

#1384 - _NoopCatFunc in transformer layer

Issue - State: open - Opened by robot-transformer 26 days ago
Labels: bug

#1383 - thd qkv-format in transformer layer

Issue - State: open - Opened by robot-transformer 26 days ago

#1382 - bug fix for using `return_layernorm_output=True`

Pull Request - State: closed - Opened by LiyuanLucasLiu 29 days ago - 1 comment

#1381 - [PyTorch] Add caching for attention backend selection results

Pull Request - State: open - Opened by cyanguwa 30 days ago

#1380 - Don't touch nor send messages to the root logger.

Pull Request - State: open - Opened by sagostinho-nvidia 30 days ago

#1379 - AttributeError: module 'transformer_engine' has no attribute 'pytorch'

Issue - State: open - Opened by carrot0117 about 1 month ago - 2 comments

#1378 - [common/PyTorch] Add cuDNN SWA (left, 0) + padding + bottom right causal

Pull Request - State: closed - Opened by cyanguwa about 1 month ago - 5 comments
Labels: 1.14.0

#1377 - ViT Support

Issue - State: open - Opened by cnut1648 about 1 month ago - 1 comment

#1376 - TypeError: initialize_ub() got an unexpected keyword argument 'tp_size'

Issue - State: closed - Opened by wccccp about 1 month ago - 3 comments

#1375 - [JAX] Bug Fix: Softmax FFIs with correct Encapsulates

Pull Request - State: closed - Opened by phu0ngng about 1 month ago - 1 comment

#1374 - [PyTorch] Add weights_only=False for torch.load

Pull Request - State: closed - Opened by cyanguwa about 1 month ago - 1 comment
Labels: 1.14.0

#1373 - [MoE][PyTorch] Add mask-based MoE permutation

Pull Request - State: open - Opened by hxbai about 1 month ago

#1372 - Should cublasLtHandle_t be Destroyed?

Issue - State: open - Opened by shenzhenghai about 1 month ago - 2 comments

#1371 - Add user to CI

Pull Request - State: closed - Opened by ksivaman about 1 month ago

#1370 - [common] Add max_t support for KV in THD

Pull Request - State: closed - Opened by cyanguwa about 1 month ago - 1 comment
Labels: 1.14.0

#1369 - [common/PyTorch] Add FusedAttention support for SWA (left, right)

Pull Request - State: open - Opened by cyanguwa about 1 month ago - 1 comment

#1368 - How to use thd format qkv with cp + packed_seq_params

Issue - State: open - Opened by Wraythh about 1 month ago - 4 comments

#1366 - [JAX] Bug fix for distributed normalization

Pull Request - State: closed - Opened by phu0ngng about 1 month ago - 1 comment
Labels: 1.14.0

#1365 - TypeError: UbufP2PCommOverlap(): incompatible function arguments.

Issue - State: closed - Opened by sallyjunjun about 1 month ago - 5 comments

#1364 - [JAX] Use default factory for not sharing mutable default values

Pull Request - State: closed - Opened by zlsh80826 about 1 month ago - 2 comments

#1364 - [JAX] Use default factory for not sharing mutable default values

Pull Request - State: closed - Opened by zlsh80826 about 1 month ago - 2 comments

#1363 - The comm/gemm overlap example failed with "ran out of input".

Issue - State: closed - Opened by wujingyue about 1 month ago - 2 comments

#1362 - Fix an invalid reference in the doc

Pull Request - State: open - Opened by wujingyue about 1 month ago

#1362 - Fix an invalid reference in the doc

Pull Request - State: closed - Opened by wujingyue about 1 month ago - 1 comment

#1361 - [JAX] Bug Fix: WeightInit with field

Pull Request - State: closed - Opened by phu0ngng about 1 month ago - 1 comment

#1360 - [Bug] Failed to pass pytorch's numerical test on A800 SXM

Issue - State: closed - Opened by junjzhang about 1 month ago - 1 comment

#1360 - [Bug] Failed to pass pytorch's numerical test on A800 SXM

Issue - State: closed - Opened by junjzhang about 1 month ago - 1 comment

#1359 - support float8 in flash-attn v3

Issue - State: open - Opened by Monekyzoon about 1 month ago

#1359 - support float8 in flash-attn v3

Issue - State: open - Opened by Monekyzoon about 1 month ago

#1358 - Enabling FP8 all-gather for TE Float8Tensor when using Torch FSDP2

Pull Request - State: closed - Opened by youngeunkwon0405 about 1 month ago - 1 comment

#1357 - Disable FP8 in Mcore integration test on older GPUs

Pull Request - State: closed - Opened by timmoon10 about 1 month ago - 1 comment
Labels: bug, testing, 1.14.0

#1356 - [JAX] Move parallel encoder tests to L0 distributed test set.

Pull Request - State: closed - Opened by phu0ngng about 1 month ago - 1 comment

#1355 - Add paged attention support

Pull Request - State: open - Opened by cyanguwa about 2 months ago - 2 comments

#1354 - Fix attention mask type for Flash Attention + CP + THD

Pull Request - State: closed - Opened by xrennvidia about 2 months ago - 1 comment

#1353 - overlapping issue about backward of LayerNormLinear

Issue - State: closed - Opened by cos120 about 2 months ago - 5 comments

#1352 - [JAX] Fused attention unit tests fixes and refinements

Pull Request - State: closed - Opened by zlsh80826 about 2 months ago - 6 comments

#1351 - Can this project support jetson orin nx？

Issue - State: closed - Opened by zzk2021 about 2 months ago

#1350 - te.TransformerLayer fails on H100 with cudnn errors.

Issue - State: closed - Opened by wujingyue about 2 months ago - 2 comments

#1349 - Support more than 1 shape/attention_params for DotProductAttention decision cache

Issue - State: open - Opened by parthmannan about 2 months ago

#1349 - Support more than 1 shape/attention_params for DotProductAttention decision cache

Issue - State: open - Opened by parthmannan about 2 months ago

#1348 - [Bug] attention_backend update throttle

Issue - State: closed - Opened by Jianbing-D about 2 months ago - 1 comment

#1347 - [JAX] Scale sequence length in CP tests to avoid tiny sizes.

Pull Request - State: closed - Opened by mgoldfarb-nvidia about 2 months ago - 2 comments

#1346 - [Draft] Introduce NVSHMEM based communication API for pytorch

Pull Request - State: open - Opened by gdengk about 2 months ago

#1345 - Fix cuda graph capture for grouped gemm

Pull Request - State: closed - Opened by xrennvidia about 2 months ago - 3 comments

#1344 - How to setup TP Overlap configs

Issue - State: open - Opened by TJ-Solergibert about 2 months ago - 1 comment

#1343 - [PyTorch] Adding TP overlap support for `te.Linear` with `parallel_mode="column"`

Pull Request - State: closed - Opened by denera about 2 months ago - 3 comments
Labels: enhancement, 1.14.0

#1342 - [Core] Add function to convert container to string

Pull Request - State: closed - Opened by timmoon10 about 2 months ago - 1 comment

#1341 - [PyTorch] Bugfix for wgrad bulk overlap conflict when dgrad overlap is reduce-scatter

Pull Request - State: open - Opened by denera 2 months ago - 2 comments
Labels: bug

#1340 - Update list of CI users

Pull Request - State: closed - Opened by timmoon10 2 months ago - 1 comment
Labels: testing

#1340 - Update list of CI users

Pull Request - State: closed - Opened by timmoon10 2 months ago - 1 comment
Labels: testing

#1339 - [Common] Moved framework agnostic THD kernels to common.

Pull Request - State: closed - Opened by mgoldfarb-nvidia 2 months ago - 8 comments

#1338 - Debug nightly docs

Pull Request - State: closed - Opened by timmoon10 2 months ago - 1 comment
Labels: documentation, testing

#1337 - [C/JAX] Comm+GEMM Overlap API for TE/JAX

Pull Request - State: open - Opened by denera 2 months ago
Labels: enhancement, jax

#1337 - [C/JAX] Comm+GEMM Overlap API for TE/JAX

Pull Request - State: open - Opened by denera 2 months ago
Labels: enhancement, jax

#1336 - the max error of moe_permute/unpermute.grad could reach 3.6e+00

Issue - State: open - Opened by NiuMa-1234 2 months ago - 1 comment

#1335 - [PyTorch] Store module extra state in tensor

Pull Request - State: open - Opened by timmoon10 2 months ago - 1 comment
Labels: bug

#1335 - [PyTorch] Store module extra state in tensor

Pull Request - State: closed - Opened by timmoon10 2 months ago - 1 comment
Labels: bug, 1.14.0

#1334 - [PyTorch] Fix multiple calls to saved_tensors in CP attention

Pull Request - State: closed - Opened by ksivaman 2 months ago - 1 comment
Labels: bug

#1334 - [PyTorch] Fix multiple calls to saved_tensors in CP attention

Pull Request - State: closed - Opened by ksivaman 2 months ago - 1 comment
Labels: bug

#1333 - Use `CMAKE_CURRENT_SOURCE_DIR` instead of `CMAKE_SOURCE_DIR`

Pull Request - State: closed - Opened by kmaehashi 2 months ago

#1333 - Use `CMAKE_CURRENT_SOURCE_DIR` instead of `CMAKE_SOURCE_DIR`

Pull Request - State: closed - Opened by kmaehashi 2 months ago

#1332 - [TP comm overlap unit test]`CUDA Error: misaligned address` error when testing with recent cublas (or pytorch container)

Issue - State: open - Opened by erhoo82 2 months ago - 3 comments

#1332 - [TP comm overlap unit test]`CUDA Error: misaligned address` error when testing with recent cublas (or pytorch container)

Issue - State: open - Opened by erhoo82 2 months ago - 4 comments

#1331 - [JAX] WIP Added L0 Distributed Tests

Pull Request - State: open - Opened by phu0ngng 2 months ago

#1331 - [JAX] WIP Added L0 Distributed Tests

Pull Request - State: closed - Opened by phu0ngng 2 months ago

#1330 - [Dummy] Testing branch for #1326

Pull Request - State: closed - Opened by timmoon10 2 months ago
Labels: invalid

#1330 - [Dummy] Testing branch for #1326

Pull Request - State: closed - Opened by timmoon10 2 months ago
Labels: invalid

#1329 - [PyTorch] Integration test for Megatron-LM

Pull Request - State: closed - Opened by timmoon10 2 months ago - 2 comments
Labels: bug, 1.13.0

#1329 - [PyTorch] Integration test for Megatron-LM

Pull Request - State: closed - Opened by timmoon10 2 months ago - 2 comments
Labels: bug, 1.13.0

#1328 - [PyTorch] Fix GQA error message

Pull Request - State: closed - Opened by cyanguwa 2 months ago - 1 comment
Labels: 1.13.0

#1328 - [PyTorch] Fix GQA error message

Pull Request - State: closed - Opened by cyanguwa 2 months ago - 1 comment
Labels: 1.13.0

GitHub / NVIDIA/TransformerEngine issues and pull requests