NVIDIA/TransformerEngine issues and pull requests

#1151 - question about Model FLOPs Utilization

Issue - State: closed - Opened by jinz2014 2 months ago - 4 comments
Labels: question

#1150 - layer normalization after Linear

Issue - State: closed - Opened by ftgreat 3 months ago - 2 comments
Labels: question

#1149 - [PyTorch/C] Exposed Userbuffers configuration option to control comm and compute stream priorities

Pull Request - State: open - Opened by denera 3 months ago
Labels: enhancement

#1148 - Improvements for building wheels

Pull Request - State: closed - Opened by ksivaman 3 months ago - 4 comments
Labels: build, 1.10.0

#1147 - Unable to import transformer_engine.pytorch using TE v1.9.0

Issue - State: closed - Opened by snarayan21 3 months ago - 1 comment

#1146 - [PyTorch] Add contiguous check for `te_grouped_gemm`

Pull Request - State: closed - Opened by BeingGod 3 months ago - 2 comments

#1145 - [PyTorch] Remove `dtype` from args of permutation

Pull Request - State: closed - Opened by yaox12 3 months ago - 2 comments

#1144 - Dose the FA3 commit of TE support bf16 or mixed precision？

Issue - State: open - Opened by Desperadoze 3 months ago

#1143 - [PyTorch] Avoid saving fp8_tensors in certain scenarios

Pull Request - State: open - Opened by cyanguwa 3 months ago

#1142 - [PyTorch] Userbuffers support in operation-based API

Pull Request - State: open - Opened by timmoon10 3 months ago - 4 comments

#1141 - [PyTorch] Fix FP8 logic related to FA2/FA3

Pull Request - State: closed - Opened by cyanguwa 3 months ago - 6 comments
Labels: 1.11

#1140 - Norms Refractor

Pull Request - State: open - Opened by phu0ngng 3 months ago - 1 comment

#1139 - Don't save fp8 q/k/v/out tensors when using bf16 bprop

Pull Request - State: open - Opened by guyueh1 3 months ago - 1 comment

#1138 - Fix param input order for cudagraph

Pull Request - State: open - Opened by yifeis-nv 3 months ago - 2 comments
Labels: bug

#1137 - [PyTorch] Remove some direct calls to PyTorch extensions in `Float8Tensor`

Pull Request - State: closed - Opened by timmoon10 3 months ago - 2 comments

#1136 - Hide non-necessary symbols from shared object

Pull Request - State: closed - Opened by ksivaman 3 months ago - 2 comments
Labels: bug, build, 1.10.0

#1135 - fp8_model_init doesn't work with DDP

Issue - State: open - Opened by MaciejBalaNV 3 months ago - 3 comments

#1134 - Fix QKV dtype in the bwd of FP8+CP

Pull Request - State: closed - Opened by xrennvidia 3 months ago - 5 comments
Labels: 1.10.0

#1133 - Bump cudnn-frontend version to 1.6.1

Pull Request - State: closed - Opened by ksivaman 3 months ago

#1132 - RMSNorm precision different from HF implementation

Issue - State: open - Opened by void-main 3 months ago - 5 comments

#1131 - Added offloading support FP8 attention

Pull Request - State: closed - Opened by sanandaraj5597 3 months ago - 2 comments

#1130 - don't put master_param to state if None

Pull Request - State: closed - Opened by akoumpa 3 months ago - 3 comments

#1129 - [PyTorch] Implement Fp8 padding and unpadding module

Pull Request - State: closed - Opened by BeingGod 3 months ago - 5 comments

#1128 - [PyTorch] Propagate fp8 scale-inverse modification to `GroupedLinear`

Pull Request - State: closed - Opened by yaox12 3 months ago - 8 comments

#1127 - [PyTorch] Proxy class for low-precision tensor

Pull Request - State: closed - Opened by timmoon10 3 months ago - 5 comments

#1126 - Let user limit number of architectures, to improve build time

Pull Request - State: closed - Opened by hXl3s 3 months ago - 1 comment

#1125 - Transformer Engine using FlashAttention V3

Issue - State: open - Opened by heavyrain-lzy 3 months ago - 1 comment

#1124 - Re-add framework specific required dependencies for source build

Pull Request - State: closed - Opened by ksivaman 3 months ago
Labels: bug, build, 1.10.0

#1123 - how to use TransformerEngine without flash attention

Issue - State: closed - Opened by ben-8878 3 months ago - 4 comments

#1121 - Add high_precision_init_val to model params when using fp8_model_init

Pull Request - State: open - Opened by kunlunl 3 months ago - 8 comments

#1120 - [PyTorch] make GroupedLinear inp support collection of torch.Tensor

Pull Request - State: closed - Opened by BeingGod 3 months ago - 7 comments

#1119 - TransformerEngine FP8 is slower & more memory intensive than FlashAttention FP16?

Issue - State: closed - Opened by darius-lam 3 months ago - 4 comments

#1117 - [PyTorch] Debug CUDA graph support with operation-based API

Pull Request - State: open - Opened by timmoon10 3 months ago - 5 comments
Labels: bug

#1116 - How to debug CUDNN_STATUS_EXECUTION_FAILED?

Issue - State: open - Opened by vedantroy 3 months ago - 7 comments

#1114 - Add FP8 support to CP implementation with KV P2P

Pull Request - State: closed - Opened by xrennvidia 3 months ago - 5 comments
Labels: 1.10.0

#1108 - Update cudnn-frontend to v1.6.1

Pull Request - State: closed - Opened by cyanguwa 3 months ago - 4 comments

#1107 - Jax example cleanup and replace pjit with jit.

Pull Request - State: closed - Opened by nouiz 3 months ago - 4 comments

#1106 - [JAX] Context Parallel Attention with All-Gather

Pull Request - State: closed - Opened by mgoldfarb-nvidia 3 months ago - 9 comments

#1100 - [PyTorch] FP8 MHA with RoPE and Miscellaneous Improvements

Pull Request - State: closed - Opened by yaox12 3 months ago - 13 comments

#1086 - [ERROR] in the last step of `pip install . `

Issue - State: closed - Opened by wplf 3 months ago - 5 comments

#1083 - Update FP8 scale-inverse in kernels with FP8 output

Pull Request - State: closed - Opened by timmoon10 3 months ago - 6 comments
Labels: performance

#1077 - stuck at building wheel

Issue - State: closed - Opened by neurosynapse 3 months ago - 4 comments

#1073 - [PyTorch] Add support for padding mask in `UnfusedDotProductAttention`

Pull Request - State: closed - Opened by cyanguwa 3 months ago - 7 comments
Labels: 1.10.0

#1071 - When will comm-gemm-overlap support multi nodes?

Issue - State: open - Opened by umiswing 3 months ago - 6 comments

#1070 - AttnFuncWithCP with seq_len==1 breaks

Issue - State: closed - Opened by MaciejBalaNV 3 months ago - 4 comments

#1067 - [C/PyTorch] Userbuffers and comm+GEMM overlap algorithms refactored and moved to TE/common

Pull Request - State: open - Opened by denera 3 months ago
Labels: enhancement

#1063 - [PyTorch] Debug checkpointing with operation-based API

Pull Request - State: open - Opened by timmoon10 4 months ago - 3 comments
Labels: bug

#1043 - Error pre-training BERT

Issue - State: open - Opened by fabiancpl 4 months ago - 1 comment

#1033 - [PyTorch] Normalization ops

Pull Request - State: open - Opened by timmoon10 4 months ago - 11 comments
Labels: enhancement

#1019 - Add support for flash-attn 3

Pull Request - State: closed - Opened by cyanguwa 4 months ago - 5 comments

#1014 - AttributeError: module 'transformer_engine' has no attribute 'pytorch'

Issue - State: open - Opened by Lzhang-hub 4 months ago - 4 comments

#1011 - Could not work , even use the official script

Issue - State: open - Opened by hellangleZ 4 months ago - 5 comments

#978 - Building wheel error during installation

Issue - State: closed - Opened by Drzhishi 4 months ago - 3 comments
Labels: bug, build

#972 - no boost in performance with Ada GPU

Issue - State: open - Opened by saurabh-kataria 5 months ago - 1 comment
Labels: performance

#965 - How to cast 16/32-bit to FP8?

Issue - State: closed - Opened by mxjmtxrm 5 months ago - 3 comments
Labels: question

#946 - [TE/JAX] Prototype for New XLA Custom Calls with FFI

Pull Request - State: closed - Opened by phu0ngng 5 months ago - 2 comments
Labels: enhancement, jax

#944 - Expose `rotary_base` as an arg instead of hardcoding

Pull Request - State: closed - Opened by sudhakarsingh27 5 months ago - 1 comment

#936 - [MoE][Common/PyTorch] Add permutation

Pull Request - State: closed - Opened by StudyingShao 5 months ago - 5 comments
Labels: enhancement

#930 - How to install with CuDNN 9.0+ ?

Issue - State: closed - Opened by tianyan01 5 months ago - 3 comments

#922 - How to use FP8 of TransformerEngine in inference

Issue - State: open - Opened by Godlovecui 5 months ago - 3 comments

#885 - [PyTorch] Add support for cuDNN FusedAttention + THD + CP

Pull Request - State: closed - Opened by xrennvidia 5 months ago - 19 comments

#856 - Cannot import and use transformer_engine after successful installation with No module named 'transformer_engine_extensions'

Issue - State: closed - Opened by sam-h-bean 6 months ago - 4 comments
Labels: bug, build

#762 - Could TransformerEngine work with Deepspeed Zero w/ offloading?

Issue - State: open - Opened by leiwen83 7 months ago - 1 comment
Labels: question

#700 - ERROR: Failed building wheel for transformer-engine

Issue - State: closed - Opened by ShabnamRA 8 months ago - 7 comments
Labels: build

#694 - main branch cannot compile due to incompatibility with the main branch of cudnn-frontend

Issue - State: closed - Opened by lucifer1004 9 months ago - 2 comments
Labels: build

#689 - Version constraint of `flash-attn` needs to be updated

Issue - State: closed - Opened by lucifer1004 9 months ago - 3 comments

#679 - [Feature Request] Grouped GEMM kernel

Issue - State: open - Opened by LiyuanLucasLiu 9 months ago - 1 comment
Labels: enhancement

#553 - installing error

Issue - State: closed - Opened by foreverpiano 11 months ago - 1 comment

#526 - Failed Installation

Issue - State: closed - Opened by sudy-super 12 months ago - 1 comment

#517 - [Common][PyTorch] Fused `apply_rotorary_pos_emb`

Pull Request - State: closed - Opened by yaox12 almost 1 year ago - 10 comments

#516 - question for building wheel for transformer-engine

Issue - State: open - Opened by Mrzhang-dada about 1 year ago - 6 comments

#459 - Failed building wheel for transformer-engine

Issue - State: closed - Opened by RuslanSel about 1 year ago - 1 comment

#359 - Optimize flash-attention transposes

Pull Request - State: closed - Opened by ksivaman over 1 year ago - 1 comment

#355 - Installation failed with cmake error

Issue - State: closed - Opened by RuiWang1998 over 1 year ago - 23 comments

#100 - Update PyTorch comm API

Pull Request - State: closed - Opened by ksivaman over 1 year ago - 1 comment

#99 - Fix FlashAttention tests

Pull Request - State: closed - Opened by tcherckez-nvidia over 1 year ago - 12 comments

#98 - Adding JAX to README.rst

Pull Request - State: closed - Opened by mingxu1067 over 1 year ago - 2 comments

#97 - Catch FP8 modulo16 error before cublas and fp8 kernels

Pull Request - State: closed - Opened by ksivaman over 1 year ago - 1 comment

#96 - [WIP] add cuDNN Flash Attention for FP8

Pull Request - State: closed - Opened by cyanguwa over 1 year ago

#95 - Add a temporary workaround to layernorm ONNX export

Pull Request - State: closed - Opened by nzmora-nvidia over 1 year ago - 6 comments

#94 - Add an option to serialize test i/o to file

Pull Request - State: closed - Opened by nzmora-nvidia over 1 year ago - 1 comment

#93 - Raise autocast usage error

Pull Request - State: closed - Opened by ksivaman over 1 year ago - 4 comments

#92 - Move from Sphinx Autodoc to sphinx-autoapi

Pull Request - State: closed - Opened by ptrendx over 1 year ago - 1 comment

#91 - Fix the link to the documentation archives

Pull Request - State: closed - Opened by ptrendx over 1 year ago - 1 comment

#90 - deprecate qk layer scaling and fp32 softmax args

Pull Request - State: closed - Opened by ksivaman over 1 year ago - 2 comments

#89 - Adding slice to fix failure with multi-devices.

Pull Request - State: closed - Opened by mingxu1067 over 1 year ago - 1 comment

#88 - Exporting MajorShardingType, ShardingType and LayerNorm for TE/JAX.

Pull Request - State: closed - Opened by mingxu1067 over 1 year ago - 1 comment

#87 - Adding documents to TE/JAX

Pull Request - State: closed - Opened by mingxu1067 over 1 year ago - 10 comments

#86 - Separate linting passes for PyTorch and JAX

Pull Request - State: closed - Opened by timmoon10 over 1 year ago - 2 comments
Labels: enhancement

#85 - Add TensorFlow module and extensions

Pull Request - State: closed - Opened by trevor-m over 1 year ago - 7 comments

#84 - Fix flash attention

Pull Request - State: closed - Opened by ksivaman over 1 year ago - 5 comments

#83 - Fix unfused QKV params case; stack vs interleave option

Pull Request - State: closed - Opened by ksivaman over 1 year ago - 2 comments

#82 - 3rd party acknowledgements

Pull Request - State: closed - Opened by ksivaman over 1 year ago

#81 - fix bug in non-FP8 nvfuser path

Pull Request - State: closed - Opened by ksivaman over 1 year ago - 1 comment

#80 - Relax checks for flash-attn

Pull Request - State: closed - Opened by cyanguwa over 1 year ago - 4 comments

#79 - Remove redundant AR for SP case

Pull Request - State: closed - Opened by ksivaman over 1 year ago - 4 comments

#78 - Move TE/PyTorch UT to tests/pytorch/

Pull Request - State: closed - Opened by jeng1220 over 1 year ago - 5 comments

#77 - Change version to 0.7.0dev

Pull Request - State: closed - Opened by ksivaman over 1 year ago

#76 - Add an option to serialize test i/o to file

Pull Request - State: closed - Opened by nzmora-nvidia over 1 year ago - 4 comments

#75 - Support arbitrary output dtypes in PyT GEMM functions

Pull Request - State: closed - Opened by timmoon10 over 1 year ago - 3 comments
Labels: enhancement

GitHub / NVIDIA/TransformerEngine issues and pull requests