NVIDIA/TransformerEngine issues and pull requests

#74 - Sequence-parallel amax reduction fix

Pull Request - State: closed - Opened by ksivaman over 1 year ago - 6 comments

#73 - New fp8_transpose_dbias kernel

Pull Request - State: closed - Opened by vasunvidia over 1 year ago - 2 comments

#72 - Gradient enablement bug fix

Pull Request - State: closed - Opened by ksivaman almost 2 years ago - 2 comments

#71 - Support simulating FP8 on older hardware

Issue - State: open - Opened by zplizzi almost 2 years ago - 1 comment
Labels: enhancement

#70 - Fix gradients when using AMP

Pull Request - State: closed - Opened by ksivaman almost 2 years ago - 1 comment

#69 - Installation errors on Ampere GPUs

Issue - State: open - Opened by realAsma almost 2 years ago - 3 comments
Labels: documentation

#68 - New transpose_dbias kernel

Pull Request - State: closed - Opened by vasunvidia almost 2 years ago - 1 comment

#67 - Zero-centered gamma support in LayerNorm (LayerNorm1p)

Pull Request - State: closed - Opened by ptrendx almost 2 years ago - 6 comments

#66 - QKV parameters unfused path fixes and optimization

Pull Request - State: closed - Opened by ksivaman almost 2 years ago - 8 comments

#65 - Bug fixes from PR 22

Pull Request - State: closed - Opened by ksivaman almost 2 years ago - 6 comments

#64 - remove d2d copies

Pull Request - State: closed - Opened by ksivaman almost 2 years ago - 3 comments

#63 - Address steady memory increase and bloated checkpoints

Pull Request - State: closed - Opened by ksivaman almost 2 years ago - 2 comments

#62 - flash-attn integration

Pull Request - State: closed - Opened by cyanguwa almost 2 years ago - 8 comments

#61 - Add docs for FP8 calibration

Pull Request - State: closed - Opened by ksivaman almost 2 years ago - 1 comment

#60 - Fix the integer overflow in fused softmax

Pull Request - State: closed - Opened by ptrendx almost 2 years ago - 2 comments

#59 - Numerics fix from #40

Pull Request - State: closed - Opened by ksivaman almost 2 years ago - 3 comments

#58 - Bug fixes from #40

Pull Request - State: closed - Opened by ksivaman almost 2 years ago - 2 comments

#57 - Add margin for LayerNorm kernel SM usage

Pull Request - State: closed - Opened by ksivaman almost 2 years ago - 2 comments

#56 - Remove intermediate dispatch functions

Pull Request - State: closed - Opened by ksivaman almost 2 years ago - 1 comment

#55 - Fix NVTX name for LN backward

Pull Request - State: closed - Opened by ksivaman almost 2 years ago

#54 - Add TE/JAX high-level modules, unittests and examples

Pull Request - State: closed - Opened by jeng1220 almost 2 years ago - 11 comments

#53 - add building workflow for TE/Jax

Pull Request - State: closed - Opened by jeng1220 almost 2 years ago - 18 comments

#52 - Indexing fix for bug in virtual interleaved pipelining configs

Pull Request - State: closed - Opened by ksivaman almost 2 years ago - 5 comments

#51 - Move calculation of scale inverse to framework

Pull Request - State: closed - Opened by ksivaman almost 2 years ago - 5 comments

#50 - Add NVTX to TE modules

Pull Request - State: closed - Opened by ptrendx almost 2 years ago - 4 comments

#49 - Enforce boolean attention mask type

Pull Request - State: closed - Opened by ksivaman almost 2 years ago - 2 comments

#48 - Update copyright year

Pull Request - State: closed - Opened by ptrendx almost 2 years ago - 2 comments

#47 - Add GeGLU and the corresponding gradient kernels

Pull Request - State: closed - Opened by zlsh80826 almost 2 years ago - 4 comments

#46 - Reduce unit tests time

Pull Request - State: closed - Opened by zlsh80826 almost 2 years ago - 3 comments

#45 - Add RMSNorm

Pull Request - State: closed - Opened by zlsh80826 almost 2 years ago - 4 comments

#44 - Docs: remove build warnings and add FP8 caching note

Pull Request - State: closed - Opened by ksivaman almost 2 years ago - 2 comments

#43 - Fix in MHA cross attention path

Pull Request - State: closed - Opened by ksivaman almost 2 years ago - 6 comments

#42 - Fix LayerNorm API param names

Pull Request - State: closed - Opened by ksivaman almost 2 years ago - 1 comment

#41 - Add ONNX export support for TE modules

Pull Request - State: closed - Opened by asfiyab-nvidia almost 2 years ago - 13 comments

#40 - Schetlur/fp8 calibration

Pull Request - State: closed - Opened by schetlur-nv almost 2 years ago - 6 comments

#39 - Standardize formatting

Pull Request - State: closed - Opened by ksivaman almost 2 years ago

#38 - Ensure contiguous inputs

Pull Request - State: closed - Opened by ksivaman almost 2 years ago - 1 comment

#37 - Softmax docstrings and type fixes

Pull Request - State: closed - Opened by ksivaman almost 2 years ago - 1 comment

#36 - Link performance optimization tutorial to docs

Pull Request - State: closed - Opened by ptrendx almost 2 years ago

#35 - cleanup pylintrc

Pull Request - State: closed - Opened by ksivaman almost 2 years ago

#34 - Fix illegal memory access in general layer norm backward kernel

Pull Request - State: closed - Opened by timmoon10 almost 2 years ago

#33 - Move the amax/scale/scale_inv into the TE Tensor struct.

Pull Request - State: closed - Opened by ptrendx almost 2 years ago - 6 comments

#32 - Don't update FP8 weights during validation/inference

Pull Request - State: closed - Opened by ksivaman almost 2 years ago - 2 comments

#31 - Full activation recompute checkpointing bug fix

Pull Request - State: closed - Opened by ksivaman almost 2 years ago - 1 comment

#30 - Framework agnostic softmax kernels

Pull Request - State: closed - Opened by ksivaman almost 2 years ago - 3 comments

#29 - Fixes #26

Pull Request - State: closed - Opened by ksivaman almost 2 years ago - 1 comment

#28 - Fix the out-of-bounds access in the C+T+dbias kernel

Pull Request - State: closed - Opened by ptrendx about 2 years ago - 3 comments

#27 - Update README.md

Pull Request - State: closed - Opened by nzmora-nvidia about 2 years ago

#26 - The fc2.bias of LayerNormMLP is not used

Issue - State: closed - Opened by wkcn about 2 years ago

#25 - Incorrect parameter in landing page example.

Issue - State: closed - Opened by jomayeri about 2 years ago - 1 comment

#24 - Fix bugs for full activation recompute in FP8

Pull Request - State: closed - Opened by ksivaman about 2 years ago - 7 comments

#23 - [DO NOT MERGE UPSTREAM]

Pull Request - State: closed - Opened by mjsML about 2 years ago - 3 comments

#22 - Increase number of FP8 tensors per GEMM

Pull Request - State: closed - Opened by vasunvidia about 2 years ago - 7 comments

#21 - Conditional wgrad support

Pull Request - State: closed - Opened by schetlur-nv about 2 years ago - 3 comments

#20 - Documentation for advanced performance optimizations

Pull Request - State: closed - Opened by timmoon10 about 2 years ago - 5 comments

#19 - Add pylint to Lint action

Pull Request - State: closed - Opened by ptrendx about 2 years ago - 2 comments

#18 - Multi-tensor cast-transpose

Pull Request - State: closed - Opened by timmoon10 about 2 years ago - 5 comments

#17 - Please consider supporting Windows

Issue - State: closed - Opened by C43H66N12O12S2 about 2 years ago - 4 comments

#16 - Test

Pull Request - State: closed - Opened by cyanguwa about 2 years ago

#15 - It doesn't support the latest RTX 40-series card

Issue - State: closed - Opened by hxssgaa about 2 years ago - 30 comments

#14 - Add link to the documentation archives in the docs

Pull Request - State: closed - Opened by ptrendx about 2 years ago

#13 - Test build as GitHub action

Pull Request - State: closed - Opened by ptrendx about 2 years ago

#12 - Test Blossom CI

Pull Request - State: closed - Opened by ptrendx about 2 years ago - 26 comments

#11 - Make amax reduction optional

Pull Request - State: closed - Opened by ksivaman about 2 years ago - 2 comments

#10 - Add C++ lint as GitHub action

Pull Request - State: closed - Opened by ptrendx about 2 years ago - 1 comment

#9 - Add Blossom CI yml

Pull Request - State: closed - Opened by ptrendx about 2 years ago - 1 comment

#8 - Remove fp8_out from the LN API

Pull Request - State: closed - Opened by ptrendx about 2 years ago - 2 comments

#7 - Remove pytest-runner from setup requirements

Pull Request - State: closed - Opened by ksivaman about 2 years ago

#6 - Fix docs for default FP8 format in recipe

Pull Request - State: closed - Opened by ksivaman about 2 years ago

#5 - Efficient Multi-Head Attention (EMHA) support

Pull Request - State: closed - Opened by ksivaman about 2 years ago - 2 comments

#4 - Bug fix for distributed TE case

Pull Request - State: closed - Opened by ksivaman about 2 years ago

#3 - Add checks for tensor parallel use case to ensure all-reduce is called only when necessary

Pull Request - State: closed - Opened by ksivaman about 2 years ago

#2 - fp8_autocast bug fix when switching from non-fp8 execution

Pull Request - State: closed - Opened by ksivaman about 2 years ago

#1 - Added the link to the User Guide

Pull Request - State: closed - Opened by ptrendx about 2 years ago

GitHub / NVIDIA/TransformerEngine issues and pull requests