Dao-AILab/flash-attention issues and pull requests

#1471 - Support 576 Head dim for MLA

Issue - State: open - Opened by sAviOr287 3 days ago

#1470 - Getting Error While Extracting

Issue - State: open - Opened by emirardagn 3 days ago

#1469 - [How-to]How to get Flash-Attention under windows 11 CUDA

Issue - State: open - Opened by mytait 7 days ago - 6 comments

#1468 - fa3: include bert_padding utilities

Pull Request - State: closed - Opened by tmm1 10 days ago - 1 comment

#1467 - FA3 package is missing padding utilities

Issue - State: open - Opened by tmm1 10 days ago

#1466 - What is `seqused_q` and `seqused_k`?

Issue - State: open - Opened by cassanof 10 days ago

#1465 - FA3 KV Cache is slower than FA2 KV Cache

Issue - State: open - Opened by DD-DuDa 11 days ago - 3 comments

#1464 - Add support for Cuda 12.8 and B200 GPUs

Issue - State: open - Opened by ofirkris 12 days ago

#1463 - Update Cuda Blackwell

Pull Request - State: open - Opened by johnnynunez 13 days ago

#1462 - fused dense lib warning

Issue - State: open - Opened by YuyueminAustin 13 days ago

#1461 - BUG? get the wrong value when logit_scale is 0

Issue - State: open - Opened by shunshen93 13 days ago - 1 comment

#1460 - [Build] Update version of setuptools used to generate core package

Pull Request - State: closed - Opened by tmm1 14 days ago

#1459 - Conflict When Installing flash-attn 2.7.3 and 3.0.0b1 Together

Issue - State: open - Opened by quanta42 14 days ago

#1458 - using Flash Attention version 2.5.7, upgraded CUTLASS to version 3.5 then encountered the following compilation error.

Issue - State: open - Opened by ccccjunkang 14 days ago - 1 comment

#1456 - dropout_layer_norm

Issue - State: closed - Opened by ADiko1997 16 days ago - 1 comment

#1455 - [BugFix] Fix a wrong reference to seqlen_k variable in the fwd_splitkv kernel

Pull Request - State: open - Opened by muoshuosha 16 days ago - 1 comment

#1454 - Usage of .item() in unpad_input()

Issue - State: closed - Opened by qwertyforce 16 days ago - 2 comments

#1453 - Main branch compilation on nvcc 12.6

Issue - State: open - Opened by roded2 16 days ago - 2 comments

#1452 - v2.7.3 build failed in NGC pytorch:24.12-py3

Issue - State: open - Opened by xuchunmei000 16 days ago - 4 comments

#1451 - FA3 consecutive failing tests after first failure

Issue - State: open - Opened by benjamin-kroeger 17 days ago

#1450 - BUG？ static_assert(!(!Mma1_is_RS && !IntraWGOverlap), "Mma1 must be RS if IntraWGOverlap is enabled");

Issue - State: closed - Opened by ziyuhuang123 17 days ago - 1 comment

#1449 - [QST] masking steps in flash decoding

Issue - State: open - Opened by aws-jiadingg 20 days ago - 1 comment

#1448 - Clarification on MMA0 Results Handling in the Latest Code

Issue - State: open - Opened by ziyuhuang123 21 days ago - 1 comment

#1447 - subprocess.CalledProcessError: Command '['path/to/cuda-11.7/bin/nvcc', '-V']' returned non-zero exit status 255

Issue - State: open - Opened by ChosenOne-xx 22 days ago

#1446 - Support ROCM builds from source distribution, and improve error handling

Pull Request - State: closed - Opened by mgorny 22 days ago - 1 comment

#1445 - Is the output of FlashAttention completely identical to that of vanilla attention?

Issue - State: closed - Opened by sunsmarterjie 23 days ago - 1 comment

#1444 - Wheel names and version inconsitency.

Issue - State: open - Opened by sfc-gh-mhazy 23 days ago - 2 comments

#1443 - Setup failure in the latest build

Issue - State: closed - Opened by complexfilter 24 days ago - 2 comments

#1442 - Replace c10::optional with std::optional in flash_attn

Pull Request - State: closed - Opened by houseroad 24 days ago - 1 comment

#1441 - Error when importing dropout_layer_norm

Issue - State: open - Opened by anfortas337 24 days ago - 1 comment

#1440 - Running flash_attn/flash_attn_triton_amd/bench.py with sequence length > 4096 causes RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered

Issue - State: open - Opened by jiqimaoke 25 days ago - 5 comments

#1440 - Running flash_attn/flash_attn_triton_amd/bench.py with sequence length > 4096 causes RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered

Issue - State: open - Opened by jiqimaoke 25 days ago

#1440 - Running flash_attn/flash_attn_triton_amd/bench.py with sequence length > 4096 causes RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered

Issue - State: open - Opened by jiqimaoke 25 days ago - 3 comments

#1440 - Running flash_attn/flash_attn_triton_amd/bench.py with sequence length > 4096 causes RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered

Issue - State: open - Opened by jiqimaoke 25 days ago

#1439 - IncompatibleTypeErrorImpl('invalid operands of type pointer<int64> and triton.language.int32')

Issue - State: open - Opened by wuyouliaoxi 26 days ago

#1438 - FA3 forward performance regression on H200

Issue - State: open - Opened by complexfilter 27 days ago - 3 comments

#1438 - FA3 forward performance regression on H200

Issue - State: open - Opened by complexfilter 27 days ago - 3 comments

#1438 - FA3 forward performance regression on H200

Issue - State: open - Opened by complexfilter 27 days ago - 7 comments

#1437 - Change version to 2.7.3

Pull Request - State: closed - Opened by ksivaman 27 days ago

#1437 - Change version to 2.7.3

Pull Request - State: closed - Opened by ksivaman 27 days ago

#1436 - Blackwell support

Pull Request - State: closed - Opened by ksivaman 27 days ago - 1 comment

#1436 - Blackwell support

Pull Request - State: closed - Opened by ksivaman 27 days ago - 1 comment

#1435 - FA3 does not work with torch.compile

Issue - State: open - Opened by nighting0le01 27 days ago

#1434 - GFX1100

Issue - State: closed - Opened by johnnynunez 27 days ago

#1434 - GFX1100

Issue - State: closed - Opened by johnnynunez 27 days ago

#1433 - Expose `zero_tensors` arg in varlen functions

Pull Request - State: closed - Opened by ksivaman 28 days ago - 1 comment

#1433 - Expose `zero_tensors` arg in varlen functions

Pull Request - State: closed - Opened by ksivaman 28 days ago - 1 comment

#1433 - Expose `zero_tensors` arg in varlen functions

Pull Request - State: closed - Opened by ksivaman 28 days ago - 1 comment

#1432 - FA3 regression on H100 80GB?

Issue - State: open - Opened by bastianhagedorn 28 days ago - 8 comments

#1432 - FA3 regression on H100 80GB?

Issue - State: open - Opened by bastianhagedorn 28 days ago - 8 comments

#1432 - FA3 regression on H100 80GB?

Issue - State: open - Opened by bastianhagedorn 28 days ago - 8 comments

#1431 - [AMD ROCm] Support variable length of page attention

Pull Request - State: closed - Opened by rocking5566 28 days ago

#1431 - [AMD ROCm] Support variable length of page attention

Pull Request - State: closed - Opened by rocking5566 28 days ago

#1430 - Fix calls to `torch.is_grad_enabled()`

Pull Request - State: closed - Opened by ksivaman 29 days ago

#1430 - Fix calls to `torch.is_grad_enabled()`

Pull Request - State: closed - Opened by ksivaman 29 days ago

#1430 - Fix calls to `torch.is_grad_enabled()`

Pull Request - State: closed - Opened by ksivaman 29 days ago

#1429 - [flash attn v2] Why V uses no-swizzle layout for registers?

Issue - State: open - Opened by phantaurus 29 days ago - 1 comment

#1429 - [flash attn v2] Why V uses no-swizzle layout for registers?

Issue - State: open - Opened by phantaurus 29 days ago - 1 comment

#1429 - [flash attn v2] Why V uses no-swizzle layout for registers?

Issue - State: open - Opened by phantaurus 29 days ago - 1 comment

#1429 - [flash attn v2] Why V uses no-swizzle layout for registers?

Issue - State: open - Opened by phantaurus 29 days ago - 1 comment

#1428 - version `GLIBCXX_3.4.29' not found

Issue - State: open - Opened by zhanghanxing2022 29 days ago

#1427 - Generalize cuda version checks for A100 and above

Pull Request - State: closed - Opened by ksivaman 30 days ago

#1427 - Generalize cuda version checks for A100 and above

Pull Request - State: closed - Opened by ksivaman 30 days ago

#1426 - [Delete]

Issue - State: closed - Opened by rebemika-amzn 30 days ago

#1426 - [Delete]

Issue - State: closed - Opened by rebemika-amzn 30 days ago

#1426 - [Delete]

Issue - State: closed - Opened by rebemika-amzn 30 days ago

#1425 - Remove unused 224 cu kernels

Pull Request - State: closed - Opened by drisspg about 1 month ago

#1425 - Remove unused 224 cu kernels

Pull Request - State: closed - Opened by drisspg about 1 month ago

#1425 - Remove unused 224 cu kernels

Pull Request - State: closed - Opened by drisspg about 1 month ago

#1425 - Remove unused 224 cu kernels

Pull Request - State: closed - Opened by drisspg about 1 month ago

#1424 - UnboundLocalError: cannot access local variable 'out' where it is not associated with a value

Issue - State: closed - Opened by CicelyCafe about 1 month ago - 1 comment

#1424 - UnboundLocalError: cannot access local variable 'out' where it is not associated with a value

Issue - State: closed - Opened by CicelyCafe about 1 month ago - 1 comment

#1424 - UnboundLocalError: cannot access local variable 'out' where it is not associated with a value

Issue - State: closed - Opened by CicelyCafe about 1 month ago - 1 comment

#1423 - ERROR: No matching distribution found for flash-attn==2.6.3+cu123torch2.4cxx11abifalse

Issue - State: open - Opened by carolynsoo about 1 month ago - 1 comment

#1422 - Unable to install flash_attn on H100 with CUDA 12.5

Issue - State: open - Opened by ghadiaravi13 about 1 month ago

#1422 - Unable to install flash_attn on H100 with CUDA 12.5

Issue - State: open - Opened by ghadiaravi13 about 1 month ago

#1421 - Unable to install `flash-attn` even if I first install `torch` alone

Issue - State: closed - Opened by ytxmobile98 about 1 month ago - 5 comments

#1420 - Is there a plan to support flash_attn_varlen_backward with fp8

Issue - State: open - Opened by gaodaheng about 1 month ago - 1 comment

#1419 - Add a macro for namespace

Pull Request - State: closed - Opened by drisspg about 1 month ago

#1418 - Encounter some problems when building wheel

Issue - State: open - Opened by ZarkPanda about 1 month ago

#1417 - `flash_attn_with_kvcache` discrepancy slicing kv_cache / cache_seqlens

Issue - State: open - Opened by jeromeku about 1 month ago

#1416 - [CK_TILE] FAv3 bwd bugfix

Pull Request - State: closed - Opened by poyenc about 1 month ago

#1415 - RuntimeError: Error compiling objects for extension

Issue - State: open - Opened by ProgramerSalar about 1 month ago - 2 comments

#1414 - looking for a test to verify cache correctness in `flash_attn_with_kvcache`

Issue - State: open - Opened by chakpongchung about 1 month ago - 2 comments

#1413 - Performance Impact of Using Three Warps per Group (WG) in FA3 Compared to Two WGs

Issue - State: open - Opened by ziyuhuang123 about 1 month ago - 1 comment

#1412 - UnboundLocalError: local variable 'out' referenced before assignment

Issue - State: open - Opened by chuangzhidan about 1 month ago - 3 comments

#1411 - Can't intall it

Issue - State: open - Opened by TherrenceF about 1 month ago - 1 comment

#1410 - Impact of Register Spills on FA3 Kernel Performance

Issue - State: open - Opened by ziyuhuang123 about 1 month ago - 1 comment

#1409 - FA 2.4.2 is falling unitest on A6000 and A5880

Issue - State: open - Opened by BoxiangW about 1 month ago - 5 comments

#1408 - Why Did FA3 Change SmemLayoutAtomO Definition in the New Version?

Issue - State: closed - Opened by ziyuhuang123 about 2 months ago

#1407 - Why Does FA3 Use Registers Instead of Directly Accessing SMEM with WGMMA on SM90?

Issue - State: open - Opened by ziyuhuang123 about 2 months ago - 1 comment

#1406 - fix bug when is_grad is false

Pull Request - State: closed - Opened by woaixiaoxiao about 2 months ago

#1405 - Add missing tests/init.py

Pull Request - State: open - Opened by BioGeek about 2 months ago

#1404 - 4 Failing `test_flash_attn_output_fp8` tests on H100

Issue - State: open - Opened by BioGeek about 2 months ago - 3 comments

#1403 - Does bar.sync Emit Semaphores Alongside bar.arrive?

Issue - State: closed - Opened by ziyuhuang123 about 2 months ago - 1 comment

#1402 - is flash_attn_with_kvcache() supposed to work for seqlen > 1 ?

Issue - State: closed - Opened by vince62s about 2 months ago - 1 comment

#1401 - Understanding sync and arrive in FA3 Store Function

Issue - State: open - Opened by ziyuhuang123 about 2 months ago

#1400 - Understanding the Role of arrive in NamedBarrier Synchronization

Issue - State: open - Opened by ziyuhuang123 about 2 months ago - 1 comment

#1399 - Fix incorrect torch dtype

Pull Request - State: closed - Opened by kevmo314 about 2 months ago

#1398 - The execution order between GEMM0 of the next iteration and GEMM1 of the current iteration in Pingpong scheduling pipeline for overlapping gemms and softmax between warpgroups

Issue - State: open - Opened by tengdecheng about 2 months ago

GitHub / Dao-AILab/flash-attention issues and pull requests