Dao-AILab/flash-attention issues and pull requests

#1420 - Is there a plan to support flash_attn_varlen_backward with fp8

Issue - State: open - Opened by gaodaheng about 1 month ago - 1 comment

#1419 - Add a macro for namespace

Pull Request - State: closed - Opened by drisspg about 1 month ago

#1418 - Encounter some problems when building wheel

Issue - State: open - Opened by ZarkPanda about 1 month ago

#1417 - `flash_attn_with_kvcache` discrepancy slicing kv_cache / cache_seqlens

Issue - State: open - Opened by jeromeku about 1 month ago

#1416 - [CK_TILE] FAv3 bwd bugfix

Pull Request - State: closed - Opened by poyenc about 2 months ago

#1415 - RuntimeError: Error compiling objects for extension

Issue - State: open - Opened by ProgramerSalar about 2 months ago - 2 comments

#1414 - looking for a test to verify cache correctness in `flash_attn_with_kvcache`

Issue - State: open - Opened by chakpongchung about 2 months ago - 2 comments

#1413 - Performance Impact of Using Three Warps per Group (WG) in FA3 Compared to Two WGs

Issue - State: closed - Opened by ziyuhuang123 about 2 months ago - 1 comment

#1412 - UnboundLocalError: local variable 'out' referenced before assignment

Issue - State: closed - Opened by chuangzhidan about 2 months ago - 6 comments

#1411 - Can't intall it

Issue - State: open - Opened by TherrenceF about 2 months ago - 1 comment

#1410 - Impact of Register Spills on FA3 Kernel Performance

Issue - State: closed - Opened by ziyuhuang123 about 2 months ago - 1 comment

#1409 - FA 2.4.2 is falling unitest on A6000 and A5880

Issue - State: open - Opened by BoxiangW about 2 months ago - 5 comments

#1408 - Why Did FA3 Change SmemLayoutAtomO Definition in the New Version?

Issue - State: closed - Opened by ziyuhuang123 about 2 months ago

#1407 - Why Does FA3 Use Registers Instead of Directly Accessing SMEM with WGMMA on SM90?

Issue - State: closed - Opened by ziyuhuang123 about 2 months ago - 1 comment

#1406 - fix bug when is_grad is false

Pull Request - State: closed - Opened by woaixiaoxiao about 2 months ago

#1405 - Add missing tests/init.py

Pull Request - State: open - Opened by BioGeek about 2 months ago

#1404 - 4 Failing `test_flash_attn_output_fp8` tests on H100

Issue - State: open - Opened by BioGeek about 2 months ago - 3 comments

#1403 - Does bar.sync Emit Semaphores Alongside bar.arrive?

Issue - State: closed - Opened by ziyuhuang123 about 2 months ago - 1 comment

#1402 - is flash_attn_with_kvcache() supposed to work for seqlen > 1 ?

Issue - State: closed - Opened by vince62s about 2 months ago - 1 comment

#1401 - Understanding sync and arrive in FA3 Store Function

Issue - State: open - Opened by ziyuhuang123 about 2 months ago

#1400 - Understanding the Role of arrive in NamedBarrier Synchronization

Issue - State: open - Opened by ziyuhuang123 about 2 months ago - 1 comment

#1399 - Fix incorrect torch dtype

Pull Request - State: closed - Opened by kevmo314 about 2 months ago

#1398 - The execution order between GEMM0 of the next iteration and GEMM1 of the current iteration in Pingpong scheduling pipeline for overlapping gemms and softmax between warpgroups

Issue - State: open - Opened by tengdecheng about 2 months ago

#1397 - check torch.is_grad_enabled before calling customer flash atten ops

Pull Request - State: closed - Opened by XiaobingSuper about 2 months ago - 5 comments

#1396 - Why Doesn't FlashAttention3 Allow KV and O to Share Memory Space?

Issue - State: open - Opened by ziyuhuang123 about 2 months ago - 1 comment

#1396 - Why Doesn't FlashAttention3 Allow KV and O to Share Memory Space?

Issue - State: open - Opened by ziyuhuang123 about 2 months ago - 1 comment

#1395 - g2s K tensor when handling padding in the seq_k, clear it rather than keeping the default SMEM values.

Issue - State: open - Opened by NVIDIA-JerryChen about 2 months ago

#1394 - Create PEP 517 build metadata

Pull Request - State: closed - Opened by frostming about 2 months ago - 1 comment

#1394 - Create PEP 517 build metadata

Pull Request - State: open - Opened by frostming about 2 months ago

#1393 - Add hipBLAS/cuBLAS distinction in benchmark_gemm.py

Pull Request - State: closed - Opened by garrettbyrd about 2 months ago

#1392 - fix a bug (issue #1390) caused by typo

Pull Request - State: closed - Opened by liguohao96 about 2 months ago - 1 comment

#1392 - fix a bug (issue #1390) caused by typo

Pull Request - State: open - Opened by liguohao96 about 2 months ago

#1391 - Large loss of accuracy between flashattention and native

Issue - State: open - Opened by fanfanaaaa about 2 months ago - 3 comments

#1391 - Large loss of accuracy between flashattention and native

Issue - State: open - Opened by fanfanaaaa about 2 months ago - 4 comments

#1390 - a small typo and fix

Issue - State: open - Opened by liguohao96 about 2 months ago - 3 comments

#1389 - Why does NamedBarrier in epilogue use NumMmaThreads(256) + NumThreadsPerWarp(32)?

Issue - State: open - Opened by ziyuhuang123 about 2 months ago - 2 comments

#1388 - Windows 11 Installation Error

Issue - State: open - Opened by 404-xianjin about 2 months ago

#1387 - FA-3 installation errors

Issue - State: closed - Opened by asahni04 about 2 months ago - 1 comment

#1386 - is fwd_kvcache compatible with torch.compile in 2.7.2post1 ?

Issue - State: open - Opened by vince62s about 2 months ago - 6 comments

#1385 - How to get actual col idx

Issue - State: open - Opened by wenkechen 2 months ago

#1384 - Support dedicated compile[For Research]

Pull Request - State: open - Opened by AllenDou 2 months ago

#1383 - don't save inputs/outputs buffer of FlashAttenFunc to reduce memory usage for inference mode

Pull Request - State: closed - Opened by XiaobingSuper 2 months ago - 3 comments

#1382 - Fix deprecation warnings

Pull Request - State: open - Opened by rongou 2 months ago

#1381 - [ROCm] benchmark_flash_attention.py failing with Memory Access Fault

Issue - State: open - Opened by nikhil-tensorwave 2 months ago - 3 comments

#1380 - Validate that `git` is available and `CUDA_HOME` is set in `setup.py`

Pull Request - State: closed - Opened by davidmezzetti 2 months ago

#1379 - Possible to install with just `torch` installed?

Issue - State: closed - Opened by davidmezzetti 2 months ago - 6 comments

#1378 - seq_lens variable used in the attention kernel

Issue - State: closed - Opened by chakpongchung 2 months ago - 1 comment

#1377 - Flash attention 3 does not use Dropout_p?

Issue - State: open - Opened by nighting0le01 2 months ago - 6 comments

#1376 - Accuracy Drop with Flash-Attention Reimplementation in Encoder-Decoder Architecture (ViT)

Issue - State: closed - Opened by ImaGonEs 2 months ago - 2 comments

#1375 - FA3 for cuda12.3 but torch only releases cuda 12.4 version

Issue - State: closed - Opened by wplf 2 months ago - 2 comments

#1374 - Headdim==96 in FA3

Issue - State: closed - Opened by wplf 2 months ago - 2 comments

#1373 - Can wgmma.async and barrier.arrive Ensure GEMM Completion Before Moving Forward?

Issue - State: closed - Opened by ziyuhuang123 2 months ago - 2 comments

#1372 - Why we have a third barrier::QueryEmpty arrive?

Issue - State: open - Opened by ziyuhuang123 2 months ago - 1 comment

#1371 - Question About Initial sync Behavior Without Prior arrive in Warpgroup Scheduling

Issue - State: closed - Opened by ziyuhuang123 2 months ago - 2 comments

#1370 - Question about warp_scheduler_barrier_arrive in FA3 and cutlass::arch::NamedBarrier::arrive Usage

Issue - State: closed - Opened by ziyuhuang123 2 months ago - 2 comments

#1369 - GLT

Issue - State: open - Opened by deepgandu 2 months ago

#1368 - The byzantine copy of Tensor O

Issue - State: closed - Opened by phantaurus 2 months ago - 4 comments

#1368 - The byzantine copy of Tensor O

Issue - State: closed - Opened by phantaurus 2 months ago - 4 comments

#1367 - Issue Installing cuDNN Python Module via pip install cudnn

Issue - State: open - Opened by ziyuhuang123 2 months ago

#1367 - Issue Installing cuDNN Python Module via pip install cudnn

Issue - State: open - Opened by ziyuhuang123 2 months ago

#1366 - Sliding Window (Local Attention) possibly incorrect on newest branch

Issue - State: open - Opened by kilianhaefeli 2 months ago - 1 comment

#1365 - Change {q,k,v}_descale to be per-batch-element

Pull Request - State: closed - Opened by ericauld 2 months ago

#1365 - Change {q,k,v}_descale to be per-batch-element

Pull Request - State: closed - Opened by ericauld 2 months ago

#1364 - Is there any way to compile the codes with nvcc debug flag(-G)?

Issue - State: open - Opened by Dev-Jahn 2 months ago - 6 comments

#1363 - flash_bwd_kernel.h: add maybe_unused annotation to suppress compile warnings

Pull Request - State: closed - Opened by acgessler 2 months ago

#1362 - Triton Issues for Rotary flash_attn.layers.rotary.apply_rotary_emb_qkv_

Issue - State: open - Opened by albertotono 2 months ago

#1362 - Triton Issues for Rotary flash_attn.layers.rotary.apply_rotary_emb_qkv_

Issue - State: open - Opened by albertotono 2 months ago

#1361 - Fix FA3 Varlen Performance regression

Pull Request - State: closed - Opened by kadeng 2 months ago

#1360 - Need `tests/init.py` for `hopper/test_flash_attn.py`

Issue - State: open - Opened by hancheolcho 3 months ago - 2 comments

#1360 - Need `tests/init.py` for `hopper/test_flash_attn.py`

Issue - State: open - Opened by hancheolcho 3 months ago - 2 comments

#1359 - Output Discrepancy Between FlashAttention and PyTorch Attention

Issue - State: closed - Opened by pengzhangzhi 3 months ago - 2 comments

#1359 - Output Discrepancy Between FlashAttention and PyTorch Attention

Issue - State: closed - Opened by pengzhangzhi 3 months ago - 2 comments

#1358 - Add support for qk dim different from v dim in PR #1166

Issue - State: closed - Opened by YTianZHU 3 months ago

#1357 - How to get attention score? "return_attn_probs=True" is not work.

Issue - State: closed - Opened by UnableToUseGit 3 months ago - 3 comments

#1357 - How to get attention score? "return_attn_probs=True" is not work.

Issue - State: closed - Opened by UnableToUseGit 3 months ago - 1 comment

#1356 - How to assign ROCm architecture during pip installing

Issue - State: open - Opened by deeptimhe 3 months ago

#1356 - How to assign ROCm architecture during pip installing

Issue - State: open - Opened by deeptimhe 3 months ago

#1355 - Does flash-attn support FP8 inference on L40-48G?

Issue - State: open - Opened by LinJianping 3 months ago

#1354 - Flashdecoding with appendKV might incorrect

Issue - State: open - Opened by DD-DuDa 3 months ago

#1353 - Added a Benchmark for Rotary and Improved Rotary Performance

Pull Request - State: closed - Opened by alexkranias-amd 3 months ago - 1 comment

#1352 - FP8 test failure on the latest 'decode' branch

Issue - State: closed - Opened by cscyuge 3 months ago - 1 comment

#1351 - Unable to cast Python instance of type <class 'torch._subclasses.fake_tensor.FakeTensor'> to C++ type

Issue - State: open - Opened by zwhe99 3 months ago

#1351 - Unable to cast Python instance of type <class 'torch._subclasses.fake_tensor.FakeTensor'> to C++ type

Issue - State: open - Opened by zwhe99 3 months ago - 1 comment

#1350 - How could I use a query to calculate the attention with multiple k-v

Issue - State: open - Opened by DongyuXu77 3 months ago - 1 comment

#1350 - How could I use a query to calculate the attention with multiple k-v

Issue - State: closed - Opened by DongyuXu77 3 months ago - 1 comment

#1349 - Question of the equation in Flash Attention 2 Paper

Issue - State: open - Opened by jeffrey-sunh1 3 months ago - 5 comments

#1348 - Issue with installing flash attention ` import flash_attn_2_cuda as flash_attn_cuda`

Issue - State: open - Opened by hahmad2008 3 months ago - 6 comments

#1348 - Issue with installing flash attention ` import flash_attn_2_cuda as flash_attn_cuda`

Issue - State: open - Opened by hahmad2008 3 months ago - 1 comment

#1347 - breaking change for head size non divisble by 8

Issue - State: closed - Opened by felix-red-panda 3 months ago - 1 comment

#1347 - breaking change for head size non divisble by 8

Issue - State: closed - Opened by felix-red-panda 3 months ago - 1 comment

#1346 - RuntimeError: Error compiling objects for extension

Issue - State: closed - Opened by beyondguo 3 months ago - 5 comments

#1346 - RuntimeError: Error compiling objects for extension

Issue - State: closed - Opened by beyondguo 3 months ago - 5 comments

#1345 - [Q] why flash attention MFU is over 100% in A800

Issue - State: closed - Opened by wonderisland 3 months ago

#1345 - [Q] why flash attention MFU is over 100% in A800

Issue - State: closed - Opened by wonderisland 3 months ago

#1344 - [Bug] Potential hazard in epilogue when kUseVarSeqLen=true

Issue - State: closed - Opened by QiZhangNV 3 months ago - 2 comments

#1343 - FA3 Failed to initialize the TMA descriptor

Issue - State: open - Opened by li-yi-dong 3 months ago

#1342 - Assistance on implementing Flash Attention 2 for Turing

Issue - State: open - Opened by samuelzxu 3 months ago

#1342 - Assistance on implementing Flash Attention 2 for Turing

Issue - State: open - Opened by samuelzxu 3 months ago

#1341 - [Bug]: Perf slump after updating flash-attn 2.7.0 (with torch.compile using)

Issue - State: open - Opened by Mnb66 3 months ago - 4 comments

#1340 - Building a wheel for torch 2.5.0-2.5.1 with Python 3.10 and CUDA 12.4 on Windows has failed.

Issue - State: open - Opened by lldacing 3 months ago - 2 comments

GitHub / Dao-AILab/flash-attention issues and pull requests