fasterdecoding/snapkv issues and pull requests

#28 - Bug on Qwen2-VL

Issue - State: open - Opened by LiJunscs 23 days ago

#27 - The first generation token output sees the whole cache key and value

Issue - State: open - Opened by PengWenChen 27 days ago - 3 comments

#26 - Llama-3 Implementation

Issue - State: closed - Opened by kunlun531 2 months ago

#25 - why not use the last token for kv cache compression

Issue - State: open - Opened by Arist12 2 months ago

#24 - Question: is key_state_compressed used for inference?

Issue - State: open - Opened by jq-wei 2 months ago - 1 comment

#23 - What happens to the total KV length > max-compacity length during response generation?

Issue - State: open - Opened by PengWenChen 3 months ago - 1 comment

#22 - Group Query Attention

Issue - State: open - Opened by SimJeg 4 months ago - 4 comments

#21 - Question on H2O experiment reproduction

Issue - State: open - Opened by CUHKSZzxy 6 months ago

#20 - Closed issue

Issue - State: closed - Opened by JulietLJY 7 months ago

#19 - Could you provide the code for visualization the Hit Rate?

Issue - State: open - Opened by Dominic789654 7 months ago

#18 - Can snapkv compress kv in case different user questions are posed towards the same context?

Issue - State: open - Opened by namespace-Pt 7 months ago - 1 comment

#17 - observation window size and consistency between layers

Issue - State: closed - Opened by Cooperx521 8 months ago - 1 comment

#16 - Question on GQA implementation

Issue - State: open - Opened by cyLi-Tiger 8 months ago - 1 comment

#15 - Can I use the SnapKV without the flash-attention ?

Issue - State: closed - Opened by pengshuang 8 months ago - 1 comment

#14 - What prompt was used in Needle in a Haystack test?

Issue - State: closed - Opened by 66RING 8 months ago - 1 comment

#13 - expanded_attn_mask = causal_4d_mask.masked_fill(expanded_attn_mask.bool(), torch.finfo(dtype).min) RuntimeError: The size of tensor a (3509) must match the size of tensor b (7017) at non-singleton dimension 3

Issue - State: closed - Opened by seeyourcell 8 months ago - 5 comments

#12 - Can't not run longbench!

Issue - State: open - Opened by HarryWu99 8 months ago - 3 comments

#11 - why only decode do compress?

Issue - State: open - Opened by CSEEduanyu 9 months ago

#10 - Only kv is compressed. Is the size of Q and K inconsistent when attention is calculated?

Issue - State: closed - Opened by CSEEduanyu 9 months ago - 1 comment

#10 - Only kv is compressed. Is the size of Q and K inconsistent when attention is calculated?

Issue - State: closed - Opened by CSEEduanyu 9 months ago - 1 comment

#9 - It seems that snapkv need to be able to do "prefill" at least once before the prompt can be compressed.

Issue - State: closed - Opened by 66RING 9 months ago - 1 comment

#8 - Observation

Pull Request - State: closed - Opened by leeyeehoo 9 months ago

#7 - yl: remove unnessecary

Pull Request - State: closed - Opened by leeyeehoo 9 months ago

#6 - yl: fix a bug

Pull Request - State: closed - Opened by leeyeehoo 9 months ago

#5 - yl: fix typo

Pull Request - State: closed - Opened by leeyeehoo 9 months ago

#4 - Grouped query attention implementation

Issue - State: closed - Opened by guozhiyu 9 months ago - 1 comment

#3 - maybe a bug in `update_kv` function

Issue - State: open - Opened by HarryWu99 9 months ago - 1 comment

#2 - The effect of Clustering via Pooling may be greater？

Issue - State: open - Opened by HarryWu99 9 months ago - 1 comment

#1 - Questions on paper and code [prompting for mistral, positional index, minor errors & questions in paper]

Issue - State: open - Opened by MarsJacobs 9 months ago - 8 comments

Ecosyste.ms: Issues

GitHub / fasterdecoding/snapkv issues and pull requests

#28 - Bug on Qwen2-VL

#27 - The first generation token output sees the whole cache key and value

#26 - Llama-3 Implementation

#25 - why not use the last token for kv cache compression

#24 - Question: is key_state_compressed used for inference?

#23 - What happens to the total KV length > max-compacity length during response generation?

#22 - Group Query Attention

#21 - Question on H2O experiment reproduction

#20 - Closed issue

#19 - Could you provide the code for visualization the Hit Rate?

#18 - Can snapkv compress kv in case different user questions are posed towards the same context?

#17 - observation window size and consistency between layers

#16 - Question on GQA implementation

#15 - Can I use the SnapKV without the flash-attention ?

#14 - What prompt was used in Needle in a Haystack test?

#13 - expanded_attn_mask = causal_4d_mask.masked_fill(expanded_attn_mask.bool(), torch.finfo(dtype).min) RuntimeError: The size of tensor a (3509) must match the size of tensor b (7017) at non-singleton dimension 3

#12 - Can't not run longbench!

#11 - why only decode do compress?

#10 - Only kv is compressed. Is the size of Q and K inconsistent when attention is calculated?

#10 - Only kv is compressed. Is the size of Q and K inconsistent when attention is calculated?

#9 - It seems that snapkv need to be able to do "prefill" at least once before the prompt can be compressed.

#8 - Observation

#7 - yl: remove unnessecary

#6 - yl: fix a bug

#5 - yl: fix typo

#4 - Grouped query attention implementation

#3 - maybe a bug in `update_kv` function

#2 - The effect of Clustering via Pooling may be greater？

#1 - Questions on paper and code [prompting for mistral, positional index, minor errors & questions in paper]