deftruth/awesome-llm-inference issues and pull requests

#117 - 🔥🔥[Mooncake] Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving

Pull Request - State: closed - Opened by DefTruth 18 days ago

#116 - 🔥🔥[DeServe] DESERVE: TOWARDS AFFORDABLE OFFLINE LLM INFERENCE VIA DECENTRALIZATION

Pull Request - State: closed - Opened by DefTruth 18 days ago

#115 - 🔥🔥[KVDirect] KVDirect: Distributed Disaggregated LLM Inference

Pull Request - State: closed - Opened by DefTruth 18 days ago

#114 - 🔥🔥[DistServe] DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving

Pull Request - State: closed - Opened by DefTruth 18 days ago

#113 - [feat] add deepseek-r1

Pull Request - State: closed - Opened by shaoyuyoung 26 days ago

#112 - add `MiniMax-01` in Trending LLM/VLM Topics and Long Context Attention

Pull Request - State: closed - Opened by shaoyuyoung about 1 month ago - 2 comments

#111 - 🔥🔥[FFPA] FFPA: Yet another Faster Flash Prefill Attention with O(1) SRAM complexity for headdim > 256, ~1.5x faster than SDPA EA(@DefTruth)

Pull Request - State: closed - Opened by DefTruth about 1 month ago

#110 - 🔥🔥[SP: TokenRing] TokenRing: An Efficient Parallelism Framework for Infinite-Context LLMs via Bidirectional Communication

Pull Request - State: closed - Opened by DefTruth about 1 month ago

#110 - 🔥🔥[SP: TokenRing] TokenRing: An Efficient Parallelism Framework for Infinite-Context LLMs via Bidirectional Communication

Pull Request - State: closed - Opened by DefTruth about 1 month ago

#109 - 🔥🔥🔥[DeepSeek-V3] DeepSeek-V3 Technical Report

Pull Request - State: closed - Opened by DefTruth about 2 months ago

#109 - 🔥🔥🔥[DeepSeek-V3] DeepSeek-V3 Technical Report

Pull Request - State: closed - Opened by DefTruth about 2 months ago

#108 - 🔥🔥[HADACORE] HADACORE: TENSOR CORE ACCELERATED HADAMARD TRANSFORM KERNEL

Pull Request - State: closed - Opened by DefTruth about 2 months ago

#108 - 🔥🔥[HADACORE] HADACORE: TENSOR CORE ACCELERATED HADAMARD TRANSFORM KERNEL

Pull Request - State: closed - Opened by DefTruth about 2 months ago

#107 - 🔥[DynamicKV] DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs

Pull Request - State: closed - Opened by DefTruth about 2 months ago

#107 - 🔥[DynamicKV] DynamicKV: Task-Aware Adaptive KV Cache Compression for Long Context LLMs

Pull Request - State: closed - Opened by DefTruth about 2 months ago

#106 - 🔥🔥[NITRO] NITRO: LLM INFERENCE ON INTEL® LAPTOP NPUS

Pull Request - State: closed - Opened by DefTruth about 2 months ago

#106 - 🔥🔥[NITRO] NITRO: LLM INFERENCE ON INTEL® LAPTOP NPUS

Pull Request - State: closed - Opened by DefTruth about 2 months ago

#105 - 🔥🔥[TurboAttention] TURBOATTENTION: EFFICIENT ATTENTION APPROXIMATION FOR HIGH THROUGHPUTS LLMS

Pull Request - State: closed - Opened by DefTruth about 2 months ago

#105 - 🔥🔥[TurboAttention] TURBOATTENTION: EFFICIENT ATTENTION APPROXIMATION FOR HIGH THROUGHPUTS LLMS

Pull Request - State: closed - Opened by DefTruth about 2 months ago

#104 - 🔥[BatchLLM] BatchLLM: Optimizing Large Batched LLM Inference with Global Prefix Sharing and Throughput-oriented Token Batching

Pull Request - State: closed - Opened by DefTruth 2 months ago

#103 - 🔥[ClusterKV] ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression

Pull Request - State: closed - Opened by DefTruth 2 months ago

#102 - 🔥[KV Cache Recomputation] Efficient LLM Inference with I/O-Aware Partial KV Cache Recomputation

Pull Request - State: closed - Opened by DefTruth 3 months ago

#101 - 🔥[Star-Attention: 11x~ speedup] Star Attention: Efficient LLM Inference over Long Sequences

Pull Request - State: closed - Opened by DefTruth 3 months ago

#100 - 🔥[SparseInfer] SparseInfer: Training-free Prediction of Activation Sparsity for Fast LLM Inference

Pull Request - State: closed - Opened by DefTruth 3 months ago

#99 - 🔥[Squeezed Attention] SQUEEZED ATTENTION: Accelerating Long Context Length LLM Inference(@UC Berkeley)

Pull Request - State: closed - Opened by DefTruth 3 months ago

#98 - 🔥[SageAttention-2] SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration(@thu-ml)

Pull Request - State: closed - Opened by DefTruth 3 months ago

#97 - 🔥[SageAttention] SAGEATTENTION: ACCURATE 8-BIT ATTENTION FOR PLUG-AND-PLAY INFERENCE ACCELERATION(@thu-ml)

Pull Request - State: closed - Opened by DefTruth 3 months ago

#96 - add vAttention code link

Pull Request - State: closed - Opened by KevinZeng08 3 months ago

#95 - Add code link to BPT

Pull Request - State: closed - Opened by DefTruth 3 months ago

#94 - 🔥🔥[TP: Comm Compression] Communication Compression for Tensor Parallel LLM Inference

Pull Request - State: closed - Opened by DefTruth 3 months ago

#93 - 🔥🔥[SP: BPT] Blockwise Parallel Transformer for Large Context Models

Pull Request - State: closed - Opened by DefTruth 3 months ago

#92 - Add DP/TP/SP/CP papers with codes

Pull Request - State: closed - Opened by DefTruth 3 months ago

#91 - 🔥[BitNet] BitNet a4.8: 4-bit Activations for 1-bit LLMs

Pull Request - State: closed - Opened by DefTruth 3 months ago

#90 - 🔥[Tensor Product] Acceleration of Tensor-Product Operations with Tensor Cores

Pull Request - State: closed - Opened by DefTruth 4 months ago

#89 - 🔥[Fast Best-of-N] Fast Best-of-N Decoding via Speculative Rejection

Pull Request - State: closed - Opened by DefTruth 4 months ago

#88 - 🔥[FastAttention] FastAttention: Extend FlashAttention2 to NPUs and Low-resource GPUs for Efficient Inference

Pull Request - State: closed - Opened by DefTruth 4 months ago

#87 - Efficient Hybrid Inference for LLMs: Reward-Based Token Modelling with Selective Cloud Assistance

Pull Request - State: closed - Opened by aharshms 4 months ago

#86 - Add paper AdaKV

Pull Request - State: closed - Opened by FFY0 4 months ago - 1 comment

#85 - early exit of LLM inference

Pull Request - State: closed - Opened by boyi-liu 4 months ago - 1 comment

#84 - 🔥[PARALLELSPEC] PARALLELSPEC: PARALLEL DRAFTER FOR EFFICIENT SPECULATIVE DECODING

Pull Request - State: closed - Opened by DefTruth 4 months ago

#83 - [LLM Inference] LARGE LANGUAGE MODEL INFERENCE ACCELERATION: A COMPREHENSIVE HARDWARE PERSPECTIVE

Pull Request - State: closed - Opened by DefTruth 4 months ago

#82 - Large Language Model Performance Benchmarking on Mobile Platforms: A Thorough Evaluation

Pull Request - State: closed - Opened by DefTruth 4 months ago

#81 - 🔥[LORC] Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy

Pull Request - State: closed - Opened by DefTruth 4 months ago

#80 - [From Author] Link CacheGen and CacheBlend to LMCache

Pull Request - State: closed - Opened by KuntaiDu 5 months ago

#79 - Bump up to v2.6

Pull Request - State: closed - Opened by DefTruth 5 months ago

#78 - 🔥[LayerKV] Optimizing Large Language Model Serving with Layer-wise KV Cache Management

Pull Request - State: closed - Opened by DefTruth 5 months ago

#77 - 🔥[KV-COMPRESS] PAGED KV-CACHE COMPRESSION WITH VARIABLE COMPRESSION RATES PER ATTENTION HEAD

Pull Request - State: closed - Opened by DefTruth 5 months ago

#76 - 🔥🔥[Tensor Cores] Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores

Pull Request - State: closed - Opened by DefTruth 5 months ago

#75 - 🔥[AlignedKV] AlignedKV: Reducing Memory Access of KV-Cache with Precision-Aligned Quantization

Pull Request - State: closed - Opened by DefTruth 5 months ago

#74 - 🔥🔥[HiFloat8] Ascend HiFloat8 Format for Deep Learning

Pull Request - State: closed - Opened by DefTruth 5 months ago

#73 - [Low-bit] A Survey of Low-bit Large Language Models: Basics, Systems, and Algorithms

Pull Request - State: closed - Opened by DefTruth 5 months ago

#72 - 🔥🔥[INT-FLASHATTENTION] INT-FLASHATTENTION: ENABLING FLASH ATTENTION FOR INT8 QUANTIZATION

Pull Request - State: closed - Opened by DefTruth 5 months ago

#71 - fix typo

Pull Request - State: closed - Opened by DefTruth 5 months ago

#70 - 🔥[VPTQ] VPTQ: EXTREME LOW-BIT VECTOR POST-TRAINING QUANTIZATION FOR LARGE LANGUAGE MODELS

Pull Request - State: closed - Opened by DefTruth 5 months ago

#69 - Bump up to v2.5

Pull Request - State: closed - Opened by DefTruth 5 months ago

#68 - 🔥🔥[CRITIPREFILL] CRITIPREFILL: A SEGMENT-WISE CRITICALITYBASED APPROACH FOR PREFILLING ACCELERATION IN LLMS

Pull Request - State: closed - Opened by DefTruth 5 months ago

#67 - move RetrievalAttention -> long context

Pull Request - State: closed - Opened by DefTruth 5 months ago

#66 - Update codebase of paper "parallel speculative decoding with adaptive draft length"

Pull Request - State: closed - Opened by smart-lty 5 months ago - 1 comment

#65 - 🔥[InstInfer] InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference

Pull Request - State: closed - Opened by DefTruth 5 months ago

#64 - Bump up to v2.4

Pull Request - State: closed - Opened by DefTruth 5 months ago

#63 - 🔥[Inf-MLLM] Inf-MLLM: Efficient Streaming Inference of Multimodal Large Language Models on a Single GPU

Pull Request - State: closed - Opened by DefTruth 5 months ago

#62 - 🔥[RetrievalAttention] Accelerating Long-Context LLM Inference via Vector Retrieval

Pull Request - State: closed - Opened by DefTruth 5 months ago

#62 - 🔥[RetrievalAttention] Accelerating Long-Context LLM Inference via Vector Retrieval

Pull Request - State: closed - Opened by DefTruth 5 months ago

#61 - Bump up to v2.3

Pull Request - State: closed - Opened by DefTruth 5 months ago