ggerganov/llama.cpp issues and pull requests

#10091 - Bug: Cannot run larger than VRAM models with `GGML_CUDA_ENABLE_UNIFIED_MEMORY`

Issue - State: open - Opened by vlovich 10 days ago
Labels: bug-unconfirmed, high severity

#10090 - Feature Request: Meta releases Layer Skip, an end-to-end solution for accelerating LLMs

Issue - State: open - Opened by mirek190 10 days ago
Labels: enhancement

#10090 - Feature Request: Meta releases Layer Skip, an end-to-end solution for accelerating LLMs

Issue - State: open - Opened by mirek190 10 days ago
Labels: enhancement

#10089 - Bug: SwiftUI example does not work on simulator.

Issue - State: open - Opened by guinmoon 10 days ago
Labels: bug-unconfirmed, low severity

#10089 - Bug: SwiftUI example does not work on simulator.

Issue - State: open - Opened by guinmoon 10 days ago
Labels: bug-unconfirmed, low severity

#10089 - Bug: SwiftUI example does not work on simulator.

Issue - State: open - Opened by guinmoon 10 days ago
Labels: bug-unconfirmed, low severity

#10083 - Bug: Floating Point Exceptions turned off by default, hiding fpExceptions

Issue - State: open - Opened by borisweberdev 10 days ago
Labels: bug-unconfirmed, medium severity

#10083 - Bug: Floating Point Exceptions turned off by default, hiding fpExceptions

Issue - State: open - Opened by borisweberdev 10 days ago
Labels: bug-unconfirmed, medium severity

#10080 - Bug: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED

Issue - State: open - Opened by morgen52 11 days ago - 4 comments
Labels: bug-unconfirmed, high severity

#10080 - Bug: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED

Issue - State: open - Opened by morgen52 11 days ago - 4 comments
Labels: bug-unconfirmed, high severity

#10078 - Bug: llama-server not logging to file

Issue - State: open - Opened by PyroGenesis 11 days ago - 4 comments
Labels: bug-unconfirmed, medium severity

#10078 - Bug: llama-server not logging to file

Issue - State: open - Opened by PyroGenesis 11 days ago - 4 comments
Labels: bug-unconfirmed, medium severity

#10077 - Bug: convert_hf_to_gguf.py: error: argument --outtype: invalid choice: 'q4_k_m' (choose from 'f32', 'f16', 'bf16', 'q8_0', 'tq1_0', 'tq2_0', 'auto')

Issue - State: open - Opened by awesomecoolraj 11 days ago - 1 comment
Labels: bug-unconfirmed, high severity

#10077 - Bug: convert_hf_to_gguf.py: error: argument --outtype: invalid choice: 'q4_k_m' (choose from 'f32', 'f16', 'bf16', 'q8_0', 'tq1_0', 'tq2_0', 'auto')

Issue - State: open - Opened by awesomecoolraj 11 days ago - 1 comment
Labels: bug-unconfirmed, high severity

#10077 - Bug: convert_hf_to_gguf.py: error: argument --outtype: invalid choice: 'q4_k_m' (choose from 'f32', 'f16', 'bf16', 'q8_0', 'tq1_0', 'tq2_0', 'auto')

Issue - State: open - Opened by awesomecoolraj 11 days ago - 1 comment
Labels: bug-unconfirmed, high severity

#10075 - Feature Request: Implement « Why Does the Effective Context Length of LLMs Fall Short? »

Issue - State: open - Opened by arthurwolf 11 days ago
Labels: enhancement

#10074 - Bug: llama-service can only generate garbled text after a request with invalid tokens.

Issue - State: closed - Opened by morgen52 11 days ago - 4 comments
Labels: bug-unconfirmed, critical severity

#10072 - Bug: Wrong slots management when receiving multiple concurrent requests.

Issue - State: open - Opened by morgen52 11 days ago - 4 comments
Labels: bug-unconfirmed, high severity

#10071 - llama : remove Tail-Free sampling

Pull Request - State: closed - Opened by ggerganov 12 days ago - 6 comments
Labels: script, testing, examples, python, server

#10070 - Bug: Setting the `np` configs leads to grabled generated tokens.

Issue - State: open - Opened by morgen52 12 days ago - 7 comments
Labels: bug-unconfirmed, high severity

#10069 - Implement ggml_v_expf() with a fast approximation on AVX/AVX2/AVX512

Pull Request - State: closed - Opened by J-Montgomery 12 days ago - 5 comments
Labels: ggml

#10065 - convert : more detailed convert lora usage docs

Pull Request - State: open - Opened by richdougherty 13 days ago
Labels: python

#10064 - readme : more lora detail in main example readme

Pull Request - State: open - Opened by richdougherty 13 days ago
Labels: examples

#10063 - nix: update flake.lock

Pull Request - State: closed - Opened by ggerganov 13 days ago
Labels: nix

#10062 - fix deepseek deseret regex

Pull Request - State: closed - Opened by dhiltgen 13 days ago - 2 comments

#10059 - llama : rename missed batch params/vars to ubatch

Pull Request - State: open - Opened by danbev 13 days ago

#10058 - Bug: when the vulan is enabled, the error log message ”vulkan-shaders-gen: command not found“ occurred

Issue - State: open - Opened by shuzf 13 days ago
Labels: bug-unconfirmed, critical severity

#10057 - sampling : add adaptive temperature sampler

Pull Request - State: closed - Opened by m18coppola 14 days ago - 5 comments
Labels: testing, examples, server

#10056 - Bug: Server /v1/chat/completions API response's model info is wrong

Issue - State: open - Opened by RifeWang 14 days ago - 1 comment
Labels: bug-unconfirmed, medium severity

#10055 - add FP8 support to gguf/llama:

Pull Request - State: open - Opened by Djip007 14 days ago - 6 comments
Labels: build, script, testing, examples, ggml

#10054 - Bug: Server with multiple slots: Long input + short output -> extreme generation token/s slowdown

Issue - State: closed - Opened by us58 14 days ago - 3 comments
Labels: bug-unconfirmed, medium severity

#10053 - Llama server - Update doc for slot states

Pull Request - State: open - Opened by PyroGenesis 14 days ago - 3 comments
Labels: examples, server

#10052 - Bug: Can't run even the smallest model due to CUDA OOM without FlashAttention

Issue - State: closed - Opened by vlovich 14 days ago - 1 comment
Labels: bug-unconfirmed, medium severity

#10051 - Bug: No LM Runtime found for model format 'gguf

Issue - State: closed - Opened by SculptorGoldenMoon 14 days ago - 1 comment
Labels: bug-unconfirmed, low severity

#10050 - Bug: 15 GiB of CPU RAM permanently leaked on each llama-cli invocation

Issue - State: closed - Opened by vlovich 14 days ago - 2 comments
Labels: bug-unconfirmed, high severity

#10049 - Bug: llama-export-lora converts all non-F32 values to F16

Issue - State: open - Opened by NWalker1208 14 days ago
Labels: bug-unconfirmed, medium severity

#10048 - sampling: add K-Shift sampler

Pull Request - State: open - Opened by MaggotHATE 14 days ago - 5 comments
Labels: testing, examples, server

#10047 - Bug: Certain RPC Servers cause major slowdown to Host machine

Issue - State: open - Opened by GoudaCouda 14 days ago - 2 comments
Labels: bug-unconfirmed, medium severity

#10045 - kompute: add backend registry / device interfaces and Q4_K shader

Pull Request - State: open - Opened by slp 14 days ago - 1 comment
Labels: ggml, Kompute

#10044 - Fix logging from llama-llava-cli

Pull Request - State: open - Opened by Googulator 14 days ago - 3 comments
Labels: examples

#10042 - musa: workaround for Guilty Lockup in cleaning src0 in #10032

Pull Request - State: closed - Opened by yeahdongcn 14 days ago - 2 comments

#10041 - [SYCL] pass SYCL CI

Pull Request - State: open - Opened by airMeng 14 days ago - 9 comments
Labels: testing, ggml, SYCL

#10039 - Execute multiple compute graphs in parallel

Issue - State: open - Opened by paomiannanjue 15 days ago

#10037 - Bug: Vulkan backend freezes during its execution

Issue - State: open - Opened by GrainyTV 15 days ago - 9 comments
Labels: bug-unconfirmed, medium severity

#10035 - Feature Request: Support Aya

Issue - State: open - Opened by maziyarpanahi 15 days ago
Labels: enhancement

#10034 - Make Kompute error verbose about unsupported types

Pull Request - State: closed - Opened by ericcurtin 15 days ago

#10033 - metal : support permuted matrix multiplicaions

Pull Request - State: closed - Opened by ggerganov 15 days ago

#10032 - CUDA: fix insufficient buffer clearing for MMQ

Pull Request - State: closed - Opened by JohannesGaessler 15 days ago
Labels: Review Complexity : Low

#10031 - Bug: issue in CUDA flash attention

Issue - State: closed - Opened by agray3 15 days ago - 7 comments
Labels: bug-unconfirmed, medium severity

#10030 - server : check that the prompt fits in the slot's context

Pull Request - State: closed - Opened by ggerganov 15 days ago
Labels: examples, python, server

#10029 - ggml : Implementations for Q4_0_8_8 quantization based functions - RISC-V vector version

Pull Request - State: open - Opened by xctan 16 days ago - 1 comment
Labels: ggml

#10028 - Feature Request: Support for DeciLMForCausalLM

Issue - State: open - Opened by ymcki 16 days ago - 2 comments
Labels: enhancement

#10027 - Feature Request: Support tools and tool_choice parameter in OpenAI compatible service

Issue - State: open - Opened by ChanceFlow 16 days ago - 1 comment
Labels: enhancement

#10026 - llama : refactor model loader with backend registry

Pull Request - State: open - Opened by slaren 16 days ago - 17 comments
Labels: script, Nvidia GPU, Vulkan, examples, python, devops, ggml, SYCL, Kompute

#10023 - server : refactor slot input data, move tokenizer to HTTP thread

Pull Request - State: closed - Opened by ngxson 16 days ago - 5 comments
Labels: examples, python, server

#10022 - llama: string_split fix

Pull Request - State: closed - Opened by Xarbirus 16 days ago - 3 comments
Labels: examples, server

#10021 - CUDA: fix MMQ for non-contiguous src0, add tests

Pull Request - State: closed - Opened by JohannesGaessler 16 days ago
Labels: testing, Nvidia GPU, Review Complexity : Medium, ggml

#10020 - Feature Request: Support OmniGen (based on phi-3 mini)

Issue - State: open - Opened by Manni1000 16 days ago
Labels: enhancement

#10019 - Server - Sampling bug fix

Pull Request - State: closed - Opened by wwoodsTM 16 days ago
Labels: examples, server

#10018 - server : don't overfill the batch during infill

Pull Request - State: closed - Opened by ggerganov 16 days ago
Labels: examples, server

#10016 - sync : ggml

Pull Request - State: closed - Opened by ggerganov 16 days ago
Labels: testing, Nvidia GPU, ggml

#10015 - llama : switch KQ multiplication to use F32 precision by default

Pull Request - State: closed - Opened by ggerganov 16 days ago

#10013 - llama : Add IBM granite template

Pull Request - State: closed - Opened by arch-btw 16 days ago - 4 comments
Labels: testing

#10011 - Bug: K cache without FA goes Nan on Llama 3.1.

Issue - State: closed - Opened by Nexesenex 16 days ago - 22 comments
Labels: bug-unconfirmed, high severity

#10010 - Extend sgemm.cpp support for Q5_0 models

Pull Request - State: closed - Opened by Srihari-mcw 17 days ago - 2 comments

#10009 - Bug:Why does llama-cli choose a GPU with lower performance?

Issue - State: open - Opened by badog-sing 17 days ago - 5 comments
Labels: bug-unconfirmed, Apple Metal, medium severity

#10005 - llama : enable FA by default and disable it per-layer

Issue - State: open - Opened by ggerganov 17 days ago - 18 comments
Labels: enhancement

#10004 - llama : rename batch.logits to batch.output

Pull Request - State: open - Opened by danbev 17 days ago - 1 comment
Labels: breaking change, android, examples, server

#10002 - Bug: No text response when "--log-disable" is set

Issue - State: open - Opened by jenskastensson 17 days ago - 6 comments
Labels: bug-unconfirmed, high severity

#9995 - llama.vim : add classic vim support

Pull Request - State: closed - Opened by m18coppola 18 days ago - 4 comments
Labels: examples

#9991 - Bug: Unexpected output from Granite 3.0 MoE 1b when all layers on NVIDIA GPU

Issue - State: closed - Opened by gabe-l-hart 18 days ago - 12 comments
Labels: bug-unconfirmed, medium severity

#9988 - Bug: Memory Leak in llama-server after exit

Issue - State: open - Opened by edwin0cheng 18 days ago - 15 comments
Labels: bug-unconfirmed, medium severity

#9978 - Bug: llama-server crash with `--embeddings`

Issue - State: closed - Opened by mokeyish 18 days ago - 13 comments
Labels: bug, critical severity

#9964 - llama.cpp Windows/ROCm builds are broken? Using shared GPU memory instead of dedicated.

Issue - State: open - Opened by SteelPh0enix 19 days ago - 3 comments

#9961 - Feature Request: Convert .devops container images to be RHEL-based UBI images rather than Ubuntu based

Issue - State: open - Opened by ericcurtin 19 days ago - 1 comment
Labels: enhancement

#9953 - Implementations for Q4_0_8_8 quantization based functions - RISC-V vector version

Pull Request - State: closed - Opened by xctan 20 days ago
Labels: ggml

#9944 - Bug: Cannot build with C++ > 20

Issue - State: open - Opened by bdashore3 21 days ago - 2 comments
Labels: bug-unconfirmed, high severity

#9943 - ggml:metal Add POOL2D op and fix IM2COL in Metal backend for running MobileVLM_V2.

Pull Request - State: closed - Opened by junhee-yoo 21 days ago - 1 comment
Labels: testing

#9942 - Add llama_cpp_canister to the README

Pull Request - State: open - Opened by icppWorld 21 days ago

#9939 - [SYCL]fix mul_mat_vec_q error

Pull Request - State: open - Opened by NeoZhangJianyu 21 days ago - 1 comment
Labels: SYCL

#9938 - server: handle n_predict==2 error

Pull Request - State: open - Opened by kylo5aby 21 days ago
Labels: examples, server

#9937 - Bug: Can't build LLAMA_CURL=ON to embed curl on windows x64 build.

Issue - State: open - Opened by AnthonyEmertec 21 days ago
Labels: bug-unconfirmed, high severity

#9935 - loader: use a map to find tensor by name from tensor weight

Pull Request - State: open - Opened by kylo5aby 21 days ago - 1 comment

#9934 - Bug: Got meaningless output when set -j {}.

Issue - State: open - Opened by morgen52 22 days ago - 8 comments
Labels: bug-unconfirmed, high severity

#9933 - Bug: Unexpected output length (Only one token response!) when set configs "-n -2 -c 256" for llama-server

Issue - State: open - Opened by morgen52 22 days ago - 1 comment
Labels: bug, good first issue, low severity

#9932 - Bug: Error when offloading falcon mamba layers on GPU

Issue - State: open - Opened by vineel96 22 days ago - 4 comments
Labels: bug-unconfirmed, low severity

#9931 - LLamaCausalLM add support for tokenizer.json

Pull Request - State: open - Opened by robbiemu 22 days ago
Labels: python

#9930 - ggml : fix possible buffer use after free in sched reserve

Pull Request - State: open - Opened by slaren 22 days ago
Labels: ggml

#9929 - server : add n_indent parameter for line indentation requirement

Pull Request - State: closed - Opened by ggerganov 22 days ago
Labels: examples, server

#9928 - Bug: Occasional crashes when a connection has been interrupted before completion of computation

Issue - State: open - Opened by sliedes 22 days ago - 5 comments
Labels: bug-unconfirmed, high severity

#9927 - Bug: WARNING: The BPE pre-tokenizer was not recognized!

Issue - State: open - Opened by smileyboy2019 22 days ago
Labels: bug-unconfirmed, medium severity

#9925 - Bug: invalid argument: --memory-f32

Issue - State: open - Opened by iperov 22 days ago - 5 comments
Labels: bug-unconfirmed, medium severity

#9924 - llama : infill sampling handle very long tokens

Pull Request - State: closed - Opened by ggerganov 22 days ago

#9922 - sample: maintain token count in penalty sampler context

Pull Request - State: open - Opened by kylo5aby 22 days ago - 1 comment

#9921 - backend cpu: add online flow for aarch64 Q4_0 GEMV/GEMM kernels

Pull Request - State: open - Opened by chaxu01 22 days ago - 3 comments
Labels: examples, ggml

#9918 - Add SwiftLlama to the Bindings list

Pull Request - State: closed - Opened by ShenghaiWang 23 days ago

#9917 - fix: allocating CPU buffer with size `0`

Pull Request - State: closed - Opened by giladgd 23 days ago
Labels: ggml

#9916 - consolidated.safetensors

Pull Request - State: open - Opened by CrispStrobe 23 days ago
Labels: python

#9914 - Feature Request: Support for Ministral-8B-Instruct-2410

Issue - State: open - Opened by arch-btw 23 days ago - 11 comments
Labels: enhancement

#9913 - Bug: Failing to build using cmake on tag b3912

Issue - State: open - Opened by Martin-HZK 23 days ago - 3 comments
Labels: bug-unconfirmed, medium severity

GitHub / ggerganov/llama.cpp issues and pull requests