qwopqwop200/GPTQ-for-LLaMa issues and pull requests

#291 - Total parameters are less after quantization

Issue - State: open - Opened by ZN1010 about 1 month ago - 1 comment

#285 - Syntax changed in triton.testing.do_bench() causing error when running llama_inference.py

Issue - State: open - Opened by prasanna about 1 year ago - 1 comment

#101 - OPT Models Fail to Pack During Quantization

Issue - State: closed - Opened by official-elinas almost 2 years ago - 3 comments

#101 - OPT Models Fail to Pack During Quantization

Issue - State: closed - Opened by official-elinas almost 2 years ago - 3 comments

#100 - TypeError: Offload_LlamaModel.forward() got an unexpected keyword argument 'position_ids' after today's commits

Issue - State: closed - Opened by ye7iaserag almost 2 years ago - 3 comments

#99 - Port left-padding fix to Offload_LlamaModel

Pull Request - State: closed - Opened by arctic-marmoset almost 2 years ago

#98 - AMD Installation broke about 1 week ago

Issue - State: closed - Opened by YellowRoseCx almost 2 years ago - 12 comments

#97 - KeyError on conversion attempt.

Issue - State: closed - Opened by GamerUntouch almost 2 years ago - 4 comments

#96 - Much slower speed with the latest updates

Issue - State: closed - Opened by oobabooga almost 2 years ago - 13 comments

#95 - Head is currently broken

Issue - State: closed - Opened by Qubitium almost 2 years ago - 5 comments

#95 - Head is currently broken

Issue - State: closed - Opened by Qubitium almost 2 years ago - 5 comments

#94 - Use multiple gpu for quantization?

Issue - State: closed - Opened by sgsdxzy almost 2 years ago - 1 comment

#94 - Use multiple gpu for quantization?

Issue - State: closed - Opened by sgsdxzy almost 2 years ago - 1 comment

#93 - llama_inference_offload.py hardcoded to cuda:0

Issue - State: closed - Opened by generic-username0718 almost 2 years ago - 1 comment

#93 - llama_inference_offload.py hardcoded to cuda:0

Issue - State: closed - Opened by generic-username0718 almost 2 years ago - 1 comment

#92 - Name format mixup - LLaMA has a capital A

Issue - State: closed - Opened by mcmonkey4eva almost 2 years ago - 1 comment

#92 - Name format mixup - LLaMA has a capital A

Issue - State: closed - Opened by mcmonkey4eva almost 2 years ago - 1 comment

#91 - 8-bit quantization succeeds with great benchmarks, but inference produces garbage past 128 tokens

Issue - State: closed - Opened by QM60 almost 2 years ago - 2 comments

#91 - 8-bit quantization succeeds with great benchmarks, but inference produces garbage past 128 tokens

Issue - State: closed - Opened by QM60 almost 2 years ago - 2 comments

#90 - Make sure to save the quantized model before bench/eval

Pull Request - State: closed - Opened by Qubitium almost 2 years ago - 1 comment

#90 - Make sure to save the quantized model before bench/eval

Pull Request - State: closed - Opened by Qubitium almost 2 years ago - 1 comment

#89 - Support for corrected Llama

Issue - State: closed - Opened by gante almost 2 years ago - 3 comments

#88 - Error while python setup_cuda.py install>

Issue - State: closed - Opened by mahxds almost 2 years ago - 14 comments

#88 - Error while python setup_cuda.py install>

Issue - State: closed - Opened by mahxds almost 2 years ago - 14 comments

#87 - Implement fallback to PyTorch matmul on large input sizes

Pull Request - State: closed - Opened by MasterTaffer almost 2 years ago - 3 comments

#87 - Implement fallback to PyTorch matmul on large input sizes

Pull Request - State: closed - Opened by MasterTaffer almost 2 years ago - 3 comments

#86 - high VRAM Memory Allocation while evaluation

Issue - State: closed - Opened by zaziki23 almost 2 years ago - 3 comments

#86 - high VRAM Memory Allocation while evaluation

Issue - State: closed - Opened by zaziki23 almost 2 years ago - 3 comments

#85 - Windows 11 / WSL2 / Ubuntu / cuda 11.7 / RTX 3070 - IndexError: list index out of range (+PTX)

Issue - State: closed - Opened by glarsson almost 2 years ago - 1 comment

#84 - Is CUDA_VISIBLE_DEVICES necessary?

Issue - State: closed - Opened by neuhaus almost 2 years ago - 1 comment

#83 - TypeError: expected string or bytes-like object

Issue - State: closed - Opened by ghost almost 2 years ago - 1 comment

#82 - 4-bit is 10x slower compared to fp16 LLaMa

Issue - State: closed - Opened by fpgaminer almost 2 years ago - 27 comments

#82 - 4-bit is 10x slower compared to fp16 LLaMa

Issue - State: closed - Opened by fpgaminer almost 2 years ago - 27 comments

#81 - Make installable with pip

Pull Request - State: closed - Opened by sterlind almost 2 years ago - 1 comment

#81 - Make installable with pip

Pull Request - State: closed - Opened by sterlind almost 2 years ago - 1 comment

#80 - args is not defined after adding faster_kernel

Issue - State: closed - Opened by ye7iaserag almost 2 years ago - 2 comments

#79 - Run with docker

Pull Request - State: closed - Opened by JamesDConley almost 2 years ago - 1 comment

#78 - I can not reproduce 7b 6.09 Wiki2 PPL.

Issue - State: closed - Opened by USBhost almost 2 years ago - 14 comments

#77 - Inference with 4bit is slow than fp32

Issue - State: closed - Opened by heya5 almost 2 years ago - 2 comments

#76 - New: ~8% faster llama inference.

Pull Request - State: closed - Opened by aljungberg almost 2 years ago - 1 comment

#76 - New: ~8% faster llama inference.

Pull Request - State: closed - Opened by aljungberg almost 2 years ago - 1 comment

#75 - GPTQ Collaboration?

Issue - State: closed - Opened by dalistarh almost 2 years ago - 4 comments

#75 - GPTQ Collaboration?

Issue - State: closed - Opened by dalistarh almost 2 years ago - 4 comments

#74 - Installing cuda, cannot find ninja + cannot find file.

Issue - State: closed - Opened by jonplumb42 almost 2 years ago - 1 comment

#74 - Installing cuda, cannot find ninja + cannot find file.

Issue - State: closed - Opened by jonplumb42 almost 2 years ago - 1 comment

#73 - Inference using CPU

Issue - State: closed - Opened by lodorg almost 2 years ago - 3 comments

#73 - Inference using CPU

Issue - State: closed - Opened by lodorg almost 2 years ago - 3 comments

#72 - error on amd gpu when starting setup_cuda

Issue - State: closed - Opened by maxime-fleury almost 2 years ago - 1 comment

#72 - error on amd gpu when starting setup_cuda

Issue - State: closed - Opened by maxime-fleury almost 2 years ago - 1 comment

#71 - Error allocating RAM

Issue - State: closed - Opened by PeterDaGrape almost 2 years ago - 3 comments

#71 - Error allocating RAM

Issue - State: closed - Opened by PeterDaGrape almost 2 years ago - 3 comments

#70 - Running on CPU

Issue - State: closed - Opened by mayaeary almost 2 years ago - 5 comments

#70 - Running on CPU

Issue - State: closed - Opened by mayaeary almost 2 years ago - 5 comments

#69 - TypeError: load_quant() missing 1 required positional argument: 'groupsize'

Issue - State: closed - Opened by matbee-eth almost 2 years ago - 7 comments

#69 - TypeError: load_quant() missing 1 required positional argument: 'groupsize'

Issue - State: closed - Opened by matbee-eth almost 2 years ago - 7 comments

#68 - The detected CUDA version (12.0) mismatches the version that was used to compile PyTorch (11.7)

Issue - State: closed - Opened by ThatCoffeeGuy almost 2 years ago - 9 comments

#68 - The detected CUDA version (12.0) mismatches the version that was used to compile PyTorch (11.7)

Issue - State: closed - Opened by ThatCoffeeGuy almost 2 years ago - 9 comments

#67 - adding ipynb fle for building on colab

Pull Request - State: closed - Opened by guccialex almost 2 years ago - 3 comments

#67 - adding ipynb fle for building on colab

Pull Request - State: closed - Opened by guccialex almost 2 years ago - 3 comments

#66 - Is compute time expected to go up linearly with batch size?

Issue - State: closed - Opened by zphang almost 2 years ago - 1 comment

#66 - Is compute time expected to go up linearly with batch size?

Issue - State: closed - Opened by zphang almost 2 years ago - 1 comment

#65 - Issue loading tokenizer when using local models

Issue - State: closed - Opened by iamlemec almost 2 years ago - 1 comment

#65 - Issue loading tokenizer when using local models

Issue - State: closed - Opened by iamlemec almost 2 years ago - 1 comment

#64 - Having trouble using saved models

Issue - State: closed - Opened by dnhkng almost 2 years ago - 6 comments

#63 - Extraneous data point

Issue - State: closed - Opened by philipturner almost 2 years ago - 3 comments

#62 - opt.py python SyntaxError?

Issue - State: closed - Opened by alexl83 almost 2 years ago - 3 comments

#61 - Issues with cuda setup

Issue - State: closed - Opened by IridiumMaster almost 2 years ago - 1 comment

#60 - potential Mistakes in the test data selection for perplexity evaluation

Issue - State: closed - Opened by Green-Sky almost 2 years ago - 2 comments

#59 - Error when installing cuda kernel

Issue - State: closed - Opened by plhosk almost 2 years ago - 5 comments

#58 - Add support for devices with compute capability < 6.0

Pull Request - State: closed - Opened by tobbez almost 2 years ago - 2 comments

#57 - How to fine-tune the 4-bit model?

Issue - State: closed - Opened by zsun227 almost 2 years ago - 10 comments

#56 - Quantising on multiple GPU?

Issue - State: closed - Opened by dnhkng almost 2 years ago - 1 comment

#55 - GPTQ+flexgen, is it possible?

Issue - State: closed - Opened by ye7iaserag almost 2 years ago - 5 comments

#54 - Revert "Use the main transformers library, rename LLaMA to Llama"

Pull Request - State: closed - Opened by qwopqwop200 almost 2 years ago - 4 comments

#53 - Revert "Use the main transformers library, rename LLaMA to Llama"

Pull Request - State: closed - Opened by qwopqwop200 almost 2 years ago

#52 - Use the main transformers library, rename LLaMA to Llama

Pull Request - State: closed - Opened by oobabooga almost 2 years ago

#51 - What would be required to quantize 65B model to 2-bit?

Issue - State: closed - Opened by Alcyon6 almost 2 years ago - 2 comments

#50 - Will loras work with this?

Issue - State: closed - Opened by fblissjr almost 2 years ago - 3 comments

#49 - Add alternative installation

Pull Request - State: closed - Opened by musabgultekin almost 2 years ago - 1 comment

#48 - llama_inference RuntimeError: Internal: src/sentencepiece_processor.cc

Issue - State: closed - Opened by youkpan almost 2 years ago - 1 comment

#47 - Problem with setup_cuda.py install

Issue - State: closed - Opened by farrael004 almost 2 years ago - 14 comments

#46 - Quantizing GALACTICA?

Issue - State: closed - Opened by oobabooga almost 2 years ago - 13 comments

#45 - [Request] Mixed Precission Quantization

Issue - State: closed - Opened by elephantpanda almost 2 years ago - 7 comments

#44 - Nvcc fatal : Unsupported gpu architecture 'compute_86'

Issue - State: closed - Opened by DamonianoStudios almost 2 years ago - 6 comments

#43 - RuntimeError: Tensors must have same number of dimensions: got 3 and 4

Issue - State: closed - Opened by enn-nafnlaus almost 2 years ago - 4 comments

#42 - GPTQ C++ Implementation Question

Issue - State: closed - Opened by MarkSchmidty almost 2 years ago - 1 comment

#41 - Bad results for WinoGrande - more testers searched

Issue - State: closed - Opened by DanielWe2 almost 2 years ago - 1 comment

#40 - Script to execute Winogrande test

Pull Request - State: closed - Opened by DanielWe2 almost 2 years ago - 2 comments

#39 - Bad performance of OPT models

Issue - State: closed - Opened by Zerogoki00 almost 2 years ago - 2 comments

#38 - cuda extension problem

Issue - State: closed - Opened by WuNein almost 2 years ago - 6 comments

#37 - Support other models?

Issue - State: closed - Opened by Ph0rk0z almost 2 years ago - 2 comments

#36 - probability tensor contains either `inf`, `nan` or element < 0

Issue - State: closed - Opened by Minami-su almost 2 years ago - 1 comment

#35 - How to convert the official ckp to fit your repo

Issue - State: closed - Opened by merlinarer almost 2 years ago - 1 comment

#34 - Lonnnnnnnnng context load time before generation

Issue - State: closed - Opened by generic-username0718 almost 2 years ago - 7 comments

#33 - The current installed version of g++ is greater than the maximum required version by CUDA

Issue - State: closed - Opened by frandmb almost 2 years ago - 7 comments

#32 - converting local hf model with llama.py

Issue - State: closed - Opened by alexl83 almost 2 years ago - 3 comments

#31 - PosixPath object has no attribute endswith Win11 WSL2

Issue - State: closed - Opened by rossbishop almost 2 years ago - 5 comments

#30 - 4-bit llama gets progressively slower with each text generation

Issue - State: closed - Opened by 1aienthusiast almost 2 years ago - 11 comments

#29 - 4-bit llama gets progressively slower with each generation

Issue - State: closed - Opened by 1aienthusiast almost 2 years ago

#27 - Quantization produces non-deterministic weights

Issue - State: closed - Opened by MarkSchmidty almost 2 years ago - 3 comments

GitHub / qwopqwop200/GPTQ-for-LLaMa issues and pull requests