EleutherAI/lm-evaluation-harness issues and pull requests

#1842 - How to use Zeno

Issue - State: open - Opened by DavidAdamczyk about 2 months ago - 1 comment

#1841 - Inconsistent evaluation results with Chat Template

Issue - State: open - Opened by shiweijiezero about 2 months ago - 3 comments

#1840 - Evaluate encoder-decoder-models

Issue - State: closed - Opened by Bachstelze about 2 months ago - 1 comment

#1839 - AssertionError: aggregation named 'mean' conflicts with existing registered aggregation!

Issue - State: open - Opened by hunter2009pf about 2 months ago

#1838 - Fix links in README guiding to another branch

Pull Request - State: closed - Opened by LSinev about 2 months ago - 1 comment

#1837 - Bug: wrong `until` default value for chat based model

Issue - State: open - Opened by YilunZhou about 2 months ago - 2 comments

#1836 - sha256 for datasets or samples

Issue - State: closed - Opened by artemorloff about 2 months ago - 1 comment

#1835 - MPS backend out of memory evaluating fine-tuned Mixtral-8x7B-Instruct-v0.1 on a machine with 100+ GB

Issue - State: closed - Opened by chimezie about 2 months ago - 2 comments

#1834 - Afrimmlu

Pull Request - State: closed - Opened by IsraelAbebe about 2 months ago - 1 comment

#1833 - Evaluation results of llama2 with lm-evaluation-harness using wikitext-2

Issue - State: open - Opened by l2002924700 about 2 months ago - 1 comment

#1832 - Adding LLaVa support

Pull Request - State: open - Opened by ashvinnihalani about 2 months ago - 5 comments

#1831 - Using Language Models as Evaluators

Issue - State: open - Opened by lintangsutawika about 2 months ago - 5 comments
Labels: feature request

#1830 - Errors when loading exact_match.py

Issue - State: open - Opened by twxin about 2 months ago - 2 comments

#1829 - eval gsm8k from local dataset folder with the bug info "ValueError: BuilderConfig 'main' not found."

Issue - State: open - Opened by Jp-17 about 2 months ago

#1828 - Fix: support PEFT/LoRA with added tokens

Pull Request - State: closed - Opened by mapmeld about 2 months ago

#1827 - Add More Tests

Issue - State: open - Opened by haileyschoelkopf about 2 months ago
Labels: feature request

#1826 - I get this error whenever I try to run an eval: ImportError: cannot import name 'HfApi' from 'huggingface_hub'

Issue - State: closed - Opened by menhguin about 2 months ago - 15 comments

#1825 - Afrixnli Updates

Pull Request - State: closed - Opened by JessicaOjo about 2 months ago - 1 comment

#1824 - Avoid slow testing due to network issues.

Issue - State: open - Opened by pixeli99 about 2 months ago - 2 comments

#1823 - Getting error on lm-evaluation for merged models deployed on HF

Issue - State: closed - Opened by tolgakurtuluss about 2 months ago - 3 comments

#1822 - The input format for XNLI seems wired?

Issue - State: open - Opened by SefaZeng about 2 months ago - 2 comments

#1821 - TypeError: 'NoneType' object is not iterable when using cache and loglikelihood_rolling

Issue - State: open - Opened by mdocekal about 2 months ago

#1820 - SyntaxError when import lm_eval

Issue - State: closed - Opened by mxjmtxrm about 2 months ago - 3 comments

#1819 - Error when evaluating math.

Issue - State: closed - Opened by SefaZeng about 2 months ago - 2 comments

#1818 - when MMLU eval, num_few_shot=5, more GPU overhead

Issue - State: closed - Opened by chunniunai220ml about 2 months ago - 1 comment

#1817 - Task description newline characters removed by Jinja templating, affecting generated requests and performance

Issue - State: open - Opened by ma0li about 2 months ago - 1 comment

#1816 - Multi-round evaluation for chat models

Issue - State: open - Opened by YilunZhou about 2 months ago - 1 comment

#1815 - Financial PhraseBank (FPB) Eval Metric

Pull Request - State: open - Opened by bcicc about 2 months ago

#1814 - Multi Label Classification

Issue - State: open - Opened by IsraelAbebe about 2 months ago

#1813 - how to evaluate on boolq? incorrect results

Issue - State: closed - Opened by sidhantls about 2 months ago - 4 comments

#1812 - Support Mamba based models for evaluation tasks

Issue - State: closed - Opened by NamburiSrinath about 2 months ago - 1 comment

#1811 - Out-Of-Memory Error for same batch size but different dataset

Issue - State: closed - Opened by richardzhuang0412 about 2 months ago - 7 comments

#1810 - Fix cost_estimate.py

Pull Request - State: open - Opened by xksteven about 2 months ago - 5 comments

#1809 - how to run all the bigbench tasks at once?

Issue - State: open - Opened by kbmlcoding about 2 months ago

#1808 - Gemini 1.5/Ultra support

Issue - State: open - Opened by notrichardren about 2 months ago

#1807 - interface doc update

Pull Request - State: closed - Opened by KonradSzafer about 2 months ago - 2 comments

#1806 - Update flag `--hf_hub_log_args` in interface documentation

Pull Request - State: closed - Opened by sepiatone about 2 months ago

#1805 - Is caching large evaluation dataset like MMLU supported?

Issue - State: closed - Opened by richardzhuang0412 about 2 months ago - 1 comment

#1804 - How to evaluate a large model like llama-65B?

Issue - State: closed - Opened by fayuge about 2 months ago - 2 comments

#1803 - Copal task

Pull Request - State: closed - Opened by Erland366 about 2 months ago - 3 comments

#1802 - Hugging Face: Open LLM Leaderboard: how do I reproduce results for details_gpt2 repository

Issue - State: closed - Opened by CoconutJJ about 2 months ago - 2 comments

#1801 - Exclude all current tasks

Issue - State: closed - Opened by YilunZhou about 2 months ago - 2 comments
Labels: feature request

#1800 - Fix `--gen_kwargs` and VLLM (`temperature` not respected)

Pull Request - State: closed - Opened by haileyschoelkopf about 2 months ago
Labels: bug

#1799 - llama3 baseline reproduction problem

Issue - State: closed - Opened by fmm170 about 2 months ago - 5 comments
Labels: asking questions

#1798 - link to the example output on the hub

Pull Request - State: closed - Opened by KonradSzafer about 2 months ago

#1797 - Add NPU support for huggingface.py

Issue - State: closed - Opened by jiaqiw09 about 2 months ago - 2 comments

#1796 - Make `scripts.write_out` error out when no splits match

Pull Request - State: closed - Opened by haileyschoelkopf about 2 months ago

#1795 - Math or minerva_math not generating any samples via scripts.write_out

Issue - State: closed - Opened by xksteven about 2 months ago - 1 comment

#1794 - Vllm get tokenizer

Pull Request - State: open - Opened by AguirreNicolas about 2 months ago - 1 comment

#1793 - Re-add Hendrycks MATH (no sympy checking, no Minerva hardcoded prompt) variant

Pull Request - State: closed - Opened by haileyschoelkopf about 2 months ago - 1 comment

#1792 - Update `--tasks list` option in interface documentation

Pull Request - State: closed - Opened by sepiatone about 2 months ago - 1 comment

#1791 - Logging Updates (Alphabetize table printouts, fix eval tracker bug) (#1774)

Pull Request - State: closed - Opened by haileyschoelkopf about 2 months ago
Labels: bug, feature request

#1790 - Fix `batch_size=auto` for HF Seq2Seq models (#1765)

Pull Request - State: closed - Opened by haileyschoelkopf about 2 months ago
Labels: bug

#1789 - Fix for bootstrap_iters = 0 case (#1715)

Pull Request - State: closed - Opened by haileyschoelkopf about 2 months ago - 1 comment
Labels: bug

#1788 - Support loading slices of a split from a dataset

Issue - State: open - Opened by alexrs about 2 months ago

#1787 - add NPU support for huggingface.py

Pull Request - State: closed - Opened by jiaqiw09 about 2 months ago - 4 comments

#1786 - Add Ascend NPU for huggingface.py

Pull Request - State: closed - Opened by jiaqiw09 about 2 months ago - 1 comment

#1785 - limit fix

Pull Request - State: closed - Opened by KonradSzafer about 2 months ago - 2 comments

#1784 - Fix bug in setting until kwarg in openai completions

Pull Request - State: closed - Opened by ciaranby about 2 months ago

#1783 - openai.InternalServerError: the model generated invalid Unicode output

Issue - State: open - Opened by djstrong about 2 months ago

#1782 - Error when limit is not specified (possibly issue with requirements?)

Issue - State: closed - Opened by hammoudhasan about 2 months ago - 2 comments

#1781 - Data preprocess is slow for mmlu

Issue - State: closed - Opened by ThisisBillhe about 2 months ago - 1 comment
Labels: asking questions

#1780 - fix limit bug when limit is None

Pull Request - State: closed - Opened by djstrong about 2 months ago - 3 comments

#1779 - remove echo parameter in OpenAI completions API

Pull Request - State: closed - Opened by djstrong about 2 months ago - 1 comment

#1778 - error in eval-tracker : 'Namespace' object has no attribute 'push_results_to_hub'

Issue - State: closed - Opened by abgoswam about 2 months ago - 1 comment

#1777 - eval tracker args fix

Pull Request - State: closed - Opened by KonradSzafer about 2 months ago

#1776 - Fix README: change`----hf_hub_log_args` to `--hf_hub_log_args`

Pull Request - State: closed - Opened by MuhammadBinUsman03 about 2 months ago - 1 comment

#1775 - Fix Caching Tests ; Remove `pretrained=gpt2` default

Pull Request - State: closed - Opened by haileyschoelkopf about 2 months ago - 1 comment

#1774 - Sorting task output alphabetically

Issue - State: closed - Opened by ad8e about 2 months ago - 2 comments

#1773 - Adding some tasks

Pull Request - State: closed - Opened by clefourrier about 2 months ago - 1 comment

#1772 - How to filter to see only generate_until: lm-eval --tasks list

Issue - State: open - Opened by chigkim about 2 months ago

#1771 - Same results - different models

Issue - State: closed - Opened by aleksoren about 2 months ago - 4 comments

#1770 - Support OpenAI's Batch API

Issue - State: open - Opened by djstrong about 2 months ago - 1 comment

#1769 - remove duplicated `num_fewshot: 0`

Pull Request - State: closed - Opened by chujiezheng 2 months ago - 1 comment

#1768 - IndexError: list index out of range when running benchmark on gguf model

Issue - State: open - Opened by fherrmannsdoerfer 2 months ago - 2 comments

#1767 - Cannot have both a group list and task list

Issue - State: open - Opened by steven-basart 2 months ago - 5 comments
Labels: bug, asking questions

#1766 - evaluation tracker implementation

Pull Request - State: closed - Opened by KonradSzafer 2 months ago - 11 comments

#1765 - Seq2Seq Models with Batch Size `auto`

Issue - State: closed - Opened by KurtMica 2 months ago

#1764 - New commits for final PR + Edit to lm-eval-overview Notebook

Pull Request - State: closed - Opened by marilevay 2 months ago - 2 comments

#1763 - Include inference time in results

Pull Request - State: closed - Opened by giorgossideris 2 months ago - 4 comments

#1762 - Bug in yaml parsing

Issue - State: open - Opened by jordane95 2 months ago

#1761 - Does this support the model to use generate functions to eval not likelihood?

Issue - State: open - Opened by Juhywcy 2 months ago

#1760 - Fix m_arc choices

Pull Request - State: closed - Opened by jordane95 2 months ago - 2 comments

#1759 - Output constrained support

Issue - State: open - Opened by Mihaiii 2 months ago

#1758 - Pile 10k new task

Pull Request - State: closed - Opened by mukobi 2 months ago - 3 comments

#1757 - HellaSwag with UnicodeDecodeError

Issue - State: open - Opened by Hua-rookie 2 months ago - 13 comments

#1756 - vllm lora support

Pull Request - State: closed - Opened by bcicc 2 months ago - 1 comment

#1755 - No inference time is returned in results

Issue - State: closed - Opened by giorgossideris 2 months ago - 3 comments

#1754 - New Task Request: LegalBench

Issue - State: open - Opened by haileyschoelkopf 2 months ago - 2 comments
Labels: help wanted, feature request, good first issue

#1753 - Create task `dharma2` - a small (300 qs) & wide (many topics) dataset

Pull Request - State: open - Opened by UmerHA 2 months ago - 6 comments

#1752 - Pytorch profiling Error In Megatron-DeepSpeed/tasks/eval_harness/evaluate.py

Issue - State: closed - Opened by jrt-20 2 months ago - 1 comment

#1751 - Accuracy gap between single GPU and multiple GPUs

Issue - State: open - Opened by HsuWanTing 2 months ago - 3 comments

#1750 - Add filter registry decorator

Pull Request - State: closed - Opened by lozhn 2 months ago - 2 comments

#1749 - Fix Parameter Propagation for Tasks that have `include`

Pull Request - State: closed - Opened by lintangsutawika 2 months ago

#1748 - Add tasks for performance on long context lengths

Issue - State: open - Opened by nairbv 2 months ago - 1 comment
Labels: feature request

#1747 - [Feature Request] Metrics that require knowledge of input.

Issue - State: open - Opened by ciaranby 2 months ago

#1746 - [Feature Request] pre-built Docker image support

Issue - State: open - Opened by zsaladin 2 months ago - 3 comments
Labels: help wanted, feature request

GitHub / EleutherAI/lm-evaluation-harness issues and pull requests