EleutherAI/lm-evaluation-harness issues and pull requests

#2246 - Implemented Polyglotoxicityprompts

Pull Request - State: closed - Opened by jjbuschhoff about 1 month ago - 1 comment

#2245 - fix group args of mmlu and mmlu_pro

Pull Request - State: closed - Opened by eyuansu62 about 1 month ago - 1 comment

#2244 - Fix typos in multiple places

Pull Request - State: closed - Opened by LSinev about 1 month ago - 1 comment

#2243 - Multimodal prototyping

Pull Request - State: closed - Opened by lintangsutawika about 1 month ago

#2242 - OpenAICompletionsAPI does not implement the chat_template function, causing a TypeError when evaluating.

Issue - State: closed - Opened by yonatano about 1 month ago - 1 comment
Labels: bug

#2241 - fix mmlu_pro typo

Pull Request - State: closed - Opened by baberabb about 1 month ago

#2240 - IndexError: index out of bounds for multiple tasks of mmlu_pro

Issue - State: closed - Opened by lxning about 1 month ago

#2239 - Fix logging when resizing embedding layer in peft mode

Pull Request - State: closed - Opened by WPoelman about 1 month ago - 2 comments

#2238 - fix the regex string in mmlu_pro template

Pull Request - State: closed - Opened by lxning about 1 month ago - 1 comment

#2237 - mmlu_pro regex in template does not work

Issue - State: closed - Opened by lxning about 1 month ago - 7 comments

#2236 - Created new task for testing Llama on Asdiv

Pull Request - State: closed - Opened by Cameron7195 about 1 month ago - 2 comments

#2235 - default chat template method fix

Pull Request - State: closed - Opened by KonradSzafer about 1 month ago - 6 comments

#2234 - Cannot load local `mmlu` dataset

Issue - State: closed - Opened by AIR-hl about 1 month ago - 2 comments

#2233 - TODOs for Implementing LLM-as-a-Judge in Eval-Harness (Work in Progress)

Issue - State: open - Opened by SeungoneKim about 1 month ago

#2232 - Add Open Arabic LLM Leaderboard Benchmarks (Full and Light Version)

Pull Request - State: closed - Opened by Malikeh97 about 1 month ago - 9 comments

#2231 - apply_chat_template got 'str' object is not callable

Issue - State: closed - Opened by lxning about 1 month ago - 15 comments

#2230 - Update mmmu

Pull Request - State: closed - Opened by lintangsutawika about 1 month ago

#2229 - Update CODEOWNERS

Pull Request - State: closed - Opened by haileyschoelkopf about 1 month ago - 1 comment

#2228 - Lingoly README update

Pull Request - State: closed - Opened by am-bean about 1 month ago - 1 comment

#2227 - Fix Zeno Visualizer

Pull Request - State: closed - Opened by namtranase about 1 month ago - 2 comments

#2226 - Can I incorporate my own benchmark as a new task within lm-evaluation-harness?

Issue - State: closed - Opened by lioooncoder about 1 month ago - 2 comments

#2225 - Why no results for closed-sourced models?

Issue - State: closed - Opened by mrconter1 about 1 month ago - 1 comment

#2224 - Fix the zeno_visualizer.py to work with Phi3 Model

Issue - State: closed - Opened by namtranase about 1 month ago - 1 comment

#2223 - add dolomite-engine support for lm-eval-harness

Pull Request - State: closed - Opened by mayank31398 about 1 month ago - 3 comments

#2222 - trust_remote_code error in simple evaluate for hellaswag

Issue - State: closed - Opened by sidhantls about 1 month ago - 1 comment

#2221 - test_version_stable.py does not exist?

Issue - State: open - Opened by dorsa-zeinali about 1 month ago - 1 comment

#2220 - Classification into categories with generate-until and metrics

Issue - State: closed - Opened by DavidAdamczyk about 1 month ago - 1 comment
Labels: asking questions

#2219 - fix the leaderboard doc to reflect the tasks

Pull Request - State: closed - Opened by NathanHB about 1 month ago

#2218 - Update IFEval dataset to official one

Pull Request - State: closed - Opened by lewtun about 1 month ago - 1 comment

#2217 - Add GPTQModel support for inferencing GPTQ models

Pull Request - State: open - Opened by Qubitium about 1 month ago - 1 comment

#2216 - Update yaml to adapt to belebele dataset changes

Pull Request - State: closed - Opened by Uminosachi about 1 month ago - 1 comment

#2215 - Created a new task for gsm8k which corresponds to the Llama cot settings…

Pull Request - State: closed - Opened by Cameron7195 about 1 month ago - 3 comments

#2214 - Extremely SLOW, even slower than training

Issue - State: closed - Opened by AaronZLT about 2 months ago - 2 comments
Labels: asking questions

#2213 - Invalid response for loglikelihood for GGUF model

Issue - State: open - Opened by jaslatendresse about 2 months ago - 1 comment

#2212 - MATH is_equiv() function can not parser some ground truth

Issue - State: open - Opened by wukaixingxp about 2 months ago

#2211 - The results in the new and old versions differ from one another.

Issue - State: closed - Opened by ShahadSZ about 2 months ago - 3 comments

#2210 - nltk pickle

Issue - State: closed - Opened by bezir about 2 months ago - 1 comment

#2209 - add option for custom aggregation

Pull Request - State: open - Opened by lintangsutawika about 2 months ago - 2 comments

#2208 - Add KoCommonGEN v2 benchmark

Pull Request - State: open - Opened by metterian about 2 months ago - 2 comments

#2207 - CoverBench

Pull Request - State: open - Opened by ysjprojects about 2 months ago - 1 comment

#2206 - Update README.md

Pull Request - State: closed - Opened by ysjprojects about 2 months ago

#2205 - Error of `continuation_logprobs_dicts` is `None` when running with `vllm` on multi-choice tasks

Issue - State: closed - Opened by tongyx361 about 2 months ago - 1 comment

#2204 - Missing Tasks in Leaderboard

Issue - State: closed - Opened by xiaoyuxin1002 about 2 months ago - 2 comments

#2203 - Logging

Pull Request - State: open - Opened by lintangsutawika about 2 months ago

#2202 - [rank1]: huggingface_hub.utils._errors.HfHubHTTPError: 429 Client Error: Too Many Requests for url:

Issue - State: open - Opened by kmehant about 2 months ago
Labels: bug

#2201 - UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb5 in position 1: invalid start byte

Issue - State: open - Opened by YcChou about 2 months ago - 3 comments

#2200 - Question about IFEval on LeaderBoard

Issue - State: closed - Opened by matouk98 about 2 months ago - 1 comment
Labels: asking questions

#2199 - AttributeError: module 'torch' has no attribute 'uint16'

Issue - State: closed - Opened by Shinning-Zhou about 2 months ago - 3 comments

#2198 - New task: Lingoly

Pull Request - State: closed - Opened by am-bean about 2 months ago - 8 comments

#2197 - AttributeError: module 'lm_eval.tasks' has no attribute 'ALL_TASKS'

Issue - State: closed - Opened by Shinning-Zhou about 2 months ago - 2 comments

#2196 - mmlu_pro fewshot_config

Issue - State: closed - Opened by liewziqin about 2 months ago - 1 comment

#2195 - Feature request: `4.4.0` Pypi release with `leaderboard`

Issue - State: closed - Opened by younesbelkada about 2 months ago - 2 comments

#2194 - It's hard to look at the code of this repository. Garbage.

Issue - State: closed - Opened by Runningwater2357 about 2 months ago - 1 comment

#2193 - How to evaluate very large models (>= 70b) ?

Issue - State: closed - Opened by eldarkurtic about 2 months ago - 2 comments

#2192 - CoQA evaluation

Issue - State: open - Opened by liewziqin about 2 months ago

#2191 - gsm_plus minor fix

Pull Request - State: closed - Opened by ysjprojects about 2 months ago

#2190 - Adding MoralChoice as a benchmark

Issue - State: open - Opened by notrichardren about 2 months ago

#2189 - Adding RMS Calibration Error as a Metric

Issue - State: open - Opened by notrichardren about 2 months ago

#2188 - Adding AdvGLUE as an evaluation

Issue - State: open - Opened by notrichardren about 2 months ago

#2187 - Fix `loglikelihood_rolling` caching ( #1821 )

Pull Request - State: closed - Opened by haileyschoelkopf about 2 months ago

#2186 - Small README tweaks

Pull Request - State: closed - Opened by haileyschoelkopf about 2 months ago

#2185 - OOM for gpqa on A100

Issue - State: closed - Opened by ZhichaoWang970201 about 2 months ago - 5 comments

#2184 - Fix `revision` kwarg dtype in edge-cases

Pull Request - State: closed - Opened by haileyschoelkopf about 2 months ago

#2183 - Better Arg Typechecking / Validation

Issue - State: open - Opened by haileyschoelkopf about 2 months ago - 2 comments
Labels: help wanted, feature request, good first issue

#2182 - Issue with `state-spaces/transformerpp-2.7b` when generating

Issue - State: closed - Opened by jhuang265 about 2 months ago - 3 comments

#2181 - IrokoBench: Fix incorrect group assignments

Pull Request - State: closed - Opened by haileyschoelkopf about 2 months ago

#2180 - Add long context evaluation benchmarks such as LongBench and LEval.

Issue - State: open - Opened by txchen-USTC about 2 months ago - 2 comments
Labels: help wanted, feature request

#2179 - Support for multi-turn conversation benchmark

Issue - State: closed - Opened by eyuansu62 about 2 months ago - 2 comments

#2178 - TypeError: argument 'ids': 'NoneType' object cannot be converted to 'Sequence'

Issue - State: open - Opened by Maryam142 about 2 months ago - 3 comments

#2177 - when executes the OPT 6.7B model evaluation, the problem TypeError: 'NoneType' object is not iterable occur

Issue - State: open - Opened by yanchenmochen about 2 months ago - 10 comments

#2176 - add roadmap document

Pull Request - State: open - Opened by lintangsutawika about 2 months ago - 4 comments

#2175 - call model instead of forward

Issue - State: closed - Opened by ad8e about 2 months ago - 5 comments

#2174 - [hotfix] API: messages were created twice

Pull Request - State: closed - Opened by baberabb about 2 months ago

#2173 - Test BBH using local dataset

Issue - State: closed - Opened by shuoYan97 about 2 months ago - 4 comments

#2172 - Merge our most recent Multimodal commits

Pull Request - State: closed - Opened by haileyschoelkopf about 2 months ago

#2171 - local-completion: Passing Messages as List to Prompt Instead of Required String Format

Issue - State: open - Opened by ankush13r about 2 months ago - 1 comment

#2170 - How to calculate the "token_perplexity"

Issue - State: open - Opened by nongfang55 about 2 months ago - 1 comment

#2169 - fix typo.

Pull Request - State: closed - Opened by kargaranamir about 2 months ago

#2168 - add okapi machine translated notice.

Pull Request - State: closed - Opened by kargaranamir about 2 months ago - 1 comment

#2167 - [BUG] Huggingface and Neuron runs fail if model revision is an integer

Issue - State: closed - Opened by christyler3030 about 2 months ago - 1 comment

#2166 - Can't use OpenAI Chat Completion API call with simplest tasks

Issue - State: open - Opened by Some-random about 2 months ago - 4 comments

#2165 - flan_held_in task is broken starting with commit #1741

Issue - State: open - Opened by m-resta 2 months ago

#2164 - After executing tmmluplus, there are no group scores displayed, only the scores for each individual task are shown.

Issue - State: closed - Opened by zhuangyuan123 2 months ago - 5 comments

#2163 - mgsm task not working

Issue - State: open - Opened by adiprasad 2 months ago

#2162 - update evaluations

Pull Request - State: closed - Opened by ouhenio 2 months ago - 1 comment

#2161 - Premature `num_fewshot` check with `fewshot_as_multiturn`

Issue - State: open - Opened by baberabb 2 months ago - 1 comment

#2160 - Better tests for API models

Issue - State: open - Opened by baberabb 2 months ago

#2159 - GPT2 eval in lambada_openai, acc only 0.325

Issue - State: open - Opened by KeyKy 2 months ago - 1 comment

#2158 - Running multiple evaluations on the same task

Issue - State: open - Opened by MihaiMasala 2 months ago - 3 comments

#2157 - Add new benchmark: Spanish bench

Pull Request - State: open - Opened by zxcvuser 2 months ago - 3 comments

#2156 - Add new benchmark: Portuguese bench

Pull Request - State: open - Opened by zxcvuser 2 months ago - 1 comment

#2155 - Add new benchmark: Galician bench

Pull Request - State: open - Opened by zxcvuser 2 months ago - 1 comment

#2154 - Add new benchmark: Catalan bench

Pull Request - State: open - Opened by zxcvuser 2 months ago - 1 comment

#2153 - Add new benchmark: Basque bench

Pull Request - State: open - Opened by zxcvuser 2 months ago - 2 comments

#2152 - Bug about the output information.

Issue - State: open - Opened by eyuansu62 2 months ago

#2151 - When using the tmmluplus dataset, I'm encountering a TypeError: init() got an unexpected keyword argument 'group_alias'.

Issue - State: closed - Opened by zhuangyuan123 2 months ago - 1 comment

#2150 - Errors when trying to evaluate local T-MAC gguf model

Issue - State: open - Opened by prsmendonca 2 months ago

#2149 - [Bugfix] add temperature=0 to logprobs and seed args to API models

Pull Request - State: closed - Opened by baberabb 2 months ago - 2 comments

#2148 - Random results when parallelize=True

Issue - State: closed - Opened by Ofir408 2 months ago - 2 comments

#2147 - Bug in the `visualize-wandb.ipynb` example

Issue - State: open - Opened by tanaymeh 2 months ago

GitHub / EleutherAI/lm-evaluation-harness issues and pull requests