huggingface/tokenizers issues and pull requests

#1426 - Profile-Guided Optimization (PGO) benchmark results

Issue - State: closed - Opened by zamazan4ik 9 months ago - 4 comments
Labels: Stale

#1425 - "make bench" command does not download all required resources

Issue - State: closed - Opened by zamazan4ik 9 months ago

#1424 - Decoding Issue for Latin Characters in `added_tokens`

Issue - State: closed - Opened by 44670 9 months ago - 2 comments
Labels: Stale

#1423 - Possible bug in case of prepending chars in a pretokenizer

Issue - State: closed - Opened by ivankrylatskoe 9 months ago - 9 comments
Labels: Stale

#1422 - loading `added_tokens.json`

Issue - State: closed - Opened by kczimm 9 months ago - 3 comments

#1421 - Memory Leak in encode_batch Function

Issue - State: closed - Opened by Atakey 9 months ago - 5 comments
Labels: Stale

#1420 - Add quick doc to byte_level.rs

Pull Request - State: closed - Opened by steventrouble 9 months ago - 1 comment

#1419 - add option to skip special tokens

Pull Request - State: closed - Opened by ArthurZucker 10 months ago - 5 comments

#1418 - Unsupported platform for tokenizers

Issue - State: closed - Opened by KolbySisk 10 months ago - 2 comments
Labels: Stale

#1417 - Questions re: Tokenizer pipeline composability

Issue - State: closed - Opened by ahgraber 10 months ago - 2 comments

#1416 - ModuleNotFoundError: No module named 'tokenizers.tokenizers'

Issue - State: closed - Opened by supreetkt 10 months ago - 7 comments
Labels: Stale

#1415 - Support PyArrow arrays as tokenizer input

Issue - State: closed - Opened by mariosasko 10 months ago - 12 comments
Labels: enhancement, good first issue, Stale

#1414 - Faster HF dataset iteration in docs

Pull Request - State: closed - Opened by mariosasko 10 months ago - 1 comment

#1413 - Efficient Replace normalizer

Pull Request - State: closed - Opened by rlrs 10 months ago - 8 comments

#1412 - Performance of tokenizer for CLIP text model

Issue - State: closed - Opened by michael-p 10 months ago - 2 comments
Labels: Stale

#1410 - How to create Tokenizer.json?

Issue - State: closed - Opened by kenaii 10 months ago - 2 comments
Labels: Stale

#1409 - Tokenizer not saving/loading correctly after adding tokens, then training

Issue - State: closed - Opened by dinhanhx 10 months ago - 8 comments
Labels: Stale

#1408 - Special tokens will be split when there is no space before them

Issue - State: closed - Opened by leizhao1234 10 months ago - 1 comment

#1407 - How to add byte_fallback tokens?

Issue - State: open - Opened by dinhanhx 10 months ago - 5 comments
Labels: bytefallback, Feature Request

#1406 - Release Candidate

Pull Request - State: closed - Opened by ArthurZucker 10 months ago - 2 comments

#1405 - Tokenization is super slow when using XGLMTokenizer or XGLMTokenizerFast

Issue - State: closed - Opened by deklesen 10 months ago - 7 comments
Labels: Stale

#1403 - Use NodeJs: Cannot find module 'tokenizers-darwin-arm64'

Issue - State: closed - Opened by guotingchao 10 months ago - 8 comments
Labels: Stale

#1402 - Installation error with pip install tokenizers==0.12.1 – Compatibility issue with Python 3.6.15 and Rust 1.72.0

Issue - State: closed - Opened by AhmetTasdemir 10 months ago - 11 comments
Labels: Stale

#1401 - Demonstrating Sentence Truncation in Tokenization

Issue - State: closed - Opened by AliHaiderAhmad001 10 months ago - 3 comments
Labels: Stale

#1400 - Another Implementation (faster and more effecient) of BPE Training Algorithm

Issue - State: closed - Opened by Yikai-Liao 10 months ago - 39 comments
Labels: Stale

#1399 - A whitespace character not displaying at a specific position

Issue - State: closed - Opened by scissorstail 10 months ago - 2 comments

#1398 - Rust tokenizer fails!

Issue - State: closed - Opened by arunpatro 10 months ago - 2 comments
Labels: Stale

#1397 - Integration with google/oss-fuzz for continuous fuzzing

Issue - State: closed - Opened by silvergasp 10 months ago - 1 comment
Labels: Stale

#1396 - fuzz: Add a BPE training fuzzer

Pull Request - State: closed - Opened by silvergasp 10 months ago - 1 comment
Labels: Stale

#1395 - train_new_from_iterator fails in non-space separated languages

Issue - State: closed - Opened by frotaur 11 months ago - 5 comments
Labels: Stale

#1394 - Fix: fixing the inconsistency in byte-level tokenization when using pre_tokenizer.sequence.

Pull Request - State: closed - Opened by junrae6454 11 months ago - 1 comment

#1393 - unable to install on python 3.12 via pip

Issue - State: closed - Opened by binary-husky 11 months ago - 10 comments

#1392 - added_tokens with bytemap charaters in ByteLevel could not be decoded correctly

Issue - State: closed - Opened by DOGEwbx 11 months ago - 9 comments
Labels: bug

#1391 - How to split special token in encode?

Issue - State: closed - Opened by leizhao1234 11 months ago - 5 comments

#1390 - udpate to version = "0.15.1-dev0"

Pull Request - State: closed - Opened by ArthurZucker 11 months ago - 1 comment

#1389 - apply_chat_template() with tokenize=False returns incorrect string

Issue - State: closed - Opened by Gnurro 11 months ago - 2 comments
Labels: Stale

#1388 - Release Candidate

Pull Request - State: closed - Opened by ArthurZucker 11 months ago - 1 comment

#1387 - is there a javascript version for tokenizers

Issue - State: closed - Opened by Zwe1 11 months ago - 2 comments
Labels: Stale

#1386 - pyo3: update to 0.20

Pull Request - State: closed - Opened by mikelui 11 months ago - 6 comments

#1385 - Allow `huggingface_hub<1.0`

Pull Request - State: closed - Opened by Wauplin 11 months ago - 6 comments

#1384 - Error: Cannot find module 'tokenizers-linux-x64-musl'

Issue - State: closed - Opened by Madnex 11 months ago - 6 comments
Labels: Stale

#1383 - Allow hf_hub 0.18

Pull Request - State: closed - Opened by mariosasko 11 months ago - 4 comments

#1382 - Fix truncation length assertion

Pull Request - State: closed - Opened by boyleconnor 11 months ago - 3 comments
Labels: Stale

#1381 - Derive `Clone` on `Tokenizer`, add `Encoding.into_tokens()` method

Pull Request - State: closed - Opened by epwalsh 11 months ago - 2 comments

#1380 - Add tokens not impacted by training

Issue - State: closed - Opened by StellaAthena 11 months ago - 6 comments
Labels: Stale

#1379 - Add C++ bindings by mlc-ai to README

Pull Request - State: closed - Opened by ShukantPal 11 months ago
Labels: Stale

#1378 - Rename modeled `token_to_id`

Pull Request - State: closed - Opened by chris-ha458 11 months ago - 2 comments
Labels: Stale

#1377 - Allow tokenizers to use huggingface_hub 0.18.0

Pull Request - State: closed - Opened by clefourrier 11 months ago - 1 comment

#1376 - RobertaTokenizer : tokenizer.decode and tokenizer.tokenize do not generate the same output

Issue - State: closed - Opened by BettyFabre 11 months ago - 4 comments
Labels: Stale

#1375 - Question: what is the add_special_tokens parameter of Tokenizer::encode?

Issue - State: closed - Opened by EricLBuehler 11 months ago - 4 comments

#1374 - add_tokens has no effect in llama fast tokenizer

Issue - State: closed - Opened by tiandiweizun 11 months ago - 1 comment

#1373 - Can not load tokoenizer from_pretrained through http_proxy since 0.14.0

Issue - State: closed - Opened by jtsai-quid 11 months ago - 7 comments
Labels: Stale

#1372 - end_of_word_suffix = "</w>" no work??

Issue - State: open - Opened by longday1102 11 months ago - 2 comments

#1371 - fix: remove useless token

Pull Request - State: closed - Opened by rtrompier 12 months ago - 1 comment

#1370 - Bump @babel/traverse from 7.22.11 to 7.23.2 in /bindings/node

Pull Request - State: closed - Opened by dependabot[bot] 12 months ago
Labels: dependencies, javascript

#1369 - `BPE` tokenization model does not respect custom `RegEx` via `Split` pre-tokenizer

Issue - State: closed - Opened by hogru 12 months ago - 8 comments
Labels: Stale

#1368 - How can we ignore special tokens when encoding text

Issue - State: closed - Opened by DOGEwbx 12 months ago - 8 comments
Labels: Stale

#1367 - Fix doc links in readme

Pull Request - State: closed - Opened by Pierrci 12 months ago - 1 comment

#1366 - Warnings for added tokens not present in the vocab

Issue - State: closed - Opened by jneuff 12 months ago - 7 comments
Labels: Stale

#1365 - cannot install with yarn & missing module in npm

Issue - State: closed - Opened by MaelAbgrall 12 months ago - 6 comments
Labels: Stale

#1364 - Wrapping Tokenizer leads to version error

Issue - State: closed - Opened by shivanraptor 12 months ago - 3 comments

#1363 - Difference between slow and fast GPT2 tokenizers

Issue - State: closed - Opened by goerch 12 months ago - 9 comments

#1362 - When decoding an English sentence with the 'add_prefix_space' parameter set to 'False,' how can I add spaces?

Issue - State: closed - Opened by enze5088 12 months ago - 4 comments

#1361 - Exception: Custom Normalizer cannot be serialized

Issue - State: closed - Opened by shivanraptor 12 months ago - 1 comment

#1360 - Errors "Using sep_token, but it is not set yet." loading tokenizer trained from scratch

Issue - State: closed - Opened by velocityCavalry 12 months ago - 3 comments
Labels: Stale

#1359 - bc3ec39d breaks the compilation (as noted in #1355)

Issue - State: closed - Opened by baptisterajaut 12 months ago - 13 comments

#1358 - Different behaviour of BPE encoder after update to 0.14.1

Issue - State: closed - Opened by DOGEwbx 12 months ago - 14 comments

#1357 - [`pre_tokenizers`] Fix sentencepiece based Metaspace

Pull Request - State: closed - Opened by ArthurZucker 12 months ago - 3 comments

#1356 - fix a clerical error in the comment

Pull Request - State: closed - Opened by tiandiweizun 12 months ago - 1 comment

#1355 - Preparing release.

Pull Request - State: closed - Opened by Narsil 12 months ago - 1 comment

#1354 - Different `encode` behavior between Python and Rust

Issue - State: closed - Opened by clarkmcc 12 months ago - 12 comments
Labels: Stale

#1353 - Fixing the progressbar.

Pull Request - State: closed - Opened by Narsil 12 months ago - 1 comment

#1352 - Cannot install transformers version 4.11.1

Issue - State: closed - Opened by ra-MANUJ-an 12 months ago - 1 comment

#1351 - Support for apply_chat_template

Issue - State: closed - Opened by xfalcox 12 months ago - 2 comments

#1350 - show_progress=True parameter of trainers.WordPieceTrainer does nothing & the Trainer does not support GPU

Issue - State: closed - Opened by shivanraptor 12 months ago - 4 comments

#1349 - Rust panic when tokenizing after adding vocabulary that contains zero-width chars

Issue - State: closed - Opened by ali-tny about 1 year ago - 6 comments
Labels: Stale

#1348 - ByteLevelBPE training error after adding normalizers.Replace

Issue - State: closed - Opened by Byshev333 about 1 year ago - 3 comments
Labels: Stale

#1347 - Mitigate prompt injection attacks by supporting "safe" encoding (encoding without special tokens)

Issue - State: closed - Opened by bilelomrani1 about 1 year ago - 11 comments

#1346 - ByteLevelBPETokenizer: training duration of different vocab sizes

Issue - State: closed - Opened by Byshev333 about 1 year ago - 2 comments
Labels: Stale

#1345 - train_new_from_iterator consumes large amount of ram

Issue - State: closed - Opened by RichardErkhov about 1 year ago - 10 comments
Labels: Stale

#1344 - Let's allow hf_hub < 1.0

Pull Request - State: closed - Opened by ArthurZucker about 1 year ago - 2 comments

#1343 - CodeLlamaTokenizerFast encodes eos_token into separate tokens in multiprocessing mode

Issue - State: closed - Opened by UniverseFly about 1 year ago - 5 comments

#1342 - tokenizers.processors is not optional

Issue - State: closed - Opened by david-waterworth about 1 year ago - 6 comments

#1341 - Added ability to inspect a 'Sequence' pre-tokenizer.

Pull Request - State: closed - Opened by eaplatanios about 1 year ago - 13 comments

#1340 - How are special chars like \u0120 (Ġ) being handled in tokenizer?

Issue - State: closed - Opened by ekagra-ranjan about 1 year ago - 2 comments

#1339 - update package version for dev

Pull Request - State: closed - Opened by ArthurZucker about 1 year ago - 1 comment

#1338 - Release candidate

Pull Request - State: closed - Opened by ArthurZucker about 1 year ago - 1 comment

#1337 - Does train_new_from_iterator messes up the old_tokenizer ids?

Issue - State: closed - Opened by palash04 about 1 year ago - 2 comments
Labels: Stale

#1336 - Encoding emoji doesn't match with UTF-8 representation

Issue - State: closed - Opened by Norawit29 about 1 year ago - 4 comments
Labels: Stale

#1335 - Update added tokens

Pull Request - State: closed - Opened by ArthurZucker about 1 year ago - 1 comment

#1334 - `AddedTokens` loophole

Issue - State: closed - Opened by ArthurZucker about 1 year ago - 1 comment

#1333 - Updating the docs with the new command.

Pull Request - State: closed - Opened by Narsil about 1 year ago - 1 comment

#1332 - Source Installation Issue in Python Bindings

Issue - State: closed - Opened by AnugunjNaman about 1 year ago - 2 comments
Labels: Stale

#1331 - Move to maturing mimicking move for `safetensors`. + Rewritten node bindings.

Pull Request - State: closed - Opened by Narsil about 1 year ago - 2 comments

#1331 - Move to maturing mimicking move for `safetensors`. + Rewritten node bindings.

Pull Request - State: closed - Opened by Narsil about 1 year ago - 2 comments

#1330 - Python 38 arm

Pull Request - State: closed - Opened by Narsil about 1 year ago - 1 comment

#1329 - Reduce number of different revisions by 1

Pull Request - State: closed - Opened by Narsil about 1 year ago - 1 comment

#1328 - Re-using scritpts from safetensors.

Pull Request - State: closed - Opened by Narsil about 1 year ago - 1 comment

#1327 - SentencePieceBPETokenizer from_spm function

Issue - State: closed - Opened by ykoyfman about 1 year ago - 2 comments

#1326 - `stride` assertion check no longer catches all invalid truncation parameters

Issue - State: closed - Opened by boyleconnor about 1 year ago - 2 comments
Labels: Stale

GitHub / huggingface/tokenizers issues and pull requests