huggingface/tokenizers issues and pull requests

#1169 - `add_tokens` results in duplicate IDs

Issue - State: closed - Opened by davidgilbertson over 1 year ago - 5 comments

#1164 - GPT-2 tokeniser's decoder is incorrect and doesn't roundtrip

Issue - State: closed - Opened by hauntsaninja over 1 year ago - 6 comments
Labels: bug, Stale

#1162 - How does tokenizers tokenize non-english characters into tokens?

Issue - State: closed - Opened by dongrixinyu almost 2 years ago - 9 comments

#1160 - Adding words to vocabulary and training for a new domain

Issue - State: closed - Opened by PeterM18 almost 2 years ago - 6 comments
Labels: Stale

#1159 - text is listed as optional in api but tokenizer throws error when missing

Issue - State: closed - Opened by epinnock almost 2 years ago - 2 comments
Labels: Stale

#1157 - Use LTO for release and benchmark builds

Pull Request - State: closed - Opened by csko almost 2 years ago - 7 comments

#1146 - How to port Google/SentencePiece tokenizer to HF Tokenizer

Issue - State: closed - Opened by SupreethRao99 almost 2 years ago - 8 comments

#1144 - Allow to build without onig or fancy-regex

Pull Request - State: closed - Opened by llogiq almost 2 years ago - 5 comments
Labels: Stale

#1144 - Allow to build without onig or fancy-regex

Pull Request - State: closed - Opened by llogiq almost 2 years ago - 5 comments
Labels: Stale

#1142 - Bump dirs from 3.0 to 4.0

Pull Request - State: closed - Opened by hvaara almost 2 years ago - 6 comments

#1141 - Support for incremental decoding

Issue - State: closed - Opened by njhill almost 2 years ago - 16 comments

#1138 - How to porting subword-nmt to tokenizers?

Issue - State: closed - Opened by wannaphong almost 2 years ago - 4 comments
Labels: Stale

#1134 - TEST Please Ignore

Pull Request - State: open - Opened by hvaara almost 2 years ago - 1 comment
Labels: Stale

#1134 - TEST Please Ignore

Pull Request - State: open - Opened by hvaara almost 2 years ago - 1 comment
Labels: Stale

#1134 - TEST Please Ignore

Pull Request - State: closed - Opened by hvaara almost 2 years ago - 1 comment
Labels: Stale

#1133 - Fix broken links in docs

Pull Request - State: closed - Opened by hvaara almost 2 years ago - 2 comments

#1131 - Ignore Cargo.lock for subfolders

Pull Request - State: closed - Opened by hvaara almost 2 years ago - 6 comments

#1129 - Bump derive_builder from 0.9 to 0.12

Pull Request - State: closed - Opened by hvaara almost 2 years ago - 3 comments

#1128 - Can't import any modules

Issue - State: closed - Opened by kronkinatorix almost 2 years ago - 2 comments
Labels: Stale

#1128 - Can't import any modules

Issue - State: closed - Opened by kronkinatorix almost 2 years ago - 2 comments
Labels: Stale

#1118 - How to decode with the existing tokenizer

Issue - State: open - Opened by ZhiYuanZeng almost 2 years ago - 5 comments
Labels: Stale

#1118 - How to decode with the existing tokenizer

Issue - State: open - Opened by ZhiYuanZeng almost 2 years ago - 5 comments
Labels: Stale

#1118 - How to decode with the existing tokenizer

Issue - State: open - Opened by ZhiYuanZeng almost 2 years ago - 5 comments
Labels: Stale

#1118 - How to decode with the existing tokenizer

Issue - State: closed - Opened by ZhiYuanZeng almost 2 years ago - 5 comments
Labels: Stale

#1117 - C++ binding

Issue - State: closed - Opened by jiaqianjing almost 2 years ago - 2 comments
Labels: Stale

#1117 - C++ binding

Issue - State: open - Opened by jiaqianjing almost 2 years ago - 2 comments
Labels: Stale

#1117 - C++ binding

Issue - State: open - Opened by jiaqianjing almost 2 years ago - 2 comments
Labels: Stale

#1117 - C++ binding

Issue - State: open - Opened by jiaqianjing almost 2 years ago - 2 comments
Labels: Stale

#1114 - Is there any support for 'google/tapas-mini-finetuned-wtq' tokenizer?

Issue - State: open - Opened by memetrusidovski almost 2 years ago - 5 comments
Labels: Stale

#1114 - Is there any support for 'google/tapas-mini-finetuned-wtq' tokenizer?

Issue - State: closed - Opened by memetrusidovski almost 2 years ago - 5 comments
Labels: Stale

#1113 - OpenSSL internal error when importing tokenizers module

Issue - State: closed - Opened by wai25 almost 2 years ago - 7 comments

#1112 - Adding treat_whitespace_as_suffix as a new feature to sentencepiece?

Issue - State: closed - Opened by Smu-Tan almost 2 years ago - 3 comments
Labels: Stale, Feature Request

#1112 - Adding treat_whitespace_as_suffix as a new feature to sentencepiece?

Issue - State: open - Opened by Smu-Tan almost 2 years ago - 2 comments
Labels: Feature Request

#1110 - NormalizedString `append` method failed after calling `clear`

Issue - State: open - Opened by yanyongyu about 2 years ago - 7 comments
Labels: Stale

#1110 - NormalizedString `append` method failed after calling `clear`

Issue - State: closed - Opened by yanyongyu about 2 years ago - 7 comments
Labels: Stale

#1109 - How can I keep the initial input vocab and incremental add the new tokens during re-training a tokenizer?

Issue - State: closed - Opened by henryxiao1997 about 2 years ago - 2 comments
Labels: Stale

#1104 - tokenizers 0.13.2 does not compile when default features are turned off

Issue - State: closed - Opened by jneuff about 2 years ago - 5 comments
Labels: Stale

#1103 - Infinite training ?

Issue - State: closed - Opened by astariul about 2 years ago - 8 comments

#1101 - Update pr docs actions

Pull Request - State: closed - Opened by mishig25 about 2 years ago - 1 comment

#1100 - Difference between PreTrainedTokenizerFast Python and Node SentencePieceBPETokenizer

Issue - State: closed - Opened by loretoparisi about 2 years ago - 11 comments
Labels: Stale

#1095 - Add spaces_between_special_tokens and cleanup tokenization spaces

Pull Request - State: closed - Opened by ArthurZucker about 2 years ago - 6 comments

#1094 - Module not found: Error: Can't resolve '../bin-package' in path

Issue - State: closed - Opened by zbloss about 2 years ago - 6 comments
Labels: Stale

#1093 - How to implement customized tokenizers in Rust.

Issue - State: closed - Opened by 5c4lar about 2 years ago - 6 comments
Labels: Stale

#1092 - Can not get package to build with Python 3.11 on a minimal linux environment

Issue - State: closed - Opened by ZetiMente about 2 years ago - 4 comments
Labels: Stale

#1091 - How to prevent joined words in output of text generation?

Issue - State: closed - Opened by auwsom about 2 years ago - 5 comments
Labels: Stale

#1089 - [MINOR] Weird text artifacts in documentation

Issue - State: closed - Opened by cakiki about 2 years ago - 2 comments
Labels: Stale

#1088 - fails - Building wheel for tokenizers (pyproject.toml)

Issue - State: closed - Opened by FurkanGozukara about 2 years ago - 8 comments
Labels: Stale

#1087 - tokenizers 0.10.3 build fails

Issue - State: closed - Opened by ashari4 about 2 years ago - 5 comments
Labels: Stale

#1086 - WordPiece Pair Score Calculation and Reproducibility

Issue - State: closed - Opened by gdeleva about 2 years ago - 4 comments
Labels: Stale

#1084 - BERT regex replace normalizers

Issue - State: closed - Opened by galtay-tempus about 2 years ago - 4 comments
Labels: Stale

#1081 - [WIP] Unigram trainer seems odd, ignoring some suffixes entirely

Pull Request - State: closed - Opened by Narsil about 2 years ago - 1 comment
Labels: Stale

#1077 - BERT-based Tokenizer will Skip Some Unicode Tokens without warning

Issue - State: closed - Opened by lolipopshock about 2 years ago - 10 comments
Labels: Stale

#1071 - Vulnerabilities for openssl 1.0.1e

Issue - State: closed - Opened by wqh17101 about 2 years ago - 6 comments
Labels: Stale

#1070 - would huggingface like publish cpp binding for tokenizers?

Issue - State: closed - Opened by mullerhai about 2 years ago - 4 comments
Labels: Stale

#1065 - How does Byte-pair Encoding handle equally frequent pairs?

Issue - State: closed - Opened by nikolay-klimenko about 2 years ago - 2 comments
Labels: Stale

#1061 - Fast WordPiece tokenizer implementation

Issue - State: closed - Opened by catleeball about 2 years ago - 10 comments
Labels: Stale

#1057 - Replace doesn't match all characters in [[:punct:]]

Issue - State: closed - Opened by david-waterworth about 2 years ago - 6 comments
Labels: Stale

#1050 - ERROR: Failed building wheel for tokenizers

Issue - State: closed - Opened by outdoorblake about 2 years ago - 66 comments
Labels: bug

#1046 - Custom truncation logic is really hard

Issue - State: open - Opened by dirkgr about 2 years ago - 10 comments
Labels: Stale

#1045 - Docs say you can pass token ids to `.encode()`, but it throws an exception when you do

Issue - State: open - Opened by dirkgr about 2 years ago - 12 comments
Labels: Stale

#1044 - Set `add_prefix_space = False` for existing pre-trained tokenizers

Issue - State: open - Opened by cyk1337 about 2 years ago - 8 comments
Labels: Stale

#1042 - Unigram finalization

Issue - State: open - Opened by david-waterworth over 2 years ago - 2 comments
Labels: Stale

#1040 - Incorrect offsets after replace with special token

Issue - State: open - Opened by david-waterworth over 2 years ago - 2 comments
Labels: Stale

#1039 - Introducing special tokens via `tokenizers.normalizers.Replace`

Issue - State: closed - Opened by david-waterworth over 2 years ago - 3 comments
Labels: Stale

#1033 - tokenizer.save_vocabulary()

Issue - State: closed - Opened by kkavyashankar0009 over 2 years ago - 8 comments
Labels: Stale

#1031 - Missing documentation for `BertWordPieceTokenizer`

Issue - State: closed - Opened by BlueskyFR over 2 years ago - 2 comments
Labels: Stale

#1028 - How to preserve original dataset fields when tokenizing with overflow?

Issue - State: closed - Opened by srobertjames over 2 years ago - 1 comment
Labels: Stale

#1027 - Problem adding token with a specific replace normalizer

Issue - State: closed - Opened by sadra-barikbin over 2 years ago - 14 comments
Labels: Stale

#1026 - prebuilt darwin arm64 wheels

Issue - State: closed - Opened by tekumara over 2 years ago - 5 comments

#1025 - Can't convert <tokenizers.trainers.WordPieceTrainer object at 0x173caa2b0> to Sequence

Issue - State: closed - Opened by Eleo22 over 2 years ago - 2 comments
Labels: Stale

#1021 - Add wasm32 emscripten target support for python binding

Pull Request - State: closed - Opened by messense over 2 years ago
Labels: Stale

#1020 - OPT Tokenizers have wrong "special_tokens_map"

Issue - State: closed - Opened by chengxuz over 2 years ago - 2 comments
Labels: Stale

#1018 - XLM-Roberta offset mapping is off by one in case of whitespace-subwords

Issue - State: closed - Opened by robvanderg over 2 years ago - 4 comments
Labels: Stale

#1017 - Unable to get Camel case tokens after tokenization in huggingface

Issue - State: closed - Opened by pjhamb over 2 years ago - 2 comments
Labels: Stale

#1015 - Training time way too long

Issue - State: closed - Opened by jxuanli over 2 years ago - 1 comment
Labels: Stale

#1012 - AttributeError: 'BertTokenizer' object has no attribute 'tokens_trie'

Issue - State: closed - Opened by sbrvrm99-zz over 2 years ago - 2 comments
Labels: Stale

#1011 - Difference in behavior between fast tokenizers and normal tokenizers regarding unicode characters in strings

Issue - State: closed - Opened by avi-jain over 2 years ago - 4 comments
Labels: Stale

#1004 - `limit_alphabet=1000` is unreasonable in some languages

Issue - State: closed - Opened by kaisugi over 2 years ago - 4 comments
Labels: Stale

#1001 - Add UNK token to a Unigram tokenizer created by giving vocabulary

Issue - State: closed - Opened by marcmk6 over 2 years ago - 3 comments
Labels: Stale

#998 - Pretrained BertWordPieceTokenizer loads with different parameters

Issue - State: closed - Opened by ulyanaisaeva over 2 years ago - 3 comments
Labels: Stale

#995 - Tokenizer VIsualizer

Issue - State: closed - Opened by ToluClassics over 2 years ago - 3 comments
Labels: Stale

#992 - Use sentencepiece's protobuf module instead of the local protobuf file

Pull Request - State: closed - Opened by tma15 over 2 years ago
Labels: Stale

#991 - [Docs] Clarify how Tokenizer.pad_to_multiple_of is useful in allowing use of GPU tensor cores

Issue - State: closed - Opened by ldorigo over 2 years ago - 4 comments
Labels: Stale

#990 - Problems in building tokenizer with a not space-separated language

Issue - State: closed - Opened by seiichiinoue over 2 years ago - 7 comments
Labels: Stale

#984 - NPM install failing with 403 response

Issue - State: closed - Opened by JulioAlbinatiCortez over 2 years ago - 15 comments
Labels: Stale

#976 - Parallelize unigram trainer

Pull Request - State: closed - Opened by mishig25 over 2 years ago - 13 comments

#972 - "GLIBC_2.29 not found" on nodejs binding

Issue - State: open - Opened by remagpie over 2 years ago - 3 comments
Labels: Stale

#968 - Support for `pad_encodings` in the Python API

Issue - State: open - Opened by LoicGrobol over 2 years ago - 8 comments
Labels: Stale

#964 - Print warnings to stderr, not stdout

Issue - State: closed - Opened by NickCrews over 2 years ago - 7 comments

#958 - Update outdated dependencies and feature-gate CLI

Pull Request - State: closed - Opened by MarcusGrass over 2 years ago - 3 comments

#948 - char_to_token is broken when is_split_into_words is set to True

Issue - State: open - Opened by zorikg over 2 years ago - 4 comments
Labels: Stale

#946 - 0.11.5 and 0.11.6 packages not compatible with manylinux2010

Issue - State: open - Opened by vgod-dbx over 2 years ago - 6 comments
Labels: Stale

#935 - Support `wasm`

Issue - State: closed - Opened by Narsil over 2 years ago - 10 comments
Labels: Stale

#933 - [Optional] Adding parallelization to `Unigram` trainer.

Issue - State: closed - Opened by Narsil over 2 years ago

#930 - Enabling simpler flow of information from the `Tokenizer` to the `trainer`.

Issue - State: closed - Opened by Narsil over 2 years ago - 2 comments
Labels: bug, enhancement, Stale, Feature Request

#929 - Implement the Byte->char hack of SPM within BPE

Issue - State: closed - Opened by Narsil over 2 years ago - 10 comments

#926 - Addition of CONTRIBUTING.md to Repository

Issue - State: closed - Opened by beneyal over 2 years ago - 6 comments
Labels: Stale

#921 - Attempt to make unigram faster 2.

Pull Request - State: closed - Opened by thomasw21 over 2 years ago - 1 comment
Labels: Stale

#920 - Attempt to make Unigram trainer parallel.

Pull Request - State: closed - Opened by Narsil over 2 years ago - 2 comments

#914 - Loading of Tokenizer is really slow when there are lots of additional tokens

Issue - State: closed - Opened by PiercarloSlavazza over 2 years ago - 9 comments

GitHub / huggingface/tokenizers issues and pull requests