Ecosyste.ms: Issues
An open API service for providing issue and pull request metadata for open source projects.
GitHub / huggingface/tokenizers issues and pull requests
#1169 - `add_tokens` results in duplicate IDs
Issue -
State: closed - Opened by davidgilbertson over 1 year ago
- 5 comments
#1164 - GPT-2 tokeniser's decoder is incorrect and doesn't roundtrip
Issue -
State: closed - Opened by hauntsaninja over 1 year ago
- 6 comments
Labels: bug, Stale
#1162 - How does tokenizers tokenize non-english characters into tokens?
Issue -
State: closed - Opened by dongrixinyu almost 2 years ago
- 9 comments
#1160 - Adding words to vocabulary and training for a new domain
Issue -
State: closed - Opened by PeterM18 almost 2 years ago
- 6 comments
Labels: Stale
#1159 - text is listed as optional in api but tokenizer throws error when missing
Issue -
State: closed - Opened by epinnock almost 2 years ago
- 2 comments
Labels: Stale
#1157 - Use LTO for release and benchmark builds
Pull Request -
State: closed - Opened by csko almost 2 years ago
- 7 comments
#1146 - How to port Google/SentencePiece tokenizer to HF Tokenizer
Issue -
State: closed - Opened by SupreethRao99 almost 2 years ago
- 8 comments
#1144 - Allow to build without onig or fancy-regex
Pull Request -
State: closed - Opened by llogiq almost 2 years ago
- 5 comments
Labels: Stale
#1144 - Allow to build without onig or fancy-regex
Pull Request -
State: closed - Opened by llogiq almost 2 years ago
- 5 comments
Labels: Stale
#1142 - Bump dirs from 3.0 to 4.0
Pull Request -
State: closed - Opened by hvaara almost 2 years ago
- 6 comments
#1141 - Support for incremental decoding
Issue -
State: closed - Opened by njhill almost 2 years ago
- 16 comments
#1138 - How to porting subword-nmt to tokenizers?
Issue -
State: closed - Opened by wannaphong almost 2 years ago
- 4 comments
Labels: Stale
#1134 - TEST Please Ignore
Pull Request -
State: open - Opened by hvaara almost 2 years ago
- 1 comment
Labels: Stale
#1134 - TEST Please Ignore
Pull Request -
State: open - Opened by hvaara almost 2 years ago
- 1 comment
Labels: Stale
#1134 - TEST Please Ignore
Pull Request -
State: closed - Opened by hvaara almost 2 years ago
- 1 comment
Labels: Stale
#1133 - Fix broken links in docs
Pull Request -
State: closed - Opened by hvaara almost 2 years ago
- 2 comments
#1131 - Ignore Cargo.lock for subfolders
Pull Request -
State: closed - Opened by hvaara almost 2 years ago
- 6 comments
#1129 - Bump derive_builder from 0.9 to 0.12
Pull Request -
State: closed - Opened by hvaara almost 2 years ago
- 3 comments
#1128 - Can't import any modules
Issue -
State: closed - Opened by kronkinatorix almost 2 years ago
- 2 comments
Labels: Stale
#1128 - Can't import any modules
Issue -
State: closed - Opened by kronkinatorix almost 2 years ago
- 2 comments
Labels: Stale
#1118 - How to decode with the existing tokenizer
Issue -
State: open - Opened by ZhiYuanZeng almost 2 years ago
- 5 comments
Labels: Stale
#1118 - How to decode with the existing tokenizer
Issue -
State: open - Opened by ZhiYuanZeng almost 2 years ago
- 5 comments
Labels: Stale
#1118 - How to decode with the existing tokenizer
Issue -
State: open - Opened by ZhiYuanZeng almost 2 years ago
- 5 comments
Labels: Stale
#1118 - How to decode with the existing tokenizer
Issue -
State: closed - Opened by ZhiYuanZeng almost 2 years ago
- 5 comments
Labels: Stale
#1117 - C++ binding
Issue -
State: closed - Opened by jiaqianjing almost 2 years ago
- 2 comments
Labels: Stale
#1117 - C++ binding
Issue -
State: open - Opened by jiaqianjing almost 2 years ago
- 2 comments
Labels: Stale
#1117 - C++ binding
Issue -
State: open - Opened by jiaqianjing almost 2 years ago
- 2 comments
Labels: Stale
#1117 - C++ binding
Issue -
State: open - Opened by jiaqianjing almost 2 years ago
- 2 comments
Labels: Stale
#1114 - Is there any support for 'google/tapas-mini-finetuned-wtq' tokenizer?
Issue -
State: open - Opened by memetrusidovski almost 2 years ago
- 5 comments
Labels: Stale
#1114 - Is there any support for 'google/tapas-mini-finetuned-wtq' tokenizer?
Issue -
State: closed - Opened by memetrusidovski almost 2 years ago
- 5 comments
Labels: Stale
#1113 - OpenSSL internal error when importing tokenizers module
Issue -
State: closed - Opened by wai25 almost 2 years ago
- 7 comments
#1112 - Adding treat_whitespace_as_suffix as a new feature to sentencepiece?
Issue -
State: closed - Opened by Smu-Tan almost 2 years ago
- 3 comments
Labels: Stale, Feature Request
#1112 - Adding treat_whitespace_as_suffix as a new feature to sentencepiece?
Issue -
State: open - Opened by Smu-Tan almost 2 years ago
- 2 comments
Labels: Feature Request
#1110 - NormalizedString `append` method failed after calling `clear`
Issue -
State: open - Opened by yanyongyu about 2 years ago
- 7 comments
Labels: Stale
#1110 - NormalizedString `append` method failed after calling `clear`
Issue -
State: closed - Opened by yanyongyu about 2 years ago
- 7 comments
Labels: Stale
#1109 - How can I keep the initial input vocab and incremental add the new tokens during re-training a tokenizer?
Issue -
State: closed - Opened by henryxiao1997 about 2 years ago
- 2 comments
Labels: Stale
#1104 - tokenizers 0.13.2 does not compile when default features are turned off
Issue -
State: closed - Opened by jneuff about 2 years ago
- 5 comments
Labels: Stale
#1103 - Infinite training ?
Issue -
State: closed - Opened by astariul about 2 years ago
- 8 comments
#1101 - Update pr docs actions
Pull Request -
State: closed - Opened by mishig25 about 2 years ago
- 1 comment
#1100 - Difference between PreTrainedTokenizerFast Python and Node SentencePieceBPETokenizer
Issue -
State: closed - Opened by loretoparisi about 2 years ago
- 11 comments
Labels: Stale
#1095 - Add spaces_between_special_tokens and cleanup tokenization spaces
Pull Request -
State: closed - Opened by ArthurZucker about 2 years ago
- 6 comments
#1094 - Module not found: Error: Can't resolve '../bin-package' in path
Issue -
State: closed - Opened by zbloss about 2 years ago
- 6 comments
Labels: Stale
#1093 - How to implement customized tokenizers in Rust.
Issue -
State: closed - Opened by 5c4lar about 2 years ago
- 6 comments
Labels: Stale
#1092 - Can not get package to build with Python 3.11 on a minimal linux environment
Issue -
State: closed - Opened by ZetiMente about 2 years ago
- 4 comments
Labels: Stale
#1091 - How to prevent joined words in output of text generation?
Issue -
State: closed - Opened by auwsom about 2 years ago
- 5 comments
Labels: Stale
#1089 - [MINOR] Weird text artifacts in documentation
Issue -
State: closed - Opened by cakiki about 2 years ago
- 2 comments
Labels: Stale
#1088 - fails - Building wheel for tokenizers (pyproject.toml)
Issue -
State: closed - Opened by FurkanGozukara about 2 years ago
- 8 comments
Labels: Stale
#1087 - tokenizers 0.10.3 build fails
Issue -
State: closed - Opened by ashari4 about 2 years ago
- 5 comments
Labels: Stale
#1086 - WordPiece Pair Score Calculation and Reproducibility
Issue -
State: closed - Opened by gdeleva about 2 years ago
- 4 comments
Labels: Stale
#1084 - BERT regex replace normalizers
Issue -
State: closed - Opened by galtay-tempus about 2 years ago
- 4 comments
Labels: Stale
#1081 - [WIP] Unigram trainer seems odd, ignoring some suffixes entirely
Pull Request -
State: closed - Opened by Narsil about 2 years ago
- 1 comment
Labels: Stale
#1077 - BERT-based Tokenizer will Skip Some Unicode Tokens without warning
Issue -
State: closed - Opened by lolipopshock about 2 years ago
- 10 comments
Labels: Stale
#1071 - Vulnerabilities for openssl 1.0.1e
Issue -
State: closed - Opened by wqh17101 about 2 years ago
- 6 comments
Labels: Stale
#1070 - would huggingface like publish cpp binding for tokenizers?
Issue -
State: closed - Opened by mullerhai about 2 years ago
- 4 comments
Labels: Stale
#1065 - How does Byte-pair Encoding handle equally frequent pairs?
Issue -
State: closed - Opened by nikolay-klimenko about 2 years ago
- 2 comments
Labels: Stale
#1061 - Fast WordPiece tokenizer implementation
Issue -
State: closed - Opened by catleeball about 2 years ago
- 10 comments
Labels: Stale
#1057 - Replace doesn't match all characters in [[:punct:]]
Issue -
State: closed - Opened by david-waterworth about 2 years ago
- 6 comments
Labels: Stale
#1050 - ERROR: Failed building wheel for tokenizers
Issue -
State: closed - Opened by outdoorblake about 2 years ago
- 66 comments
Labels: bug
#1046 - Custom truncation logic is really hard
Issue -
State: open - Opened by dirkgr about 2 years ago
- 10 comments
Labels: Stale
#1045 - Docs say you can pass token ids to `.encode()`, but it throws an exception when you do
Issue -
State: open - Opened by dirkgr about 2 years ago
- 12 comments
Labels: Stale
#1044 - Set `add_prefix_space = False` for existing pre-trained tokenizers
Issue -
State: open - Opened by cyk1337 about 2 years ago
- 8 comments
Labels: Stale
#1042 - Unigram finalization
Issue -
State: open - Opened by david-waterworth over 2 years ago
- 2 comments
Labels: Stale
#1040 - Incorrect offsets after replace with special token
Issue -
State: open - Opened by david-waterworth over 2 years ago
- 2 comments
Labels: Stale
#1039 - Introducing special tokens via `tokenizers.normalizers.Replace`
Issue -
State: closed - Opened by david-waterworth over 2 years ago
- 3 comments
Labels: Stale
#1033 - tokenizer.save_vocabulary()
Issue -
State: closed - Opened by kkavyashankar0009 over 2 years ago
- 8 comments
Labels: Stale
#1031 - Missing documentation for `BertWordPieceTokenizer`
Issue -
State: closed - Opened by BlueskyFR over 2 years ago
- 2 comments
Labels: Stale
#1028 - How to preserve original dataset fields when tokenizing with overflow?
Issue -
State: closed - Opened by srobertjames over 2 years ago
- 1 comment
Labels: Stale
#1027 - Problem adding token with a specific replace normalizer
Issue -
State: closed - Opened by sadra-barikbin over 2 years ago
- 14 comments
Labels: Stale
#1026 - prebuilt darwin arm64 wheels
Issue -
State: closed - Opened by tekumara over 2 years ago
- 5 comments
#1025 - Can't convert <tokenizers.trainers.WordPieceTrainer object at 0x173caa2b0> to Sequence
Issue -
State: closed - Opened by Eleo22 over 2 years ago
- 2 comments
Labels: Stale
#1021 - Add wasm32 emscripten target support for python binding
Pull Request -
State: closed - Opened by messense over 2 years ago
Labels: Stale
#1020 - OPT Tokenizers have wrong "special_tokens_map"
Issue -
State: closed - Opened by chengxuz over 2 years ago
- 2 comments
Labels: Stale
#1018 - XLM-Roberta offset mapping is off by one in case of whitespace-subwords
Issue -
State: closed - Opened by robvanderg over 2 years ago
- 4 comments
Labels: Stale
#1017 - Unable to get Camel case tokens after tokenization in huggingface
Issue -
State: closed - Opened by pjhamb over 2 years ago
- 2 comments
Labels: Stale
#1015 - Training time way too long
Issue -
State: closed - Opened by jxuanli over 2 years ago
- 1 comment
Labels: Stale
#1012 - AttributeError: 'BertTokenizer' object has no attribute 'tokens_trie'
Issue -
State: closed - Opened by sbrvrm99-zz over 2 years ago
- 2 comments
Labels: Stale
#1011 - Difference in behavior between fast tokenizers and normal tokenizers regarding unicode characters in strings
Issue -
State: closed - Opened by avi-jain over 2 years ago
- 4 comments
Labels: Stale
#1004 - `limit_alphabet=1000` is unreasonable in some languages
Issue -
State: closed - Opened by kaisugi over 2 years ago
- 4 comments
Labels: Stale
#1001 - Add UNK token to a Unigram tokenizer created by giving vocabulary
Issue -
State: closed - Opened by marcmk6 over 2 years ago
- 3 comments
Labels: Stale
#998 - Pretrained BertWordPieceTokenizer loads with different parameters
Issue -
State: closed - Opened by ulyanaisaeva over 2 years ago
- 3 comments
Labels: Stale
#995 - Tokenizer VIsualizer
Issue -
State: closed - Opened by ToluClassics over 2 years ago
- 3 comments
Labels: Stale
#992 - Use sentencepiece's protobuf module instead of the local protobuf file
Pull Request -
State: closed - Opened by tma15 over 2 years ago
Labels: Stale
#991 - [Docs] Clarify how Tokenizer.pad_to_multiple_of is useful in allowing use of GPU tensor cores
Issue -
State: closed - Opened by ldorigo over 2 years ago
- 4 comments
Labels: Stale
#990 - Problems in building tokenizer with a not space-separated language
Issue -
State: closed - Opened by seiichiinoue over 2 years ago
- 7 comments
Labels: Stale
#984 - NPM install failing with 403 response
Issue -
State: closed - Opened by JulioAlbinatiCortez over 2 years ago
- 15 comments
Labels: Stale
#976 - Parallelize unigram trainer
Pull Request -
State: closed - Opened by mishig25 over 2 years ago
- 13 comments
#972 - "GLIBC_2.29 not found" on nodejs binding
Issue -
State: open - Opened by remagpie over 2 years ago
- 3 comments
Labels: Stale
#968 - Support for `pad_encodings` in the Python API
Issue -
State: open - Opened by LoicGrobol over 2 years ago
- 8 comments
Labels: Stale
#964 - Print warnings to stderr, not stdout
Issue -
State: closed - Opened by NickCrews over 2 years ago
- 7 comments
#958 - Update outdated dependencies and feature-gate CLI
Pull Request -
State: closed - Opened by MarcusGrass over 2 years ago
- 3 comments
#948 - char_to_token is broken when is_split_into_words is set to True
Issue -
State: open - Opened by zorikg over 2 years ago
- 4 comments
Labels: Stale
#946 - 0.11.5 and 0.11.6 packages not compatible with manylinux2010
Issue -
State: open - Opened by vgod-dbx over 2 years ago
- 6 comments
Labels: Stale
#935 - Support `wasm`
Issue -
State: closed - Opened by Narsil over 2 years ago
- 10 comments
Labels: Stale
#933 - [Optional] Adding parallelization to `Unigram` trainer.
Issue -
State: closed - Opened by Narsil over 2 years ago
#930 - Enabling simpler flow of information from the `Tokenizer` to the `trainer`.
Issue -
State: closed - Opened by Narsil over 2 years ago
- 2 comments
Labels: bug, enhancement, Stale, Feature Request
#929 - Implement the Byte->char hack of SPM within BPE
Issue -
State: closed - Opened by Narsil over 2 years ago
- 10 comments
#926 - Addition of CONTRIBUTING.md to Repository
Issue -
State: closed - Opened by beneyal over 2 years ago
- 6 comments
Labels: Stale
#921 - Attempt to make unigram faster 2.
Pull Request -
State: closed - Opened by thomasw21 over 2 years ago
- 1 comment
Labels: Stale
#920 - Attempt to make Unigram trainer parallel.
Pull Request -
State: closed - Opened by Narsil over 2 years ago
- 2 comments
#914 - Loading of Tokenizer is really slow when there are lots of additional tokens
Issue -
State: closed - Opened by PiercarloSlavazza over 2 years ago
- 9 comments