Ecosyste.ms: Issues
An open API service for providing issue and pull request metadata for open source projects.
GitHub / huggingface/tokenizers issues and pull requests
#1426 - Profile-Guided Optimization (PGO) benchmark results
Issue -
State: closed - Opened by zamazan4ik 9 months ago
- 4 comments
Labels: Stale
#1425 - "make bench" command does not download all required resources
Issue -
State: closed - Opened by zamazan4ik 9 months ago
#1424 - Decoding Issue for Latin Characters in `added_tokens`
Issue -
State: closed - Opened by 44670 9 months ago
- 2 comments
Labels: Stale
#1423 - Possible bug in case of prepending chars in a pretokenizer
Issue -
State: closed - Opened by ivankrylatskoe 9 months ago
- 9 comments
Labels: Stale
#1422 - loading `added_tokens.json`
Issue -
State: closed - Opened by kczimm 9 months ago
- 3 comments
#1421 - Memory Leak in encode_batch Function
Issue -
State: closed - Opened by Atakey 9 months ago
- 5 comments
Labels: Stale
#1420 - Add quick doc to byte_level.rs
Pull Request -
State: closed - Opened by steventrouble 9 months ago
- 1 comment
#1419 - add option to skip special tokens
Pull Request -
State: closed - Opened by ArthurZucker 10 months ago
- 5 comments
#1418 - Unsupported platform for tokenizers
Issue -
State: closed - Opened by KolbySisk 10 months ago
- 2 comments
Labels: Stale
#1417 - Questions re: Tokenizer pipeline composability
Issue -
State: closed - Opened by ahgraber 10 months ago
- 2 comments
#1416 - ModuleNotFoundError: No module named 'tokenizers.tokenizers'
Issue -
State: closed - Opened by supreetkt 10 months ago
- 7 comments
Labels: Stale
#1415 - Support PyArrow arrays as tokenizer input
Issue -
State: closed - Opened by mariosasko 10 months ago
- 12 comments
Labels: enhancement, good first issue, Stale
#1414 - Faster HF dataset iteration in docs
Pull Request -
State: closed - Opened by mariosasko 10 months ago
- 1 comment
#1413 - Efficient Replace normalizer
Pull Request -
State: closed - Opened by rlrs 10 months ago
- 8 comments
#1412 - Performance of tokenizer for CLIP text model
Issue -
State: closed - Opened by michael-p 10 months ago
- 2 comments
Labels: Stale
#1410 - How to create Tokenizer.json?
Issue -
State: closed - Opened by kenaii 10 months ago
- 2 comments
Labels: Stale
#1409 - Tokenizer **not saving/loading** correctly after adding tokens, then training
Issue -
State: closed - Opened by dinhanhx 10 months ago
- 8 comments
Labels: Stale
#1408 - Special tokens will be split when there is no space before them
Issue -
State: closed - Opened by leizhao1234 10 months ago
- 1 comment
#1407 - How to add byte_fallback tokens?
Issue -
State: open - Opened by dinhanhx 10 months ago
- 5 comments
Labels: bytefallback, Feature Request
#1406 - Release Candidate
Pull Request -
State: closed - Opened by ArthurZucker 10 months ago
- 2 comments
#1405 - Tokenization is super slow when using XGLMTokenizer or XGLMTokenizerFast
Issue -
State: closed - Opened by deklesen 10 months ago
- 7 comments
Labels: Stale
#1403 - Use NodeJs: Cannot find module 'tokenizers-darwin-arm64'
Issue -
State: closed - Opened by guotingchao 10 months ago
- 8 comments
Labels: Stale
#1402 - Installation error with pip install tokenizers==0.12.1 – Compatibility issue with Python 3.6.15 and Rust 1.72.0
Issue -
State: closed - Opened by AhmetTasdemir 10 months ago
- 11 comments
Labels: Stale
#1401 - Demonstrating Sentence Truncation in Tokenization
Issue -
State: closed - Opened by AliHaiderAhmad001 10 months ago
- 3 comments
Labels: Stale
#1400 - Another Implementation (faster and more effecient) of BPE Training Algorithm
Issue -
State: closed - Opened by Yikai-Liao 10 months ago
- 39 comments
Labels: Stale
#1399 - A whitespace character not displaying at a specific position
Issue -
State: closed - Opened by scissorstail 10 months ago
- 2 comments
#1398 - Rust tokenizer fails!
Issue -
State: closed - Opened by arunpatro 10 months ago
- 2 comments
Labels: Stale
#1397 - Integration with google/oss-fuzz for continuous fuzzing
Issue -
State: closed - Opened by silvergasp 10 months ago
- 1 comment
Labels: Stale
#1396 - fuzz: Add a BPE training fuzzer
Pull Request -
State: closed - Opened by silvergasp 10 months ago
- 1 comment
Labels: Stale
#1395 - train_new_from_iterator fails in non-space separated languages
Issue -
State: closed - Opened by frotaur 11 months ago
- 5 comments
Labels: Stale
#1394 - Fix: fixing the inconsistency in byte-level tokenization when using pre_tokenizer.sequence.
Pull Request -
State: closed - Opened by junrae6454 11 months ago
- 1 comment
#1393 - unable to install on python 3.12 via pip
Issue -
State: closed - Opened by binary-husky 11 months ago
- 10 comments
#1392 - added_tokens with bytemap charaters in ByteLevel could not be decoded correctly
Issue -
State: closed - Opened by DOGEwbx 11 months ago
- 9 comments
Labels: bug
#1391 - How to split special token in encode?
Issue -
State: closed - Opened by leizhao1234 11 months ago
- 5 comments
#1390 - udpate to version = "0.15.1-dev0"
Pull Request -
State: closed - Opened by ArthurZucker 11 months ago
- 1 comment
#1389 - apply_chat_template() with tokenize=False returns incorrect string
Issue -
State: closed - Opened by Gnurro 11 months ago
- 2 comments
Labels: Stale
#1388 - Release Candidate
Pull Request -
State: closed - Opened by ArthurZucker 11 months ago
- 1 comment
#1387 - is there a javascript version for tokenizers
Issue -
State: closed - Opened by Zwe1 11 months ago
- 2 comments
Labels: Stale
#1386 - pyo3: update to 0.20
Pull Request -
State: closed - Opened by mikelui 11 months ago
- 6 comments
#1385 - Allow `huggingface_hub<1.0`
Pull Request -
State: closed - Opened by Wauplin 11 months ago
- 6 comments
#1384 - Error: Cannot find module 'tokenizers-linux-x64-musl'
Issue -
State: closed - Opened by Madnex 11 months ago
- 6 comments
Labels: Stale
#1383 - Allow hf_hub 0.18
Pull Request -
State: closed - Opened by mariosasko 11 months ago
- 4 comments
#1382 - Fix truncation length assertion
Pull Request -
State: closed - Opened by boyleconnor 11 months ago
- 3 comments
Labels: Stale
#1381 - Derive `Clone` on `Tokenizer`, add `Encoding.into_tokens()` method
Pull Request -
State: closed - Opened by epwalsh 11 months ago
- 2 comments
#1380 - Add tokens not impacted by training
Issue -
State: closed - Opened by StellaAthena 11 months ago
- 6 comments
Labels: Stale
#1379 - Add C++ bindings by mlc-ai to README
Pull Request -
State: closed - Opened by ShukantPal 11 months ago
Labels: Stale
#1378 - Rename modeled `token_to_id`
Pull Request -
State: closed - Opened by chris-ha458 11 months ago
- 2 comments
Labels: Stale
#1377 - Allow tokenizers to use huggingface_hub 0.18.0
Pull Request -
State: closed - Opened by clefourrier 11 months ago
- 1 comment
#1376 - RobertaTokenizer : tokenizer.decode and tokenizer.tokenize do not generate the same output
Issue -
State: closed - Opened by BettyFabre 11 months ago
- 4 comments
Labels: Stale
#1375 - Question: what is the add_special_tokens parameter of Tokenizer::encode?
Issue -
State: closed - Opened by EricLBuehler 11 months ago
- 4 comments
#1374 - add_tokens has no effect in llama fast tokenizer
Issue -
State: closed - Opened by tiandiweizun 11 months ago
- 1 comment
#1373 - Can not load tokoenizer from_pretrained through http_proxy since 0.14.0
Issue -
State: closed - Opened by jtsai-quid 11 months ago
- 7 comments
Labels: Stale
#1372 - end_of_word_suffix = "</w>" no work??
Issue -
State: open - Opened by longday1102 11 months ago
- 2 comments
#1371 - fix: remove useless token
Pull Request -
State: closed - Opened by rtrompier 12 months ago
- 1 comment
#1370 - Bump @babel/traverse from 7.22.11 to 7.23.2 in /bindings/node
Pull Request -
State: closed - Opened by dependabot[bot] 12 months ago
Labels: dependencies, javascript
#1369 - `BPE` tokenization model does not respect custom `RegEx` via `Split` pre-tokenizer
Issue -
State: closed - Opened by hogru 12 months ago
- 8 comments
Labels: Stale
#1368 - How can we ignore special tokens when encoding text
Issue -
State: closed - Opened by DOGEwbx 12 months ago
- 8 comments
Labels: Stale
#1367 - Fix doc links in readme
Pull Request -
State: closed - Opened by Pierrci 12 months ago
- 1 comment
#1366 - Warnings for added tokens not present in the vocab
Issue -
State: closed - Opened by jneuff 12 months ago
- 7 comments
Labels: Stale
#1365 - cannot install with yarn & missing module in npm
Issue -
State: closed - Opened by MaelAbgrall 12 months ago
- 6 comments
Labels: Stale
#1364 - Wrapping Tokenizer leads to version error
Issue -
State: closed - Opened by shivanraptor 12 months ago
- 3 comments
#1363 - Difference between slow and fast GPT2 tokenizers
Issue -
State: closed - Opened by goerch 12 months ago
- 9 comments
#1362 - When decoding an English sentence with the 'add_prefix_space' parameter set to 'False,' how can I add spaces?
Issue -
State: closed - Opened by enze5088 12 months ago
- 4 comments
#1361 - Exception: Custom Normalizer cannot be serialized
Issue -
State: closed - Opened by shivanraptor 12 months ago
- 1 comment
#1360 - Errors "Using sep_token, but it is not set yet." loading tokenizer trained from scratch
Issue -
State: closed - Opened by velocityCavalry 12 months ago
- 3 comments
Labels: Stale
#1359 - bc3ec39d breaks the compilation (as noted in #1355)
Issue -
State: closed - Opened by baptisterajaut 12 months ago
- 13 comments
#1358 - Different behaviour of BPE encoder after update to 0.14.1
Issue -
State: closed - Opened by DOGEwbx 12 months ago
- 14 comments
#1357 - [`pre_tokenizers`] Fix sentencepiece based Metaspace
Pull Request -
State: closed - Opened by ArthurZucker 12 months ago
- 3 comments
#1356 - fix a clerical error in the comment
Pull Request -
State: closed - Opened by tiandiweizun 12 months ago
- 1 comment
#1355 - Preparing release.
Pull Request -
State: closed - Opened by Narsil 12 months ago
- 1 comment
#1354 - Different `encode` behavior between Python and Rust
Issue -
State: closed - Opened by clarkmcc 12 months ago
- 12 comments
Labels: Stale
#1353 - Fixing the progressbar.
Pull Request -
State: closed - Opened by Narsil 12 months ago
- 1 comment
#1352 - Cannot install transformers version 4.11.1
Issue -
State: closed - Opened by ra-MANUJ-an 12 months ago
- 1 comment
#1351 - Support for apply_chat_template
Issue -
State: closed - Opened by xfalcox 12 months ago
- 2 comments
#1350 - show_progress=True parameter of trainers.WordPieceTrainer does nothing & the Trainer does not support GPU
Issue -
State: closed - Opened by shivanraptor 12 months ago
- 4 comments
#1349 - Rust panic when tokenizing after adding vocabulary that contains zero-width chars
Issue -
State: closed - Opened by ali-tny about 1 year ago
- 6 comments
Labels: Stale
#1348 - ByteLevelBPE training error after adding normalizers.Replace
Issue -
State: closed - Opened by Byshev333 about 1 year ago
- 3 comments
Labels: Stale
#1347 - Mitigate prompt injection attacks by supporting "safe" encoding (encoding without special tokens)
Issue -
State: closed - Opened by bilelomrani1 about 1 year ago
- 11 comments
#1346 - ByteLevelBPETokenizer: training duration of different vocab sizes
Issue -
State: closed - Opened by Byshev333 about 1 year ago
- 2 comments
Labels: Stale
#1345 - train_new_from_iterator consumes large amount of ram
Issue -
State: closed - Opened by RichardErkhov about 1 year ago
- 10 comments
Labels: Stale
#1344 - Let's allow hf_hub < 1.0
Pull Request -
State: closed - Opened by ArthurZucker about 1 year ago
- 2 comments
#1343 - CodeLlamaTokenizerFast encodes eos_token into separate tokens in multiprocessing mode
Issue -
State: closed - Opened by UniverseFly about 1 year ago
- 5 comments
#1342 - tokenizers.processors is not optional
Issue -
State: closed - Opened by david-waterworth about 1 year ago
- 6 comments
#1341 - Added ability to inspect a 'Sequence' pre-tokenizer.
Pull Request -
State: closed - Opened by eaplatanios about 1 year ago
- 13 comments
#1340 - How are special chars like \u0120 (Ġ) being handled in tokenizer?
Issue -
State: closed - Opened by ekagra-ranjan about 1 year ago
- 2 comments
#1339 - update package version for dev
Pull Request -
State: closed - Opened by ArthurZucker about 1 year ago
- 1 comment
#1338 - Release candidate
Pull Request -
State: closed - Opened by ArthurZucker about 1 year ago
- 1 comment
#1337 - Does train_new_from_iterator messes up the old_tokenizer ids?
Issue -
State: closed - Opened by palash04 about 1 year ago
- 2 comments
Labels: Stale
#1336 - Encoding emoji doesn't match with UTF-8 representation
Issue -
State: closed - Opened by Norawit29 about 1 year ago
- 4 comments
Labels: Stale
#1335 - Update added tokens
Pull Request -
State: closed - Opened by ArthurZucker about 1 year ago
- 1 comment
#1334 - `AddedTokens` loophole
Issue -
State: closed - Opened by ArthurZucker about 1 year ago
- 1 comment
#1333 - Updating the docs with the new command.
Pull Request -
State: closed - Opened by Narsil about 1 year ago
- 1 comment
#1332 - Source Installation Issue in Python Bindings
Issue -
State: closed - Opened by AnugunjNaman about 1 year ago
- 2 comments
Labels: Stale
#1331 - Move to maturing mimicking move for `safetensors`. + Rewritten node bindings.
Pull Request -
State: closed - Opened by Narsil about 1 year ago
- 2 comments
#1331 - Move to maturing mimicking move for `safetensors`. + Rewritten node bindings.
Pull Request -
State: closed - Opened by Narsil about 1 year ago
- 2 comments
#1330 - Python 38 arm
Pull Request -
State: closed - Opened by Narsil about 1 year ago
- 1 comment
#1329 - Reduce number of different revisions by 1
Pull Request -
State: closed - Opened by Narsil about 1 year ago
- 1 comment
#1328 - Re-using scritpts from safetensors.
Pull Request -
State: closed - Opened by Narsil about 1 year ago
- 1 comment
#1327 - SentencePieceBPETokenizer from_spm function
Issue -
State: closed - Opened by ykoyfman about 1 year ago
- 2 comments
#1326 - `stride` assertion check no longer catches all invalid truncation parameters
Issue -
State: closed - Opened by boyleconnor about 1 year ago
- 2 comments
Labels: Stale