Ecosyste.ms: Issues
An open API service for providing issue and pull request metadata for open source projects.
GitHub / huggingface/tokenizers issues and pull requests
#1565 - documentation of the `pattern` parameter in `pre_tokenizers.Split` is incorrect
Issue -
State: closed - Opened by craigschmidt 3 months ago
- 1 comment
Labels: documentation
#1564 - Decode regression
Issue -
State: open - Opened by daulet 3 months ago
- 7 comments
Labels: performance, decoding
#1564 - Decode regression
Issue -
State: open - Opened by daulet 3 months ago
- 7 comments
Labels: performance, decoding
#1564 - Decode regression
Issue -
State: open - Opened by daulet 3 months ago
- 10 comments
Labels: performance, decoding
#1563 - Unable to sSet `use_regex=False` in BPE decoder & post_processor?
Issue -
State: closed - Opened by jchwenger 3 months ago
- 4 comments
#1562 - Unable to Load Custom GPT2 Tokenizer - " data did not match any variant of untagged enum ModelWrapper at line 1 column 3193814" Error
Issue -
State: open - Opened by maghwa 3 months ago
- 7 comments
#1561 - tokenizers
Pull Request -
State: closed - Opened by Oleh8978 3 months ago
#1560 - Fast encode
Pull Request -
State: closed - Opened by ArthurZucker 3 months ago
- 1 comment
#1559 - Progress bar doesn't show in log file.
Issue -
State: open - Opened by amssljc 3 months ago
- 5 comments
Labels: Stale
#1558 - Bump braces from 3.0.2 to 3.0.3 in /tokenizers/examples/unstable_wasm/www
Pull Request -
State: closed - Opened by dependabot[bot] 3 months ago
- 2 comments
Labels: dependencies, javascript, Stale
#1557 - Bump ws from 8.8.1 to 8.17.1 in /tokenizers/examples/unstable_wasm/www
Pull Request -
State: closed - Opened by dependabot[bot] 3 months ago
- 2 comments
Labels: dependencies, javascript, Stale
#1556 - `Encoding` object stub doesn't include `__len__`
Issue -
State: open - Opened by thearchitector 3 months ago
#1556 - `Encoding` object stub doesn't include `__len__`
Issue -
State: open - Opened by thearchitector 3 months ago
- 4 comments
#1555 - Add bytelevel normalizer to fix decode when adding tokens to BPE
Pull Request -
State: closed - Opened by ArthurZucker 3 months ago
- 2 comments
#1555 - Fix decode
Pull Request -
State: open - Opened by ArthurZucker 3 months ago
- 1 comment
#1554 - make sure we don't warn on empty tokens
Pull Request -
State: closed - Opened by ArthurZucker 3 months ago
- 2 comments
#1554 - make sure we don't warn on empty tokens
Pull Request -
State: open - Opened by ArthurZucker 3 months ago
- 1 comment
#1553 - Llama-3 offset-mapping needs fixing
Issue -
State: open - Opened by davidb-cerebras 4 months ago
- 11 comments
#1552 - [Bug?] Modifying normalizer for pretrained tokenizers don't consistently work
Issue -
State: closed - Opened by alvations 4 months ago
- 3 comments
Labels: Stale
#1552 - Modifying normalizer for pretrained tokenizers don't consistently work
Issue -
State: open - Opened by alvations 4 months ago
#1551 - feat(ci): add trufflehog secrets detection
Pull Request -
State: closed - Opened by McPatate 4 months ago
- 1 comment
#1551 - feat(ci): add trufflehog secrets detection
Pull Request -
State: closed - Opened by McPatate 4 months ago
- 1 comment
#1551 - feat(ci): add trufflehog secrets detection
Pull Request -
State: closed - Opened by McPatate 4 months ago
- 1 comment
#1550 - Enable `dropout = 0.0` as an equivalent to `none` in BPE
Pull Request -
State: closed - Opened by mcognetta 4 months ago
- 6 comments
#1550 - Enable `dropout = 0.0` as an equivalent to `none` in BPE
Pull Request -
State: open - Opened by mcognetta 4 months ago
#1550 - Enable `dropout = 0.0` as an equivalent to `none` in BPE
Pull Request -
State: open - Opened by mcognetta 4 months ago
#1549 - How to use `TokenizerBuilder`?
Issue -
State: closed - Opened by polarathene 4 months ago
- 4 comments
Labels: Stale
#1548 - Fixing for clippy 1.78
Pull Request -
State: closed - Opened by Narsil 4 months ago
- 1 comment
#1548 - Fixing for clippy 1.78
Pull Request -
State: closed - Opened by Narsil 4 months ago
- 1 comment
#1547 - Switch from `cached_download` to `hf_hub_download` in tests
Pull Request -
State: closed - Opened by Wauplin 4 months ago
- 2 comments
#1547 - Switch from `cached_download` to `hf_hub_download` in tests
Pull Request -
State: closed - Opened by Wauplin 4 months ago
- 2 comments
#1546 - "Solution" to memory hogging in train_new_from_iterator with a hack
Issue -
State: open - Opened by morphpiece 4 months ago
- 4 comments
#1546 - "Solution" to memory hogging in train_new_from_iterator with a hack
Issue -
State: open - Opened by morphpiece 4 months ago
- 4 comments
#1546 - "Solution" to memory hogging in train_new_from_iterator with a hack
Issue -
State: open - Opened by morphpiece 4 months ago
- 7 comments
#1545 - How can I get the mapping relationship between byte values and Unicode characters of the fast tokenizer?
Issue -
State: closed - Opened by LuoKaiGSW 4 months ago
- 7 comments
Labels: Stale
#1545 - How can I get the mapping relationship between byte values and Unicode characters of the fast tokenizer?
Issue -
State: open - Opened by LuoKaiGSW 4 months ago
- 5 comments
#1544 - [BUG] Fast tokenizer does not deal with AddedTokens properly(no problem in Transformers python tokenizer impl.)
Issue -
State: closed - Opened by MilkClouds 4 months ago
- 7 comments
#1544 - [BUG] Fast tokenizer does not deal with AddedTokens properly(no problem in Transformers python tokenizer impl.)
Issue -
State: open - Opened by MilkClouds 4 months ago
- 6 comments
#1544 - [BUG] Fast tokenizer does not deal with AddedTokens properly(no problem in Transformers python tokenizer impl.)
Issue -
State: open - Opened by MilkClouds 4 months ago
- 2 comments
#1544 - [BUG] Fast tokenizer does not deal with AddedTokens properly(no problem in Transformers python tokenizer impl.)
Issue -
State: open - Opened by MilkClouds 4 months ago
- 6 comments
#1543 - llama3 tokenizer doesn't round trip
Issue -
State: open - Opened by josharian 4 months ago
- 3 comments
#1543 - llama3 tokenizer doesn't round trip
Issue -
State: open - Opened by josharian 4 months ago
- 4 comments
Labels: Stale
#1542 - Add display capabilities to tokenizers objects
Pull Request -
State: closed - Opened by ArthurZucker 4 months ago
- 5 comments
#1541 - Deserializing BPE tokenizer failure
Issue -
State: closed - Opened by mcognetta 4 months ago
- 4 comments
#1540 - Adding pretty print of tokenizer
Pull Request -
State: closed - Opened by haixuanTao 4 months ago
- 2 comments
#1539 - Memory leak for large strings
Issue -
State: open - Opened by noamgai21 4 months ago
- 14 comments
#1538 - "from_pretrained" read wrong config file. not "tokenizer_config.json", but "config.json"
Issue -
State: open - Opened by daehuikim 4 months ago
#1537 - Training HuggingFace tokenizer - ignore_merges
Issue -
State: closed - Opened by ykoyfman 4 months ago
- 2 comments
Labels: Stale, Feature Request, planned
#1536 - [BUG]Might be a bug in Unigram Trainer
Issue -
State: open - Opened by Codesticker 4 months ago
- 1 comment
Labels: Stale
#1535 - feat: add support for pyarrow arrays as input
Pull Request -
State: closed - Opened by notjedi 5 months ago
- 10 comments
Labels: Stale
#1534 - How to allow the merging of consecutive newline tokens \n when training a byte-level bpe tokenizer?
Issue -
State: open - Opened by liuslnlp 5 months ago
- 5 comments
Labels: Stale
#1533 - Make `onig` crate non-optional
Pull Request -
State: closed - Opened by nathaniel-daniel 5 months ago
- 1 comment
Labels: Stale
#1532 - Make `USED_PARALLELISM` atomic
Pull Request -
State: closed - Opened by nathaniel-daniel 5 months ago
- 3 comments
#1531 - How to Batch-Encode Paired Input Sentences with Tokenizers: Seeking Clarification
Issue -
State: closed - Opened by insookim43 5 months ago
- 1 comment
Labels: Stale
#1530 - Converting `tokenizers` tokenizers into `tiktoken` tokenizers
Issue -
State: closed - Opened by umarbutler 5 months ago
- 5 comments
Labels: Stale
#1529 - Bug with `CodeQwen1.5`: `data did not match any variant of untagged enum PyPreTokenizerTypeWrapper`
Issue -
State: closed - Opened by QwertyJack 5 months ago
- 1 comment
#1528 - Strange warnings with tokenizer for some models
Issue -
State: closed - Opened by EricLBuehler 5 months ago
- 5 comments
#1527 - Special token handling breaks idempotency of sentencepiece due to extra spaces
Issue -
State: open - Opened by cat-state 5 months ago
- 5 comments
Labels: Stale
#1526 - Link to download the training text in `docs/source/quicktour.rst` is broken
Issue -
State: closed - Opened by 14jdelap 5 months ago
- 6 comments
Labels: Stale
#1525 - How to write custom Wordpiece class?
Issue -
State: closed - Opened by xinyinan9527 5 months ago
- 3 comments
Labels: Stale
#1524 - Convert huggingface tokenizer into sentencepiece format
Issue -
State: closed - Opened by RRaphaell 5 months ago
- 3 comments
Labels: Stale
#1523 - ❓Get stats (e.g. counts) about the merged pairs
Issue -
State: closed - Opened by pietrolesci 5 months ago
- 3 comments
Labels: Stale
#1522 - Error: Cannot find module 'tokenizers/bindings/tokenizer'
Issue -
State: closed - Opened by meichangsu1 5 months ago
- 1 comment
Labels: Stale
#1522 - Error: Cannot find module 'tokenizers/bindings/tokenizer'
Issue -
State: open - Opened by meichangsu1 5 months ago
#1521 - remove enforcement of non special when adding tokens
Pull Request -
State: closed - Opened by ArthurZucker 5 months ago
- 2 comments
#1520 - Why are 'unknown' tokens randomly added to my tokenized input?
Issue -
State: closed - Opened by tshmak 5 months ago
- 2 comments
#1520 - Why are 'unknown' tokens randomly added to my tokenized input?
Issue -
State: closed - Opened by tshmak 5 months ago
- 2 comments
#1519 - Why the tokenizer is slower than tiktoken?
Issue -
State: open - Opened by BigBinnie 5 months ago
- 8 comments
#1519 - Why the tokenizer is slower than tiktoken?
Issue -
State: open - Opened by BigBinnie 5 months ago
- 5 comments
#1518 - Loading `tokenizer.model` with Rust API
Issue -
State: open - Opened by EricLBuehler 5 months ago
- 10 comments
#1518 - Loading `tokenizer.model` with Rust API
Issue -
State: open - Opened by EricLBuehler 5 months ago
- 7 comments
#1518 - Loading `tokenizer.model` with Rust API
Issue -
State: open - Opened by EricLBuehler 5 months ago
- 5 comments
#1518 - Loading `tokenizer.model` with Rust API
Issue -
State: closed - Opened by EricLBuehler 5 months ago
- 11 comments
Labels: Stale
#1517 - Llama3 tokenizer with Incorrect offset_mapping
Issue -
State: open - Opened by justin-shao 5 months ago
- 2 comments
Labels: Stale
#1517 - Llama3 tokenizer with Incorrect offset_mapping
Issue -
State: closed - Opened by justin-shao 5 months ago
- 3 comments
Labels: Stale
#1517 - Llama3 tokenizer with Incorrect offset_mapping
Issue -
State: open - Opened by justin-shao 5 months ago
#1516 - Tokens Removed from Trained Custom BPE Tokenizer
Issue -
State: closed - Opened by rteehas 5 months ago
#1516 - Tokens Removed from Trained Custom BPE Tokenizer
Issue -
State: closed - Opened by rteehas 5 months ago
#1516 - Tokens Removed from Trained Custom BPE Tokenizer
Issue -
State: closed - Opened by rteehas 5 months ago
#1515 - UnigramTrainer: byte_fallback is false.
Issue -
State: open - Opened by Moddus 5 months ago
- 4 comments
Labels: Feature Request, training
#1515 - UnigramTrainer: byte_fallback is false.
Issue -
State: open - Opened by Moddus 5 months ago
- 3 comments
Labels: Feature Request, training
#1514 - BPE Trainer doesn't respect the `vocab_size` parameter when dataset size is increased
Issue -
State: closed - Opened by Abhinay1997 5 months ago
- 3 comments
Labels: Stale
#1514 - BPE Trainer doesn't respect the `vocab_size` parameter when dataset size is increased
Issue -
State: open - Opened by Abhinay1997 5 months ago
- 2 comments
Labels: Stale
#1514 - BPE Trainer doesn't respect the `vocab_size` parameter when dataset size is increased
Issue -
State: open - Opened by Abhinay1997 5 months ago
- 1 comment
#1513 - [BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder
Pull Request -
State: closed - Opened by Narsil 5 months ago
- 6 comments
#1513 - [BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder
Pull Request -
State: open - Opened by Narsil 5 months ago
- 2 comments
#1512 - Breaking changes in v0.19.1 for tiktoken/llama3
Issue -
State: closed - Opened by sanderland 5 months ago
- 7 comments
Labels: Stale
#1512 - Breaking changes in v0.19.1 for tiktoken/llama3
Issue -
State: closed - Opened by sanderland 5 months ago
- 7 comments
Labels: Stale
#1511 - Fix "dictionnary" typo
Pull Request -
State: open - Opened by nprisbrey 5 months ago
#1511 - Fix "dictionnary" typo
Pull Request -
State: closed - Opened by nprisbrey 5 months ago
- 3 comments
#1510 - change conditional compilation for regex libraries
Pull Request -
State: closed - Opened by semaraugusto 5 months ago
- 1 comment
Labels: Stale
#1510 - change conditional compilation for regex libraries
Pull Request -
State: open - Opened by semaraugusto 5 months ago
#1509 - Cross-compilation fails for custom target
Issue -
State: closed - Opened by semaraugusto 5 months ago
- 1 comment
Labels: Stale
#1509 - Cross-compilation fails for custom target
Issue -
State: closed - Opened by semaraugusto 5 months ago
- 3 comments
Labels: Stale
#1508 - Add `.editorconfig` and `rustfmt.toml` for Consistent Code Formatting
Pull Request -
State: closed - Opened by tal7aouy 5 months ago
- 1 comment
Labels: Stale
#1508 - Add `.editorconfig` and `rustfmt.toml` for Consistent Code Formatting
Pull Request -
State: open - Opened by tal7aouy 5 months ago
#1508 - Add `.editorconfig` and `rustfmt.toml` for Consistent Code Formatting
Pull Request -
State: open - Opened by tal7aouy 5 months ago
#1507 - Treatment of hyphenated words
Issue -
State: closed - Opened by rattle99 5 months ago
- 7 comments
Labels: Stale
#1507 - Treatment of hyphenated words
Issue -
State: closed - Opened by rattle99 5 months ago
- 2 comments
Labels: Stale
#1507 - Treatment of hyphenated words
Issue -
State: closed - Opened by rattle99 5 months ago
- 2 comments
Labels: Stale