Ecosyste.ms: Issues

An open API service for providing issue and pull request metadata for open source projects.

GitHub / huggingface/tokenizers issues and pull requests

#1565 - documentation of the `pattern` parameter in `pre_tokenizers.Split` is incorrect

Issue - State: closed - Opened by craigschmidt 3 months ago - 1 comment
Labels: documentation

#1564 - Decode regression

Issue - State: open - Opened by daulet 3 months ago - 7 comments
Labels: performance, decoding

#1564 - Decode regression

Issue - State: open - Opened by daulet 3 months ago - 7 comments
Labels: performance, decoding

#1564 - Decode regression

Issue - State: open - Opened by daulet 3 months ago - 10 comments
Labels: performance, decoding

#1563 - Unable to sSet `use_regex=False` in BPE decoder & post_processor?

Issue - State: closed - Opened by jchwenger 3 months ago - 4 comments

#1561 - tokenizers

Pull Request - State: closed - Opened by Oleh8978 3 months ago

#1560 - Fast encode

Pull Request - State: closed - Opened by ArthurZucker 3 months ago - 1 comment

#1559 - Progress bar doesn't show in log file.

Issue - State: open - Opened by amssljc 3 months ago - 5 comments
Labels: Stale

#1558 - Bump braces from 3.0.2 to 3.0.3 in /tokenizers/examples/unstable_wasm/www

Pull Request - State: closed - Opened by dependabot[bot] 3 months ago - 2 comments
Labels: dependencies, javascript, Stale

#1557 - Bump ws from 8.8.1 to 8.17.1 in /tokenizers/examples/unstable_wasm/www

Pull Request - State: closed - Opened by dependabot[bot] 3 months ago - 2 comments
Labels: dependencies, javascript, Stale

#1556 - `Encoding` object stub doesn't include `__len__`

Issue - State: open - Opened by thearchitector 3 months ago - 4 comments

#1555 - Add bytelevel normalizer to fix decode when adding tokens to BPE

Pull Request - State: closed - Opened by ArthurZucker 3 months ago - 2 comments

#1555 - Fix decode

Pull Request - State: open - Opened by ArthurZucker 3 months ago - 1 comment

#1554 - make sure we don't warn on empty tokens

Pull Request - State: closed - Opened by ArthurZucker 3 months ago - 2 comments

#1554 - make sure we don't warn on empty tokens

Pull Request - State: open - Opened by ArthurZucker 3 months ago - 1 comment

#1553 - Llama-3 offset-mapping needs fixing

Issue - State: open - Opened by davidb-cerebras 4 months ago - 11 comments

#1552 - [Bug?] Modifying normalizer for pretrained tokenizers don't consistently work

Issue - State: closed - Opened by alvations 4 months ago - 3 comments
Labels: Stale

#1551 - feat(ci): add trufflehog secrets detection

Pull Request - State: closed - Opened by McPatate 4 months ago - 1 comment

#1551 - feat(ci): add trufflehog secrets detection

Pull Request - State: closed - Opened by McPatate 4 months ago - 1 comment

#1551 - feat(ci): add trufflehog secrets detection

Pull Request - State: closed - Opened by McPatate 4 months ago - 1 comment

#1550 - Enable `dropout = 0.0` as an equivalent to `none` in BPE

Pull Request - State: closed - Opened by mcognetta 4 months ago - 6 comments

#1550 - Enable `dropout = 0.0` as an equivalent to `none` in BPE

Pull Request - State: open - Opened by mcognetta 4 months ago

#1550 - Enable `dropout = 0.0` as an equivalent to `none` in BPE

Pull Request - State: open - Opened by mcognetta 4 months ago

#1549 - How to use `TokenizerBuilder`?

Issue - State: closed - Opened by polarathene 4 months ago - 4 comments
Labels: Stale

#1548 - Fixing for clippy 1.78

Pull Request - State: closed - Opened by Narsil 4 months ago - 1 comment

#1548 - Fixing for clippy 1.78

Pull Request - State: closed - Opened by Narsil 4 months ago - 1 comment

#1547 - Switch from `cached_download` to `hf_hub_download` in tests

Pull Request - State: closed - Opened by Wauplin 4 months ago - 2 comments

#1547 - Switch from `cached_download` to `hf_hub_download` in tests

Pull Request - State: closed - Opened by Wauplin 4 months ago - 2 comments

#1546 - "Solution" to memory hogging in train_new_from_iterator with a hack

Issue - State: open - Opened by morphpiece 4 months ago - 4 comments

#1546 - "Solution" to memory hogging in train_new_from_iterator with a hack

Issue - State: open - Opened by morphpiece 4 months ago - 4 comments

#1546 - "Solution" to memory hogging in train_new_from_iterator with a hack

Issue - State: open - Opened by morphpiece 4 months ago - 7 comments

#1543 - llama3 tokenizer doesn't round trip

Issue - State: open - Opened by josharian 4 months ago - 3 comments

#1543 - llama3 tokenizer doesn't round trip

Issue - State: open - Opened by josharian 4 months ago - 4 comments
Labels: Stale

#1542 - Add display capabilities to tokenizers objects

Pull Request - State: closed - Opened by ArthurZucker 4 months ago - 5 comments

#1541 - Deserializing BPE tokenizer failure

Issue - State: closed - Opened by mcognetta 4 months ago - 4 comments

#1540 - Adding pretty print of tokenizer

Pull Request - State: closed - Opened by haixuanTao 4 months ago - 2 comments

#1539 - Memory leak for large strings

Issue - State: open - Opened by noamgai21 4 months ago - 14 comments

#1537 - Training HuggingFace tokenizer - ignore_merges

Issue - State: closed - Opened by ykoyfman 4 months ago - 2 comments
Labels: Stale, Feature Request, planned

#1536 - [BUG]Might be a bug in Unigram Trainer

Issue - State: open - Opened by Codesticker 4 months ago - 1 comment
Labels: Stale

#1535 - feat: add support for pyarrow arrays as input

Pull Request - State: closed - Opened by notjedi 5 months ago - 10 comments
Labels: Stale

#1533 - Make `onig` crate non-optional

Pull Request - State: closed - Opened by nathaniel-daniel 5 months ago - 1 comment
Labels: Stale

#1532 - Make `USED_PARALLELISM` atomic

Pull Request - State: closed - Opened by nathaniel-daniel 5 months ago - 3 comments

#1531 - How to Batch-Encode Paired Input Sentences with Tokenizers: Seeking Clarification

Issue - State: closed - Opened by insookim43 5 months ago - 1 comment
Labels: Stale

#1530 - Converting `tokenizers` tokenizers into `tiktoken` tokenizers

Issue - State: closed - Opened by umarbutler 5 months ago - 5 comments
Labels: Stale

#1528 - Strange warnings with tokenizer for some models

Issue - State: closed - Opened by EricLBuehler 5 months ago - 5 comments

#1527 - Special token handling breaks idempotency of sentencepiece due to extra spaces

Issue - State: open - Opened by cat-state 5 months ago - 5 comments
Labels: Stale

#1526 - Link to download the training text in `docs/source/quicktour.rst` is broken

Issue - State: closed - Opened by 14jdelap 5 months ago - 6 comments
Labels: Stale

#1525 - How to write custom Wordpiece class?

Issue - State: closed - Opened by xinyinan9527 5 months ago - 3 comments
Labels: Stale

#1524 - Convert huggingface tokenizer into sentencepiece format

Issue - State: closed - Opened by RRaphaell 5 months ago - 3 comments
Labels: Stale

#1523 - ❓Get stats (e.g. counts) about the merged pairs

Issue - State: closed - Opened by pietrolesci 5 months ago - 3 comments
Labels: Stale

#1522 - Error: Cannot find module 'tokenizers/bindings/tokenizer'

Issue - State: closed - Opened by meichangsu1 5 months ago - 1 comment
Labels: Stale

#1521 - remove enforcement of non special when adding tokens

Pull Request - State: closed - Opened by ArthurZucker 5 months ago - 2 comments

#1520 - Why are 'unknown' tokens randomly added to my tokenized input?

Issue - State: closed - Opened by tshmak 5 months ago - 2 comments

#1520 - Why are 'unknown' tokens randomly added to my tokenized input?

Issue - State: closed - Opened by tshmak 5 months ago - 2 comments

#1519 - Why the tokenizer is slower than tiktoken?

Issue - State: open - Opened by BigBinnie 5 months ago - 8 comments

#1519 - Why the tokenizer is slower than tiktoken?

Issue - State: open - Opened by BigBinnie 5 months ago - 5 comments

#1518 - Loading `tokenizer.model` with Rust API

Issue - State: open - Opened by EricLBuehler 5 months ago - 10 comments

#1518 - Loading `tokenizer.model` with Rust API

Issue - State: open - Opened by EricLBuehler 5 months ago - 7 comments

#1518 - Loading `tokenizer.model` with Rust API

Issue - State: open - Opened by EricLBuehler 5 months ago - 5 comments

#1518 - Loading `tokenizer.model` with Rust API

Issue - State: closed - Opened by EricLBuehler 5 months ago - 11 comments
Labels: Stale

#1517 - Llama3 tokenizer with Incorrect offset_mapping

Issue - State: open - Opened by justin-shao 5 months ago - 2 comments
Labels: Stale

#1517 - Llama3 tokenizer with Incorrect offset_mapping

Issue - State: closed - Opened by justin-shao 5 months ago - 3 comments
Labels: Stale

#1517 - Llama3 tokenizer with Incorrect offset_mapping

Issue - State: open - Opened by justin-shao 5 months ago

#1516 - Tokens Removed from Trained Custom BPE Tokenizer

Issue - State: closed - Opened by rteehas 5 months ago

#1516 - Tokens Removed from Trained Custom BPE Tokenizer

Issue - State: closed - Opened by rteehas 5 months ago

#1516 - Tokens Removed from Trained Custom BPE Tokenizer

Issue - State: closed - Opened by rteehas 5 months ago

#1515 - UnigramTrainer: byte_fallback is false.

Issue - State: open - Opened by Moddus 5 months ago - 4 comments
Labels: Feature Request, training

#1515 - UnigramTrainer: byte_fallback is false.

Issue - State: open - Opened by Moddus 5 months ago - 3 comments
Labels: Feature Request, training

#1514 - BPE Trainer doesn't respect the `vocab_size` parameter when dataset size is increased

Issue - State: closed - Opened by Abhinay1997 5 months ago - 3 comments
Labels: Stale

#1514 - BPE Trainer doesn't respect the `vocab_size` parameter when dataset size is increased

Issue - State: open - Opened by Abhinay1997 5 months ago - 2 comments
Labels: Stale

#1513 - [BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder

Pull Request - State: closed - Opened by Narsil 5 months ago - 6 comments

#1513 - [BREAKING CHANGE] Ignore added_tokens (both special and not) in the decoder

Pull Request - State: open - Opened by Narsil 5 months ago - 2 comments

#1512 - Breaking changes in v0.19.1 for tiktoken/llama3

Issue - State: closed - Opened by sanderland 5 months ago - 7 comments
Labels: Stale

#1512 - Breaking changes in v0.19.1 for tiktoken/llama3

Issue - State: closed - Opened by sanderland 5 months ago - 7 comments
Labels: Stale

#1511 - Fix "dictionnary" typo

Pull Request - State: open - Opened by nprisbrey 5 months ago

#1511 - Fix "dictionnary" typo

Pull Request - State: closed - Opened by nprisbrey 5 months ago - 3 comments

#1510 - change conditional compilation for regex libraries

Pull Request - State: closed - Opened by semaraugusto 5 months ago - 1 comment
Labels: Stale

#1510 - change conditional compilation for regex libraries

Pull Request - State: open - Opened by semaraugusto 5 months ago

#1509 - Cross-compilation fails for custom target

Issue - State: closed - Opened by semaraugusto 5 months ago - 1 comment
Labels: Stale

#1509 - Cross-compilation fails for custom target

Issue - State: closed - Opened by semaraugusto 5 months ago - 3 comments
Labels: Stale

#1508 - Add `.editorconfig` and `rustfmt.toml` for Consistent Code Formatting

Pull Request - State: closed - Opened by tal7aouy 5 months ago - 1 comment
Labels: Stale

#1507 - Treatment of hyphenated words

Issue - State: closed - Opened by rattle99 5 months ago - 7 comments
Labels: Stale

#1507 - Treatment of hyphenated words

Issue - State: closed - Opened by rattle99 5 months ago - 2 comments
Labels: Stale

#1507 - Treatment of hyphenated words

Issue - State: closed - Opened by rattle99 5 months ago - 2 comments
Labels: Stale