Ecosyste.ms: Issues

An open API service for providing issue and pull request metadata for open source projects.

GitHub / huggingface/tokenizers issues and pull requests

#1638 - Fix off-by-one error in tokenizer::normalizer::Range::len

Pull Request - State: open - Opened by rlanday about 20 hours ago

#1636 - NormalizedString.clear() broken?

Issue - State: open - Opened by lkurlandski 5 days ago - 1 comment
Labels: bug

#1634 - Cannot inject custom PreTokenizer into Tokenizer

Issue - State: open - Opened by Old-Shatterhand 6 days ago - 6 comments

#1633 - README.md contains non-functional code

Issue - State: open - Opened by ahenkes1 11 days ago - 2 comments

#1632 - style: simplify string formatting for readability

Pull Request - State: open - Opened by hamirmahal 12 days ago

#1631 - Bump send and express in /tokenizers/examples/unstable_wasm/www

Pull Request - State: open - Opened by dependabot[bot] 13 days ago
Labels: dependencies, javascript

#1630 - Bump serve-static and express in /tokenizers/examples/unstable_wasm/www

Pull Request - State: open - Opened by dependabot[bot] 13 days ago
Labels: dependencies, javascript

#1629 - Bump body-parser and express in /tokenizers/examples/unstable_wasm/www

Pull Request - State: open - Opened by dependabot[bot] 14 days ago
Labels: dependencies, javascript

#1628 - Access utf-8 byte sequence for each token

Issue - State: open - Opened by DanielHesslow 21 days ago - 2 comments

#1627 - Rust: How to handle models with `precompiled_charsmap = null`

Issue - State: open - Opened by kallebysantos 26 days ago - 1 comment

#1626 - Bump actions/download-artifact from 3 to 4.1.7 in /.github/workflows

Pull Request - State: open - Opened by dependabot[bot] 26 days ago - 1 comment
Labels: dependencies, github_actions

#1623 - STATUS_ENTRYPOINT_NOT_FOUND

Issue - State: open - Opened by impurity-dev 28 days ago - 1 comment

#1622 - Bump webpack from 5.76.0 to 5.94.0 in /tokenizers/examples/unstable_wasm/www

Pull Request - State: open - Opened by dependabot[bot] about 1 month ago - 1 comment
Labels: dependencies, javascript

#1621 - Arg name correction: auth_token -> token

Pull Request - State: open - Opened by rravenel about 1 month ago - 3 comments

#1620 - PreTrainedTokenizerFast `char_to_token` `token_to_char` not working as expected

Issue - State: open - Opened by yonigottesman about 2 months ago - 5 comments
Labels: bug

#1619 - ModuleNotFoundError: No module named 'tokenizers.tokenizers'

Issue - State: open - Opened by jpferraro1 about 1 month ago - 6 comments

#1617 - 🚨 breaking: Fix training with special tokens

Pull Request - State: open - Opened by ArthurZucker about 1 month ago - 2 comments

#1617 - 🚨 breaking: Fix training with special tokens

Pull Request - State: open - Opened by ArthurZucker about 1 month ago - 2 comments

#1617 - 🚨 breaking: Fix training with special tokens

Pull Request - State: open - Opened by ArthurZucker about 1 month ago - 2 comments

#1617 - 🚨 breaking: Fix training with special tokens

Pull Request - State: open - Opened by ArthurZucker about 1 month ago - 2 comments

#1616 - BPE trainer ignoring special tokens.

Issue - State: open - Opened by henrycharlesworth about 1 month ago - 3 comments

#1615 - .NET bindings

Issue - State: open - Opened by sappho192 about 2 months ago

#1614 - Can I use SentencePieceBPETokenizer to replace google/sentencepiece?

Issue - State: closed - Opened by npuichigo about 2 months ago - 6 comments

#1613 - Space after unnormalized token is added when `use_fast=True` for Llama tokenizer

Issue - State: open - Opened by Butanium about 2 months ago - 10 comments

#1612 - `RefMutContainer` is unsound

Issue - State: open - Opened by CheaterCodes about 2 months ago - 3 comments

#1611 - [test-infra] Enable Codecov for tokenizers

Issue - State: open - Opened by hvaara about 2 months ago

#1610 - fix benchmark file link

Pull Request - State: closed - Opened by 152334H about 2 months ago - 1 comment

#1609 - Token ID Out of Range & Indexing Assertion Errors During Training

Issue - State: closed - Opened by haseebrj17 about 2 months ago - 4 comments

#1608 - Update README.md

Pull Request - State: closed - Opened by ArthurZucker about 2 months ago - 1 comment

#1607 - Fix CI

Pull Request - State: closed - Opened by Narsil about 2 months ago - 1 comment

#1606 - Candidate release

Pull Request - State: closed - Opened by ArthurZucker about 2 months ago - 1 comment

#1605 - Fast regex

Pull Request - State: open - Opened by ArthurZucker about 2 months ago - 1 comment

#1604 - Tests + Deserialization improvement for normalizers.

Pull Request - State: closed - Opened by Narsil about 2 months ago - 1 comment

#1603 - add deserialize for pre tokenizers

Pull Request - State: closed - Opened by ArthurZucker about 2 months ago - 1 comment

#1602 - Fix strip python type

Pull Request - State: closed - Opened by ArthurZucker about 2 months ago - 1 comment

#1601 - Support for Golang now or support a cli for other languages?

Issue - State: open - Opened by xuxiaoxia96 about 2 months ago - 2 comments

#1600 - Add test normalizers

Pull Request - State: closed - Opened by ArthurZucker about 2 months ago - 1 comment

#1599 - Improve decoder deserialization

Pull Request - State: closed - Opened by Narsil about 2 months ago - 1 comment

#1598 - Adding a few tests for decoder deserialization.

Pull Request - State: closed - Opened by Narsil about 2 months ago - 1 comment

#1597 - Add-legacy-tests

Pull Request - State: closed - Opened by ArthurZucker about 2 months ago - 1 comment

#1595 - Better serialization error

Pull Request - State: closed - Opened by Narsil about 2 months ago - 1 comment

#1594 - Adding some serialization testing around the wrapper.

Pull Request - State: closed - Opened by Narsil about 2 months ago - 1 comment

#1593 - Fixing release CI strict (taken from safetensors).

Pull Request - State: closed - Opened by Narsil about 2 months ago - 1 comment

#1592 - Better serialization and deserialization error

Pull Request - State: closed - Opened by ArthurZucker about 2 months ago - 1 comment

#1591 - Fix doc about split

Pull Request - State: closed - Opened by ArthurZucker about 2 months ago - 1 comment

#1590 - Support `None` to reset pre_tokenizers and normalizers, and index sequences

Pull Request - State: closed - Opened by ArthurZucker about 2 months ago - 2 comments

#1589 - Recursive ellipsis for serde_pyo3

Pull Request - State: closed - Opened by EricLBuehler about 2 months ago - 2 comments

#1588 - Using serde (serde_pyo3) to get __str__ and __repr__ easily.

Pull Request - State: closed - Opened by Narsil about 2 months ago - 1 comment

#1587 - Perf improvement 16% by removing offsets.

Pull Request - State: closed - Opened by Narsil about 2 months ago - 1 comment

#1586 - Enable fancy regex

Pull Request - State: closed - Opened by Narsil about 2 months ago - 1 comment

#1585 - Tiny improvement

Pull Request - State: closed - Opened by Narsil about 2 months ago - 1 comment

#1584 - Fixing benchmark2.

Pull Request - State: closed - Opened by Narsil about 2 months ago - 2 comments

#1583 - Fixing the benchmark.

Pull Request - State: closed - Opened by Narsil about 2 months ago - 1 comment

#1582 - Add benchmark vs tiktoken

Pull Request - State: closed - Opened by Narsil 2 months ago - 1 comment

#1580 - Fix clippy + feature test management.

Pull Request - State: closed - Opened by Narsil 2 months ago - 1 comment

#1579 - Risk of global variable memory leaks when calling train_from_iterator

Issue - State: open - Opened by Yikai-Liao 2 months ago - 1 comment
Labels: Stale

#1578 - return pytorch tensors like in transformers?

Issue - State: closed - Opened by PaulLerner 2 months ago - 5 comments

#1577 - `train_from_iterator` out of memory on WMT14 `de` dataset

Issue - State: closed - Opened by Kami-chanw 2 months ago - 2 comments

#1577 - `train_from_iterator` out of memory on WMT14 `de` dataset

Issue - State: closed - Opened by Kami-chanw 2 months ago - 1 comment

#1577 - `train_from_iterator` out of memory on WMT14 `de` dataset

Issue - State: closed - Opened by Kami-chanw 2 months ago - 1 comment

#1576 - Issue with `SentencePieceUnigramTokenizer` Handling Unknown Tokens

Issue - State: open - Opened by Munikumar09 2 months ago - 1 comment

#1575 - apply_chat_template api usage consult

Issue - State: open - Opened by FanZhang91 2 months ago

#1575 - apply_chat_template api usage consult

Issue - State: open - Opened by FanZhang91 2 months ago

#1575 - apply_chat_template api usage consult

Issue - State: open - Opened by FanZhang91 2 months ago

#1575 - apply_chat_template api usage consult

Issue - State: open - Opened by FanZhang91 2 months ago

#1575 - apply_chat_template api usage consult

Issue - State: open - Opened by FanZhang91 2 months ago

#1575 - apply_chat_template api usage consult

Issue - State: open - Opened by FanZhang91 2 months ago

#1574 - Use pyo3 smd v0.21

Pull Request - State: closed - Opened by EricLBuehler 2 months ago - 1 comment

#1574 - Use pyo3 smd v0.21

Pull Request - State: open - Opened by EricLBuehler 2 months ago - 1 comment

#1574 - Use pyo3 smd v0.21

Pull Request - State: open - Opened by EricLBuehler 2 months ago - 1 comment

#1574 - Use pyo3 smd v0.21

Pull Request - State: open - Opened by EricLBuehler 2 months ago - 1 comment

#1574 - Use pyo3 smd v0.21

Pull Request - State: open - Opened by EricLBuehler 2 months ago - 1 comment

#1573 - Truncation performs slowly. Tokenizer firstly encodes long sequence and then truncates it.

Issue - State: open - Opened by galtimur 2 months ago - 2 comments
Labels: Feature Request

#1572 - BPE Split pretokenization rule is not reflected in the vocabulary

Issue - State: closed - Opened by meliksahturker 2 months ago - 2 comments

#1571 - Bump spm_precompiled to 0.1.3

Pull Request - State: closed - Opened by MikeIvanichev 3 months ago - 4 comments

#1570 - [Feature] support Assign token to update the content of a token

Pull Request - State: open - Opened by ArthurZucker 3 months ago - 1 comment

#1570 - [Feature] support Assign token to update the content of a token

Pull Request - State: open - Opened by ArthurZucker 3 months ago - 1 comment

#1570 - [Feature] support Assign token to update the content of a token

Pull Request - State: open - Opened by ArthurZucker 3 months ago - 3 comments
Labels: Stale

#1569 - Revert "[BREAKING CHANGE] Ignore added_tokens (both special and not) …

Pull Request - State: closed - Opened by ArthurZucker 3 months ago - 1 comment

#1569 - Revert "[BREAKING CHANGE] Ignore added_tokens (both special and not) …

Pull Request - State: closed - Opened by ArthurZucker 3 months ago - 1 comment

#1569 - Revert "[BREAKING CHANGE] Ignore added_tokens (both special and not) …

Pull Request - State: closed - Opened by ArthurZucker 3 months ago - 1 comment

#1569 - Revert "[BREAKING CHANGE] Ignore added_tokens (both special and not) …

Pull Request - State: closed - Opened by ArthurZucker 3 months ago - 1 comment

#1569 - Revert "[BREAKING CHANGE] Ignore added_tokens (both special and not) …

Pull Request - State: closed - Opened by ArthurZucker 3 months ago - 1 comment

#1568 - [Fix metaspace prepending scheme] ⛓️‍💥⛓️‍💥

Pull Request - State: open - Opened by ArthurZucker 3 months ago - 1 comment

#1568 - [Fix metaspace prepending scheme] ⛓️‍💥⛓️‍💥

Pull Request - State: closed - Opened by ArthurZucker 3 months ago - 1 comment
Labels: Stale

#1567 - Tokenizer.from_bytes() not available in python bindings

Issue - State: open - Opened by RamvigneshPasupathy 3 months ago - 2 comments
Labels: Feature Request

#1567 - Tokenizer.from_bytes() not available in python bindings

Issue - State: closed - Opened by RamvigneshPasupathy 3 months ago - 4 comments
Labels: Stale, Feature Request

#1566 - Custom fast PreTokenizer, ported via PyO3 to Python

Issue - State: open - Opened by vandrw 3 months ago - 2 comments