An open API service for providing issue and pull request metadata for open source projects.

GitHub / huggingface/tokenizers issues and pull requests

#1824 - Practical limits of BPE tokenizer training

Issue - State: closed - Opened by dvruette 25 days ago - 5 comments

#1822 - Faster Whitespace PreTokenizer (Drop-in Replacement)

Pull Request - State: open - Opened by 8ria 26 days ago

#1818 - Clippy fixes.

Pull Request - State: closed - Opened by Narsil 29 days ago - 1 comment

#1816 - tokenizers cannot be compiled successfully on riscv machine

Issue - State: closed - Opened by zengqingfu1442 about 1 month ago - 5 comments

#1815 - Needs minor version release

Issue - State: open - Opened by sftse about 1 month ago - 2 comments

#1814 - Create vendor.yml

Pull Request - State: closed - Opened by bloodwass about 1 month ago - 1 comment

#1813 - Is it possible to return the count of truncated tokens?

Issue - State: open - Opened by leejuyuu about 1 month ago

#1810 - Building Rust numpy v0.23 dependency fails on Mac OS Sequoia

Issue - State: closed - Opened by lordfeck about 1 month ago - 4 comments

#1809 - Add 3.13t CI using pytest-run-parallel

Pull Request - State: open - Opened by ngoldbaum about 1 month ago

#1808 - Fix typo in README

Pull Request - State: open - Opened by aisk about 1 month ago

#1807 - error: casting &T to &mut T is undefined behavior #1485

Issue - State: open - Opened by Fshrink about 1 month ago

#1806 - Track lockfile

Pull Request - State: open - Opened by sftse about 1 month ago

#1805 - Fix: Refactor BpeTrainer to use chunked pair counting

Pull Request - State: closed - Opened by MeryylleA about 1 month ago

#1804 - Adding multiprocessing for sentencepiece_extractor

Pull Request - State: open - Opened by AamodThakur about 1 month ago

#1803 - Proposal for Optimizing transformers.BertTokenizerFast

Issue - State: open - Opened by springkim 4 months ago - 4 comments
Labels: Feature Request

#1802 - Hotfixing the stub.

Pull Request - State: closed - Opened by Narsil about 2 months ago - 2 comments

#1801 - Upgrading dependencies.

Pull Request - State: closed - Opened by Narsil about 2 months ago - 1 comment

#1800 - Adding throughput to benches to have a more consistent measure across

Pull Request - State: closed - Opened by Narsil about 2 months ago - 1 comment

#1799 - Consolidated optimization ahash dary compact str

Pull Request - State: closed - Opened by Narsil about 2 months ago - 1 comment

#1798 - Fix features blending into a paragraph

Pull Request - State: closed - Opened by bionicles about 2 months ago

#1797 - py03 async bindings for encode/decode in rust

Issue - State: open - Opened by michaelfeil about 2 months ago - 1 comment

#1796 - Bump brace-expansion from 1.1.11 to 1.1.12 in /bindings/node

Pull Request - State: closed - Opened by dependabot[bot] about 2 months ago - 1 comment
Labels: dependencies, javascript

#1795 - Bugs encountered during training

Issue - State: open - Opened by gitterlover about 2 months ago

#1794 - `WordPieceTrainer.train_from_iterator` is not deterministic

Issue - State: open - Opened by Tialo about 2 months ago

#1793 - `end_of_word_suffix` is ignored

Issue - State: open - Opened by Tialo about 2 months ago - 1 comment

#1792 - Bump webpack-dev-server from 4.10.0 to 5.2.1 in /tokenizers/examples/unstable_wasm/www

Pull Request - State: closed - Opened by dependabot[bot] about 2 months ago - 1 comment
Labels: dependencies, javascript

#1791 - SyntaxWarning: invalid escape sequence '\w'

Issue - State: open - Opened by wyattscarpenter about 2 months ago

#1790 - Regression in THUDM/chatglm3-6b at least starting from **tokenizers-0.21.1**

Issue - State: open - Opened by pavel-esir about 2 months ago - 1 comment

#1789 - Expose `Encoding` attributes via the buffer protocol interface

Pull Request - State: open - Opened by mariosasko about 2 months ago - 3 comments

#1788 - add group capture to replace

Pull Request - State: open - Opened by cboseak about 2 months ago

#1787 - Hi, we need java /scala tokenizers bindings

Issue - State: open - Opened by mullerhai 2 months ago - 2 comments

#1785 - [docs] Whitespace

Pull Request - State: closed - Opened by stevhliu 2 months ago - 6 comments

#1783 - Add Truncate pre-tokenizer

Pull Request - State: open - Opened by ArthurZucker 2 months ago

#1782 - Add benchmark for deserializing large added vocab + optimizations

Pull Request - State: open - Opened by ArthurZucker 2 months ago - 1 comment

#1781 - clippy

Pull Request - State: closed - Opened by ArthurZucker 2 months ago - 1 comment

#1780 - Update decode stream api

Pull Request - State: open - Opened by ArthurZucker 2 months ago - 1 comment

#1778 - Fix ApiBuilder in from_pretrained to use env variables

Pull Request - State: closed - Opened by vdebergue 2 months ago - 5 comments

#1777 - Suggestion to Clarify WordPiece Documentation

Issue - State: open - Opened by pietrolesci 3 months ago - 1 comment

#1776 - Tokenizers fail to build due to Rust dependency

Issue - State: closed - Opened by AmmarkoV 3 months ago - 2 comments

#1773 - Property 'pre_tokenizer' cannot be set

Issue - State: open - Opened by Masquito 3 months ago - 2 comments

#1772 - Fix no-onig no-wasm builds

Pull Request - State: closed - Opened by 414owen 3 months ago - 3 comments

#1771 - Upgrade onig, to get it compiling with GCC 15

Pull Request - State: closed - Opened by 414owen 3 months ago - 5 comments

#1770 - Fix typos in strings and comments

Pull Request - State: closed - Opened by co63oc 3 months ago

#1769 - Return_offsets_mapping when decoding

Issue - State: closed - Opened by Boltzmachine 3 months ago - 1 comment

#1768 - How to debug tokenizers with python?

Issue - State: open - Opened by JinJieGan 3 months ago - 1 comment

#1767 - Support free-threaded Python and ship 3.13t wheels

Issue - State: closed - Opened by ngoldbaum 3 months ago - 9 comments

#1766 - Fix type notation of merges in BPE Python binding

Pull Request - State: closed - Opened by Coqueue 3 months ago

#1764 - Update __init__.pyi: fix 525: SyntaxWarning: invalid escape sequence '\w'

Pull Request - State: closed - Opened by wyattscarpenter 4 months ago - 2 comments

#1763 - Make unigram cache optional

Pull Request - State: open - Opened by wangrunji0408 4 months ago

#1762 - Bump http-proxy-middleware from 2.0.6 to 2.0.9 in /tokenizers/examples/unstable_wasm/www

Pull Request - State: closed - Opened by dependabot[bot] 4 months ago - 1 comment
Labels: dependencies, javascript

#1761 - Trying to load slow tokenizer in Rust

Issue - State: closed - Opened by mayocream 4 months ago - 1 comment

#1760 - normalizers.Replace able to support regex group capture

Issue - State: open - Opened by nrv 4 months ago

#1759 - AttributeError: module 'decoders' has no attribute 'DecodeStream'

Issue - State: open - Opened by scattw 4 months ago - 5 comments

#1757 - Tokenizer encode and decode get different token ids and text

Issue - State: closed - Opened by liho00 4 months ago - 1 comment

#1756 - Itertools upgrade

Pull Request - State: closed - Opened by sftse 4 months ago - 3 comments

#1755 - Implement Append normalizer

Pull Request - State: open - Opened by austinleedavis 4 months ago

#1754 - Ability to get `Ġ` (encoded space) with `Tokenizer.decode`

Issue - State: closed - Opened by jamesbraza 4 months ago - 3 comments

#1753 - Pre-tokenizers that support multi-word/non-whitespace BPE in single pass

Pull Request - State: open - Opened by mjbommar 4 months ago - 2 comments

#1752 - Switch to FXHash

Pull Request - State: open - Opened by MeetThePatel 5 months ago - 1 comment

#1751 - Proposal: Add Golang Bindings for tokenizers

Issue - State: open - Opened by Nav31 5 months ago - 3 comments
Labels: Feature Request

#1749 - Tokenizers fails to build in Python3.13t (NoGIL build)

Issue - State: closed - Opened by vinayakdsci 5 months ago - 2 comments

#1748 - tokenizer memory allocation of 1179664 bytes failed

Issue - State: closed - Opened by TonFard 5 months ago

#1747 - Fix data path in test_continuing_prefix_trainer_mismatch

Pull Request - State: closed - Opened by GaetanLepage 5 months ago - 1 comment

#1746 - Update the release builds following 0.21.1.

Pull Request - State: closed - Opened by Narsil 5 months ago - 1 comment

#1745 - Git v0.21.1 rc0

Pull Request - State: closed - Opened by Narsil 5 months ago - 1 comment

#1744 - Git v0.21.1

Pull Request - State: closed - Opened by Narsil 5 months ago - 1 comment

#1743 - Request for latest version release (for rustls Support)

Issue - State: closed - Opened by Femure 5 months ago

#1741 - Cannot import name '__version__' from 'tokenizers.tokenizers'

Issue - State: open - Opened by ArthurAardvark 5 months ago - 2 comments

#1740 - creating custom tokenizer models in python

Issue - State: closed - Opened by ctruexcytiva 5 months ago - 2 comments

#1739 - replace lazy_static with stabilized std::sync::LazyLock in 1.80

Pull Request - State: closed - Opened by sftse 5 months ago - 2 comments

#1738 - ERROR occurs when running "tokenizer._tokenizer.model.clear_cache()"

Issue - State: open - Opened by nixonjin 5 months ago - 2 comments

#1737 - Use ApiBuilder::from_env() in from_pretrained function

Pull Request - State: closed - Opened by BenLocal 5 months ago - 1 comment

#1736 - Consider cutting a release

Issue - State: closed - Opened by torymur 5 months ago - 1 comment

#1735 - running apply_chat_template is VERY slow

Issue - State: closed - Opened by AaronZLT 5 months ago

#1734 - Python 3.13t

Issue - State: closed - Opened by btakita 5 months ago - 1 comment

#1733 - Add FxHash and ShortStringOptimization.

Pull Request - State: open - Opened by MeetThePatel 6 months ago - 5 comments

#1732 - Add rustls-tls feature

Pull Request - State: closed - Opened by torymur 6 months ago - 1 comment

#1730 - Slow compile times

Issue - State: open - Opened by 414owen 6 months ago

#1729 - Building without `onig` feature fails

Issue - State: closed - Opened by 414owen 6 months ago - 5 comments

#1727 - Is there a way to remap token IDs?

Issue - State: closed - Opened by cptspacemanspiff 6 months ago - 1 comment

#1726 - Thread safe?

Issue - State: open - Opened by drupol 6 months ago - 6 comments

#1725 - Add `with_sequence` for decode stream

Pull Request - State: open - Opened by ArthurZucker 6 months ago - 2 comments