GitHub / huggingface/tokenizers issues and pull requests
#1825 - Proposal: Replace regex in `whitespace.rs` with manual code for speed improvements
Issue -
State: open - Opened by 8ria 21 days ago
#1824 - Practical limits of BPE tokenizer training
Issue -
State: closed - Opened by dvruette 25 days ago
- 5 comments
#1823 - how can i only initiate one instance of tokenizer but use it in multi-process?
Issue -
State: open - Opened by wangguanggg 26 days ago
- 2 comments
#1822 - Faster Whitespace PreTokenizer (Drop-in Replacement)
Pull Request -
State: open - Opened by 8ria 26 days ago
#1821 - Proposal: Faster `Whitespace` PreTokenizer Implementation (10–30% Speedup)
Issue -
State: open - Opened by 8ria 26 days ago
- 2 comments
#1820 - Cannot download test data: 'make test' and direct links fail with "Repository not found" / 404
Issue -
State: open - Opened by 8ria 26 days ago
#1819 - Bug with `add_prefix_space` parameter for ByteLevel post-processor
Issue -
State: open - Opened by megatron6000 28 days ago
#1818 - Clippy fixes.
Pull Request -
State: closed - Opened by Narsil 29 days ago
- 1 comment
#1817 - Inconsistent encoding of unknown tokens when they have special tokens as prefixes in the WordLevel Tokenizer
Issue -
State: open - Opened by diego-andres-ardila 29 days ago
#1816 - tokenizers cannot be compiled successfully on riscv machine
Issue -
State: closed - Opened by zengqingfu1442 about 1 month ago
- 5 comments
#1815 - Needs minor version release
Issue -
State: open - Opened by sftse about 1 month ago
- 2 comments
#1814 - Create vendor.yml
Pull Request -
State: closed - Opened by bloodwass about 1 month ago
- 1 comment
#1813 - Is it possible to return the count of truncated tokens?
Issue -
State: open - Opened by leejuyuu about 1 month ago
#1812 - Is it possible for this library to provider WebAssembly support?
Issue -
State: closed - Opened by phosae about 1 month ago
#1811 - Avoiding '�' in decoded output when decoding token by token (currently testing with GPT-2)?
Issue -
State: closed - Opened by jchwenger about 1 month ago
- 7 comments
#1810 - Building Rust numpy v0.23 dependency fails on Mac OS Sequoia
Issue -
State: closed - Opened by lordfeck about 1 month ago
- 4 comments
#1809 - Add 3.13t CI using pytest-run-parallel
Pull Request -
State: open - Opened by ngoldbaum about 1 month ago
#1808 - Fix typo in README
Pull Request -
State: open - Opened by aisk about 1 month ago
#1807 - error: casting &T to &mut T is undefined behavior #1485
Issue -
State: open - Opened by Fshrink about 1 month ago
#1806 - Track lockfile
Pull Request -
State: open - Opened by sftse about 1 month ago
#1805 - Fix: Refactor BpeTrainer to use chunked pair counting
Pull Request -
State: closed - Opened by MeryylleA about 1 month ago
#1804 - Adding multiprocessing for sentencepiece_extractor
Pull Request -
State: open - Opened by AamodThakur about 1 month ago
#1803 - Proposal for Optimizing transformers.BertTokenizerFast
Issue -
State: open - Opened by springkim 4 months ago
- 4 comments
Labels: Feature Request
#1802 - Hotfixing the stub.
Pull Request -
State: closed - Opened by Narsil about 2 months ago
- 2 comments
#1801 - Upgrading dependencies.
Pull Request -
State: closed - Opened by Narsil about 2 months ago
- 1 comment
#1800 - Adding throughput to benches to have a more consistent measure across
Pull Request -
State: closed - Opened by Narsil about 2 months ago
- 1 comment
#1799 - Consolidated optimization ahash dary compact str
Pull Request -
State: closed - Opened by Narsil about 2 months ago
- 1 comment
#1798 - Fix features blending into a paragraph
Pull Request -
State: closed - Opened by bionicles about 2 months ago
#1797 - py03 async bindings for encode/decode in rust
Issue -
State: open - Opened by michaelfeil about 2 months ago
- 1 comment
#1796 - Bump brace-expansion from 1.1.11 to 1.1.12 in /bindings/node
Pull Request -
State: closed - Opened by dependabot[bot] about 2 months ago
- 1 comment
Labels: dependencies, javascript
#1795 - Bugs encountered during training
Issue -
State: open - Opened by gitterlover about 2 months ago
#1794 - `WordPieceTrainer.train_from_iterator` is not deterministic
Issue -
State: open - Opened by Tialo about 2 months ago
#1793 - `end_of_word_suffix` is ignored
Issue -
State: open - Opened by Tialo about 2 months ago
- 1 comment
#1792 - Bump webpack-dev-server from 4.10.0 to 5.2.1 in /tokenizers/examples/unstable_wasm/www
Pull Request -
State: closed - Opened by dependabot[bot] about 2 months ago
- 1 comment
Labels: dependencies, javascript
#1791 - SyntaxWarning: invalid escape sequence '\w'
Issue -
State: open - Opened by wyattscarpenter about 2 months ago
#1790 - Regression in THUDM/chatglm3-6b at least starting from **tokenizers-0.21.1**
Issue -
State: open - Opened by pavel-esir about 2 months ago
- 1 comment
#1789 - Expose `Encoding` attributes via the buffer protocol interface
Pull Request -
State: open - Opened by mariosasko about 2 months ago
- 3 comments
#1788 - add group capture to replace
Pull Request -
State: open - Opened by cboseak about 2 months ago
#1787 - Hi, we need java /scala tokenizers bindings
Issue -
State: open - Opened by mullerhai 2 months ago
- 2 comments
#1786 - Behavior differences between non-fast and fast versions of the same tokenizer
Issue -
State: closed - Opened by jiayuanmark 2 months ago
- 1 comment
#1785 - [docs] Whitespace
Pull Request -
State: closed - Opened by stevhliu 2 months ago
- 6 comments
#1784 - Add cp313t tests, mark extension modules as compatible, and ship free-threaded wheels
Issue -
State: open - Opened by ngoldbaum 2 months ago
- 2 comments
#1783 - Add Truncate pre-tokenizer
Pull Request -
State: open - Opened by ArthurZucker 2 months ago
#1782 - Add benchmark for deserializing large added vocab + optimizations
Pull Request -
State: open - Opened by ArthurZucker 2 months ago
- 1 comment
#1781 - clippy
Pull Request -
State: closed - Opened by ArthurZucker 2 months ago
- 1 comment
#1780 - Update decode stream api
Pull Request -
State: open - Opened by ArthurZucker 2 months ago
- 1 comment
#1779 - Possible (minor) inconsistency in`unk_token` argument between WordPiece and UnigramLM
Issue -
State: open - Opened by pietrolesci 2 months ago
- 1 comment
#1778 - Fix ApiBuilder in from_pretrained to use env variables
Pull Request -
State: closed - Opened by vdebergue 2 months ago
- 5 comments
#1777 - Suggestion to Clarify WordPiece Documentation
Issue -
State: open - Opened by pietrolesci 3 months ago
- 1 comment
#1776 - Tokenizers fail to build due to Rust dependency
Issue -
State: closed - Opened by AmmarkoV 3 months ago
- 2 comments
#1775 - DebertaV2TokenizerFast and XLMRobertaTokenizerFast has overlapping offsets/CharSpans which leads to char_to_token() pointing to unexpected token
Issue -
State: open - Opened by ligz08 3 months ago
#1774 - Update pyo3 and rust-numpy depends for no-gil/free-threading compat
Pull Request -
State: open - Opened by Qubitium 3 months ago
#1773 - Property 'pre_tokenizer' cannot be set
Issue -
State: open - Opened by Masquito 3 months ago
- 2 comments
#1772 - Fix no-onig no-wasm builds
Pull Request -
State: closed - Opened by 414owen 3 months ago
- 3 comments
#1771 - Upgrade onig, to get it compiling with GCC 15
Pull Request -
State: closed - Opened by 414owen 3 months ago
- 5 comments
#1770 - Fix typos in strings and comments
Pull Request -
State: closed - Opened by co63oc 3 months ago
#1769 - Return_offsets_mapping when decoding
Issue -
State: closed - Opened by Boltzmachine 3 months ago
- 1 comment
#1768 - How to debug tokenizers with python?
Issue -
State: open - Opened by JinJieGan 3 months ago
- 1 comment
#1767 - Support free-threaded Python and ship 3.13t wheels
Issue -
State: closed - Opened by ngoldbaum 3 months ago
- 9 comments
#1766 - Fix type notation of merges in BPE Python binding
Pull Request -
State: closed - Opened by Coqueue 3 months ago
#1765 - Patching version 0.10.3 to fix invalid_reference_casting for legacy projects
Issue -
State: open - Opened by KTRosenberg 3 months ago
- 1 comment
#1764 - Update __init__.pyi: fix 525: SyntaxWarning: invalid escape sequence '\w'
Pull Request -
State: closed - Opened by wyattscarpenter 4 months ago
- 2 comments
#1763 - Make unigram cache optional
Pull Request -
State: open - Opened by wangrunji0408 4 months ago
#1762 - Bump http-proxy-middleware from 2.0.6 to 2.0.9 in /tokenizers/examples/unstable_wasm/www
Pull Request -
State: closed - Opened by dependabot[bot] 4 months ago
- 1 comment
Labels: dependencies, javascript
#1761 - Trying to load slow tokenizer in Rust
Issue -
State: closed - Opened by mayocream 4 months ago
- 1 comment
#1760 - normalizers.Replace able to support regex group capture
Issue -
State: open - Opened by nrv 4 months ago
#1759 - AttributeError: module 'decoders' has no attribute 'DecodeStream'
Issue -
State: open - Opened by scattw 4 months ago
- 5 comments
#1758 - Implement `from_bytes` and `read_bytes` Methods in WordPiece Tokenizer for WebAssembly Compatibility
Pull Request -
State: open - Opened by sondalex 4 months ago
- 1 comment
#1757 - Tokenizer encode and decode get different token ids and text
Issue -
State: closed - Opened by liho00 4 months ago
- 1 comment
#1756 - Itertools upgrade
Pull Request -
State: closed - Opened by sftse 4 months ago
- 3 comments
#1755 - Implement Append normalizer
Pull Request -
State: open - Opened by austinleedavis 4 months ago
#1754 - Ability to get `Ġ` (encoded space) with `Tokenizer.decode`
Issue -
State: closed - Opened by jamesbraza 4 months ago
- 3 comments
#1753 - Pre-tokenizers that support multi-word/non-whitespace BPE in single pass
Pull Request -
State: open - Opened by mjbommar 4 months ago
- 2 comments
#1752 - Switch to FXHash
Pull Request -
State: open - Opened by MeetThePatel 5 months ago
- 1 comment
#1751 - Proposal: Add Golang Bindings for tokenizers
Issue -
State: open - Opened by Nav31 5 months ago
- 3 comments
Labels: Feature Request
#1750 - Update dependency versions to fix NoGIL Python package install
Pull Request -
State: open - Opened by vinayakdsci 5 months ago
#1749 - Tokenizers fails to build in Python3.13t (NoGIL build)
Issue -
State: closed - Opened by vinayakdsci 5 months ago
- 2 comments
#1748 - tokenizer memory allocation of 1179664 bytes failed
Issue -
State: closed - Opened by TonFard 5 months ago
#1747 - Fix data path in test_continuing_prefix_trainer_mismatch
Pull Request -
State: closed - Opened by GaetanLepage 5 months ago
- 1 comment
#1746 - Update the release builds following 0.21.1.
Pull Request -
State: closed - Opened by Narsil 5 months ago
- 1 comment
#1745 - Git v0.21.1 rc0
Pull Request -
State: closed - Opened by Narsil 5 months ago
- 1 comment
#1744 - Git v0.21.1
Pull Request -
State: closed - Opened by Narsil 5 months ago
- 1 comment
#1743 - Request for latest version release (for rustls Support)
Issue -
State: closed - Opened by Femure 5 months ago
#1742 - Cannot tokenize byte sequences that are not valid UTF-8 due to design flaw
Issue -
State: closed - Opened by sharpobject 5 months ago
- 7 comments
#1741 - Cannot import name '__version__' from 'tokenizers.tokenizers'
Issue -
State: open - Opened by ArthurAardvark 5 months ago
- 2 comments
#1740 - creating custom tokenizer models in python
Issue -
State: closed - Opened by ctruexcytiva 5 months ago
- 2 comments
#1739 - replace lazy_static with stabilized std::sync::LazyLock in 1.80
Pull Request -
State: closed - Opened by sftse 5 months ago
- 2 comments
#1738 - ERROR occurs when running "tokenizer._tokenizer.model.clear_cache()"
Issue -
State: open - Opened by nixonjin 5 months ago
- 2 comments
#1737 - Use ApiBuilder::from_env() in from_pretrained function
Pull Request -
State: closed - Opened by BenLocal 5 months ago
- 1 comment
#1736 - Consider cutting a release
Issue -
State: closed - Opened by torymur 5 months ago
- 1 comment
#1735 - running apply_chat_template is VERY slow
Issue -
State: closed - Opened by AaronZLT 5 months ago
#1734 - Python 3.13t
Issue -
State: closed - Opened by btakita 5 months ago
- 1 comment
#1733 - Add FxHash and ShortStringOptimization.
Pull Request -
State: open - Opened by MeetThePatel 6 months ago
- 5 comments
#1732 - Add rustls-tls feature
Pull Request -
State: closed - Opened by torymur 6 months ago
- 1 comment
#1731 - Fails to build from source with GCC 15 due to mismatched function declarations
Issue -
State: open - Opened by glaubitz 6 months ago
- 5 comments
#1730 - Slow compile times
Issue -
State: open - Opened by 414owen 6 months ago
#1729 - Building without `onig` feature fails
Issue -
State: closed - Opened by 414owen 6 months ago
- 5 comments
#1727 - Is there a way to remap token IDs?
Issue -
State: closed - Opened by cptspacemanspiff 6 months ago
- 1 comment
#1726 - Thread safe?
Issue -
State: open - Opened by drupol 6 months ago
- 6 comments
#1725 - Add `with_sequence` for decode stream
Pull Request -
State: open - Opened by ArthurZucker 6 months ago
- 2 comments