huggingface/tokenizers issues and pull requests

#914 - Loading of Tokenizer is really slow when there are lots of additional tokens

Issue - State: closed - Opened by PiercarloSlavazza over 2 years ago - 9 comments

#911 - Tokenizers for Node 16?

Issue - State: closed - Opened by etiennelunetta almost 3 years ago - 8 comments
Labels: Stale

#909 - Merges cannot handle tokens containing spaces.

Pull Request - State: closed - Opened by Narsil almost 3 years ago - 17 comments

#905 - tokenizers.models.BPE loses whitespace with GPT-2 pretrained vocab & merges

Issue - State: closed - Opened by umbra-scientia almost 3 years ago - 2 comments
Labels: Stale

#903 - Some questions about building a tokenizer from scratch: vocab size can't decide actual vocab size and token order unstable.

Issue - State: closed - Opened by catqaq almost 3 years ago - 12 comments
Labels: Stale

#902 - How to Suppress "Using bos_token, but it is not set yet..." in HuggingFace T5 Tokenizer

Issue - State: closed - Opened by xsys-technology almost 3 years ago - 5 comments
Labels: Stale

#900 - get_vocab_size() is different from len(get_vocab())

Issue - State: closed - Opened by anatoly-khomenko almost 3 years ago - 2 comments
Labels: Stale

#899 - Error: ThreadPoolBuildError

Issue - State: closed - Opened by HarshkumarP almost 3 years ago - 4 comments

#898 - Missing Version info in build from source installation instruction

Issue - State: closed - Opened by leo-du almost 3 years ago - 6 comments
Labels: Stale

#892 - Wrong alignments after calling `NormalizedString.replace()`

Issue - State: closed - Opened by t-yamamura almost 3 years ago - 11 comments
Labels: Stale

#888 - PanicException For Result::unwarp()

Issue - State: closed - Opened by Namco0816 almost 3 years ago - 5 comments
Labels: Stale

#886 - vocab_size issue with Whitespace pre_tokenizer

Issue - State: closed - Opened by ctoraman almost 3 years ago - 10 comments
Labels: Stale

#880 - Inference widget errors on tokenizer load with "data did not match any variant of untagged enum PyPreTokenizerTypeWrapper"

Issue - State: closed - Opened by yhavinga almost 3 years ago - 5 comments

#879 - [TBD] add a feature to continue training a tokenizer

Issue - State: closed - Opened by SaulLu almost 3 years ago - 3 comments
Labels: Stale

#876 - pyo3_runtime.PanicException: Missing additional token

Issue - State: closed - Opened by chinoll almost 3 years ago - 5 comments
Labels: Stale

#875 - Count number of tokens toeknizer might produce without really tokenizing?

Issue - State: closed - Opened by xrkk almost 3 years ago - 4 comments
Labels: Stale

#874 - compile error when installing versions 0.9.2 or 0.8.1.rc2

Issue - State: closed - Opened by bth5032 almost 3 years ago - 9 comments
Labels: Stale

#873 - Add a `Sequence` to the processors

Issue - State: closed - Opened by SaulLu almost 3 years ago - 10 comments
Labels: enhancement, Stale

#855 - Regex Capture Group?

Issue - State: closed - Opened by SamuelLarkin almost 3 years ago - 4 comments
Labels: Stale, Feature Request

#852 - Question about offsets returned for text starting with a space when using `add_prefix_space=True` and `trim_offsets=True`

Issue - State: closed - Opened by SaulLu almost 3 years ago - 4 comments
Labels: Stale

#849 - Sampling alternative tokenizations

Issue - State: closed - Opened by david-waterworth almost 3 years ago - 8 comments

#842 - JVM - Add bindings with Java API

Pull Request - State: closed - Opened by nguyenvietyen almost 3 years ago - 2 comments
Labels: Stale

#839 - Fine-tune a BPE tokenize by only adding merge rules

Issue - State: closed - Opened by cccntu almost 3 years ago - 7 comments
Labels: Stale

#836 - tokenizers.decoders.WordPiece contracts "do not" to "don't" in decoding

Issue - State: closed - Opened by dhdaines almost 3 years ago - 4 comments
Labels: Stale

#834 - About reproduction of Byt5

Issue - State: closed - Opened by fengyuanyu1 almost 3 years ago - 1 comment
Labels: Stale

#828 - Character Based Model

Issue - State: closed - Opened by david-waterworth about 3 years ago - 1 comment
Labels: Stale

#826 - After extra tokens are added, decoded results has no whitespace between tokens

Issue - State: closed - Opened by YiweiJiang2015 about 3 years ago - 13 comments
Labels: Stale

#821 - Tokenizer throwing PanicException

Issue - State: closed - Opened by tanmaylaud about 3 years ago - 15 comments
Labels: Stale

#817 - Issue with space tokens + BPE tokenizer

Issue - State: closed - Opened by sdtblck about 3 years ago - 4 comments
Labels: Stale

#816 - unexpected behaviour byte char level tokenizer

Issue - State: closed - Opened by glample about 3 years ago - 5 comments
Labels: Stale

#815 - Pre-trainined German tokenizers for BPE or Subword embeddings?

Issue - State: closed - Opened by yuvaraj91 about 3 years ago - 2 comments
Labels: Stale

#814 - Tokenizer training sticks for long times.

Issue - State: closed - Opened by lancelotzty about 3 years ago - 26 comments
Labels: Stale

#811 - Include return type annotation in `Encoding` class properties?

Issue - State: closed - Opened by willfrey about 3 years ago - 2 comments
Labels: Stale

#809 - No processors.Sequence

Issue - State: closed - Opened by felix-schneider about 3 years ago - 2 comments
Labels: Stale

#808 - Inconsistencies between documentation and API

Issue - State: closed - Opened by felix-schneider about 3 years ago - 4 comments
Labels: Stale

#805 - Unable to build due to compilation errors in a dependency

Issue - State: closed - Opened by Michael-F-Bryan about 3 years ago - 1 comment
Labels: Stale

#804 - Support tokenizing more than two sequences

Issue - State: closed - Opened by cemilcengiz about 3 years ago - 12 comments
Labels: Stale

#802 - tokenizer decode not joining words with continuing_subword_prefix

Issue - State: closed - Opened by wei-ann-Github about 3 years ago - 2 comments
Labels: Stale

#800 - Adding doc_stride while preprocessing the data for Question Answering

Issue - State: closed - Opened by IamSparky about 3 years ago - 1 comment
Labels: Stale

#796 - unigram.json to transformers bert tokenizer

Issue - State: closed - Opened by sooftware about 3 years ago - 1 comment
Labels: Stale

#791 - ```word_to_tokens``` is incorrect with passing in sentence with certain punctuation.

Issue - State: closed - Opened by lorr1 about 3 years ago - 1 comment
Labels: Stale

#790 - "return_special_tokens_mask" does not mask new tokens when added via "add_special_tokens"

Issue - State: closed - Opened by vgoklani about 3 years ago - 5 comments
Labels: Stale

#786 - BERT tokenizer split words with no "##" prefix after adding vocab

Issue - State: closed - Opened by yqxie-inst about 3 years ago - 1 comment
Labels: Stale

#785 - Tokenizer not adding "BOS", "EOS" tokens

Issue - State: closed - Opened by brand17 about 3 years ago - 2 comments
Labels: Stale

#784 - Python bindings tests fail on macos, python 3.8+

Issue - State: closed - Opened by risicle about 3 years ago - 1 comment
Labels: Stale

#783 - Byte-Level BPE: Splitting a sequence of token ids on spaces

Issue - State: closed - Opened by quasimik about 3 years ago - 1 comment
Labels: Stale

#782 - ValueError: ZIP does not support timestamps before 1980 while installing

Issue - State: closed - Opened by leo848 about 3 years ago - 1 comment
Labels: Stale

#781 - Do I need all the corpura to train the tokenizer?

Issue - State: closed - Opened by allanj about 3 years ago - 3 comments
Labels: Stale

#779 - [Feature request] Add direction parameter for truncation

Issue - State: closed - Opened by NielsRogge over 3 years ago - 2 comments
Labels: Stale

#779 - [Feature request] Add direction parameter for truncation

Issue - State: open - Opened by NielsRogge over 3 years ago - 2 comments
Labels: Stale

#777 - PretrainedTokenizerFast from Tokenizer Does Not Keep The Same Properties?

Issue - State: closed - Opened by FeryET over 3 years ago - 4 comments
Labels: Stale

#777 - PretrainedTokenizerFast from Tokenizer Does Not Keep The Same Properties?

Issue - State: closed - Opened by FeryET over 3 years ago - 4 comments
Labels: Stale

#772 - Discussion about the behavior of the `add_tokens` method

Issue - State: open - Opened by SaulLu over 3 years ago - 1 comment
Labels: Stale

#772 - Discussion about the behavior of the `add_tokens` method

Issue - State: closed - Opened by SaulLu over 3 years ago - 1 comment
Labels: Stale

#760 - I can't train a tokenizer from pre-tokenized word counts in Python

Issue - State: closed - Opened by bskaggs over 3 years ago - 7 comments
Labels: Stale

#760 - I can't train a tokenizer from pre-tokenized word counts in Python

Issue - State: open - Opened by bskaggs over 3 years ago - 7 comments
Labels: Stale

#760 - I can't train a tokenizer from pre-tokenized word counts in Python

Issue - State: open - Opened by bskaggs over 3 years ago - 7 comments
Labels: Stale

#760 - I can't train a tokenizer from pre-tokenized word counts in Python

Issue - State: open - Opened by bskaggs over 3 years ago - 7 comments
Labels: Stale

#757 - How to use the pre training model offline?

Issue - State: open - Opened by dongteng over 3 years ago - 2 comments
Labels: Stale

#757 - How to use the pre training model offline?

Issue - State: open - Opened by dongteng over 3 years ago - 2 comments
Labels: Stale

#757 - How to use the pre training model offline?

Issue - State: open - Opened by dongteng over 3 years ago - 2 comments
Labels: Stale

#757 - How to use the pre training model offline?

Issue - State: closed - Opened by dongteng over 3 years ago - 2 comments
Labels: Stale

#754 - How to set Unicode codes as smallest byte representation in BPETrainer?

Issue - State: open - Opened by fake-warrior8 over 3 years ago - 3 comments
Labels: Stale

#754 - How to set Unicode codes as smallest byte representation in BPETrainer?

Issue - State: closed - Opened by fake-warrior8 over 3 years ago - 3 comments
Labels: Stale

#754 - How to set Unicode codes as smallest byte representation in BPETrainer?

Issue - State: open - Opened by fake-warrior8 over 3 years ago - 3 comments
Labels: Stale

#754 - How to set Unicode codes as smallest byte representation in BPETrainer?

Issue - State: open - Opened by fake-warrior8 over 3 years ago - 3 comments
Labels: Stale

#753 - Unclear parameter in Custom PreTokenizer

Issue - State: open - Opened by thomwolf over 3 years ago - 2 comments
Labels: Stale

#753 - Unclear parameter in Custom PreTokenizer

Issue - State: closed - Opened by thomwolf over 3 years ago - 2 comments
Labels: Stale

#748 - UnigramTrainer forgets the model unk_id

Issue - State: open - Opened by sgugger over 3 years ago - 2 comments
Labels: Stale

#748 - UnigramTrainer forgets the model unk_id

Issue - State: closed - Opened by sgugger over 3 years ago - 2 comments
Labels: Stale

#748 - UnigramTrainer forgets the model unk_id

Issue - State: open - Opened by sgugger over 3 years ago - 2 comments
Labels: Stale

#747 - Questions on modifying a vocabulary vs. training a LM from scratch

Issue - State: closed - Opened by brijow over 3 years ago - 7 comments

#745 - ByteLevelBPETokenizer does not merge as much as it could

Issue - State: open - Opened by JellePiepenbrock over 3 years ago - 2 comments
Labels: Stale

#745 - ByteLevelBPETokenizer does not merge as much as it could

Issue - State: open - Opened by JellePiepenbrock over 3 years ago - 2 comments
Labels: Stale

#745 - ByteLevelBPETokenizer does not merge as much as it could

Issue - State: closed - Opened by JellePiepenbrock over 3 years ago - 2 comments
Labels: Stale

#745 - ByteLevelBPETokenizer does not merge as much as it could

Issue - State: closed - Opened by JellePiepenbrock over 3 years ago - 2 comments
Labels: Stale

#745 - ByteLevelBPETokenizer does not merge as much as it could

Issue - State: open - Opened by JellePiepenbrock over 3 years ago - 2 comments
Labels: Stale

#744 - Transformers CLI converting TF model to pytorch, error saving pytorch model

Issue - State: open - Opened by anandmg101 over 3 years ago - 1 comment
Labels: Stale

#744 - Transformers CLI converting TF model to pytorch, error saving pytorch model

Issue - State: closed - Opened by anandmg101 over 3 years ago - 1 comment
Labels: Stale

#744 - Transformers CLI converting TF model to pytorch, error saving pytorch model

Issue - State: closed - Opened by anandmg101 over 3 years ago - 1 comment
Labels: Stale

#744 - Transformers CLI converting TF model to pytorch, error saving pytorch model

Issue - State: open - Opened by anandmg101 over 3 years ago - 1 comment
Labels: Stale

#744 - Transformers CLI converting TF model to pytorch, error saving pytorch model

Issue - State: open - Opened by anandmg101 over 3 years ago - 1 comment
Labels: Stale

#741 - How to use the latest pretrained tokenizer？

Issue - State: closed - Opened by LXXiaogege over 3 years ago - 1 comment
Labels: Stale

#741 - How to use the latest pretrained tokenizer？

Issue - State: closed - Opened by LXXiaogege over 3 years ago - 1 comment
Labels: Stale

#739 - DistilBert tokenizer different tokenization

Issue - State: closed - Opened by suggestedUsername over 3 years ago - 1 comment
Labels: Stale

#739 - DistilBert tokenizer different tokenization

Issue - State: open - Opened by suggestedUsername over 3 years ago - 1 comment
Labels: Stale

#736 - normalizers.Replace followed by normalizers.BertNormalizer raise PanicException

Issue - State: open - Opened by bab2min over 3 years ago - 1 comment
Labels: Stale

#736 - normalizers.Replace followed by normalizers.BertNormalizer raise PanicException

Issue - State: closed - Opened by bab2min over 3 years ago - 1 comment
Labels: Stale