Ecosyste.ms: Issues
An open API service for providing issue and pull request metadata for open source projects.
GitHub / huggingface/tokenizers issues and pull requests
#914 - Loading of Tokenizer is really slow when there are lots of additional tokens
Issue -
State: closed - Opened by PiercarloSlavazza over 2 years ago
- 9 comments
#911 - Tokenizers for Node 16?
Issue -
State: closed - Opened by etiennelunetta almost 3 years ago
- 8 comments
Labels: Stale
#909 - Merges cannot handle tokens containing spaces.
Pull Request -
State: closed - Opened by Narsil almost 3 years ago
- 17 comments
#905 - tokenizers.models.BPE loses whitespace with GPT-2 pretrained vocab & merges
Issue -
State: closed - Opened by umbra-scientia almost 3 years ago
- 2 comments
Labels: Stale
#903 - Some questions about building a tokenizer from scratch: vocab size can't decide actual vocab size and token order unstable.
Issue -
State: closed - Opened by catqaq almost 3 years ago
- 12 comments
Labels: Stale
#902 - How to Suppress "Using bos_token, but it is not set yet..." in HuggingFace T5 Tokenizer
Issue -
State: closed - Opened by xsys-technology almost 3 years ago
- 5 comments
Labels: Stale
#900 - get_vocab_size() is different from len(get_vocab())
Issue -
State: closed - Opened by anatoly-khomenko almost 3 years ago
- 2 comments
Labels: Stale
#899 - Error: ThreadPoolBuildError
Issue -
State: closed - Opened by HarshkumarP almost 3 years ago
- 4 comments
#898 - Missing Version info in build from source installation instruction
Issue -
State: closed - Opened by leo-du almost 3 years ago
- 6 comments
Labels: Stale
#892 - Wrong alignments after calling `NormalizedString.replace()`
Issue -
State: closed - Opened by t-yamamura almost 3 years ago
- 11 comments
Labels: Stale
#888 - PanicException For Result::unwarp()
Issue -
State: closed - Opened by Namco0816 almost 3 years ago
- 5 comments
Labels: Stale
#886 - vocab_size issue with Whitespace pre_tokenizer
Issue -
State: closed - Opened by ctoraman almost 3 years ago
- 10 comments
Labels: Stale
#880 - Inference widget errors on tokenizer load with "data did not match any variant of untagged enum PyPreTokenizerTypeWrapper"
Issue -
State: closed - Opened by yhavinga almost 3 years ago
- 5 comments
#879 - [TBD] add a feature to continue training a tokenizer
Issue -
State: closed - Opened by SaulLu almost 3 years ago
- 3 comments
Labels: Stale
#876 - pyo3_runtime.PanicException: Missing additional token
Issue -
State: closed - Opened by chinoll almost 3 years ago
- 5 comments
Labels: Stale
#875 - Count number of tokens toeknizer might produce without really tokenizing?
Issue -
State: closed - Opened by xrkk almost 3 years ago
- 4 comments
Labels: Stale
#874 - compile error when installing versions 0.9.2 or 0.8.1.rc2
Issue -
State: closed - Opened by bth5032 almost 3 years ago
- 9 comments
Labels: Stale
#873 - Add a `Sequence` to the processors
Issue -
State: closed - Opened by SaulLu almost 3 years ago
- 10 comments
Labels: enhancement, Stale
#855 - Regex Capture Group?
Issue -
State: closed - Opened by SamuelLarkin almost 3 years ago
- 4 comments
Labels: Stale, Feature Request
#852 - Question about offsets returned for text starting with a space when using `add_prefix_space=True` and `trim_offsets=True`
Issue -
State: closed - Opened by SaulLu almost 3 years ago
- 4 comments
Labels: Stale
#849 - Sampling alternative tokenizations
Issue -
State: closed - Opened by david-waterworth almost 3 years ago
- 8 comments
#842 - JVM - Add bindings with Java API
Pull Request -
State: closed - Opened by nguyenvietyen almost 3 years ago
- 2 comments
Labels: Stale
#839 - Fine-tune a BPE tokenize by only adding merge rules
Issue -
State: closed - Opened by cccntu almost 3 years ago
- 7 comments
Labels: Stale
#836 - tokenizers.decoders.WordPiece contracts "do not" to "don't" in decoding
Issue -
State: closed - Opened by dhdaines almost 3 years ago
- 4 comments
Labels: Stale
#834 - About reproduction of Byt5
Issue -
State: closed - Opened by fengyuanyu1 almost 3 years ago
- 1 comment
Labels: Stale
#828 - Character Based Model
Issue -
State: closed - Opened by david-waterworth about 3 years ago
- 1 comment
Labels: Stale
#826 - After extra tokens are added, decoded results has no whitespace between tokens
Issue -
State: closed - Opened by YiweiJiang2015 about 3 years ago
- 13 comments
Labels: Stale
#821 - Tokenizer throwing PanicException
Issue -
State: closed - Opened by tanmaylaud about 3 years ago
- 15 comments
Labels: Stale
#817 - Issue with space tokens + BPE tokenizer
Issue -
State: closed - Opened by sdtblck about 3 years ago
- 4 comments
Labels: Stale
#816 - unexpected behaviour byte char level tokenizer
Issue -
State: closed - Opened by glample about 3 years ago
- 5 comments
Labels: Stale
#815 - Pre-trainined German tokenizers for BPE or Subword embeddings?
Issue -
State: closed - Opened by yuvaraj91 about 3 years ago
- 2 comments
Labels: Stale
#814 - Tokenizer training sticks for long times.
Issue -
State: closed - Opened by lancelotzty about 3 years ago
- 26 comments
Labels: Stale
#811 - Include return type annotation in `Encoding` class properties?
Issue -
State: closed - Opened by willfrey about 3 years ago
- 2 comments
Labels: Stale
#809 - No processors.Sequence
Issue -
State: closed - Opened by felix-schneider about 3 years ago
- 2 comments
Labels: Stale
#808 - Inconsistencies between documentation and API
Issue -
State: closed - Opened by felix-schneider about 3 years ago
- 4 comments
Labels: Stale
#805 - Unable to build due to compilation errors in a dependency
Issue -
State: closed - Opened by Michael-F-Bryan about 3 years ago
- 1 comment
Labels: Stale
#804 - Support tokenizing more than two sequences
Issue -
State: closed - Opened by cemilcengiz about 3 years ago
- 12 comments
Labels: Stale
#802 - tokenizer decode not joining words with continuing_subword_prefix
Issue -
State: closed - Opened by wei-ann-Github about 3 years ago
- 2 comments
Labels: Stale
#800 - Adding doc_stride while preprocessing the data for Question Answering
Issue -
State: closed - Opened by IamSparky about 3 years ago
- 1 comment
Labels: Stale
#796 - unigram.json to transformers bert tokenizer
Issue -
State: closed - Opened by sooftware about 3 years ago
- 1 comment
Labels: Stale
#791 - ```word_to_tokens``` is incorrect with passing in sentence with certain punctuation.
Issue -
State: closed - Opened by lorr1 about 3 years ago
- 1 comment
Labels: Stale
#790 - "return_special_tokens_mask" does not mask new tokens when added via "add_special_tokens"
Issue -
State: closed - Opened by vgoklani about 3 years ago
- 5 comments
Labels: Stale
#786 - BERT tokenizer split words with no "##" prefix after adding vocab
Issue -
State: closed - Opened by yqxie-inst about 3 years ago
- 1 comment
Labels: Stale
#785 - Tokenizer not adding "BOS", "EOS" tokens
Issue -
State: closed - Opened by brand17 about 3 years ago
- 2 comments
Labels: Stale
#784 - Python bindings tests fail on macos, python 3.8+
Issue -
State: closed - Opened by risicle about 3 years ago
- 1 comment
Labels: Stale
#783 - Byte-Level BPE: Splitting a sequence of token ids on spaces
Issue -
State: closed - Opened by quasimik about 3 years ago
- 1 comment
Labels: Stale
#782 - ValueError: ZIP does not support timestamps before 1980 while installing
Issue -
State: closed - Opened by leo848 about 3 years ago
- 1 comment
Labels: Stale
#781 - Do I need all the corpura to train the tokenizer?
Issue -
State: closed - Opened by allanj about 3 years ago
- 3 comments
Labels: Stale
#779 - [Feature request] Add direction parameter for truncation
Issue -
State: closed - Opened by NielsRogge over 3 years ago
- 2 comments
Labels: Stale
#779 - [Feature request] Add direction parameter for truncation
Issue -
State: open - Opened by NielsRogge over 3 years ago
- 2 comments
Labels: Stale
#777 - PretrainedTokenizerFast from Tokenizer Does Not Keep The Same Properties?
Issue -
State: closed - Opened by FeryET over 3 years ago
- 4 comments
Labels: Stale
#777 - PretrainedTokenizerFast from Tokenizer Does Not Keep The Same Properties?
Issue -
State: closed - Opened by FeryET over 3 years ago
- 4 comments
Labels: Stale
#772 - Discussion about the behavior of the `add_tokens` method
Issue -
State: open - Opened by SaulLu over 3 years ago
- 1 comment
Labels: Stale
#772 - Discussion about the behavior of the `add_tokens` method
Issue -
State: closed - Opened by SaulLu over 3 years ago
- 1 comment
Labels: Stale
#760 - I can't train a tokenizer from pre-tokenized word counts in Python
Issue -
State: closed - Opened by bskaggs over 3 years ago
- 7 comments
Labels: Stale
#760 - I can't train a tokenizer from pre-tokenized word counts in Python
Issue -
State: open - Opened by bskaggs over 3 years ago
- 7 comments
Labels: Stale
#760 - I can't train a tokenizer from pre-tokenized word counts in Python
Issue -
State: open - Opened by bskaggs over 3 years ago
- 7 comments
Labels: Stale
#760 - I can't train a tokenizer from pre-tokenized word counts in Python
Issue -
State: open - Opened by bskaggs over 3 years ago
- 7 comments
Labels: Stale
#757 - How to use the pre training model offline?
Issue -
State: open - Opened by dongteng over 3 years ago
- 2 comments
Labels: Stale
#757 - How to use the pre training model offline?
Issue -
State: open - Opened by dongteng over 3 years ago
- 2 comments
Labels: Stale
#757 - How to use the pre training model offline?
Issue -
State: open - Opened by dongteng over 3 years ago
- 2 comments
Labels: Stale
#757 - How to use the pre training model offline?
Issue -
State: closed - Opened by dongteng over 3 years ago
- 2 comments
Labels: Stale
#754 - How to set Unicode codes as smallest byte representation in BPETrainer?
Issue -
State: open - Opened by fake-warrior8 over 3 years ago
- 3 comments
Labels: Stale
#754 - How to set Unicode codes as smallest byte representation in BPETrainer?
Issue -
State: closed - Opened by fake-warrior8 over 3 years ago
- 3 comments
Labels: Stale
#754 - How to set Unicode codes as smallest byte representation in BPETrainer?
Issue -
State: open - Opened by fake-warrior8 over 3 years ago
- 3 comments
Labels: Stale
#754 - How to set Unicode codes as smallest byte representation in BPETrainer?
Issue -
State: open - Opened by fake-warrior8 over 3 years ago
- 3 comments
Labels: Stale
#753 - Unclear parameter in Custom PreTokenizer
Issue -
State: open - Opened by thomwolf over 3 years ago
- 2 comments
Labels: Stale
#753 - Unclear parameter in Custom PreTokenizer
Issue -
State: closed - Opened by thomwolf over 3 years ago
- 2 comments
Labels: Stale
#748 - UnigramTrainer forgets the model unk_id
Issue -
State: open - Opened by sgugger over 3 years ago
- 2 comments
Labels: Stale
#748 - UnigramTrainer forgets the model unk_id
Issue -
State: closed - Opened by sgugger over 3 years ago
- 2 comments
Labels: Stale
#748 - UnigramTrainer forgets the model unk_id
Issue -
State: open - Opened by sgugger over 3 years ago
- 2 comments
Labels: Stale
#747 - Questions on modifying a vocabulary vs. training a LM from scratch
Issue -
State: closed - Opened by brijow over 3 years ago
- 7 comments
#745 - ByteLevelBPETokenizer does not merge as much as it could
Issue -
State: open - Opened by JellePiepenbrock over 3 years ago
- 2 comments
Labels: Stale
#745 - ByteLevelBPETokenizer does not merge as much as it could
Issue -
State: open - Opened by JellePiepenbrock over 3 years ago
- 2 comments
Labels: Stale
#745 - ByteLevelBPETokenizer does not merge as much as it could
Issue -
State: closed - Opened by JellePiepenbrock over 3 years ago
- 2 comments
Labels: Stale
#745 - ByteLevelBPETokenizer does not merge as much as it could
Issue -
State: closed - Opened by JellePiepenbrock over 3 years ago
- 2 comments
Labels: Stale
#745 - ByteLevelBPETokenizer does not merge as much as it could
Issue -
State: open - Opened by JellePiepenbrock over 3 years ago
- 2 comments
Labels: Stale
#744 - Transformers CLI converting TF model to pytorch, error saving pytorch model
Issue -
State: open - Opened by anandmg101 over 3 years ago
- 1 comment
Labels: Stale
#744 - Transformers CLI converting TF model to pytorch, error saving pytorch model
Issue -
State: closed - Opened by anandmg101 over 3 years ago
- 1 comment
Labels: Stale
#744 - Transformers CLI converting TF model to pytorch, error saving pytorch model
Issue -
State: closed - Opened by anandmg101 over 3 years ago
- 1 comment
Labels: Stale
#744 - Transformers CLI converting TF model to pytorch, error saving pytorch model
Issue -
State: open - Opened by anandmg101 over 3 years ago
- 1 comment
Labels: Stale
#744 - Transformers CLI converting TF model to pytorch, error saving pytorch model
Issue -
State: open - Opened by anandmg101 over 3 years ago
- 1 comment
Labels: Stale
#741 - How to use the latest pretrained tokenizer?
Issue -
State: closed - Opened by LXXiaogege over 3 years ago
- 1 comment
Labels: Stale
#741 - How to use the latest pretrained tokenizer?
Issue -
State: closed - Opened by LXXiaogege over 3 years ago
- 1 comment
Labels: Stale
#739 - DistilBert tokenizer different tokenization
Issue -
State: closed - Opened by suggestedUsername over 3 years ago
- 1 comment
Labels: Stale
#739 - DistilBert tokenizer different tokenization
Issue -
State: open - Opened by suggestedUsername over 3 years ago
- 1 comment
Labels: Stale
#736 - normalizers.Replace followed by normalizers.BertNormalizer raise PanicException
Issue -
State: open - Opened by bab2min over 3 years ago
- 1 comment
Labels: Stale
#736 - normalizers.Replace followed by normalizers.BertNormalizer raise PanicException
Issue -
State: closed - Opened by bab2min over 3 years ago
- 1 comment
Labels: Stale
#733 - Error with tokenizing a pd dataframe object
Issue -
State: closed - Opened by ShivanshuPurohit over 3 years ago
- 1 comment
Labels: Stale
#730 - Subword regularization
Issue -
State: open - Opened by nikitakit over 3 years ago
- 2 comments
Labels: Stale
#730 - Subword regularization
Issue -
State: open - Opened by nikitakit over 3 years ago
- 2 comments
Labels: Stale
#730 - Subword regularization
Issue -
State: closed - Opened by nikitakit over 3 years ago
- 2 comments
Labels: Stale
#730 - Subword regularization
Issue -
State: open - Opened by nikitakit over 3 years ago
- 2 comments
Labels: Stale
#729 - AttributeError: type object 'tokenizers.models.Unigram' has no attribute 'from_file'
Issue -
State: open - Opened by nikitakit over 3 years ago
- 2 comments
Labels: Stale
#729 - AttributeError: type object 'tokenizers.models.Unigram' has no attribute 'from_file'
Issue -
State: open - Opened by nikitakit over 3 years ago
- 2 comments
Labels: Stale
#729 - AttributeError: type object 'tokenizers.models.Unigram' has no attribute 'from_file'
Issue -
State: closed - Opened by nikitakit over 3 years ago
- 2 comments
Labels: Stale
#725 - Incorrect offsets for special tokens?
Issue -
State: closed - Opened by david-waterworth over 3 years ago
- 2 comments
Labels: Stale
#725 - Incorrect offsets for special tokens?
Issue -
State: open - Opened by david-waterworth over 3 years ago
- 2 comments
Labels: Stale
#725 - Incorrect offsets for special tokens?
Issue -
State: open - Opened by david-waterworth over 3 years ago
- 2 comments
Labels: Stale
#724 - Weird whitespace token while encoding using tokenizer
Issue -
State: closed - Opened by PiotrNawrot over 3 years ago
- 1 comment
Labels: Stale