Ecosyste.ms: Issues
An open API service for providing issue and pull request metadata for open source projects.
GitHub / google/sentencepiece issues and pull requests
#900 - Question - pad token on pretrained tokenizer
Issue -
State: closed - Opened by stellanhaglund over 1 year ago
- 1 comment
#899 - Tokenizing/sampling differently during different training epochs
Issue -
State: closed - Opened by lsy641 over 1 year ago
- 3 comments
#898 - Bad prompt continuation with trailing space
Issue -
State: closed - Opened by xefoci7612 over 1 year ago
#897 - error: subprocess-exited-with-error × Building wheel for sentencepiece (pyproject.toml) did not run successfully.
Issue -
State: closed - Opened by dv000-7 over 1 year ago
- 1 comment
#896 - Zero width non-joiner incorrectly mapped to space
Issue -
State: closed - Opened by angelaslin over 1 year ago
- 1 comment
#895 - How to check the tokenization method of a given `spm.model` ?
Issue -
State: closed - Opened by cyk1337 over 1 year ago
- 1 comment
Labels: duplicate
#894 - Right bigram index too large for uint16
Issue -
State: closed - Opened by Numeri over 1 year ago
- 3 comments
#893 - Suppress Warnings via Python API?
Issue -
State: closed - Opened by dgrahn over 1 year ago
- 4 comments
Labels: Will fix in next release
#892 - universal build release for c++ library
Issue -
State: closed - Opened by gianmarcohutter over 1 year ago
- 4 comments
Labels: feature request, Will fix in next release
#891 - Adding continuous `'<0x0A><0x0A>'` tokens to encode continuous `\n\n`
Issue -
State: closed - Opened by cyk1337 over 1 year ago
- 2 comments
#890 - The vocabulary order in BPE
Issue -
State: closed - Opened by cyk1337 over 1 year ago
- 1 comment
#889 - Generate `sentencepiece_model_pb2.py` with newer protobuf
Issue -
State: closed - Opened by ydshieh over 1 year ago
- 2 comments
#888 - set add_dummy_prefix according to the identified language
Issue -
State: closed - Opened by Life-0-1 over 1 year ago
- 2 comments
#887 - Very slow training on AMD CPU
Issue -
State: closed - Opened by kjhanjee over 1 year ago
- 6 comments
#886 - Can options from SentencePieceTrainer training be extracted from the tokenizer.model file?
Issue -
State: closed - Opened by DinhLuan14 over 1 year ago
- 2 comments
#885 - vocab size is smaller than required_chars
Issue -
State: closed - Opened by jiangix-paper over 1 year ago
- 1 comment
#884 - How does word regularization calculate the sampling score returned by the function "sample_encode_and_score"?
Issue -
State: closed - Opened by lsy641 over 1 year ago
- 1 comment
#883 - spm_train_main.cc(171) error
Issue -
State: closed - Opened by glide-the over 1 year ago
- 1 comment
#882 - RuntimeError Internal: src/sentencepiece_processor.cc(1101)
Issue -
State: closed - Opened by Feng-Jay over 1 year ago
- 1 comment
#881 - Duplicate tokens in BPE vocabulary
Issue -
State: open - Opened by astanic over 1 year ago
- 1 comment
#877 - There is always a "▁" before the first piece after tokenization, why?
Issue -
State: closed - Opened by hxs91 over 1 year ago
- 3 comments
Labels: duplicate
#876 - Patches carried by conda-forge for packaging sentencepiece
Issue -
State: open - Opened by h-vetinari over 1 year ago
- 3 comments
#875 - Use with LASER3
Issue -
State: closed - Opened by thyripian over 1 year ago
- 2 comments
#874 - Please add a Python interface for normalization (like `spm_normalize` in CLI)
Issue -
State: closed - Opened by avidale over 1 year ago
- 1 comment
Labels: feature request, Will fix in next release
#873 - Respecify the normalization rule for trained sentenpieces models
Issue -
State: closed - Opened by geolvr over 1 year ago
- 1 comment
#872 - why spm_encode keeps characters not in the vocab?
Issue -
State: closed - Opened by Smu-Tan over 1 year ago
- 1 comment
Labels: duplicate
#871 - How do I make a spiece.model for vanilla GPT2 BPE Tokenizer?
Issue -
State: closed - Opened by redthing1 over 1 year ago
- 1 comment
#870 - Fix overlinking with protobuf
Pull Request -
State: closed - Opened by ryandesign almost 2 years ago
#869 - Using SPM_USE_EXTERNAL_ABSL
Issue -
State: closed - Opened by ryandesign almost 2 years ago
- 2 comments
Labels: Will fix in next release
#868 - Update bundled protobuf-lite?
Issue -
State: closed - Opened by ryandesign almost 2 years ago
- 2 comments
#867 - Fix nasty bug in BPE position encoding
Pull Request -
State: closed - Opened by vmarkovtsev almost 2 years ago
#866 - Tokens Chunking to respect Language Word Boundaries
Issue -
State: open - Opened by loretoparisi almost 2 years ago
#865 - Python from source on armv7l raises ' undefined symbol: __atomic_fetch_add_8 '
Issue -
State: open - Opened by FrancescoScandiffio almost 2 years ago
- 2 comments
#864 - tokens taker
Issue -
State: closed - Opened by Hopom almost 2 years ago
- 1 comment
#863 - How about `▁` own escape?
Issue -
State: closed - Opened by fseasy almost 2 years ago
- 2 comments
#862 - segment fault
Issue -
State: closed - Opened by wlike almost 2 years ago
- 2 comments
#861 - RuntimeError occurs when T5Tokenizer is executed on big-endian platform
Issue -
State: closed - Opened by kiszk almost 2 years ago
- 3 comments
Labels: bug
#860 - Build failure at 0.19.8 or 0.19.9 on big-endian platform
Issue -
State: closed - Opened by kiszk almost 2 years ago
- 2 comments
#859 - Program terminated with an unrecoverable error.
Issue -
State: closed - Opened by RuslanSel almost 2 years ago
- 2 comments
#858 - Issue with XLM-RoBERTa tokenizer
Issue -
State: closed - Opened by ozanarmagan almost 2 years ago
- 1 comment
#857 - not able to use more than 128 threads
Issue -
State: closed - Opened by GradientGuru almost 2 years ago
- 1 comment
#856 - does it support bpe of clip model?
Issue -
State: closed - Opened by susht3 almost 2 years ago
- 1 comment
#855 - Code for tf-sentencepiece
Issue -
State: closed - Opened by lxomb almost 2 years ago
- 1 comment
#854 - Potential bug: sequential unk tokens are combined in word mode
Issue -
State: closed - Opened by smesham almost 2 years ago
- 2 comments
#853 - encode underline
Issue -
State: closed - Opened by yangsp5 almost 2 years ago
- 2 comments
#852 - Extending the byte_fallback option
Issue -
State: closed - Opened by chris-ha458 almost 2 years ago
- 1 comment
#851 - NaN unigram model score error with sentencepiece 0.1.98
Issue -
State: closed - Opened by lucaslingle almost 2 years ago
- 3 comments
Labels: bug
#850 - Error when loading LlamaTokenizer
Issue -
State: closed - Opened by creamiracle almost 2 years ago
- 1 comment
#849 - How to get the score of a token
Issue -
State: closed - Opened by WangJW424 almost 2 years ago
#847 - failing to install FastChat - missing sentencepiece_processor.h
Issue -
State: closed - Opened by sirnails almost 2 years ago
- 1 comment
Labels: duplicate
#846 - pip install failed on windows x64 with python 3.9
Issue -
State: closed - Opened by liqingjun123 almost 2 years ago
- 2 comments
#845 - Update sentencepiece_python_module_example.ipynb
Pull Request -
State: closed - Opened by chris-ha458 almost 2 years ago
- 2 comments
#844 - improve stable version documentation
Issue -
State: closed - Opened by grumpyp almost 2 years ago
- 1 comment
#843 - Potential bug in Unigram model
Issue -
State: closed - Opened by tsinggggg almost 2 years ago
- 2 comments
#842 - Only support 64bit?
Issue -
State: open - Opened by logicvv almost 2 years ago
- 2 comments
#841 - How to customize `input_format` except txt, tsv
Issue -
State: closed - Opened by cyk1337 almost 2 years ago
- 1 comment
#840 - [Question] Why include characters that are in the set of fallback bytes?
Issue -
State: closed - Opened by ZJaume almost 2 years ago
- 4 comments
#839 - Letters are split
Issue -
State: closed - Opened by wizardk almost 2 years ago
- 3 comments
#838 - Question about encode sentence pieces?
Issue -
State: closed - Opened by dsj96 almost 2 years ago
- 1 comment
#837 - Removed replacing of /MD with /MT for MSVC
Pull Request -
State: closed - Opened by ilya-lavrenov almost 2 years ago
#836 - Simulate Erros of common mistakes
Issue -
State: closed - Opened by darrkj2049 almost 2 years ago
- 1 comment
#835 - PyPI source distribution do not contain full sources
Issue -
State: closed - Opened by pkubik almost 2 years ago
- 4 comments
Labels: enhancement
#833 - spm_encode splits into letters when --vocabulary is provided
Issue -
State: closed - Opened by drunkinlove almost 2 years ago
- 5 comments
#832 - disregard
Issue -
State: closed - Opened by Technetium1 almost 2 years ago
#831 - undefined reference to `__atomic_fetch_add_8' while build in Raspberry Pi
Issue -
State: closed - Opened by yanghl12138 almost 2 years ago
- 1 comment
#830 - Is it possibile to extend a trained BPE model's merge operations?
Issue -
State: closed - Opened by pluiez almost 2 years ago
- 1 comment
#829 - Windows debug build fails to load anything, with “file not found” status code
Issue -
State: closed - Opened by Const-me almost 2 years ago
- 2 comments
Labels: bug
#828 - How can I transfer a SentencePieceProcessor object into a transformers AutoTokenizer?
Issue -
State: closed - Opened by nameless0704 almost 2 years ago
- 1 comment
#827 - Error building wheel
Issue -
State: closed - Opened by eobrien2002 almost 2 years ago
- 7 comments
Labels: enhancement
#826 - can sentencepiece python wrapper training with generator rather than files?
Issue -
State: closed - Opened by Maxlinn almost 2 years ago
- 1 comment
#825 - Usage of MSVC static runtime by default
Issue -
State: closed - Opened by ilya-lavrenov almost 2 years ago
- 4 comments
#824 - spm.SentencePieceTrainer.train stuck in Jupyter Notebook
Issue -
State: closed - Opened by shivanraptor almost 2 years ago
- 2 comments
#823 - Add interface for C# calling
Pull Request -
State: closed - Opened by zhongkaifu almost 2 years ago
- 7 comments
#822 - Can the score of token be updated manually? Can two vocabulary been merged?
Issue -
State: closed - Opened by obangw almost 2 years ago
- 2 comments
#821 - sentencepiece v0.1.95 pip install failure on HuggingFace Docker Space
Issue -
State: closed - Opened by anammari almost 2 years ago
- 2 comments
#820 - Fix setup-python version not detected
Pull Request -
State: closed - Opened by juliusfrost about 2 years ago
- 1 comment
#819 - Add Python 3.11 builds
Pull Request -
State: closed - Opened by juliusfrost about 2 years ago
- 2 comments
#818 - Error with ProtoBuf code Python
Issue -
State: closed - Opened by ahmedlone127 about 2 years ago
- 4 comments
#817 - Add interface for C# calling
Pull Request -
State: closed - Opened by zhongkaifu about 2 years ago
- 1 comment
#816 - Cannot encode special tokens in input
Issue -
State: closed - Opened by pks about 2 years ago
- 1 comment
#815 - sentencepiece.dll problem in the API
Issue -
State: closed - Opened by piedralaves about 2 years ago
- 2 comments
#814 - Error while installing from source
Issue -
State: closed - Opened by singhakr about 2 years ago
- 2 comments
#813 - Segfault using sentencepiece with tf-nightly
Issue -
State: closed - Opened by mattdangerw about 2 years ago
- 7 comments
Labels: bug
#812 - Undo unk hack
Pull Request -
State: closed - Opened by pks about 2 years ago
- 1 comment
#811 - duplicate tokens in user_defined_symbols param cause RuntimeError
Issue -
State: closed - Opened by poedator about 2 years ago
- 2 comments
Labels: enhancement
#810 - Support for python3.11
Issue -
State: closed - Opened by kuutsav about 2 years ago
- 25 comments
Labels: enhancement
#809 - Can I know the loss of training with BPE and subword regularization?
Issue -
State: closed - Opened by lsy641 about 2 years ago
- 1 comment
#808 - Update README.md
Pull Request -
State: closed - Opened by jacek-michalak about 2 years ago
#807 - Exporting tokenizer to .spm
Issue -
State: closed - Opened by SupreethRao99 about 2 years ago
- 1 comment
#806 - Output folder for spm_train
Issue -
State: closed - Opened by AlexUmnov about 2 years ago
- 4 comments
#805 - sp.encode has results but sp.sample_encode_and_score doesn't
Issue -
State: closed - Opened by lsy641 about 2 years ago
#804 - Training a BPE model w/ "identity" normalization rule doesn't add "\n" to the vocab
Issue -
State: closed - Opened by pks about 2 years ago
- 5 comments
#803 - Not able to install sentencepiece on s390x machine
Issue -
State: closed - Opened by swagaths1 about 2 years ago
- 2 comments
Labels: bug
#802 - Is it allowed to rearrange index/id of each vocabulary?
Issue -
State: closed - Opened by lsy641 about 2 years ago
- 1 comment
#801 - tokens listed in user_defined_symbols tokenized as unknowns when using the "word" model_type
Issue -
State: open - Opened by lintangsutawika about 2 years ago
Labels: bug
#800 - low encoding speed in Python library
Issue -
State: closed - Opened by tomsbergmanis about 2 years ago
- 1 comment
#799 - fix the path in add_new_vocab.ipynb
Pull Request -
State: closed - Opened by kyoto7250 about 2 years ago