google/sentencepiece issues and pull requests

#900 - Question - pad token on pretrained tokenizer

Issue - State: closed - Opened by stellanhaglund over 1 year ago - 1 comment

#899 - Tokenizing/sampling differently during different training epochs

Issue - State: closed - Opened by lsy641 over 1 year ago - 3 comments

#898 - Bad prompt continuation with trailing space

Issue - State: closed - Opened by xefoci7612 over 1 year ago

#897 - error: subprocess-exited-with-error × Building wheel for sentencepiece (pyproject.toml) did not run successfully.

Issue - State: closed - Opened by dv000-7 over 1 year ago - 1 comment

#896 - Zero width non-joiner incorrectly mapped to space

Issue - State: closed - Opened by angelaslin over 1 year ago - 1 comment

#895 - How to check the tokenization method of a given `spm.model` ?

Issue - State: closed - Opened by cyk1337 over 1 year ago - 1 comment
Labels: duplicate

#894 - Right bigram index too large for uint16

Issue - State: closed - Opened by Numeri over 1 year ago - 3 comments

#893 - Suppress Warnings via Python API?

Issue - State: closed - Opened by dgrahn over 1 year ago - 4 comments
Labels: Will fix in next release

#892 - universal build release for c++ library

Issue - State: closed - Opened by gianmarcohutter over 1 year ago - 4 comments
Labels: feature request, Will fix in next release

#891 - Adding continuous `'<0x0A><0x0A>'` tokens to encode continuous `\n\n`

Issue - State: closed - Opened by cyk1337 over 1 year ago - 2 comments

#890 - The vocabulary order in BPE

Issue - State: closed - Opened by cyk1337 over 1 year ago - 1 comment

#889 - Generate `sentencepiece_model_pb2.py` with newer protobuf

Issue - State: closed - Opened by ydshieh over 1 year ago - 2 comments

#888 - set add_dummy_prefix according to the identified language

Issue - State: closed - Opened by Life-0-1 over 1 year ago - 2 comments

#887 - Very slow training on AMD CPU

Issue - State: closed - Opened by kjhanjee over 1 year ago - 6 comments

#886 - Can options from SentencePieceTrainer training be extracted from the tokenizer.model file?

Issue - State: closed - Opened by DinhLuan14 over 1 year ago - 2 comments

#885 - vocab size is smaller than required_chars

Issue - State: closed - Opened by jiangix-paper over 1 year ago - 1 comment

#884 - How does word regularization calculate the sampling score returned by the function "sample_encode_and_score"?

Issue - State: closed - Opened by lsy641 over 1 year ago - 1 comment

#883 - spm_train_main.cc(171) error

Issue - State: closed - Opened by glide-the over 1 year ago - 1 comment

#882 - RuntimeError Internal: src/sentencepiece_processor.cc(1101)

Issue - State: closed - Opened by Feng-Jay over 1 year ago - 1 comment

#881 - Duplicate tokens in BPE vocabulary

Issue - State: open - Opened by astanic over 1 year ago - 1 comment

#880 - v5

Pull Request - State: closed - Opened by Fovty over 1 year ago - 1 comment

#879 - V3

Pull Request - State: closed - Opened by Fovty over 1 year ago - 1 comment

#878 - v3

Pull Request - State: closed - Opened by Fovty over 1 year ago - 1 comment

#877 - There is always a "▁" before the first piece after tokenization, why?

Issue - State: closed - Opened by hxs91 over 1 year ago - 3 comments
Labels: duplicate

#876 - Patches carried by conda-forge for packaging sentencepiece

Issue - State: open - Opened by h-vetinari over 1 year ago - 3 comments

#875 - Use with LASER3

Issue - State: closed - Opened by thyripian over 1 year ago - 2 comments

#874 - Please add a Python interface for normalization (like `spm_normalize` in CLI)

Issue - State: closed - Opened by avidale over 1 year ago - 1 comment
Labels: feature request, Will fix in next release

#873 - Respecify the normalization rule for trained sentenpieces models

Issue - State: closed - Opened by geolvr over 1 year ago - 1 comment

#872 - why spm_encode keeps characters not in the vocab?

Issue - State: closed - Opened by Smu-Tan over 1 year ago - 1 comment
Labels: duplicate

#871 - How do I make a spiece.model for vanilla GPT2 BPE Tokenizer?

Issue - State: closed - Opened by redthing1 over 1 year ago - 1 comment

#870 - Fix overlinking with protobuf

Pull Request - State: closed - Opened by ryandesign almost 2 years ago

#869 - Using SPM_USE_EXTERNAL_ABSL

Issue - State: closed - Opened by ryandesign almost 2 years ago - 2 comments
Labels: Will fix in next release

#868 - Update bundled protobuf-lite?

Issue - State: closed - Opened by ryandesign almost 2 years ago - 2 comments

#867 - Fix nasty bug in BPE position encoding

Pull Request - State: closed - Opened by vmarkovtsev almost 2 years ago

#866 - Tokens Chunking to respect Language Word Boundaries

Issue - State: open - Opened by loretoparisi almost 2 years ago

#865 - Python from source on armv7l raises ' undefined symbol: __atomic_fetch_add_8 '

Issue - State: open - Opened by FrancescoScandiffio almost 2 years ago - 2 comments

#864 - tokens taker

Issue - State: closed - Opened by Hopom almost 2 years ago - 1 comment

#863 - How about `▁` own escape?

Issue - State: closed - Opened by fseasy almost 2 years ago - 2 comments

#862 - segment fault

Issue - State: closed - Opened by wlike almost 2 years ago - 2 comments

#861 - RuntimeError occurs when T5Tokenizer is executed on big-endian platform

Issue - State: closed - Opened by kiszk almost 2 years ago - 3 comments
Labels: bug

#860 - Build failure at 0.19.8 or 0.19.9 on big-endian platform

Issue - State: closed - Opened by kiszk almost 2 years ago - 2 comments

#859 - Program terminated with an unrecoverable error.

Issue - State: closed - Opened by RuslanSel almost 2 years ago - 2 comments

#858 - Issue with XLM-RoBERTa tokenizer

Issue - State: closed - Opened by ozanarmagan almost 2 years ago - 1 comment

#857 - not able to use more than 128 threads

Issue - State: closed - Opened by GradientGuru almost 2 years ago - 1 comment

#856 - does it support bpe of clip model?

Issue - State: closed - Opened by susht3 almost 2 years ago - 1 comment

#855 - Code for tf-sentencepiece

Issue - State: closed - Opened by lxomb almost 2 years ago - 1 comment

#854 - Potential bug: sequential unk tokens are combined in word mode

Issue - State: closed - Opened by smesham almost 2 years ago - 2 comments

#853 - encode underline

Issue - State: closed - Opened by yangsp5 almost 2 years ago - 2 comments

#852 - Extending the byte_fallback option

Issue - State: closed - Opened by chris-ha458 almost 2 years ago - 1 comment

#851 - NaN unigram model score error with sentencepiece 0.1.98

Issue - State: closed - Opened by lucaslingle almost 2 years ago - 3 comments
Labels: bug

#850 - Error when loading LlamaTokenizer

Issue - State: closed - Opened by creamiracle almost 2 years ago - 1 comment

#849 - How to get the score of a token

Issue - State: closed - Opened by WangJW424 almost 2 years ago

#847 - failing to install FastChat - missing sentencepiece_processor.h

Issue - State: closed - Opened by sirnails almost 2 years ago - 1 comment
Labels: duplicate

#846 - pip install failed on windows x64 with python 3.9

Issue - State: closed - Opened by liqingjun123 almost 2 years ago - 2 comments

#845 - Update sentencepiece_python_module_example.ipynb

Pull Request - State: closed - Opened by chris-ha458 almost 2 years ago - 2 comments

#844 - improve stable version documentation

Issue - State: closed - Opened by grumpyp almost 2 years ago - 1 comment

#843 - Potential bug in Unigram model

Issue - State: closed - Opened by tsinggggg almost 2 years ago - 2 comments

#842 - Only support 64bit?

Issue - State: open - Opened by logicvv almost 2 years ago - 2 comments

#841 - How to customize `input_format` except txt, tsv

Issue - State: closed - Opened by cyk1337 almost 2 years ago - 1 comment

#840 - [Question] Why include characters that are in the set of fallback bytes?

Issue - State: closed - Opened by ZJaume almost 2 years ago - 4 comments

#839 - Letters are split

Issue - State: closed - Opened by wizardk almost 2 years ago - 3 comments

#838 - Question about encode sentence pieces?

Issue - State: closed - Opened by dsj96 almost 2 years ago - 1 comment

#837 - Removed replacing of /MD with /MT for MSVC

Pull Request - State: closed - Opened by ilya-lavrenov almost 2 years ago

#836 - Simulate Erros of common mistakes

Issue - State: closed - Opened by darrkj2049 almost 2 years ago - 1 comment

#835 - PyPI source distribution do not contain full sources

Issue - State: closed - Opened by pkubik almost 2 years ago - 4 comments
Labels: enhancement

#833 - spm_encode splits into letters when --vocabulary is provided

Issue - State: closed - Opened by drunkinlove almost 2 years ago - 5 comments

#832 - disregard

Issue - State: closed - Opened by Technetium1 almost 2 years ago

#831 - undefined reference to `__atomic_fetch_add_8' while build in Raspberry Pi

Issue - State: closed - Opened by yanghl12138 almost 2 years ago - 1 comment

#830 - Is it possibile to extend a trained BPE model's merge operations?

Issue - State: closed - Opened by pluiez almost 2 years ago - 1 comment

#829 - Windows debug build fails to load anything, with “file not found” status code

Issue - State: closed - Opened by Const-me almost 2 years ago - 2 comments
Labels: bug

#828 - How can I transfer a SentencePieceProcessor object into a transformers AutoTokenizer?

Issue - State: closed - Opened by nameless0704 almost 2 years ago - 1 comment

#827 - Error building wheel

Issue - State: closed - Opened by eobrien2002 almost 2 years ago - 7 comments
Labels: enhancement

#826 - can sentencepiece python wrapper training with generator rather than files?

Issue - State: closed - Opened by Maxlinn almost 2 years ago - 1 comment

#825 - Usage of MSVC static runtime by default

Issue - State: closed - Opened by ilya-lavrenov almost 2 years ago - 4 comments

#824 - spm.SentencePieceTrainer.train stuck in Jupyter Notebook

Issue - State: closed - Opened by shivanraptor almost 2 years ago - 2 comments

#823 - Add interface for C# calling

Pull Request - State: closed - Opened by zhongkaifu almost 2 years ago - 7 comments

#822 - Can the score of token be updated manually? Can two vocabulary been merged?

Issue - State: closed - Opened by obangw almost 2 years ago - 2 comments

#821 - sentencepiece v0.1.95 pip install failure on HuggingFace Docker Space

Issue - State: closed - Opened by anammari almost 2 years ago - 2 comments

#820 - Fix setup-python version not detected

Pull Request - State: closed - Opened by juliusfrost about 2 years ago - 1 comment

#819 - Add Python 3.11 builds

Pull Request - State: closed - Opened by juliusfrost about 2 years ago - 2 comments

#818 - Error with ProtoBuf code Python

Issue - State: closed - Opened by ahmedlone127 about 2 years ago - 4 comments

#817 - Add interface for C# calling

Pull Request - State: closed - Opened by zhongkaifu about 2 years ago - 1 comment

#816 - Cannot encode special tokens in input

Issue - State: closed - Opened by pks about 2 years ago - 1 comment

#815 - sentencepiece.dll problem in the API

Issue - State: closed - Opened by piedralaves about 2 years ago - 2 comments

#814 - Error while installing from source

Issue - State: closed - Opened by singhakr about 2 years ago - 2 comments

#813 - Segfault using sentencepiece with tf-nightly

Issue - State: closed - Opened by mattdangerw about 2 years ago - 7 comments
Labels: bug

#812 - Undo unk hack

Pull Request - State: closed - Opened by pks about 2 years ago - 1 comment

#811 - duplicate tokens in user_defined_symbols param cause RuntimeError

Issue - State: closed - Opened by poedator about 2 years ago - 2 comments
Labels: enhancement

#810 - Support for python3.11

Issue - State: closed - Opened by kuutsav about 2 years ago - 25 comments
Labels: enhancement

#809 - Can I know the loss of training with BPE and subword regularization?

Issue - State: closed - Opened by lsy641 about 2 years ago - 1 comment

#808 - Update README.md

Pull Request - State: closed - Opened by jacek-michalak about 2 years ago

#807 - Exporting tokenizer to .spm

Issue - State: closed - Opened by SupreethRao99 about 2 years ago - 1 comment

#806 - Output folder for spm_train

Issue - State: closed - Opened by AlexUmnov about 2 years ago - 4 comments

#805 - sp.encode has results but sp.sample_encode_and_score doesn't

Issue - State: closed - Opened by lsy641 about 2 years ago

#804 - Training a BPE model w/ "identity" normalization rule doesn't add "\n" to the vocab

Issue - State: closed - Opened by pks about 2 years ago - 5 comments

#803 - Not able to install sentencepiece on s390x machine

Issue - State: closed - Opened by swagaths1 about 2 years ago - 2 comments
Labels: bug

#802 - Is it allowed to rearrange index/id of each vocabulary?

Issue - State: closed - Opened by lsy641 about 2 years ago - 1 comment

#801 - tokens listed in user_defined_symbols tokenized as unknowns when using the "word" model_type

Issue - State: open - Opened by lintangsutawika about 2 years ago
Labels: bug

#800 - low encoding speed in Python library

Issue - State: closed - Opened by tomsbergmanis about 2 years ago - 1 comment

#799 - fix the path in add_new_vocab.ipynb

Pull Request - State: closed - Opened by kyoto7250 about 2 years ago

GitHub / google/sentencepiece issues and pull requests