Ecosyste.ms: Issues
An open API service for providing issue and pull request metadata for open source projects.
GitHub / google-research/deduplicate-text-datasets issues and pull requests
#52 - Bump tensorflow from 2.9.0 to 2.12.1
Pull Request -
State: open - Opened by dependabot[bot] about 2 months ago
Labels: dependencies, python
#51 - called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }
Issue -
State: open - Opened by bingkunyao 3 months ago
- 1 comment
#50 - Format rust code via `cargo fmt`
Pull Request -
State: open - Opened by nathan-barry 4 months ago
- 1 comment
#49 - Distributed running
Issue -
State: open - Opened by jordane95 4 months ago
#48 - does finish_dedup_wiki40b.py has some wrong?
Issue -
State: open - Opened by mathCrazyy 4 months ago
#47 - does this tool can process Chinese?
Issue -
State: open - Opened by mathCrazyy 4 months ago
- 1 comment
#46 - Fix issue #45:out of range bug
Pull Request -
State: closed - Opened by WWWonderer 4 months ago
- 5 comments
#45 - [Bug] Out of range error when counting occurrences on a custom suffix array
Issue -
State: closed - Opened by WWWonderer 4 months ago
- 1 comment
#44 - Question: Upper Bound
Issue -
State: closed - Opened by bezir 5 months ago
- 1 comment
#43 - Count_occurrence does not work with tokenizer?
Issue -
State: closed - Opened by WWWonderer 5 months ago
- 2 comments
#42 - 是否可以提供一个纯python版本的,相信很多研究者在服务器上没有权限安装gcc
Issue -
State: closed - Opened by gongye19 5 months ago
- 1 comment
#41 - question about wstring_equal function
Issue -
State: closed - Opened by WWWonderer 6 months ago
- 2 comments
#40 - where the data is?
Issue -
State: closed - Opened by jianshu93 6 months ago
- 1 comment
#39 - when i use tokenizer , I obtained many patterns that span across the data, which is quite strange.
Issue -
State: open - Opened by gawei1995 7 months ago
- 4 comments
#38 - customized dataset deduplication
Issue -
State: closed - Opened by zengyangjie 7 months ago
- 1 comment
#37 - Bump tensorflow from 2.9.0 to 2.11.1
Pull Request -
State: closed - Opened by dependabot[bot] 7 months ago
- 1 comment
Labels: dependencies, python
#36 - [Question] An error with the same repo guideline
Issue -
State: open - Opened by manel-hikk 8 months ago
- 5 comments
#35 - remove_ex in finish_dedup_wiki40b
Issue -
State: closed - Opened by wead-hsu 10 months ago
- 1 comment
#34 - Incomplete Sentences
Issue -
State: closed - Opened by MiladMolazadeh 10 months ago
- 1 comment
#33 - Adjust TensorFlow version to fix cuDNN, cuFFT, cuBLAS errors.
Issue -
State: closed - Opened by longxudou 10 months ago
- 4 comments
#32 - Retain one instance per duplicate
Issue -
State: closed - Opened by RobinQrtz 11 months ago
- 2 comments
#31 - what is the input dataset format for custom dataset?
Issue -
State: closed - Opened by mittsommer 12 months ago
- 1 comment
#30 - cargo build error.Could you upload cargo.lock file?
Issue -
State: open - Opened by jinzhuer about 1 year ago
- 2 comments
#29 - How to restore the result data after deduplication (remove invisible characters)
Issue -
State: closed - Opened by greenriver777 about 1 year ago
- 1 comment
#28 - [Paper Question] Why use w-shingles over k-shingles?
Issue -
State: closed - Opened by micimize over 1 year ago
- 1 comment
#27 - Inplementation of NearDup(approximate match)
Issue -
State: closed - Opened by Yaoming95 over 1 year ago
- 1 comment
#26 - Simple test
Issue -
State: closed - Opened by KeremTurgutlu over 1 year ago
#25 - [Proposal] Use `release` build, not `debug` builds
Pull Request -
State: closed - Opened by Narsil over 1 year ago
- 3 comments
#24 - Off-by-1 error in `collect`?
Issue -
State: open - Opened by ola13 almost 2 years ago
#23 - question about deduplication cluster size
Issue -
State: closed - Opened by everks almost 2 years ago
- 2 comments
#22 - make cmd_merge use multiple threads again
Pull Request -
State: closed - Opened by TristanThrush almost 2 years ago
- 1 comment
#21 - how to deduplicate huggingface datasets
Issue -
State: closed - Opened by StephennFernandes about 2 years ago
- 7 comments
#20 - Accessing the duplicates and their counts
Issue -
State: closed - Opened by yanaiela about 2 years ago
- 13 comments
#19 - Fix to issue #17 limits cmd_merge to be single-threaded
Issue -
State: closed - Opened by kleinj about 2 years ago
- 3 comments
#18 - RAM crash when use collect method
Issue -
State: open - Opened by acul3 about 2 years ago
- 2 comments
#17 - one bug when I use
Issue -
State: closed - Opened by flyingwaters over 2 years ago
- 2 comments
#16 - Should newline char be removed
Issue -
State: closed - Opened by cperiz over 2 years ago
- 1 comment
#15 - Unexpected behavior with ending symbols
Issue -
State: closed - Opened by mitya52 over 2 years ago
- 2 comments
#14 - "failed to fill whole buffer" errors
Issue -
State: closed - Opened by mitya52 over 2 years ago
- 2 comments
#13 - Fix multiprocessing bug in Windows/Mac OS X
Pull Request -
State: closed - Opened by alistairewj over 2 years ago
- 1 comment
#12 - Error when running the code
Issue -
State: closed - Opened by MatthewCYM over 2 years ago
- 15 comments
#11 - Error with table size not being divisible by text size
Issue -
State: closed - Opened by jinyongyoo over 2 years ago
- 7 comments
#10 - Dev v1
Pull Request -
State: closed - Opened by carlini over 2 years ago
- 1 comment
#9 - false positives
Issue -
State: closed - Opened by ChenghaoMou over 2 years ago
- 1 comment
#8 - Can the tool run on plain text files?
Issue -
State: closed - Opened by m-resta almost 3 years ago
- 20 comments
#7 - How to dedup subtring in one dataset?
Issue -
State: closed - Opened by lan2016286 almost 3 years ago
- 9 comments
#6 - Adding possibility to load an HF-dataset
Pull Request -
State: closed - Opened by TevenLeScao almost 3 years ago
- 5 comments
#5 - Error on self deduplication
Issue -
State: closed - Opened by zijwang about 3 years ago
- 10 comments
#4 - Why not use Simhash?
Issue -
State: closed - Opened by Ethan-yt about 3 years ago
- 3 comments
#3 - How to dedup between two datasets?
Issue -
State: closed - Opened by mralexis1 about 3 years ago
- 7 comments
#2 - Add CSVs to replicate approximate deduped datasets
Pull Request -
State: closed - Opened by daphnei about 3 years ago
#1 - Some changes to README formatting, language, and typos.
Pull Request -
State: closed - Opened by daphnei about 3 years ago