Ecosyste.ms: Issues

An open API service for providing issue and pull request metadata for open source projects.

GitHub / google-research/deduplicate-text-datasets issues and pull requests

#52 - Bump tensorflow from 2.9.0 to 2.12.1

Pull Request - State: open - Opened by dependabot[bot] about 2 months ago
Labels: dependencies, python

#50 - Format rust code via `cargo fmt`

Pull Request - State: open - Opened by nathan-barry 4 months ago - 1 comment

#49 - Distributed running

Issue - State: open - Opened by jordane95 4 months ago

#48 - does finish_dedup_wiki40b.py has some wrong?

Issue - State: open - Opened by mathCrazyy 4 months ago

#47 - does this tool can process Chinese?

Issue - State: open - Opened by mathCrazyy 4 months ago - 1 comment

#46 - Fix issue #45:out of range bug

Pull Request - State: closed - Opened by WWWonderer 4 months ago - 5 comments

#44 - Question: Upper Bound

Issue - State: closed - Opened by bezir 5 months ago - 1 comment

#43 - Count_occurrence does not work with tokenizer?

Issue - State: closed - Opened by WWWonderer 5 months ago - 2 comments

#41 - question about wstring_equal function

Issue - State: closed - Opened by WWWonderer 6 months ago - 2 comments

#40 - where the data is?

Issue - State: closed - Opened by jianshu93 6 months ago - 1 comment

#38 - customized dataset deduplication

Issue - State: closed - Opened by zengyangjie 7 months ago - 1 comment

#37 - Bump tensorflow from 2.9.0 to 2.11.1

Pull Request - State: closed - Opened by dependabot[bot] 7 months ago - 1 comment
Labels: dependencies, python

#36 - [Question] An error with the same repo guideline

Issue - State: open - Opened by manel-hikk 8 months ago - 5 comments

#35 - remove_ex in finish_dedup_wiki40b

Issue - State: closed - Opened by wead-hsu 10 months ago - 1 comment

#34 - Incomplete Sentences

Issue - State: closed - Opened by MiladMolazadeh 10 months ago - 1 comment

#33 - Adjust TensorFlow version to fix cuDNN, cuFFT, cuBLAS errors.

Issue - State: closed - Opened by longxudou 10 months ago - 4 comments

#32 - Retain one instance per duplicate

Issue - State: closed - Opened by RobinQrtz 11 months ago - 2 comments

#31 - what is the input dataset format for custom dataset?

Issue - State: closed - Opened by mittsommer 12 months ago - 1 comment

#30 - cargo build error.Could you upload cargo.lock file?

Issue - State: open - Opened by jinzhuer about 1 year ago - 2 comments

#28 - [Paper Question] Why use w-shingles over k-shingles?

Issue - State: closed - Opened by micimize over 1 year ago - 1 comment

#27 - Inplementation of NearDup(approximate match)

Issue - State: closed - Opened by Yaoming95 over 1 year ago - 1 comment

#26 - Simple test

Issue - State: closed - Opened by KeremTurgutlu over 1 year ago

#25 - [Proposal] Use `release` build, not `debug` builds

Pull Request - State: closed - Opened by Narsil over 1 year ago - 3 comments

#24 - Off-by-1 error in `collect`?

Issue - State: open - Opened by ola13 almost 2 years ago

#23 - question about deduplication cluster size

Issue - State: closed - Opened by everks almost 2 years ago - 2 comments

#22 - make cmd_merge use multiple threads again

Pull Request - State: closed - Opened by TristanThrush almost 2 years ago - 1 comment

#21 - how to deduplicate huggingface datasets

Issue - State: closed - Opened by StephennFernandes about 2 years ago - 7 comments

#20 - Accessing the duplicates and their counts

Issue - State: closed - Opened by yanaiela about 2 years ago - 13 comments

#19 - Fix to issue #17 limits cmd_merge to be single-threaded

Issue - State: closed - Opened by kleinj about 2 years ago - 3 comments

#18 - RAM crash when use collect method

Issue - State: open - Opened by acul3 about 2 years ago - 2 comments

#17 - one bug when I use

Issue - State: closed - Opened by flyingwaters over 2 years ago - 2 comments

#16 - Should newline char be removed

Issue - State: closed - Opened by cperiz over 2 years ago - 1 comment

#15 - Unexpected behavior with ending symbols

Issue - State: closed - Opened by mitya52 over 2 years ago - 2 comments

#14 - "failed to fill whole buffer" errors

Issue - State: closed - Opened by mitya52 over 2 years ago - 2 comments

#13 - Fix multiprocessing bug in Windows/Mac OS X

Pull Request - State: closed - Opened by alistairewj over 2 years ago - 1 comment

#12 - Error when running the code

Issue - State: closed - Opened by MatthewCYM over 2 years ago - 15 comments

#11 - Error with table size not being divisible by text size

Issue - State: closed - Opened by jinyongyoo over 2 years ago - 7 comments

#10 - Dev v1

Pull Request - State: closed - Opened by carlini over 2 years ago - 1 comment

#9 - false positives

Issue - State: closed - Opened by ChenghaoMou over 2 years ago - 1 comment

#8 - Can the tool run on plain text files?

Issue - State: closed - Opened by m-resta almost 3 years ago - 20 comments

#7 - How to dedup subtring in one dataset?

Issue - State: closed - Opened by lan2016286 almost 3 years ago - 9 comments

#6 - Adding possibility to load an HF-dataset

Pull Request - State: closed - Opened by TevenLeScao almost 3 years ago - 5 comments

#5 - Error on self deduplication

Issue - State: closed - Opened by zijwang about 3 years ago - 10 comments

#4 - Why not use Simhash?

Issue - State: closed - Opened by Ethan-yt about 3 years ago - 3 comments

#3 - How to dedup between two datasets?

Issue - State: closed - Opened by mralexis1 about 3 years ago - 7 comments

#2 - Add CSVs to replicate approximate deduped datasets

Pull Request - State: closed - Opened by daphnei about 3 years ago

#1 - Some changes to README formatting, language, and typos.

Pull Request - State: closed - Opened by daphnei about 3 years ago