Ecosyste.ms: Issues

An open API service for providing issue and pull request metadata for open source projects.

GitHub / eleutherai/dps issues and pull requests

#82 - k

Issue - State: open - Opened by NguyenNhoTrung 5 months ago

#81 - dedup_job java.lang.UnsatisfiedLinkError

Issue - State: open - Opened by syedhasnainrazashah 8 months ago - 3 comments

#80 - Fix #79

Pull Request - State: closed - Opened by ohwi over 1 year ago

#79 - Bug in the function `remove_repeated_text`

Issue - State: closed - Opened by ohwi over 1 year ago

#78 - documentation fixes for the dataframe workflow

Pull Request - State: closed - Opened by paulovn over 1 year ago

#77 - Clean up JA pipeline

Pull Request - State: closed - Opened by polm-stability over 1 year ago - 2 comments

#76 - DataFrame processing pipeline

Pull Request - State: closed - Opened by paulovn over 1 year ago - 4 comments

#75 - fix carriage return removed

Pull Request - State: open - Opened by jason9693 over 1 year ago

#74 - [ja] `.filter` is used instead of `.map` for non-filter methods

Issue - State: open - Opened by mrorii over 1 year ago - 1 comment

#73 - fix: return list[str] from word_tokenize instead of str

Pull Request - State: closed - Opened by mrorii over 1 year ago - 1 comment

#72 - dev japanese

Pull Request - State: closed - Opened by fujiki-1emon over 1 year ago - 1 comment

#71 - Japanese development branch

Pull Request - State: closed - Opened by fujiki-1emon over 1 year ago - 1 comment

#70 - Update README.md

Pull Request - State: closed - Opened by chris-ha458 over 1 year ago

#69 - [WIP] preprocessing vietnamese language

Pull Request - State: open - Opened by wookee3 over 1 year ago - 1 comment

#68 - compatible for v2

Pull Request - State: closed - Opened by jason9693 over 1 year ago - 1 comment

#67 - add pre-processing for Chinese

Pull Request - State: closed - Opened by Kaeun-Lee over 1 year ago - 1 comment

#66 - Work In Progress (thai)

Pull Request - State: closed - Opened by skytmddus27 over 1 year ago - 2 comments

#65 - Chiese dedup memory error

Issue - State: open - Opened by hyeinhyun over 1 year ago - 1 comment

#64 - [WIP] [#62] improve MinHashLSH-based deduplication for Japanese

Pull Request - State: closed - Opened by fujiki-1emon over 1 year ago - 1 comment

#63 - [WIP] [#62] add refactored method for Japanese MinHashLSH-based near-deduplication

Pull Request - State: closed - Opened by fujiki-1emon over 1 year ago - 1 comment

#61 - improve the japanese pre-processing (namely `japanese_job`)

Pull Request - State: closed - Opened by fujiki-1emon over 1 year ago

#60 - [WIP] improve the japanese pre-processing (namely `japanese_job`)

Pull Request - State: closed - Opened by fujiki-1emon over 1 year ago

#59 - [#52] freq char filter: flip comparison and ratio->cnt

Pull Request - State: closed - Opened by skjang54 over 1 year ago

#58 - add indonesia and malaysia preprocessed

Pull Request - State: closed - Opened by acul3 over 1 year ago

#57 - Refactor RDD process to Dataframe process

Issue - State: open - Opened by Taekyoon over 1 year ago
Labels: enhancement

#56 - Need to add ignore null or empty text during korean text process

Issue - State: open - Opened by Taekyoon over 1 year ago
Labels: bug

#55 - [#54] Improve korean preprocessing algorithm

Pull Request - State: closed - Opened by hyunwoongko over 1 year ago

#54 - Improve Korean preprocessing algorithm

Issue - State: closed - Opened by hyunwoongko over 1 year ago

#53 - [#52] add japanese_frequent_char_existence_filter

Pull Request - State: closed - Opened by skjang54 over 1 year ago - 1 comment

#52 - Japanese pre-procesesing - remove text with low rate of Japanese stopwords

Issue - State: closed - Opened by fujiki-1emon over 1 year ago - 4 comments

#51 - [ja] spam word filter

Issue - State: open - Opened by fujiki-1emon over 1 year ago

#50 - [ja] reduce emoticon

Issue - State: closed - Opened by fujiki-1emon over 1 year ago - 1 comment

#49 - [ja] replace Japanese PII

Issue - State: open - Opened by fujiki-1emon over 1 year ago

#48 - First commit for Romance filtering

Pull Request - State: closed - Opened by josemlopez over 1 year ago - 3 comments

#47 - small fix to requirements.txt

Pull Request - State: closed - Opened by fujiki-1emon almost 2 years ago - 1 comment

#46 - add pre-processing for Japanese

Pull Request - State: closed - Opened by fujiki-1emon almost 2 years ago - 3 comments

#45 - Add many features to korean

Pull Request - State: closed - Opened by hyunwoongko almost 2 years ago

#44 - Add BR to html processing

Pull Request - State: closed - Opened by hyunwoongko almost 2 years ago

#43 - Add html and url processing

Pull Request - State: closed - Opened by hyunwoongko almost 2 years ago

#42 - modify dedup_job

Pull Request - State: closed - Opened by hyunwoongko almost 2 years ago

#41 - modify readme

Pull Request - State: closed - Opened by hyunwoongko almost 2 years ago

#40 - rename sample-jsonl to sample-job

Pull Request - State: closed - Opened by hyunwoongko almost 2 years ago

#39 - modify readme

Pull Request - State: closed - Opened by hyunwoongko almost 2 years ago

#39 - modify readme

Pull Request - State: closed - Opened by hyunwoongko almost 2 years ago

#38 - Fetch from master (modify readme)

Pull Request - State: closed - Opened by hyunwoongko almost 2 years ago

#37 - Dev to master

Pull Request - State: closed - Opened by hyunwoongko almost 2 years ago

#36 - Feature/#33 Add email, url, spam detection

Pull Request - State: closed - Opened by hyunwoongko almost 2 years ago - 4 comments

#35 - [WIP] Add minhash dedup job

Pull Request - State: closed - Opened by Taekyoon almost 2 years ago - 3 comments

#34 - Implement minhash dedup module

Issue - State: closed - Opened by Taekyoon almost 2 years ago
Labels: enhancement

#33 - Task consideration

Issue - State: closed - Opened by hyunwoongko about 2 years ago - 3 comments

#32 - Replace html2text from Beautifulsoup

Issue - State: closed - Opened by Taekyoon about 2 years ago - 1 comment

#31 - WIP: add Deduplication for Japanese text datasets

Pull Request - State: closed - Opened by fujiki-1emon about 2 years ago - 1 comment

#30 - WIP: Deduplication for Japanese text datasets

Pull Request - State: closed - Opened by fujiki-1emon about 2 years ago

#29 - Add massive text filter logics

Pull Request - State: closed - Opened by Taekyoon about 2 years ago

#28 - Add pre-processing for Japanese texts

Issue - State: closed - Opened by fujiki-1emon about 2 years ago

#27 - Remove `soynlp` library

Issue - State: closed - Opened by Taekyoon over 2 years ago
Labels: enhancement

#26 - Change logics for sample jsonl

Pull Request - State: closed - Opened by Taekyoon over 2 years ago

#25 - Add scripts to run hadoop cluster

Pull Request - State: closed - Opened by Taekyoon over 2 years ago

#24 - [WIP] Feature/#23

Pull Request - State: closed - Opened by Ronalmoo over 2 years ago - 1 comment

#23 - Update additional preprocess function

Issue - State: closed - Opened by Ronalmoo over 2 years ago - 1 comment

#22 - Add normalize `?,:"!` in common preprocess job

Issue - State: closed - Opened by Taekyoon over 2 years ago

#21 - Feature/#17

Pull Request - State: closed - Opened by Kaeun-Lee over 2 years ago - 3 comments

#20 - Add scripts to run hadoop cluster

Issue - State: open - Opened by Taekyoon over 2 years ago
Labels: enhancement

#19 - Add function for processing empty string

Pull Request - State: closed - Opened by Ronalmoo over 2 years ago

#18 - Add function for processing empty string

Issue - State: closed - Opened by Ronalmoo over 2 years ago

#17 - Add huggingface tokenizers for data length statistics

Issue - State: closed - Opened by Kaeun-Lee over 2 years ago

#16 - Add job to separate train and validate data

Issue - State: closed - Opened by Taekyoon over 2 years ago
Labels: add job

#15 - Feature/#13

Pull Request - State: closed - Opened by donggrii over 2 years ago - 7 comments

#14 - Feature/#13

Pull Request - State: closed - Opened by donggrii over 2 years ago

#13 - Add statistics by data category

Issue - State: closed - Opened by donggrii over 2 years ago
Labels: add tool

#12 - [python] feat: Add build news paper data job

Pull Request - State: closed - Opened by Taekyoon over 2 years ago
Labels: add job

#11 - Feature/#4

Pull Request - State: closed - Opened by jayseok-park over 2 years ago - 1 comment

#10 - Feature/#1

Pull Request - State: closed - Opened by Ronalmoo over 2 years ago - 1 comment

#9 - Add build news paper dataset as long text data form

Issue - State: open - Opened by Taekyoon over 2 years ago
Labels: add job

#8 - Add Toxic text labeler

Issue - State: closed - Opened by Taekyoon over 2 years ago
Labels: add job

#7 - Add Text length Stats for datasets

Issue - State: closed - Opened by Taekyoon over 2 years ago
Labels: add job

#6 - [etc] feat: Add development guides

Pull Request - State: closed - Opened by Taekyoon over 2 years ago
Labels: documentation

#5 - [etc] feat: Add requriements-dev.txt

Pull Request - State: closed - Opened by Taekyoon over 2 years ago
Labels: enhancement

#4 - MassiveText Quality Filtering

Issue - State: closed - Opened by jayseok-park over 2 years ago - 3 comments
Labels: add job

#3 - Add guides to run dps jobs

Issue - State: closed - Opened by Taekyoon over 2 years ago
Labels: enhancement

#2 - Add requirements-dev.txt

Issue - State: closed - Opened by Taekyoon over 2 years ago
Labels: enhancement

#1 - Add general text refinement job

Issue - State: closed - Opened by Taekyoon over 2 years ago - 2 comments
Labels: add job