Ecosyste.ms: Issues

An open API service for providing issue and pull request metadata for open source projects.

GitHub / bigscience-workshop/data_tooling issues and pull requests

#87 - Add GitHub Action to add issue to project

Pull Request - State: closed - Opened by albertvillanova almost 3 years ago

#86 - [pre-commit.ci] pre-commit autoupdate

Pull Request - State: closed - Opened by pre-commit-ci[bot] almost 3 years ago

#85 - Add Alternate bad words & stop words list

Issue - State: closed - Opened by cccntu almost 3 years ago - 3 comments

#84 - Add ESTER datasets

Issue - State: closed - Opened by albertvillanova almost 3 years ago - 1 comment
Labels: data catalog

#83 - pii-manager v. 0.2.0

Pull Request - State: closed - Opened by paulovn almost 3 years ago

#82 - Fix pre-commit issues

Pull Request - State: closed - Opened by olinguyen almost 3 years ago

#81 - added pii-package

Pull Request - State: closed - Opened by paulovn almost 3 years ago - 3 comments

#80 - Run pre-commit on pii_processing

Pull Request - State: closed - Opened by olinguyen about 3 years ago

#79 - moving pii_processing repo into sub diretory

Pull Request - State: closed - Opened by huu4ontocord about 3 years ago

#78 - Update deduplication scripts

Pull Request - State: closed - Opened by ChenghaoMou about 3 years ago

#77 - Releasable Dataset to Only Include Metadata

Issue - State: closed - Opened by huu4ontocord about 3 years ago - 2 comments
Labels: metadata

#76 - Normalize and use sentence piece tokenizer

Pull Request - State: closed - Opened by edugp about 3 years ago - 1 comment

#75 - Create license-compliant version of the Pile: FreeLaw

Issue - State: open - Opened by albertvillanova about 3 years ago
Labels: data catalog

#74 - Create the license-compliant version of the Pile: PubMed Central

Issue - State: closed - Opened by albertvillanova about 3 years ago - 3 comments
Labels: wontfix, data catalog, language modeling script

#73 - Match data between OSCAR v1 (in the registry repo) and OSCAR v2

Issue - State: closed - Opened by huu4ontocord about 3 years ago - 13 comments

#72 - Pass dataset object to filtering class

Pull Request - State: closed - Opened by olinguyen about 3 years ago

#71 - create deduped registry dataset for low resource language for Yoruba and Basque

Issue - State: closed - Opened by huu4ontocord about 3 years ago - 5 comments

#70 - Add argument to load the Oscar dataset with registry info

Pull Request - State: closed - Opened by olinguyen about 3 years ago

#69 - Add deduplication script

Pull Request - State: closed - Opened by ChenghaoMou about 3 years ago - 2 comments

#68 - Check languages data covered in the current filtering pipeline

Issue - State: closed - Opened by ggdupont about 3 years ago

#67 - Update basque badwords list.

Pull Request - State: closed - Opened by asoroa about 3 years ago - 1 comment

#66 - add parameters of filtering for vietnamese

Pull Request - State: closed - Opened by heraclex12 about 3 years ago - 4 comments

#65 - Create license-compliant version of the Pile

Issue - State: open - Opened by albertvillanova about 3 years ago - 2 comments
Labels: data catalog, language modeling script

#64 - adding govt id regex - WIP

Pull Request - State: closed - Opened by huu4ontocord about 3 years ago - 8 comments

#63 - Add documentation to setup and run the data filtering pipeline

Issue - State: closed - Opened by olinguyen about 3 years ago - 1 comment
Labels: good first issue, tooling

#62 - Determine optimal parameters for data filtering per language

Issue - State: closed - Opened by olinguyen about 3 years ago - 2 comments
Labels: tooling

#61 - Create dataset Norwegian Colossal Corpus

Issue - State: closed - Opened by albertvillanova about 3 years ago - 9 comments
Labels: wontfix, data catalog, language modeling script

#60 - Create dataset from GALILEO Open Learning Materials

Issue - State: closed - Opened by albertvillanova about 3 years ago - 1 comment
Labels: duplicate, data catalog

#59 - Create dataset from Bloom Library

Issue - State: closed - Opened by albertvillanova about 3 years ago - 1 comment
Labels: duplicate, data catalog

#58 - Create dataset from Book Dash

Issue - State: closed - Opened by albertvillanova about 3 years ago - 5 comments
Labels: duplicate, data catalog, need data sourcing feedback

#57 - Create dataset from African Minds

Issue - State: closed - Opened by albertvillanova about 3 years ago - 2 comments
Labels: duplicate, data catalog

#56 - Create dataset from SciELO Books

Issue - State: closed - Opened by albertvillanova about 3 years ago - 10 comments
Labels: data catalog, need data sourcing feedback, language modeling script

#55 - Create dataset from Project Gutenberg

Issue - State: closed - Opened by albertvillanova about 3 years ago - 12 comments
Labels: data catalog, language modeling script

#54 - [pre-commit.ci] pre-commit autoupdate

Pull Request - State: closed - Opened by pre-commit-ci[bot] about 3 years ago

#53 - Load dataset in streaming mode for visualization

Pull Request - State: closed - Opened by Luvata about 3 years ago - 7 comments

#52 - Regarding Spanish bad word list (badwords.py)

Issue - State: closed - Opened by asoroa about 3 years ago - 2 comments

#51 - Add language specific special_char filters

Issue - State: closed - Opened by huu4ontocord about 3 years ago - 1 comment

#50 - Create sanity tests for all language pipelines

Issue - State: closed - Opened by ggdupont about 3 years ago - 2 comments
Labels: good first issue, tooling

#49 - fix: ensure consistent & efficient tokenization

Pull Request - State: closed - Opened by ggdupont about 3 years ago - 6 comments

#48 - detect or remove eval dataset contamination

Issue - State: closed - Opened by huu4ontocord about 3 years ago - 3 comments
Labels: evaluation, corpus

#47 - remove eval dataset contamination

Issue - State: closed - Opened by huu4ontocord about 3 years ago - 1 comment

#46 - remove strip_chars

Issue - State: closed - Opened by huu4ontocord about 3 years ago
Labels: filter, tokenizer

#45 - Update stopword cutoff for better filtering

Pull Request - State: closed - Opened by olinguyen about 3 years ago - 1 comment

#44 - Save dataset for offline analysis of filtering

Pull Request - State: closed - Opened by olinguyen about 3 years ago - 2 comments

#43 - using different models under bertin

Issue - State: closed - Opened by huu4ontocord about 3 years ago - 1 comment
Labels: corpus

#42 - Add poetry packages and Makefile

Pull Request - State: closed - Opened by olinguyen about 3 years ago

#41 - Added parameters for "fr"

Pull Request - State: closed - Opened by clancyoftheoverflow about 3 years ago - 1 comment

#40 - [WIP] Refactor for contribution

Pull Request - State: closed - Opened by Luvata about 3 years ago - 7 comments
Labels: tooling

#39 - Add perplexity visualization tool

Pull Request - State: closed - Opened by edugp about 3 years ago - 15 comments

#38 - Make pipeline runnable

Pull Request - State: closed - Opened by olinguyen about 3 years ago

#37 - Add kenlm language ids and model download script

Pull Request - State: closed - Opened by olinguyen about 3 years ago

#36 - [pre-commit.ci] pre-commit autoupdate

Pull Request - State: closed - Opened by pre-commit-ci[bot] about 3 years ago

#35 - Stop word filtering

Pull Request - State: closed - Opened by clancyoftheoverflow about 3 years ago - 4 comments

#34 - Added parameters for "fr"

Pull Request - State: closed - Opened by clancyoftheoverflow about 3 years ago - 4 comments

#33 - Run pre-commit-hooks and enable CI

Pull Request - State: closed - Opened by Skylion007 about 3 years ago

#32 - Remove unused imports

Pull Request - State: closed - Opened by Skylion007 about 3 years ago

#31 - Create perplexity scoring script for JZ

Issue - State: closed - Opened by huu4ontocord about 3 years ago

#30 - CLI for oscar_sample_filter.py and make generic of datasets

Issue - State: closed - Opened by huu4ontocord about 3 years ago - 8 comments
Labels: tooling

#29 - Create distributed data filtering on JZ

Issue - State: closed - Opened by huu4ontocord about 3 years ago - 2 comments
Labels: tooling

#28 - Move Description From Data Filtering Spec Document to data_tooling/ac_dc/README.md

Issue - State: closed - Opened by huu4ontocord about 3 years ago
Labels: documentation

#27 - Improved Dedup

Issue - State: closed - Opened by huu4ontocord about 3 years ago - 5 comments
Labels: corpus

#26 - Incorporate community filtering into pipeline with github auto-checking

Issue - State: closed - Opened by huu4ontocord about 3 years ago - 2 comments
Labels: good first issue, tooling

#25 - integrate register information to the pipeline

Issue - State: closed - Opened by mavela about 3 years ago - 18 comments
Labels: tooling, metadata

#24 - (perplexity sampling) Add script to get perplexity for other languages in Oscar/mc4

Issue - State: closed - Opened by cccntu about 3 years ago - 10 comments
Labels: corpus

#23 - Add better stopword filters

Issue - State: closed - Opened by huu4ontocord about 3 years ago - 2 comments
Labels: tooling, corpus

#22 - Add better stopword filters

Issue - State: closed - Opened by huu4ontocord about 3 years ago - 1 comment

#21 - Add mojibake filter

Issue - State: closed - Opened by huu4ontocord about 3 years ago
Labels: good first issue, tooling

#20 - Review ac_dc/oscar_sample_filter.py for correctness

Issue - State: closed - Opened by huu4ontocord about 3 years ago - 5 comments

#19 - adding bertin oscar sampling code

Pull Request - State: closed - Opened by huu4ontocord about 3 years ago

#18 - adding bertin oscar code

Pull Request - State: closed - Opened by huu4ontocord about 3 years ago

#17 - Add bertin oscar

Pull Request - State: closed - Opened by huu4ontocord about 3 years ago

#16 - Remove bertin

Pull Request - State: closed - Opened by huu4ontocord about 3 years ago

#15 - adding bertin code

Pull Request - State: closed - Opened by huu4ontocord about 3 years ago

#14 - adding ontology code

Pull Request - State: closed - Opened by huu4ontocord about 3 years ago

#13 - adding gender lists

Pull Request - State: closed - Opened by huu4ontocord about 3 years ago

#12 - fix NER parsing

Pull Request - State: closed - Opened by huu4ontocord about 3 years ago

#11 - fixing pii code

Pull Request - State: closed - Opened by huu4ontocord about 3 years ago

#10 - adding disease and processor pipeline

Pull Request - State: closed - Opened by huu4ontocord about 3 years ago

#9 - adding minhash

Pull Request - State: closed - Opened by huu4ontocord about 3 years ago

#8 - Cleanup fixes

Pull Request - State: closed - Opened by huu4ontocord about 3 years ago

#7 - adding __init__.py

Pull Request - State: closed - Opened by huu4ontocord about 3 years ago

#6 - Adding to requirements.txt

Pull Request - State: closed - Opened by huu4ontocord about 3 years ago

#5 - Adding more pii

Pull Request - State: closed - Opened by huu4ontocord about 3 years ago

#4 - adding pii

Pull Request - State: closed - Opened by huu4ontocord about 3 years ago

#3 - Adding more bug fixes to datastore

Pull Request - State: closed - Opened by huu4ontocord over 3 years ago

#2 - adding datastore

Pull Request - State: closed - Opened by huu4ontocord over 3 years ago

#1 - Initial proposal for metadata guidelines

Pull Request - State: closed - Opened by ggdupont over 3 years ago