Ecosyste.ms: Issues
An open API service for providing issue and pull request metadata for open source projects.
GitHub / bigscience-workshop/data_tooling issues and pull requests
#87 - Add GitHub Action to add issue to project
Pull Request -
State: closed - Opened by albertvillanova almost 3 years ago
#86 - [pre-commit.ci] pre-commit autoupdate
Pull Request -
State: closed - Opened by pre-commit-ci[bot] almost 3 years ago
#85 - Add Alternate bad words & stop words list
Issue -
State: closed - Opened by cccntu almost 3 years ago
- 3 comments
#84 - Add ESTER datasets
Issue -
State: closed - Opened by albertvillanova almost 3 years ago
- 1 comment
Labels: data catalog
#83 - pii-manager v. 0.2.0
Pull Request -
State: closed - Opened by paulovn almost 3 years ago
#82 - Fix pre-commit issues
Pull Request -
State: closed - Opened by olinguyen almost 3 years ago
#81 - added pii-package
Pull Request -
State: closed - Opened by paulovn almost 3 years ago
- 3 comments
#80 - Run pre-commit on pii_processing
Pull Request -
State: closed - Opened by olinguyen about 3 years ago
#79 - moving pii_processing repo into sub diretory
Pull Request -
State: closed - Opened by huu4ontocord about 3 years ago
#78 - Update deduplication scripts
Pull Request -
State: closed - Opened by ChenghaoMou about 3 years ago
#77 - Releasable Dataset to Only Include Metadata
Issue -
State: closed - Opened by huu4ontocord about 3 years ago
- 2 comments
Labels: metadata
#76 - Normalize and use sentence piece tokenizer
Pull Request -
State: closed - Opened by edugp about 3 years ago
- 1 comment
#75 - Create license-compliant version of the Pile: FreeLaw
Issue -
State: open - Opened by albertvillanova about 3 years ago
Labels: data catalog
#74 - Create the license-compliant version of the Pile: PubMed Central
Issue -
State: closed - Opened by albertvillanova about 3 years ago
- 3 comments
Labels: wontfix, data catalog, language modeling script
#73 - Match data between OSCAR v1 (in the registry repo) and OSCAR v2
Issue -
State: closed - Opened by huu4ontocord about 3 years ago
- 13 comments
#72 - Pass dataset object to filtering class
Pull Request -
State: closed - Opened by olinguyen about 3 years ago
#71 - create deduped registry dataset for low resource language for Yoruba and Basque
Issue -
State: closed - Opened by huu4ontocord about 3 years ago
- 5 comments
#70 - Add argument to load the Oscar dataset with registry info
Pull Request -
State: closed - Opened by olinguyen about 3 years ago
#69 - Add deduplication script
Pull Request -
State: closed - Opened by ChenghaoMou about 3 years ago
- 2 comments
#68 - Check languages data covered in the current filtering pipeline
Issue -
State: closed - Opened by ggdupont about 3 years ago
#67 - Update basque badwords list.
Pull Request -
State: closed - Opened by asoroa about 3 years ago
- 1 comment
#66 - add parameters of filtering for vietnamese
Pull Request -
State: closed - Opened by heraclex12 about 3 years ago
- 4 comments
#65 - Create license-compliant version of the Pile
Issue -
State: open - Opened by albertvillanova about 3 years ago
- 2 comments
Labels: data catalog, language modeling script
#64 - adding govt id regex - WIP
Pull Request -
State: closed - Opened by huu4ontocord about 3 years ago
- 8 comments
#63 - Add documentation to setup and run the data filtering pipeline
Issue -
State: closed - Opened by olinguyen about 3 years ago
- 1 comment
Labels: good first issue, tooling
#62 - Determine optimal parameters for data filtering per language
Issue -
State: closed - Opened by olinguyen about 3 years ago
- 2 comments
Labels: tooling
#61 - Create dataset Norwegian Colossal Corpus
Issue -
State: closed - Opened by albertvillanova about 3 years ago
- 9 comments
Labels: wontfix, data catalog, language modeling script
#60 - Create dataset from GALILEO Open Learning Materials
Issue -
State: closed - Opened by albertvillanova about 3 years ago
- 1 comment
Labels: duplicate, data catalog
#59 - Create dataset from Bloom Library
Issue -
State: closed - Opened by albertvillanova about 3 years ago
- 1 comment
Labels: duplicate, data catalog
#58 - Create dataset from Book Dash
Issue -
State: closed - Opened by albertvillanova about 3 years ago
- 5 comments
Labels: duplicate, data catalog, need data sourcing feedback
#57 - Create dataset from African Minds
Issue -
State: closed - Opened by albertvillanova about 3 years ago
- 2 comments
Labels: duplicate, data catalog
#56 - Create dataset from SciELO Books
Issue -
State: closed - Opened by albertvillanova about 3 years ago
- 10 comments
Labels: data catalog, need data sourcing feedback, language modeling script
#55 - Create dataset from Project Gutenberg
Issue -
State: closed - Opened by albertvillanova about 3 years ago
- 12 comments
Labels: data catalog, language modeling script
#54 - [pre-commit.ci] pre-commit autoupdate
Pull Request -
State: closed - Opened by pre-commit-ci[bot] about 3 years ago
#53 - Load dataset in streaming mode for visualization
Pull Request -
State: closed - Opened by Luvata about 3 years ago
- 7 comments
#52 - Regarding Spanish bad word list (badwords.py)
Issue -
State: closed - Opened by asoroa about 3 years ago
- 2 comments
#51 - Add language specific special_char filters
Issue -
State: closed - Opened by huu4ontocord about 3 years ago
- 1 comment
#50 - Create sanity tests for all language pipelines
Issue -
State: closed - Opened by ggdupont about 3 years ago
- 2 comments
Labels: good first issue, tooling
#49 - fix: ensure consistent & efficient tokenization
Pull Request -
State: closed - Opened by ggdupont about 3 years ago
- 6 comments
#48 - detect or remove eval dataset contamination
Issue -
State: closed - Opened by huu4ontocord about 3 years ago
- 3 comments
Labels: evaluation, corpus
#47 - remove eval dataset contamination
Issue -
State: closed - Opened by huu4ontocord about 3 years ago
- 1 comment
#46 - remove strip_chars
Issue -
State: closed - Opened by huu4ontocord about 3 years ago
Labels: filter, tokenizer
#45 - Update stopword cutoff for better filtering
Pull Request -
State: closed - Opened by olinguyen about 3 years ago
- 1 comment
#44 - Save dataset for offline analysis of filtering
Pull Request -
State: closed - Opened by olinguyen about 3 years ago
- 2 comments
#43 - using different models under bertin
Issue -
State: closed - Opened by huu4ontocord about 3 years ago
- 1 comment
Labels: corpus
#42 - Add poetry packages and Makefile
Pull Request -
State: closed - Opened by olinguyen about 3 years ago
#41 - Added parameters for "fr"
Pull Request -
State: closed - Opened by clancyoftheoverflow about 3 years ago
- 1 comment
#40 - [WIP] Refactor for contribution
Pull Request -
State: closed - Opened by Luvata about 3 years ago
- 7 comments
Labels: tooling
#39 - Add perplexity visualization tool
Pull Request -
State: closed - Opened by edugp about 3 years ago
- 15 comments
#38 - Make pipeline runnable
Pull Request -
State: closed - Opened by olinguyen about 3 years ago
#37 - Add kenlm language ids and model download script
Pull Request -
State: closed - Opened by olinguyen about 3 years ago
#36 - [pre-commit.ci] pre-commit autoupdate
Pull Request -
State: closed - Opened by pre-commit-ci[bot] about 3 years ago
#35 - Stop word filtering
Pull Request -
State: closed - Opened by clancyoftheoverflow about 3 years ago
- 4 comments
#34 - Added parameters for "fr"
Pull Request -
State: closed - Opened by clancyoftheoverflow about 3 years ago
- 4 comments
#33 - Run pre-commit-hooks and enable CI
Pull Request -
State: closed - Opened by Skylion007 about 3 years ago
#32 - Remove unused imports
Pull Request -
State: closed - Opened by Skylion007 about 3 years ago
#31 - Create perplexity scoring script for JZ
Issue -
State: closed - Opened by huu4ontocord about 3 years ago
#30 - CLI for oscar_sample_filter.py and make generic of datasets
Issue -
State: closed - Opened by huu4ontocord about 3 years ago
- 8 comments
Labels: tooling
#29 - Create distributed data filtering on JZ
Issue -
State: closed - Opened by huu4ontocord about 3 years ago
- 2 comments
Labels: tooling
#28 - Move Description From Data Filtering Spec Document to data_tooling/ac_dc/README.md
Issue -
State: closed - Opened by huu4ontocord about 3 years ago
Labels: documentation
#27 - Improved Dedup
Issue -
State: closed - Opened by huu4ontocord about 3 years ago
- 5 comments
Labels: corpus
#26 - Incorporate community filtering into pipeline with github auto-checking
Issue -
State: closed - Opened by huu4ontocord about 3 years ago
- 2 comments
Labels: good first issue, tooling
#25 - integrate register information to the pipeline
Issue -
State: closed - Opened by mavela about 3 years ago
- 18 comments
Labels: tooling, metadata
#24 - (perplexity sampling) Add script to get perplexity for other languages in Oscar/mc4
Issue -
State: closed - Opened by cccntu about 3 years ago
- 10 comments
Labels: corpus
#23 - Add better stopword filters
Issue -
State: closed - Opened by huu4ontocord about 3 years ago
- 2 comments
Labels: tooling, corpus
#22 - Add better stopword filters
Issue -
State: closed - Opened by huu4ontocord about 3 years ago
- 1 comment
#21 - Add mojibake filter
Issue -
State: closed - Opened by huu4ontocord about 3 years ago
Labels: good first issue, tooling
#20 - Review ac_dc/oscar_sample_filter.py for correctness
Issue -
State: closed - Opened by huu4ontocord about 3 years ago
- 5 comments
#19 - adding bertin oscar sampling code
Pull Request -
State: closed - Opened by huu4ontocord about 3 years ago
#18 - adding bertin oscar code
Pull Request -
State: closed - Opened by huu4ontocord about 3 years ago
#17 - Add bertin oscar
Pull Request -
State: closed - Opened by huu4ontocord about 3 years ago
#16 - Remove bertin
Pull Request -
State: closed - Opened by huu4ontocord about 3 years ago
#15 - adding bertin code
Pull Request -
State: closed - Opened by huu4ontocord about 3 years ago
#14 - adding ontology code
Pull Request -
State: closed - Opened by huu4ontocord about 3 years ago
#13 - adding gender lists
Pull Request -
State: closed - Opened by huu4ontocord about 3 years ago
#12 - fix NER parsing
Pull Request -
State: closed - Opened by huu4ontocord about 3 years ago
#11 - fixing pii code
Pull Request -
State: closed - Opened by huu4ontocord about 3 years ago
#10 - adding disease and processor pipeline
Pull Request -
State: closed - Opened by huu4ontocord about 3 years ago
#9 - adding minhash
Pull Request -
State: closed - Opened by huu4ontocord about 3 years ago
#8 - Cleanup fixes
Pull Request -
State: closed - Opened by huu4ontocord about 3 years ago
#7 - adding __init__.py
Pull Request -
State: closed - Opened by huu4ontocord about 3 years ago
#6 - Adding to requirements.txt
Pull Request -
State: closed - Opened by huu4ontocord about 3 years ago
#5 - Adding more pii
Pull Request -
State: closed - Opened by huu4ontocord about 3 years ago
#4 - adding pii
Pull Request -
State: closed - Opened by huu4ontocord about 3 years ago
#3 - Adding more bug fixes to datastore
Pull Request -
State: closed - Opened by huu4ontocord over 3 years ago
#2 - adding datastore
Pull Request -
State: closed - Opened by huu4ontocord over 3 years ago
#1 - Initial proposal for metadata guidelines
Pull Request -
State: closed - Opened by ggdupont over 3 years ago