Ecosyste.ms: Issues
An open API service for providing issue and pull request metadata for open source projects.
GitHub / bigscience-workshop/data_tooling issues and pull requests
#419 - Update train_all.sh
Pull Request -
State: open - Opened by chris-ha458 over 1 year ago
#418 - [pre-commit.ci] pre-commit autoupdate
Pull Request -
State: open - Opened by pre-commit-ci[bot] over 2 years ago
#417 - docstring not valid anymore
Pull Request -
State: closed - Opened by HugoLaurencon over 2 years ago
#416 - Reason for not applying remove_non_prining_characters normalization
Issue -
State: open - Opened by JoeyOhman over 2 years ago
- 1 comment
#415 - [pre-commit.ci] pre-commit autoupdate
Pull Request -
State: closed - Opened by pre-commit-ci[bot] over 2 years ago
#414 - Citing this resource
Issue -
State: open - Opened by yuvalkirstain over 2 years ago
- 4 comments
#413 - add html lang detector
Pull Request -
State: closed - Opened by SaulLu over 2 years ago
#412 - Improved Deduplication
Pull Request -
State: closed - Opened by ChenghaoMou over 2 years ago
- 2 comments
#411 - [pre-commit.ci] pre-commit autoupdate
Pull Request -
State: closed - Opened by pre-commit-ci[bot] over 2 years ago
#410 - [pre-commit.ci] pre-commit autoupdate
Pull Request -
State: closed - Opened by pre-commit-ci[bot] over 2 years ago
#409 - Dedup exact lines training tokenizer dataset
Pull Request -
State: closed - Opened by SaulLu over 2 years ago
#408 - Apply cleaning: remove deduplicated exact lines
Pull Request -
State: closed - Opened by SaulLu over 2 years ago
- 1 comment
#407 - Add exact document deduplicate scripts
Pull Request -
State: closed - Opened by SaulLu over 2 years ago
#406 - Updated Anonymization
Pull Request -
State: closed - Opened by ianyu93 over 2 years ago
#405 - [pre-commit.ci] pre-commit autoupdate
Pull Request -
State: closed - Opened by pre-commit-ci[bot] almost 3 years ago
#404 - Updated apply_regex_anonymization
Pull Request -
State: closed - Opened by ianyu93 almost 3 years ago
#403 - Flagged words in Bengali
Pull Request -
State: closed - Opened by manandey almost 3 years ago
- 1 comment
#402 - Updated ac_dc
Pull Request -
State: closed - Opened by ianyu93 almost 3 years ago
- 1 comment
#401 - Update flagged_words.py
Pull Request -
State: closed - Opened by majauhar almost 3 years ago
- 1 comment
#400 - Replacing pii_processing with muliwai
Pull Request -
State: closed - Opened by huu4ontocord almost 3 years ago
#399 - Update spanish flagged words
Pull Request -
State: closed - Opened by edugp almost 3 years ago
- 1 comment
#398 - [pre-commit.ci] pre-commit autoupdate
Pull Request -
State: closed - Opened by pre-commit-ci[bot] almost 3 years ago
#397 - Pseudo crawl 2
Pull Request -
State: closed - Opened by thomasw21 almost 3 years ago
#396 - Update Spanish flag words
Pull Request -
State: closed - Opened by omarespejel almost 3 years ago
- 1 comment
#395 - Final update for Indonesian list of flagged words
Pull Request -
State: closed - Opened by jtboing almost 3 years ago
- 1 comment
#394 - Update flagged_words.py for Catalan
Pull Request -
State: closed - Opened by onadegibert almost 3 years ago
- 3 comments
#393 - Update flagged_words.py for Portuguese
Pull Request -
State: closed - Opened by ruinunca almost 3 years ago
- 1 comment
#392 - further updates to flagged_words.py for indonesian
Pull Request -
State: closed - Opened by jtboing almost 3 years ago
- 1 comment
#391 - Remove some characters from Chinese bad word list
Pull Request -
State: closed - Opened by JetRunner almost 3 years ago
- 4 comments
#390 - Update flagged_words.py for Indonesian
Pull Request -
State: closed - Opened by afaji almost 3 years ago
- 5 comments
#389 - organize repo for language annotation
Pull Request -
State: closed - Opened by SaulLu almost 3 years ago
#388 - add skeleton of slurms files for seeds batch 2
Pull Request -
State: closed - Opened by SaulLu almost 3 years ago
- 1 comment
#387 - move dataset folder seeds batch 1
Pull Request -
State: closed - Opened by SaulLu almost 3 years ago
#386 - Refacto of the pseudo_crawl folder
Pull Request -
State: closed - Opened by SaulLu almost 3 years ago
#385 - [WIP] add slurm and python files to extend the pseudo crawl dataset with the seeds of batch 2
Pull Request -
State: closed - Opened by SaulLu almost 3 years ago
- 1 comment
#384 - Reorganization folder to host the batch 2 of seeds to retrive for the pseudo crawl dataset
Pull Request -
State: closed - Opened by SaulLu almost 3 years ago
#383 - Pseudo crawl dataset creation
Pull Request -
State: closed - Opened by thomasw21 almost 3 years ago
#382 - add files to compute basic stats on pseudo crawl dataset
Pull Request -
State: open - Opened by SaulLu almost 3 years ago
- 3 comments
#381 - [pre-commit.ci] pre-commit autoupdate
Pull Request -
State: closed - Opened by pre-commit-ci[bot] almost 3 years ago
#380 - Updated flagged_words.py
Pull Request -
State: closed - Opened by abumafrim almost 3 years ago
- 1 comment
#379 - Update flagged words for vietnamese
Pull Request -
State: closed - Opened by Luvata almost 3 years ago
- 1 comment
#378 - Create license-compliant version of the Pile: EuroParl
Issue -
State: closed - Opened by albertvillanova almost 3 years ago
- 1 comment
Labels: data catalog, language modeling script
#377 - add push to hub slurm script
Pull Request -
State: closed - Opened by SaulLu almost 3 years ago
#376 - Create license-compliant version of the Pile: Stack Exchange
Issue -
State: closed - Opened by albertvillanova almost 3 years ago
- 1 comment
Labels: data catalog, language modeling script
#375 - Empty Basque flagged words.
Pull Request -
State: closed - Opened by asoroa almost 3 years ago
- 3 comments
#374 - Update flagged_words.py
Pull Request -
State: closed - Opened by majauhar almost 3 years ago
- 5 comments
#373 - Update flagged_words.py
Pull Request -
State: closed - Opened by majauhar almost 3 years ago
- 1 comment
#372 - Checked existing words + Added more words
Pull Request -
State: closed - Opened by hbenyamina almost 3 years ago
- 2 comments
#371 - Adding initial set of flagged word in Tamil
Pull Request -
State: closed - Opened by reshinthadithyan almost 3 years ago
- 1 comment
#369 - Update flagged_words.py
Pull Request -
State: closed - Opened by sashavor almost 3 years ago
- 2 comments
#351 - Create dataset indonesian_news_articles_2017
Issue -
State: closed - Opened by albertvillanova almost 3 years ago
- 4 comments
Labels: data catalog, language modeling script
#313 - Add code to train KenLM models on Wikipedia and OSCAR
Pull Request -
State: closed - Opened by edugp almost 3 years ago
#310 - Create license-compliant version of the Pile: Enron Emails
Issue -
State: closed - Opened by albertvillanova almost 3 years ago
- 2 comments
Labels: data catalog, language modeling script
#301 - Create license-compliant version of the Pile: subsets
Issue -
State: open - Opened by albertvillanova almost 3 years ago
Labels: data catalog
#297 - Create license-compliant version of the Pile: USPTO
Issue -
State: closed - Opened by albertvillanova almost 3 years ago
- 1 comment
Labels: data catalog, language modeling script
#294 - Create dataset Shamela
Issue -
State: closed - Opened by albertvillanova almost 3 years ago
- 13 comments
Labels: duplicate, wontfix, data catalog, need data sourcing feedback
#292 - Create dataset OSAC
Issue -
State: closed - Opened by albertvillanova almost 3 years ago
- 4 comments
Labels: data catalog, language modeling script
#291 - Create dataset Habibi
Issue -
State: closed - Opened by albertvillanova almost 3 years ago
- 6 comments
Labels: data catalog, language modeling script
#289 - Create dataset LABR
Issue -
State: closed - Opened by albertvillanova almost 3 years ago
- 3 comments
Labels: data catalog, language modeling script
#288 - Create dataset MultiUN v2
Issue -
State: open - Opened by albertvillanova almost 3 years ago
- 7 comments
Labels: data catalog, language modeling script
#285 - Create dataset SANAD
Issue -
State: closed - Opened by albertvillanova almost 3 years ago
- 6 comments
Labels: data catalog, language modeling script
#284 - Create dataset QADI Arabic
Issue -
State: open - Opened by albertvillanova almost 3 years ago
- 5 comments
Labels: data catalog
#280 - Create dataset arabic billion words
Issue -
State: closed - Opened by albertvillanova almost 3 years ago
- 3 comments
Labels: data catalog, language modeling script
#272 - Topic classification for AC_DC
Issue -
State: closed - Opened by huu4ontocord almost 3 years ago
Labels: tooling, metadata
#266 - Single Sign On From BigScience/Umbrellabird to Data Hosts Trusted Data Environment
Issue -
State: closed - Opened by huu4ontocord almost 3 years ago
Labels: tooling
#259 - Create dataset vietnamese_poetry_from_fsoft_ai_lab
Issue -
State: closed - Opened by albertvillanova almost 3 years ago
- 4 comments
Labels: data catalog, language modeling script
#254 - Create dataset vietnamese_MT_EV_VLSP2020
Issue -
State: closed - Opened by albertvillanova almost 3 years ago
- 6 comments
Labels: data catalog, language modeling script
#253 - Create dataset unsupervised_cross_lingual_representation_learning_at_scale
Issue -
State: open - Opened by albertvillanova almost 3 years ago
- 4 comments
Labels: data catalog, language modeling script
#232 - Create dataset british_library_hertiage_made_digital_newspapers
Issue -
State: open - Opened by albertvillanova almost 3 years ago
- 16 comments
Labels: help wanted, data catalog, data format
#230 - Create dataset galileo_open_learning_materials
Issue -
State: open - Opened by albertvillanova almost 3 years ago
- 6 comments
Labels: data catalog, data format, language modeling script
#228 - Create dataset hindi_wikipedia_articles
Issue -
State: closed - Opened by albertvillanova almost 3 years ago
- 4 comments
Labels: data catalog, language modeling script
#226 - Create dataset viquiquad__an_extractive_qa_dataset_from_catalan_wikipedia
Issue -
State: closed - Opened by albertvillanova almost 3 years ago
- 4 comments
Labels: data catalog, language modeling script
#225 - Create dataset hal_archives_ouvertes
Issue -
State: open - Opened by albertvillanova almost 3 years ago
- 3 comments
Labels: data catalog
#221 - Create dataset bengali_question_answering_dataset
Issue -
State: closed - Opened by albertvillanova almost 3 years ago
- 5 comments
Labels: data catalog, language modeling script
#216 - Create dataset indonesian_news_corpus
Issue -
State: closed - Opened by albertvillanova almost 3 years ago
- 2 comments
Labels: data catalog, language modeling script
#212 - Create dataset bangla_sentiment_classification_datasets
Issue -
State: closed - Opened by albertvillanova almost 3 years ago
- 4 comments
Labels: data catalog, language modeling script
#202 - Create dataset multilingual_knowledge_questions_answers
Issue -
State: open - Opened by albertvillanova almost 3 years ago
- 5 comments
Labels: data catalog, language modeling script
#198 - Create dataset bloom_library
Issue -
State: open - Opened by albertvillanova almost 3 years ago
- 6 comments
Labels: data catalog, need custodian permission, language modeling script
#189 - Create dataset indo4b
Issue -
State: closed - Opened by albertvillanova almost 3 years ago
- 4 comments
Labels: data catalog, language modeling script
#181 - Create dataset UIT-VSMEC
Issue -
State: closed - Opened by albertvillanova almost 3 years ago
- 4 comments
Labels: data catalog, language modeling script
#169 - Create dataset wit_ted_talks
Issue -
State: closed - Opened by albertvillanova almost 3 years ago
- 7 comments
Labels: data catalog, language modeling script
#158 - Create dataset cna_taiwan
Issue -
State: open - Opened by albertvillanova almost 3 years ago
- 6 comments
Labels: data catalog, need custodian permission
#156 - Create dataset arxiv
Issue -
State: closed - Opened by albertvillanova almost 3 years ago
- 20 comments
Labels: help wanted, data catalog, data format, need data sourcing feedback, language modeling script
#152 - Create dataset african_union_website
Issue -
State: open - Opened by albertvillanova almost 3 years ago
- 1 comment
Labels: data catalog
#134 - Create dataset tokikom
Issue -
State: open - Opened by albertvillanova almost 3 years ago
- 3 comments
Labels: data catalog, need custodian permission
#131 - Create dataset a_million_news_headlines_abc_australia
Issue -
State: closed - Opened by albertvillanova almost 3 years ago
- 6 comments
Labels: data catalog, language modeling script
#127 - Create dataset s2orc_the_semantic_scholar_open_research_corpus
Issue -
State: open - Opened by albertvillanova almost 3 years ago
- 10 comments
Labels: data catalog, language modeling script
#103 - Create dataset 100_days_of_covid_19_in_the_australian_twittersphere
Issue -
State: open - Opened by albertvillanova almost 3 years ago
- 8 comments
Labels: data catalog, need data sourcing feedback
#102 - Create dataset spotify_podcast_dataset
Issue -
State: open - Opened by albertvillanova almost 3 years ago
- 2 comments
Labels: data catalog
#100 - Create dataset el_colombiano
Issue -
State: open - Opened by albertvillanova almost 3 years ago
Labels: data catalog, need custodian permission
#98 - Create dataset aaj_tak
Issue -
State: open - Opened by albertvillanova almost 3 years ago
Labels: data catalog, need custodian permission
#97 - Create dataset ndltd_taiwan
Issue -
State: closed - Opened by albertvillanova almost 3 years ago
- 5 comments
Labels: wontfix, data catalog, need custodian permission, need data sourcing feedback
#96 - Create dataset gaceta_parlamentaria_mexico
Issue -
State: open - Opened by albertvillanova almost 3 years ago
- 2 comments
Labels: data catalog
#95 - Create dataset kumparan_com
Issue -
State: open - Opened by albertvillanova almost 3 years ago
Labels: data catalog
#93 - Create dataset the_times_of_india
Issue -
State: open - Opened by albertvillanova almost 3 years ago
Labels: data catalog
#92 - Create dataset african_story_book_emakhuwa
Issue -
State: open - Opened by albertvillanova almost 3 years ago
Labels: data catalog
#91 - Create dataset wikpiedia_in_chinese
Issue -
State: open - Opened by albertvillanova almost 3 years ago
Labels: data catalog
#90 - Create dataset samanantar_indic_parallel_corpora
Issue -
State: closed - Opened by albertvillanova almost 3 years ago
- 2 comments
Labels: data catalog, language modeling script
#89 - Create dataset newspaper_in_basque
Issue -
State: open - Opened by albertvillanova almost 3 years ago
- 2 comments
Labels: data catalog
#88 - Collect data from Data Catalog
Issue -
State: open - Opened by albertvillanova almost 3 years ago
- 3 comments
Labels: data catalog