bigscience-workshop/data_tooling issues and pull requests

#419 - Update train_all.sh

Pull Request - State: open - Opened by chris-ha458 over 1 year ago

#418 - [pre-commit.ci] pre-commit autoupdate

Pull Request - State: open - Opened by pre-commit-ci[bot] over 2 years ago

#417 - docstring not valid anymore

Pull Request - State: closed - Opened by HugoLaurencon over 2 years ago

#416 - Reason for not applying remove_non_prining_characters normalization

Issue - State: open - Opened by JoeyOhman over 2 years ago - 1 comment

#415 - [pre-commit.ci] pre-commit autoupdate

Pull Request - State: closed - Opened by pre-commit-ci[bot] over 2 years ago

#414 - Citing this resource

Issue - State: open - Opened by yuvalkirstain over 2 years ago - 4 comments

#413 - add html lang detector

Pull Request - State: closed - Opened by SaulLu over 2 years ago

#412 - Improved Deduplication

Pull Request - State: closed - Opened by ChenghaoMou over 2 years ago - 2 comments

#411 - [pre-commit.ci] pre-commit autoupdate

Pull Request - State: closed - Opened by pre-commit-ci[bot] over 2 years ago

#410 - [pre-commit.ci] pre-commit autoupdate

Pull Request - State: closed - Opened by pre-commit-ci[bot] over 2 years ago

#409 - Dedup exact lines training tokenizer dataset

Pull Request - State: closed - Opened by SaulLu over 2 years ago

#408 - Apply cleaning: remove deduplicated exact lines

Pull Request - State: closed - Opened by SaulLu over 2 years ago - 1 comment

#407 - Add exact document deduplicate scripts

Pull Request - State: closed - Opened by SaulLu over 2 years ago

#406 - Updated Anonymization

Pull Request - State: closed - Opened by ianyu93 over 2 years ago

#405 - [pre-commit.ci] pre-commit autoupdate

Pull Request - State: closed - Opened by pre-commit-ci[bot] almost 3 years ago

#404 - Updated apply_regex_anonymization

Pull Request - State: closed - Opened by ianyu93 almost 3 years ago

#403 - Flagged words in Bengali

Pull Request - State: closed - Opened by manandey almost 3 years ago - 1 comment

#402 - Updated ac_dc

Pull Request - State: closed - Opened by ianyu93 almost 3 years ago - 1 comment

#401 - Update flagged_words.py

Pull Request - State: closed - Opened by majauhar almost 3 years ago - 1 comment

#400 - Replacing pii_processing with muliwai

Pull Request - State: closed - Opened by huu4ontocord almost 3 years ago

#399 - Update spanish flagged words

Pull Request - State: closed - Opened by edugp almost 3 years ago - 1 comment

#398 - [pre-commit.ci] pre-commit autoupdate

Pull Request - State: closed - Opened by pre-commit-ci[bot] almost 3 years ago

#397 - Pseudo crawl 2

Pull Request - State: closed - Opened by thomasw21 almost 3 years ago

#396 - Update Spanish flag words

Pull Request - State: closed - Opened by omarespejel almost 3 years ago - 1 comment

#395 - Final update for Indonesian list of flagged words

Pull Request - State: closed - Opened by jtboing almost 3 years ago - 1 comment

#394 - Update flagged_words.py for Catalan

Pull Request - State: closed - Opened by onadegibert almost 3 years ago - 3 comments

#393 - Update flagged_words.py for Portuguese

Pull Request - State: closed - Opened by ruinunca almost 3 years ago - 1 comment

#392 - further updates to flagged_words.py for indonesian

Pull Request - State: closed - Opened by jtboing almost 3 years ago - 1 comment

#391 - Remove some characters from Chinese bad word list

Pull Request - State: closed - Opened by JetRunner almost 3 years ago - 4 comments

#390 - Update flagged_words.py for Indonesian

Pull Request - State: closed - Opened by afaji almost 3 years ago - 5 comments

#389 - organize repo for language annotation

Pull Request - State: closed - Opened by SaulLu almost 3 years ago

#388 - add skeleton of slurms files for seeds batch 2

Pull Request - State: closed - Opened by SaulLu almost 3 years ago - 1 comment

#387 - move dataset folder seeds batch 1

Pull Request - State: closed - Opened by SaulLu almost 3 years ago

#386 - Refacto of the pseudo_crawl folder

Pull Request - State: closed - Opened by SaulLu almost 3 years ago

#385 - [WIP] add slurm and python files to extend the pseudo crawl dataset with the seeds of batch 2

Pull Request - State: closed - Opened by SaulLu almost 3 years ago - 1 comment

#384 - Reorganization folder to host the batch 2 of seeds to retrive for the pseudo crawl dataset

Pull Request - State: closed - Opened by SaulLu almost 3 years ago

#383 - Pseudo crawl dataset creation

Pull Request - State: closed - Opened by thomasw21 almost 3 years ago

#382 - add files to compute basic stats on pseudo crawl dataset

Pull Request - State: open - Opened by SaulLu almost 3 years ago - 3 comments

#381 - [pre-commit.ci] pre-commit autoupdate

Pull Request - State: closed - Opened by pre-commit-ci[bot] almost 3 years ago

#380 - Updated flagged_words.py

Pull Request - State: closed - Opened by abumafrim almost 3 years ago - 1 comment

#379 - Update flagged words for vietnamese

Pull Request - State: closed - Opened by Luvata almost 3 years ago - 1 comment

#378 - Create license-compliant version of the Pile: EuroParl

Issue - State: closed - Opened by albertvillanova almost 3 years ago - 1 comment
Labels: data catalog, language modeling script

#377 - add push to hub slurm script

Pull Request - State: closed - Opened by SaulLu almost 3 years ago

#376 - Create license-compliant version of the Pile: Stack Exchange

Issue - State: closed - Opened by albertvillanova almost 3 years ago - 1 comment
Labels: data catalog, language modeling script

#375 - Empty Basque flagged words.

Pull Request - State: closed - Opened by asoroa almost 3 years ago - 3 comments

#374 - Update flagged_words.py

Pull Request - State: closed - Opened by majauhar almost 3 years ago - 5 comments

#373 - Update flagged_words.py

Pull Request - State: closed - Opened by majauhar almost 3 years ago - 1 comment

#372 - Checked existing words + Added more words

Pull Request - State: closed - Opened by hbenyamina almost 3 years ago - 2 comments

#371 - Adding initial set of flagged word in Tamil

Pull Request - State: closed - Opened by reshinthadithyan almost 3 years ago - 1 comment

#369 - Update flagged_words.py

Pull Request - State: closed - Opened by sashavor almost 3 years ago - 2 comments

#351 - Create dataset indonesian_news_articles_2017

Issue - State: closed - Opened by albertvillanova almost 3 years ago - 4 comments
Labels: data catalog, language modeling script

#313 - Add code to train KenLM models on Wikipedia and OSCAR

Pull Request - State: closed - Opened by edugp almost 3 years ago

#310 - Create license-compliant version of the Pile: Enron Emails

Issue - State: closed - Opened by albertvillanova almost 3 years ago - 2 comments
Labels: data catalog, language modeling script

#301 - Create license-compliant version of the Pile: subsets

Issue - State: open - Opened by albertvillanova almost 3 years ago
Labels: data catalog

#297 - Create license-compliant version of the Pile: USPTO

Issue - State: closed - Opened by albertvillanova almost 3 years ago - 1 comment
Labels: data catalog, language modeling script

#294 - Create dataset Shamela

Issue - State: closed - Opened by albertvillanova almost 3 years ago - 13 comments
Labels: duplicate, wontfix, data catalog, need data sourcing feedback

#292 - Create dataset OSAC

Issue - State: closed - Opened by albertvillanova almost 3 years ago - 4 comments
Labels: data catalog, language modeling script

#291 - Create dataset Habibi

Issue - State: closed - Opened by albertvillanova almost 3 years ago - 6 comments
Labels: data catalog, language modeling script

#289 - Create dataset LABR

Issue - State: closed - Opened by albertvillanova almost 3 years ago - 3 comments
Labels: data catalog, language modeling script

#288 - Create dataset MultiUN v2

Issue - State: open - Opened by albertvillanova almost 3 years ago - 7 comments
Labels: data catalog, language modeling script

#285 - Create dataset SANAD

Issue - State: closed - Opened by albertvillanova almost 3 years ago - 6 comments
Labels: data catalog, language modeling script

#284 - Create dataset QADI Arabic

Issue - State: open - Opened by albertvillanova almost 3 years ago - 5 comments
Labels: data catalog

#280 - Create dataset arabic billion words

Issue - State: closed - Opened by albertvillanova almost 3 years ago - 3 comments
Labels: data catalog, language modeling script

#272 - Topic classification for AC_DC

Issue - State: closed - Opened by huu4ontocord almost 3 years ago
Labels: tooling, metadata

#266 - Single Sign On From BigScience/Umbrellabird to Data Hosts Trusted Data Environment

Issue - State: closed - Opened by huu4ontocord almost 3 years ago
Labels: tooling

#259 - Create dataset vietnamese_poetry_from_fsoft_ai_lab

Issue - State: closed - Opened by albertvillanova almost 3 years ago - 4 comments
Labels: data catalog, language modeling script

#254 - Create dataset vietnamese_MT_EV_VLSP2020

Issue - State: closed - Opened by albertvillanova almost 3 years ago - 6 comments
Labels: data catalog, language modeling script

#253 - Create dataset unsupervised_cross_lingual_representation_learning_at_scale

Issue - State: open - Opened by albertvillanova almost 3 years ago - 4 comments
Labels: data catalog, language modeling script

#232 - Create dataset british_library_hertiage_made_digital_newspapers

Issue - State: open - Opened by albertvillanova almost 3 years ago - 16 comments
Labels: help wanted, data catalog, data format

#230 - Create dataset galileo_open_learning_materials

Issue - State: open - Opened by albertvillanova almost 3 years ago - 6 comments
Labels: data catalog, data format, language modeling script

#228 - Create dataset hindi_wikipedia_articles

Issue - State: closed - Opened by albertvillanova almost 3 years ago - 4 comments
Labels: data catalog, language modeling script

#226 - Create dataset viquiquad__an_extractive_qa_dataset_from_catalan_wikipedia

Issue - State: closed - Opened by albertvillanova almost 3 years ago - 4 comments
Labels: data catalog, language modeling script

#225 - Create dataset hal_archives_ouvertes

Issue - State: open - Opened by albertvillanova almost 3 years ago - 3 comments
Labels: data catalog

#221 - Create dataset bengali_question_answering_dataset

Issue - State: closed - Opened by albertvillanova almost 3 years ago - 5 comments
Labels: data catalog, language modeling script

#216 - Create dataset indonesian_news_corpus

Issue - State: closed - Opened by albertvillanova almost 3 years ago - 2 comments
Labels: data catalog, language modeling script

#212 - Create dataset bangla_sentiment_classification_datasets

Issue - State: closed - Opened by albertvillanova almost 3 years ago - 4 comments
Labels: data catalog, language modeling script

#202 - Create dataset multilingual_knowledge_questions_answers

Issue - State: open - Opened by albertvillanova almost 3 years ago - 5 comments
Labels: data catalog, language modeling script

#198 - Create dataset bloom_library

Issue - State: open - Opened by albertvillanova almost 3 years ago - 6 comments
Labels: data catalog, need custodian permission, language modeling script

#189 - Create dataset indo4b

Issue - State: closed - Opened by albertvillanova almost 3 years ago - 4 comments
Labels: data catalog, language modeling script

#181 - Create dataset UIT-VSMEC

Issue - State: closed - Opened by albertvillanova almost 3 years ago - 4 comments
Labels: data catalog, language modeling script

#169 - Create dataset wit_ted_talks

Issue - State: closed - Opened by albertvillanova almost 3 years ago - 7 comments
Labels: data catalog, language modeling script

#158 - Create dataset cna_taiwan

Issue - State: open - Opened by albertvillanova almost 3 years ago - 6 comments
Labels: data catalog, need custodian permission

#156 - Create dataset arxiv

Issue - State: closed - Opened by albertvillanova almost 3 years ago - 20 comments
Labels: help wanted, data catalog, data format, need data sourcing feedback, language modeling script

#152 - Create dataset african_union_website

Issue - State: open - Opened by albertvillanova almost 3 years ago - 1 comment
Labels: data catalog

#134 - Create dataset tokikom

Issue - State: open - Opened by albertvillanova almost 3 years ago - 3 comments
Labels: data catalog, need custodian permission

#131 - Create dataset a_million_news_headlines_abc_australia

Issue - State: closed - Opened by albertvillanova almost 3 years ago - 6 comments
Labels: data catalog, language modeling script

#127 - Create dataset s2orc_the_semantic_scholar_open_research_corpus

Issue - State: open - Opened by albertvillanova almost 3 years ago - 10 comments
Labels: data catalog, language modeling script

#103 - Create dataset 100_days_of_covid_19_in_the_australian_twittersphere

Issue - State: open - Opened by albertvillanova almost 3 years ago - 8 comments
Labels: data catalog, need data sourcing feedback

#102 - Create dataset spotify_podcast_dataset

Issue - State: open - Opened by albertvillanova almost 3 years ago - 2 comments
Labels: data catalog

#100 - Create dataset el_colombiano

Issue - State: open - Opened by albertvillanova almost 3 years ago
Labels: data catalog, need custodian permission

#98 - Create dataset aaj_tak

Issue - State: open - Opened by albertvillanova almost 3 years ago
Labels: data catalog, need custodian permission

#97 - Create dataset ndltd_taiwan

Issue - State: closed - Opened by albertvillanova almost 3 years ago - 5 comments
Labels: wontfix, data catalog, need custodian permission, need data sourcing feedback

#96 - Create dataset gaceta_parlamentaria_mexico

Issue - State: open - Opened by albertvillanova almost 3 years ago - 2 comments
Labels: data catalog

#95 - Create dataset kumparan_com

Issue - State: open - Opened by albertvillanova almost 3 years ago
Labels: data catalog

#93 - Create dataset the_times_of_india

Issue - State: open - Opened by albertvillanova almost 3 years ago
Labels: data catalog

#92 - Create dataset african_story_book_emakhuwa

Issue - State: open - Opened by albertvillanova almost 3 years ago
Labels: data catalog

#91 - Create dataset wikpiedia_in_chinese

Issue - State: open - Opened by albertvillanova almost 3 years ago
Labels: data catalog

#90 - Create dataset samanantar_indic_parallel_corpora

Issue - State: closed - Opened by albertvillanova almost 3 years ago - 2 comments
Labels: data catalog, language modeling script

#89 - Create dataset newspaper_in_basque

Issue - State: open - Opened by albertvillanova almost 3 years ago - 2 comments
Labels: data catalog

#88 - Collect data from Data Catalog

Issue - State: open - Opened by albertvillanova almost 3 years ago - 3 comments
Labels: data catalog

GitHub / bigscience-workshop/data_tooling issues and pull requests