mlfoundations/dclm issues and pull requests

#104 - Dataset for training 1B-5x model in Table 1

Issue - State: open - Opened by LeoXinhaoLee about 2 months ago

#103 - denied access while copying shards from aws s3 bucket

Issue - State: open - Opened by emirkaan5 2 months ago - 3 comments

#102 - DCLM-RW subsets and other documentation additions

Pull Request - State: closed - Opened by jeffreywpli 2 months ago

#101 - Searching DCLM-baseline

Issue - State: open - Opened by chtmp223 2 months ago

#100 - Could you please release the 8.2B token for the 400M-1x setting?

Issue - State: closed - Opened by xszheng2020 3 months ago - 1 comment

#99 - bloom filter occupied 90 % of memory on server with 836Gb available ram

Issue - State: open - Opened by ethany21 3 months ago

#98 - Update README.md

Pull Request - State: closed - Opened by revbucket 3 months ago

#97 - Cannot train

Issue - State: closed - Opened by camilobrownpinilla 3 months ago - 1 comment

#96 - Improved Documentation, Rust Tokenize Shuffle

Pull Request - State: closed - Opened by afang-story 3 months ago

#95 - Build my own model and use DCLM-1B training script and dataset.

Issue - State: open - Opened by windbar778 3 months ago

#94 - more details to the documentation of data preprocessing

Pull Request - State: closed - Opened by Mivg 3 months ago

#93 - Can bff read file formats other than jsonl?

Issue - State: closed - Opened by ethany21 4 months ago - 2 comments

#92 - Fix typos

Pull Request - State: closed - Opened by Muennighoff 4 months ago

#91 - Understanding the pool sizes at each scale and the DCLM baseline

Issue - State: closed - Opened by ameyagodbole 4 months ago - 4 comments

#90 - How can I calculate expected-ngram-count?

Issue - State: closed - Opened by ethany21 4 months ago - 1 comment

#89 - add eval heavy results for MATES in the 1B-1x setting

Pull Request - State: closed - Opened by yuzc19 4 months ago - 3 comments

#88 - tokenization memory usage

Issue - State: open - Opened by brian-ham 4 months ago - 1 comment

#87 - How to download pools for smaller scale tracks

Issue - State: closed - Opened by arnavmdas 4 months ago - 1 comment

#86 - Using Evaluation Prompts to Inform Data Selection

Issue - State: closed - Opened by arnavmdas 4 months ago - 1 comment

#85 - Reproducing experiments in the paper

Issue - State: closed - Opened by normster 4 months ago - 2 comments

#84 - Update README.md -- naive-both -> old-both

Pull Request - State: closed - Opened by revbucket 5 months ago

#83 - bff deduplication removes >90% data with NaiveBoth remove type

Issue - State: closed - Opened by XirenZhou 5 months ago - 2 comments

#82 - Documentation updates

Pull Request - State: closed - Opened by GeorgiosSmyrnis 5 months ago - 1 comment

#81 - tokenize and shuffle test

Pull Request - State: closed - Opened by dhgottesman 5 months ago

#80 - Cannot Interpret result of bff deduplication

Issue - State: closed - Opened by ethany21 5 months ago - 2 comments

#79 - Unable to ray up (part 2)

Issue - State: closed - Opened by tonychenxyz 5 months ago - 4 comments

#78 - fasttext cannot be found

Issue - State: closed - Opened by tonychenxyz 5 months ago - 7 comments

#77 - Availability of DCLM data mixes used in figure 3

Issue - State: open - Opened by IanMagnusson 5 months ago - 4 comments

#76 - How do we just download the data necessary to enter competition?

Issue - State: closed - Opened by davidbrandfonbrener 5 months ago - 2 comments

#75 - The dataset for training fastText OH-2.5 +ELI5 text classifier

Issue - State: open - Opened by yqy2001 5 months ago - 3 comments

#74 - Training data of model-based filtering

Issue - State: open - Opened by Yu-Shi 5 months ago - 3 comments

#73 - added missing train_fasttext_classifier.py file

Pull Request - State: closed - Opened by Mivg 5 months ago

#72 - Missing train_fasttext_classifier.py

Issue - State: closed - Opened by yuzc19 6 months ago - 2 comments

#71 - deduplication removes 98% of my data

Issue - State: open - Opened by Yu-Shi 6 months ago - 2 comments

#70 - Need multi-node training script example

Issue - State: closed - Opened by LeoXinhaoLee 6 months ago - 2 comments

#69 - Unable to ray up

Issue - State: closed - Opened by tonychenxyz 6 months ago - 8 comments

#68 - What is the pretrain scripts?

Issue - State: open - Opened by mathfinder 6 months ago - 12 comments

#67 - buffer write is so slow

Issue - State: closed - Opened by Yu-Shi 6 months ago - 1 comment

#66 - TypeError: Couldn't cast array of type

Issue - State: open - Opened by shizhediao 6 months ago - 2 comments

#65 - Training on data with a fixed order

Issue - State: closed - Opened by Yu-Shi 6 months ago - 2 comments

#64 - CommonCrawl WARC files for building mlfoundations/dclm-pool-400m-1x

Issue - State: closed - Opened by Pab1x 6 months ago - 2 comments

#63 - Missing "default_dataset_yaml" for tokenization

Issue - State: closed - Opened by chenweize1998 6 months ago - 2 comments

#62 - Training crashes after some steps

Issue - State: closed - Opened by Yu-Shi 6 months ago - 8 comments

#61 - About the `--num_checkpoints` argument in pretraining

Issue - State: closed - Opened by Yu-Shi 6 months ago - 2 comments

#60 - Add instructions for setup.py

Pull Request - State: closed - Opened by GeorgiosSmyrnis 6 months ago

#59 - Any plans to release pools after refinedweb heuristic filtering + dedup?

Issue - State: open - Opened by CodeCreator 6 months ago - 9 comments

#58 - Add architecture study CSVs

Pull Request - State: closed - Opened by GeorgiosSmyrnis 6 months ago

#57 - Update README.md

Pull Request - State: closed - Opened by RyanMarten 6 months ago

#56 - README and eval updates

Pull Request - State: closed - Opened by Mivg 6 months ago

#55 - Missing baselines/mappers/banlists/refinedweb_banned_domains_curated.txt

Issue - State: closed - Opened by yuzc19 6 months ago - 2 comments

#54 - Dedup methods

Issue - State: open - Opened by ch-shin 6 months ago - 1 comment

#53 - Does a Higher `fasttext_oh_eli5_vs_rw_v2_prob` Indicate Better Data Quality?

Issue - State: closed - Opened by huyiwen 6 months ago - 1 comment

#52 - Local data processing (non-AWS)

Issue - State: closed - Opened by ryoungj 6 months ago - 1 comment

#51 - Added instructions for finetuning

Pull Request - State: closed - Opened by GeorgiosSmyrnis 6 months ago - 1 comment

#50 - Release of Trained Models on DCLM-Baseline

Issue - State: closed - Opened by m1k2zoo 7 months ago - 1 comment

#50 - Release of Trained Models on DCLM-Baseline

Issue - State: closed - Opened by m1k2zoo 7 months ago - 1 comment

#49 - The lack of the experimental environment introduction in the paper

Issue - State: closed - Opened by flyflypeng 7 months ago - 4 comments

#48 - Training variance

Issue - State: closed - Opened by ttccxx 7 months ago - 6 comments

#47 - Example command using ray_processing/process.py

Issue - State: closed - Opened by tonychenxyz 7 months ago - 4 comments

#46 - What is the pearson correlation in lighteval scores between 1B/400M model and 7B model?

Issue - State: closed - Opened by ZefanW 7 months ago - 1 comment

#45 - Is the model architecture of DCLM different from LLaMA?

Issue - State: closed - Opened by czczup 7 months ago - 2 comments

#44 - Instruction on DCLM-Baseline reproduction and Filtering track

Issue - State: closed - Opened by chenweize1998 7 months ago - 5 comments

#43 - migrate bff deduplication documentation (v1)

Pull Request - State: closed - Opened by jeffreywpli 7 months ago

#42 - missing fasttext_filter yaml

Pull Request - State: closed - Opened by jeffreywpli 7 months ago

#41 - Missing training model configs

Issue - State: closed - Opened by ch-shin 7 months ago - 1 comment

#40 - Missing FastText Config File

Issue - State: closed - Opened by purefall 7 months ago - 1 comment

#39 - ported fasttext code and fixes

Pull Request - State: closed - Opened by Mivg 7 months ago

#38 - sanitized s3 paths

Pull Request - State: closed - Opened by Mivg 7 months ago

#37 - Getting path issue when trying to load language model

Issue - State: closed - Opened by humzaiqbal 7 months ago - 2 comments

#36 - Feature/readme updates

Pull Request - State: closed - Opened by Mivg 7 months ago

#35 - added training configs

Pull Request - State: closed - Opened by Mivg 7 months ago

#34 - How to train and fine-tuning model

Issue - State: closed - Opened by Jackjiayou 7 months ago - 1 comment

#33 - BFF code？

Issue - State: closed - Opened by luludus 7 months ago - 1 comment

#32 - Add fasttext documentation + Minor processor edits to handle fasttext inference

Pull Request - State: closed - Opened by jeffreywpli 7 months ago

#31 - Missing files or bugs in evaluation code?

Issue - State: closed - Opened by ch-shin 7 months ago - 6 comments

#30 - Any web demo?

Issue - State: closed - Opened by MontaEllis 7 months ago - 2 comments

#29 - botocore.exceptions.NoCredentialsError: Unable to locate credentials

Issue - State: closed - Opened by pearl-rabbit 7 months ago - 1 comment

#28 - point curated banlist download to HF instead of s3

Pull Request - State: closed - Opened by jeffreywpli 7 months ago

#27 - Missing scale configs?

Issue - State: closed - Opened by ch-shin 7 months ago - 1 comment

#26 - Update README.md to add Amber and Crystal from llm360

Pull Request - State: open - Opened by TianhuaTao 7 months ago - 1 comment

#25 - Skip hyperparams for nonexisting config dirs.

Pull Request - State: closed - Opened by dwadden 7 months ago

#24 - Ray Actor dies during tokenization process

Issue - State: closed - Opened by humzaiqbal 7 months ago - 8 comments

#23 - Update c4.yaml.

Pull Request - State: closed - Opened by GeorgiosSmyrnis 7 months ago

#22 - Unable to run `eval/eval_openlm_ckpt.py`

Issue - State: closed - Opened by dwadden 7 months ago - 10 comments

#21 - Would you share the 0.28T token dataset for achieve highest scores in 7B-2x experiment?

Issue - State: closed - Opened by xinghuang2050 7 months ago - 1 comment

#20 - ArrowConversionError when running tokenization

Issue - State: closed - Opened by humzaiqbal 7 months ago - 12 comments

#19 - Tokenization file missing

Issue - State: closed - Opened by humzaiqbal 8 months ago - 2 comments

#18 - Data download script

Issue - State: closed - Opened by ch-shin 8 months ago - 2 comments

#17 - Update README and add missing file.

Pull Request - State: closed - Opened by GeorgiosSmyrnis 8 months ago

#16 - Accessing S3 bucket dcnlp-west

Issue - State: closed - Opened by humzaiqbal 8 months ago - 4 comments

#15 - Update import of GLOBAL_FUNCTIONS

Pull Request - State: closed - Opened by humzaiqbal 8 months ago - 2 comments

#14 - add eval heavy results for LLM360/CrystalChat and LLM360/CrystalCoder

Pull Request - State: closed - Opened by TianhuaTao 8 months ago

#13 - add eval heavy results for LLM360/CrystalChat and LLM360/CrystalCoder

Pull Request - State: closed - Opened by TianhuaTao 8 months ago - 3 comments

#12 - Delete LICENSE.txt

Pull Request - State: closed - Opened by Vaishaal 8 months ago

#11 - Causal Transformer for Perplexity

Issue - State: closed - Opened by akshayg08 8 months ago - 3 comments

#10 - How to find CORE, MMLU, EXTENDED values in the eval json?

Issue - State: closed - Opened by pavel-denisov-fraunhofer 8 months ago - 2 comments

#9 - Which data file correspond to table 4 fasttext?

Issue - State: closed - Opened by SHUMKASHUN 8 months ago - 7 comments

#8 - Request to DCLM-Pool

Issue - State: closed - Opened by SAI990323 8 months ago - 3 comments

#7 - Duplicated licenses

Issue - State: closed - Opened by JorgeCepeda 8 months ago - 1 comment

#6 - Update README.md

Pull Request - State: closed - Opened by Vaishaal 8 months ago

GitHub / mlfoundations/dclm issues and pull requests