Ecosyste.ms: Issues
An open API service for providing issue and pull request metadata for open source projects.
GitHub / mlfoundations/dclm issues and pull requests
#104 - Dataset for training 1B-5x model in Table 1
Issue -
State: open - Opened by LeoXinhaoLee about 2 months ago
#103 - denied access while copying shards from aws s3 bucket
Issue -
State: open - Opened by emirkaan5 2 months ago
- 3 comments
#102 - DCLM-RW subsets and other documentation additions
Pull Request -
State: closed - Opened by jeffreywpli 2 months ago
#101 - Searching DCLM-baseline
Issue -
State: open - Opened by chtmp223 2 months ago
#100 - Could you please release the 8.2B token for the 400M-1x setting?
Issue -
State: closed - Opened by xszheng2020 3 months ago
- 1 comment
#99 - bloom filter occupied 90 % of memory on server with 836Gb available ram
Issue -
State: open - Opened by ethany21 3 months ago
#98 - Update README.md
Pull Request -
State: closed - Opened by revbucket 3 months ago
#97 - Cannot train
Issue -
State: closed - Opened by camilobrownpinilla 3 months ago
- 1 comment
#96 - Improved Documentation, Rust Tokenize Shuffle
Pull Request -
State: closed - Opened by afang-story 3 months ago
#95 - Build my own model and use DCLM-1B training script and dataset.
Issue -
State: open - Opened by windbar778 3 months ago
#94 - more details to the documentation of data preprocessing
Pull Request -
State: closed - Opened by Mivg 3 months ago
#93 - Can bff read file formats other than jsonl?
Issue -
State: closed - Opened by ethany21 4 months ago
- 2 comments
#92 - Fix typos
Pull Request -
State: closed - Opened by Muennighoff 4 months ago
#91 - Understanding the pool sizes at each scale and the DCLM baseline
Issue -
State: closed - Opened by ameyagodbole 4 months ago
- 4 comments
#90 - How can I calculate expected-ngram-count?
Issue -
State: closed - Opened by ethany21 4 months ago
- 1 comment
#89 - add eval heavy results for MATES in the 1B-1x setting
Pull Request -
State: closed - Opened by yuzc19 4 months ago
- 3 comments
#88 - tokenization memory usage
Issue -
State: open - Opened by brian-ham 4 months ago
- 1 comment
#87 - How to download pools for smaller scale tracks
Issue -
State: closed - Opened by arnavmdas 4 months ago
- 1 comment
#86 - Using Evaluation Prompts to Inform Data Selection
Issue -
State: closed - Opened by arnavmdas 4 months ago
- 1 comment
#85 - Reproducing experiments in the paper
Issue -
State: closed - Opened by normster 4 months ago
- 2 comments
#84 - Update README.md -- naive-both -> old-both
Pull Request -
State: closed - Opened by revbucket 5 months ago
#83 - bff deduplication removes >90% data with NaiveBoth remove type
Issue -
State: closed - Opened by XirenZhou 5 months ago
- 2 comments
#82 - Documentation updates
Pull Request -
State: closed - Opened by GeorgiosSmyrnis 5 months ago
- 1 comment
#81 - tokenize and shuffle test
Pull Request -
State: closed - Opened by dhgottesman 5 months ago
#80 - Cannot Interpret result of bff deduplication
Issue -
State: closed - Opened by ethany21 5 months ago
- 2 comments
#79 - Unable to ray up (part 2)
Issue -
State: closed - Opened by tonychenxyz 5 months ago
- 4 comments
#78 - fasttext cannot be found
Issue -
State: closed - Opened by tonychenxyz 5 months ago
- 7 comments
#77 - Availability of DCLM data mixes used in figure 3
Issue -
State: open - Opened by IanMagnusson 5 months ago
- 4 comments
#76 - How do we just download the data necessary to enter competition?
Issue -
State: closed - Opened by davidbrandfonbrener 5 months ago
- 2 comments
#75 - The dataset for training fastText OH-2.5 +ELI5 text classifier
Issue -
State: open - Opened by yqy2001 5 months ago
- 3 comments
#74 - Training data of model-based filtering
Issue -
State: open - Opened by Yu-Shi 5 months ago
- 3 comments
#73 - added missing train_fasttext_classifier.py file
Pull Request -
State: closed - Opened by Mivg 5 months ago
#72 - Missing train_fasttext_classifier.py
Issue -
State: closed - Opened by yuzc19 6 months ago
- 2 comments
#71 - deduplication removes 98% of my data
Issue -
State: open - Opened by Yu-Shi 6 months ago
- 2 comments
#70 - Need multi-node training script example
Issue -
State: closed - Opened by LeoXinhaoLee 6 months ago
- 2 comments
#69 - Unable to ray up
Issue -
State: closed - Opened by tonychenxyz 6 months ago
- 8 comments
#68 - What is the pretrain scripts?
Issue -
State: open - Opened by mathfinder 6 months ago
- 12 comments
#67 - buffer write is so slow
Issue -
State: closed - Opened by Yu-Shi 6 months ago
- 1 comment
#66 - TypeError: Couldn't cast array of type
Issue -
State: open - Opened by shizhediao 6 months ago
- 2 comments
#65 - Training on data with a fixed order
Issue -
State: closed - Opened by Yu-Shi 6 months ago
- 2 comments
#64 - CommonCrawl WARC files for building mlfoundations/dclm-pool-400m-1x
Issue -
State: closed - Opened by Pab1x 6 months ago
- 2 comments
#63 - Missing "default_dataset_yaml" for tokenization
Issue -
State: closed - Opened by chenweize1998 6 months ago
- 2 comments
#62 - Training crashes after some steps
Issue -
State: closed - Opened by Yu-Shi 6 months ago
- 8 comments
#61 - About the `--num_checkpoints` argument in pretraining
Issue -
State: closed - Opened by Yu-Shi 6 months ago
- 2 comments
#60 - Add instructions for setup.py
Pull Request -
State: closed - Opened by GeorgiosSmyrnis 6 months ago
#59 - Any plans to release pools after refinedweb heuristic filtering + dedup?
Issue -
State: open - Opened by CodeCreator 6 months ago
- 9 comments
#58 - Add architecture study CSVs
Pull Request -
State: closed - Opened by GeorgiosSmyrnis 6 months ago
#57 - Update README.md
Pull Request -
State: closed - Opened by RyanMarten 6 months ago
#56 - README and eval updates
Pull Request -
State: closed - Opened by Mivg 6 months ago
#55 - Missing baselines/mappers/banlists/refinedweb_banned_domains_curated.txt
Issue -
State: closed - Opened by yuzc19 6 months ago
- 2 comments
#54 - Dedup methods
Issue -
State: open - Opened by ch-shin 6 months ago
- 1 comment
#53 - Does a Higher `fasttext_oh_eli5_vs_rw_v2_prob` Indicate Better Data Quality?
Issue -
State: closed - Opened by huyiwen 6 months ago
- 1 comment
#52 - Local data processing (non-AWS)
Issue -
State: closed - Opened by ryoungj 6 months ago
- 1 comment
#51 - Added instructions for finetuning
Pull Request -
State: closed - Opened by GeorgiosSmyrnis 6 months ago
- 1 comment
#50 - Release of Trained Models on DCLM-Baseline
Issue -
State: closed - Opened by m1k2zoo 7 months ago
- 1 comment
#50 - Release of Trained Models on DCLM-Baseline
Issue -
State: closed - Opened by m1k2zoo 7 months ago
- 1 comment
#49 - The lack of the experimental environment introduction in the paper
Issue -
State: closed - Opened by flyflypeng 7 months ago
- 4 comments
#48 - Training variance
Issue -
State: closed - Opened by ttccxx 7 months ago
- 6 comments
#47 - Example command using ray_processing/process.py
Issue -
State: closed - Opened by tonychenxyz 7 months ago
- 4 comments
#46 - What is the pearson correlation in lighteval scores between 1B/400M model and 7B model?
Issue -
State: closed - Opened by ZefanW 7 months ago
- 1 comment
#45 - Is the model architecture of DCLM different from LLaMA?
Issue -
State: closed - Opened by czczup 7 months ago
- 2 comments
#44 - Instruction on DCLM-Baseline reproduction and Filtering track
Issue -
State: closed - Opened by chenweize1998 7 months ago
- 5 comments
#43 - migrate bff deduplication documentation (v1)
Pull Request -
State: closed - Opened by jeffreywpli 7 months ago
#42 - missing fasttext_filter yaml
Pull Request -
State: closed - Opened by jeffreywpli 7 months ago
#41 - Missing training model configs
Issue -
State: closed - Opened by ch-shin 7 months ago
- 1 comment
#40 - Missing FastText Config File
Issue -
State: closed - Opened by purefall 7 months ago
- 1 comment
#39 - ported fasttext code and fixes
Pull Request -
State: closed - Opened by Mivg 7 months ago
#38 - sanitized s3 paths
Pull Request -
State: closed - Opened by Mivg 7 months ago
#37 - Getting path issue when trying to load language model
Issue -
State: closed - Opened by humzaiqbal 7 months ago
- 2 comments
#36 - Feature/readme updates
Pull Request -
State: closed - Opened by Mivg 7 months ago
#35 - added training configs
Pull Request -
State: closed - Opened by Mivg 7 months ago
#34 - How to train and fine-tuning model
Issue -
State: closed - Opened by Jackjiayou 7 months ago
- 1 comment
#33 - BFF code?
Issue -
State: closed - Opened by luludus 7 months ago
- 1 comment
#32 - Add fasttext documentation + Minor processor edits to handle fasttext inference
Pull Request -
State: closed - Opened by jeffreywpli 7 months ago
#31 - Missing files or bugs in evaluation code?
Issue -
State: closed - Opened by ch-shin 7 months ago
- 6 comments
#30 - Any web demo?
Issue -
State: closed - Opened by MontaEllis 7 months ago
- 2 comments
#29 - botocore.exceptions.NoCredentialsError: Unable to locate credentials
Issue -
State: closed - Opened by pearl-rabbit 7 months ago
- 1 comment
#28 - point curated banlist download to HF instead of s3
Pull Request -
State: closed - Opened by jeffreywpli 7 months ago
#27 - Missing scale configs?
Issue -
State: closed - Opened by ch-shin 7 months ago
- 1 comment
#26 - Update README.md to add Amber and Crystal from llm360
Pull Request -
State: open - Opened by TianhuaTao 7 months ago
- 1 comment
#25 - Skip hyperparams for nonexisting config dirs.
Pull Request -
State: closed - Opened by dwadden 7 months ago
#24 - Ray Actor dies during tokenization process
Issue -
State: closed - Opened by humzaiqbal 7 months ago
- 8 comments
#23 - Update c4.yaml.
Pull Request -
State: closed - Opened by GeorgiosSmyrnis 7 months ago
#22 - Unable to run `eval/eval_openlm_ckpt.py`
Issue -
State: closed - Opened by dwadden 7 months ago
- 10 comments
#21 - Would you share the 0.28T token dataset for achieve highest scores in 7B-2x experiment?
Issue -
State: closed - Opened by xinghuang2050 7 months ago
- 1 comment
#20 - ArrowConversionError when running tokenization
Issue -
State: closed - Opened by humzaiqbal 7 months ago
- 12 comments
#19 - Tokenization file missing
Issue -
State: closed - Opened by humzaiqbal 8 months ago
- 2 comments
#18 - Data download script
Issue -
State: closed - Opened by ch-shin 8 months ago
- 2 comments
#17 - Update README and add missing file.
Pull Request -
State: closed - Opened by GeorgiosSmyrnis 8 months ago
#16 - Accessing S3 bucket dcnlp-west
Issue -
State: closed - Opened by humzaiqbal 8 months ago
- 4 comments
#15 - Update import of GLOBAL_FUNCTIONS
Pull Request -
State: closed - Opened by humzaiqbal 8 months ago
- 2 comments
#14 - add eval heavy results for LLM360/CrystalChat and LLM360/CrystalCoder
Pull Request -
State: closed - Opened by TianhuaTao 8 months ago
#13 - add eval heavy results for LLM360/CrystalChat and LLM360/CrystalCoder
Pull Request -
State: closed - Opened by TianhuaTao 8 months ago
- 3 comments
#12 - Delete LICENSE.txt
Pull Request -
State: closed - Opened by Vaishaal 8 months ago
#11 - Causal Transformer for Perplexity
Issue -
State: closed - Opened by akshayg08 8 months ago
- 3 comments
#10 - How to find CORE, MMLU, EXTENDED values in the eval json?
Issue -
State: closed - Opened by pavel-denisov-fraunhofer 8 months ago
- 2 comments
#9 - Which data file correspond to table 4 fasttext?
Issue -
State: closed - Opened by SHUMKASHUN 8 months ago
- 7 comments
#8 - Request to DCLM-Pool
Issue -
State: closed - Opened by SAI990323 8 months ago
- 3 comments
#7 - Duplicated licenses
Issue -
State: closed - Opened by JorgeCepeda 8 months ago
- 1 comment
#6 - Update README.md
Pull Request -
State: closed - Opened by Vaishaal 8 months ago