Ecosyste.ms: Issues
An open API service for providing issue and pull request metadata for open source projects.
GitHub / allenai/dolma issues and pull requests
#226 - Question about running the mixer
Issue -
State: closed - Opened by mrqorib 5 days ago
- 1 comment
#225 - Generation of own dataset with Dolma Tokenizer CLI
Issue -
State: open - Opened by WenJett 6 days ago
- 5 comments
#224 - Fixed `ignore_existing` flag not working as expected.
Pull Request -
State: open - Opened by soldni about 1 month ago
- 1 comment
#223 - New language ID
Pull Request -
State: open - Opened by soldni about 1 month ago
#222 - Update reference to Phishing.Database.
Pull Request -
State: open - Opened by phishing-database-bot about 2 months ago
#221 - Typo in optional dependency (`lingua`) check
Pull Request -
State: closed - Opened by soldni 2 months ago
#220 - Is this a bug? `if not LANGDETECT_AVAILABLE:` in LinguaTagger
Issue -
State: closed - Opened by Nevermetyou65 2 months ago
- 1 comment
#219 - Adding support for Classifiers and Search tools
Pull Request -
State: open - Opened by soldni 3 months ago
#218 - dedupe_paragraphs didn't working
Issue -
State: closed - Opened by wannaphong 3 months ago
- 2 comments
#217 - PII didn't working
Issue -
State: closed - Opened by wannaphong 3 months ago
- 3 comments
#216 - Fixed issues and improved documentation in getting-started.md
Pull Request -
State: closed - Opened by aman-17 4 months ago
#215 - Mixer validator
Pull Request -
State: closed - Opened by mariia-iureva 4 months ago
- 4 comments
#214 - DCLM Style Deduplications
Pull Request -
State: open - Opened by revbucket 4 months ago
#213 - 'File partition' option and 'document' directory specification
Pull Request -
State: closed - Opened by Whattabatt 4 months ago
#212 - Mattj/requirements
Pull Request -
State: open - Opened by revbucket 4 months ago
#211 - Add requirements.txt?
Issue -
State: closed - Opened by revbucket 4 months ago
- 1 comment
#210 - DNM: Patch FT Tagger
Pull Request -
State: open - Opened by undfined 4 months ago
#209 - FastText training data
Issue -
State: open - Opened by msaebi1993 4 months ago
#208 - Also pin maturin in action
Pull Request -
State: closed - Opened by undfined 4 months ago
#207 - Pin maturin in CI
Pull Request -
State: closed - Opened by undfined 4 months ago
#206 - Bump version with postfix for PyPI
Pull Request -
State: closed - Opened by undfined 4 months ago
#205 - Adds dclm fasttext classifier
Pull Request -
State: closed - Opened by undfined 4 months ago
#204 - Undfined/runner v3
Pull Request -
State: closed - Opened by undfined 4 months ago
#203 - Revert upload/download to v3 for now
Pull Request -
State: closed - Opened by undfined 4 months ago
#202 - Dependabot fail, match upload/download action versions
Pull Request -
State: closed - Opened by undfined 4 months ago
#201 - after pip install dolma, use the dolma--help, return zipfile.BadZipFile: File is not a zip file
Issue -
State: closed - Opened by XiaozhuLove 4 months ago
- 3 comments
#200 - Polymorphic span replacement
Pull Request -
State: closed - Opened by undfined 4 months ago
#199 - Mixer progress tracking
Issue -
State: open - Opened by aetting 5 months ago
#198 - Bump actions/download-artifact from 3 to 4.1.7 in /.github/workflows in the github_actions group across 1 directory
Pull Request -
State: closed - Opened by dependabot[bot] 5 months ago
Labels: dependencies, github_actions
#197 - Fix bug in length filtering for deduping
Pull Request -
State: closed - Opened by soldni 5 months ago
#196 - [Json fooramt error in line 133] Update getting-started.md
Pull Request -
State: closed - Opened by yushengsu-thu 5 months ago
#195 - Use Numpy v1.x instead of 2.x
Pull Request -
State: closed - Opened by soldni 5 months ago
#194 - [Issue: `dolma.core.errors.DolmaFatalError` in Step 1: Run Taggers]
Issue -
State: closed - Opened by yushengsu-thu 5 months ago
- 2 comments
#193 - Update getting-started.md
Pull Request -
State: closed - Opened by yushengsu-thu 5 months ago
- 1 comment
#192 - [Dolma Tutorial (https://allenai.github.io/docs): 707NotFound]
Issue -
State: open - Opened by yushengsu-thu 5 months ago
- 2 comments
#191 - Changing Formatter
Pull Request -
State: closed - Opened by soldni 5 months ago
#190 - Allow specifying different bins for visualization and computation.
Pull Request -
State: open - Opened by soldni 5 months ago
#189 - Added tokenizers for length
Pull Request -
State: closed - Opened by soldni 5 months ago
#188 - Changed entrypoint to increase IntelliJ compatibility
Pull Request -
State: closed - Opened by soldni 5 months ago
#187 - Bump nltk from 3.8.1 to 3.9 in the pip group across 1 directory
Pull Request -
State: closed - Opened by dependabot[bot] 5 months ago
Labels: dependencies, python
#186 - Count Bytes and Docs
Pull Request -
State: closed - Opened by soldni 6 months ago
#185 - Fix local installation on MacOS
Pull Request -
State: closed - Opened by epwalsh 6 months ago
#184 - Fix Tests to pass with new mixer behavior
Pull Request -
State: closed - Opened by soldni 6 months ago
#183 - Always use inferred extension
Pull Request -
State: closed - Opened by undfined 6 months ago
#182 - Bump to 1.0.7
Pull Request -
State: closed - Opened by undfined 6 months ago
#181 - V2 of Gopher tagger
Pull Request -
State: closed - Opened by soldni 6 months ago
#180 - Cherry pick zstd compressor
Pull Request -
State: closed - Opened by undfined 6 months ago
#179 - Version bump for new release (1.0.4)
Pull Request -
State: closed - Opened by soldni 6 months ago
#178 - Bump openssl from 0.10.64 to 0.10.66 in the cargo group
Pull Request -
State: closed - Opened by dependabot[bot] 6 months ago
Labels: dependencies
#177 - Is there explicitly instruction-following data in the version of Dolma used to train Olmo v1?
Issue -
State: open - Opened by john-hewitt 7 months ago
- 3 comments
#176 - added option for tokenizer to split on special tokens
Pull Request -
State: closed - Opened by soldni 7 months ago
#175 - Issue with ring tokenizer
Issue -
State: closed - Opened by davidbrandfonbrener 7 months ago
- 1 comment
#174 - Deduplication / Decontamination
Issue -
State: closed - Opened by chschroeder 7 months ago
- 3 comments
#173 - Need help in customizing python/dolma/taggers/c4.py
Issue -
State: open - Opened by mihara-bot 8 months ago
#172 - Need clarification of Gopher in Step 2
Issue -
State: open - Opened by mihara-bot 8 months ago
#171 - Better Filters Error Handling
Pull Request -
State: closed - Opened by soldni 8 months ago
#170 - Adds ZST support in Deduper and Mixer
Pull Request -
State: closed - Opened by soldni 8 months ago
#169 - Workaround to fix memory leak in HuggingFace tokenizer
Pull Request -
State: closed - Opened by soldni 8 months ago
- 1 comment
#168 - Need help on accessing the raw reddit data
Issue -
State: closed - Opened by Jianxin-MNM 8 months ago
- 1 comment
#167 - PyPI release
Issue -
State: open - Opened by baberabb 8 months ago
#166 - dedupe.documents.attribute_name does not work
Issue -
State: open - Opened by mathCrazyy 8 months ago
- 1 comment
#165 - New Progress Bar, Backoff, Batching
Pull Request -
State: open - Opened by soldni 8 months ago
#164 - New Progress Bar code
Pull Request -
State: closed - Opened by soldni 8 months ago
#163 - Adding Quality Classifier from Dolma 1.7
Pull Request -
State: closed - Opened by soldni 9 months ago
#162 - Clarification Needed on "C4 NoPunc" in Data Processing
Issue -
State: closed - Opened by codefly13 9 months ago
- 1 comment
#161 - Adding partition logic
Pull Request -
State: closed - Opened by Whattabatt 9 months ago
#160 - Warc Backoff
Pull Request -
State: open - Opened by soldni 9 months ago
#159 - [EXPERIMENT ONLY, NOT FOR MERGING] Exporting First 200 Text
Pull Request -
State: closed - Opened by power10dan 9 months ago
- 1 comment
#158 - Need help for installing dolma
Issue -
State: closed - Opened by mihara-bot 9 months ago
- 3 comments
#157 - Duplicate ids in Dolma v1.7
Issue -
State: open - Opened by Vedaad-Shakib 9 months ago
#156 - Reducing hash calls
Pull Request -
State: closed - Opened by Whattabatt 9 months ago
#155 - Bump rustls from 0.21.11 to 0.21.12 in the cargo group across 1 directory
Pull Request -
State: closed - Opened by dependabot[bot] 9 months ago
Labels: dependencies
#154 - Fixing dtype option not being correctly propagated
Pull Request -
State: closed - Opened by soldni 9 months ago
#153 - Add support for parsing WARC
Pull Request -
State: closed - Opened by soldni 9 months ago
#152 - dtype option is not working as expected
Issue -
State: closed - Opened by tokenizer-decode 9 months ago
- 1 comment
#151 - Inquiry about Web Pipeline Availability
Issue -
State: open - Opened by codefly13 9 months ago
- 2 comments
#150 - Running paragraph level deduplication on c4
Issue -
State: open - Opened by andrewhojel 10 months ago
- 2 comments
#149 - Bump rustls from 0.21.10 to 0.21.11 in the cargo group across 1 directory
Pull Request -
State: closed - Opened by dependabot[bot] 10 months ago
Labels: dependencies
#148 - fix divide by 0 in gopher tagger
Pull Request -
State: closed - Opened by peterbjorgensen 10 months ago
#147 - Bump s3 client lib and parameterize region in s3 tests + devcontainer
Pull Request -
State: closed - Opened by undfined 10 months ago
#146 - Bump h2 from 0.3.24 to 0.3.26 in the cargo group across 1 directory
Pull Request -
State: closed - Opened by dependabot[bot] 10 months ago
- 1 comment
Labels: dependencies
#145 - Add extra tests for multi-byte unicode spans in deduper.
Pull Request -
State: closed - Opened by soldni 10 months ago
#144 - Optionally add total/sum to output of analyzer
Pull Request -
State: closed - Opened by soldni 10 months ago
#143 - S3 mixer doesn't start
Issue -
State: open - Opened by marcopasqua 10 months ago
- 2 comments
#142 - Data out of bounds when using ‘dolma tokens --dtype uint32’
Issue -
State: open - Opened by Jackwaterveg 10 months ago
- 1 comment
#141 - Add an option to improve tokenization shuffling
Pull Request -
State: closed - Opened by soldni 10 months ago
#140 - Fix local shuffling failure
Pull Request -
State: closed - Opened by soldni 11 months ago
#139 - Possible bug in `local_shuffle`?
Issue -
State: closed - Opened by hwijeen 11 months ago
- 2 comments
#138 - Some race condition in url taggers
Issue -
State: open - Opened by peterbjorgensen 11 months ago
- 1 comment
#137 - use precompiled regex when loading url blocklists
Pull Request -
State: closed - Opened by peterbjorgensen 11 months ago
#136 - Is there a way to intergratge Dolma toolkit to Spark?
Issue -
State: open - Opened by DangoWang 11 months ago
- 1 comment
#135 - Improves tool to compute statistics; adds deduplication options.
Pull Request -
State: closed - Opened by soldni 11 months ago
#134 - A Question about the meaning of dolma_v1.6_cc_en
Issue -
State: closed - Opened by aleien95 11 months ago
- 1 comment
#133 - Added JQ syntax for replacements + added minimum score.
Pull Request -
State: closed - Opened by soldni 11 months ago
#132 - Bump the cargo group group with 1 update
Pull Request -
State: closed - Opened by dependabot[bot] 11 months ago
Labels: dependencies
#131 - Added Support for JQ syntax in include/exclude mixer config
Pull Request -
State: closed - Opened by soldni 11 months ago
#130 - Support providing streams into mixer via CLI
Issue -
State: open - Opened by soldni 11 months ago
Labels: enhancement
#129 - Tagger modules import (fix for #128)
Pull Request -
State: closed - Opened by soldni 11 months ago
#128 - tagger_modules do not work in current git version
Issue -
State: closed - Opened by peterbjorgensen 12 months ago
- 1 comment
#127 - Can I use the dolma toolkit to process my own datasets?
Issue -
State: closed - Opened by Tendo33 12 months ago
- 1 comment