Ecosyste.ms: Issues

An open API service for providing issue and pull request metadata for open source projects.

GitHub / allenai/dolma issues and pull requests

#226 - Question about running the mixer

Issue - State: closed - Opened by mrqorib 5 days ago - 1 comment

#225 - Generation of own dataset with Dolma Tokenizer CLI

Issue - State: open - Opened by WenJett 6 days ago - 5 comments

#224 - Fixed `ignore_existing` flag not working as expected.

Pull Request - State: open - Opened by soldni about 1 month ago - 1 comment

#223 - New language ID

Pull Request - State: open - Opened by soldni about 1 month ago

#222 - Update reference to Phishing.Database.

Pull Request - State: open - Opened by phishing-database-bot about 2 months ago

#221 - Typo in optional dependency (`lingua`) check

Pull Request - State: closed - Opened by soldni 2 months ago

#220 - Is this a bug? `if not LANGDETECT_AVAILABLE:` in LinguaTagger

Issue - State: closed - Opened by Nevermetyou65 2 months ago - 1 comment

#219 - Adding support for Classifiers and Search tools

Pull Request - State: open - Opened by soldni 3 months ago

#218 - dedupe_paragraphs didn't working

Issue - State: closed - Opened by wannaphong 3 months ago - 2 comments

#217 - PII didn't working

Issue - State: closed - Opened by wannaphong 3 months ago - 3 comments

#216 - Fixed issues and improved documentation in getting-started.md

Pull Request - State: closed - Opened by aman-17 4 months ago

#215 - Mixer validator

Pull Request - State: closed - Opened by mariia-iureva 4 months ago - 4 comments

#214 - DCLM Style Deduplications

Pull Request - State: open - Opened by revbucket 4 months ago

#213 - 'File partition' option and 'document' directory specification

Pull Request - State: closed - Opened by Whattabatt 4 months ago

#212 - Mattj/requirements

Pull Request - State: open - Opened by revbucket 4 months ago

#211 - Add requirements.txt?

Issue - State: closed - Opened by revbucket 4 months ago - 1 comment

#210 - DNM: Patch FT Tagger

Pull Request - State: open - Opened by undfined 4 months ago

#209 - FastText training data

Issue - State: open - Opened by msaebi1993 4 months ago

#208 - Also pin maturin in action

Pull Request - State: closed - Opened by undfined 4 months ago

#207 - Pin maturin in CI

Pull Request - State: closed - Opened by undfined 4 months ago

#206 - Bump version with postfix for PyPI

Pull Request - State: closed - Opened by undfined 4 months ago

#205 - Adds dclm fasttext classifier

Pull Request - State: closed - Opened by undfined 4 months ago

#204 - Undfined/runner v3

Pull Request - State: closed - Opened by undfined 4 months ago

#203 - Revert upload/download to v3 for now

Pull Request - State: closed - Opened by undfined 4 months ago

#202 - Dependabot fail, match upload/download action versions

Pull Request - State: closed - Opened by undfined 4 months ago

#200 - Polymorphic span replacement

Pull Request - State: closed - Opened by undfined 4 months ago

#199 - Mixer progress tracking

Issue - State: open - Opened by aetting 5 months ago

#197 - Fix bug in length filtering for deduping

Pull Request - State: closed - Opened by soldni 5 months ago

#196 - [Json fooramt error in line 133] Update getting-started.md

Pull Request - State: closed - Opened by yushengsu-thu 5 months ago

#195 - Use Numpy v1.x instead of 2.x

Pull Request - State: closed - Opened by soldni 5 months ago

#194 - [Issue: `dolma.core.errors.DolmaFatalError` in Step 1: Run Taggers]

Issue - State: closed - Opened by yushengsu-thu 5 months ago - 2 comments

#193 - Update getting-started.md

Pull Request - State: closed - Opened by yushengsu-thu 5 months ago - 1 comment

#192 - [Dolma Tutorial (https://allenai.github.io/docs): 707NotFound]

Issue - State: open - Opened by yushengsu-thu 5 months ago - 2 comments

#191 - Changing Formatter

Pull Request - State: closed - Opened by soldni 5 months ago

#189 - Added tokenizers for length

Pull Request - State: closed - Opened by soldni 5 months ago

#188 - Changed entrypoint to increase IntelliJ compatibility

Pull Request - State: closed - Opened by soldni 5 months ago

#187 - Bump nltk from 3.8.1 to 3.9 in the pip group across 1 directory

Pull Request - State: closed - Opened by dependabot[bot] 5 months ago
Labels: dependencies, python

#186 - Count Bytes and Docs

Pull Request - State: closed - Opened by soldni 6 months ago

#185 - Fix local installation on MacOS

Pull Request - State: closed - Opened by epwalsh 6 months ago

#184 - Fix Tests to pass with new mixer behavior

Pull Request - State: closed - Opened by soldni 6 months ago

#183 - Always use inferred extension

Pull Request - State: closed - Opened by undfined 6 months ago

#182 - Bump to 1.0.7

Pull Request - State: closed - Opened by undfined 6 months ago

#181 - V2 of Gopher tagger

Pull Request - State: closed - Opened by soldni 6 months ago

#180 - Cherry pick zstd compressor

Pull Request - State: closed - Opened by undfined 6 months ago

#179 - Version bump for new release (1.0.4)

Pull Request - State: closed - Opened by soldni 6 months ago

#178 - Bump openssl from 0.10.64 to 0.10.66 in the cargo group

Pull Request - State: closed - Opened by dependabot[bot] 6 months ago
Labels: dependencies

#176 - added option for tokenizer to split on special tokens

Pull Request - State: closed - Opened by soldni 7 months ago

#175 - Issue with ring tokenizer

Issue - State: closed - Opened by davidbrandfonbrener 7 months ago - 1 comment

#174 - Deduplication / Decontamination

Issue - State: closed - Opened by chschroeder 7 months ago - 3 comments

#172 - Need clarification of Gopher in Step 2

Issue - State: open - Opened by mihara-bot 8 months ago

#171 - Better Filters Error Handling

Pull Request - State: closed - Opened by soldni 8 months ago

#170 - Adds ZST support in Deduper and Mixer

Pull Request - State: closed - Opened by soldni 8 months ago

#169 - Workaround to fix memory leak in HuggingFace tokenizer

Pull Request - State: closed - Opened by soldni 8 months ago - 1 comment

#168 - Need help on accessing the raw reddit data

Issue - State: closed - Opened by Jianxin-MNM 8 months ago - 1 comment

#167 - PyPI release

Issue - State: open - Opened by baberabb 8 months ago

#166 - dedupe.documents.attribute_name does not work

Issue - State: open - Opened by mathCrazyy 8 months ago - 1 comment

#165 - New Progress Bar, Backoff, Batching

Pull Request - State: open - Opened by soldni 8 months ago

#164 - New Progress Bar code

Pull Request - State: closed - Opened by soldni 8 months ago

#163 - Adding Quality Classifier from Dolma 1.7

Pull Request - State: closed - Opened by soldni 9 months ago

#162 - Clarification Needed on "C4 NoPunc" in Data Processing

Issue - State: closed - Opened by codefly13 9 months ago - 1 comment

#161 - Adding partition logic

Pull Request - State: closed - Opened by Whattabatt 9 months ago

#160 - Warc Backoff

Pull Request - State: open - Opened by soldni 9 months ago

#159 - [EXPERIMENT ONLY, NOT FOR MERGING] Exporting First 200 Text

Pull Request - State: closed - Opened by power10dan 9 months ago - 1 comment

#158 - Need help for installing dolma

Issue - State: closed - Opened by mihara-bot 9 months ago - 3 comments

#157 - Duplicate ids in Dolma v1.7

Issue - State: open - Opened by Vedaad-Shakib 9 months ago

#156 - Reducing hash calls

Pull Request - State: closed - Opened by Whattabatt 9 months ago

#155 - Bump rustls from 0.21.11 to 0.21.12 in the cargo group across 1 directory

Pull Request - State: closed - Opened by dependabot[bot] 9 months ago
Labels: dependencies

#154 - Fixing dtype option not being correctly propagated

Pull Request - State: closed - Opened by soldni 9 months ago

#153 - Add support for parsing WARC

Pull Request - State: closed - Opened by soldni 9 months ago

#152 - dtype option is not working as expected

Issue - State: closed - Opened by tokenizer-decode 9 months ago - 1 comment

#151 - Inquiry about Web Pipeline Availability

Issue - State: open - Opened by codefly13 9 months ago - 2 comments

#150 - Running paragraph level deduplication on c4

Issue - State: open - Opened by andrewhojel 10 months ago - 2 comments

#149 - Bump rustls from 0.21.10 to 0.21.11 in the cargo group across 1 directory

Pull Request - State: closed - Opened by dependabot[bot] 10 months ago
Labels: dependencies

#148 - fix divide by 0 in gopher tagger

Pull Request - State: closed - Opened by peterbjorgensen 10 months ago

#146 - Bump h2 from 0.3.24 to 0.3.26 in the cargo group across 1 directory

Pull Request - State: closed - Opened by dependabot[bot] 10 months ago - 1 comment
Labels: dependencies

#145 - Add extra tests for multi-byte unicode spans in deduper.

Pull Request - State: closed - Opened by soldni 10 months ago

#144 - Optionally add total/sum to output of analyzer

Pull Request - State: closed - Opened by soldni 10 months ago

#143 - S3 mixer doesn't start

Issue - State: open - Opened by marcopasqua 10 months ago - 2 comments

#142 - Data out of bounds when using ‘dolma tokens --dtype uint32’

Issue - State: open - Opened by Jackwaterveg 10 months ago - 1 comment

#141 - Add an option to improve tokenization shuffling

Pull Request - State: closed - Opened by soldni 10 months ago

#140 - Fix local shuffling failure

Pull Request - State: closed - Opened by soldni 11 months ago

#139 - Possible bug in `local_shuffle`?

Issue - State: closed - Opened by hwijeen 11 months ago - 2 comments

#138 - Some race condition in url taggers

Issue - State: open - Opened by peterbjorgensen 11 months ago - 1 comment

#137 - use precompiled regex when loading url blocklists

Pull Request - State: closed - Opened by peterbjorgensen 11 months ago

#136 - Is there a way to intergratge Dolma toolkit to Spark?

Issue - State: open - Opened by DangoWang 11 months ago - 1 comment

#135 - Improves tool to compute statistics; adds deduplication options.

Pull Request - State: closed - Opened by soldni 11 months ago

#134 - A Question about the meaning of dolma_v1.6_cc_en

Issue - State: closed - Opened by aleien95 11 months ago - 1 comment

#133 - Added JQ syntax for replacements + added minimum score.

Pull Request - State: closed - Opened by soldni 11 months ago

#132 - Bump the cargo group group with 1 update

Pull Request - State: closed - Opened by dependabot[bot] 11 months ago
Labels: dependencies

#131 - Added Support for JQ syntax in include/exclude mixer config

Pull Request - State: closed - Opened by soldni 11 months ago

#130 - Support providing streams into mixer via CLI

Issue - State: open - Opened by soldni 11 months ago
Labels: enhancement

#129 - Tagger modules import (fix for #128)

Pull Request - State: closed - Opened by soldni 11 months ago

#128 - tagger_modules do not work in current git version

Issue - State: closed - Opened by peterbjorgensen 12 months ago - 1 comment

#127 - Can I use the dolma toolkit to process my own datasets?

Issue - State: closed - Opened by Tendo33 12 months ago - 1 comment