Ecosyste.ms: Issues

An open API service for providing issue and pull request metadata for open source projects.

GitHub / huggingface/datatrove issues and pull requests

#100 - Adds `depends=` to LocalPipelineExecutor

Pull Request - State: closed - Opened by guipenedo 7 months ago - 1 comment

#99 - Support for arbitrary fasttext models

Pull Request - State: closed - Opened by guipenedo 7 months ago

#98 - Decoupled reading logic from DedupReader

Pull Request - State: closed - Opened by guipenedo 7 months ago - 1 comment

#97 - Job depends in local executor

Issue - State: closed - Opened by jordane95 7 months ago - 3 comments
Labels: enhancement

#96 - Add adapter in DedupReader

Pull Request - State: closed - Opened by jordane95 7 months ago - 2 comments

#95 - Fix compression type

Pull Request - State: closed - Opened by jordane95 7 months ago - 1 comment

#94 - Adds language option for nltk

Pull Request - State: closed - Opened by guipenedo 7 months ago

#93 - Tokenization for Non English data

Issue - State: closed - Opened by Manel-Hik 7 months ago - 1 comment
Labels: question

#92 - bugfix stats file not being saved to s3

Pull Request - State: closed - Opened by guipenedo 7 months ago

#91 - [`Docs`] Fix typos

Pull Request - State: closed - Opened by tolgacangoz 7 months ago - 2 comments

#90 - Adding doc strings + adding a faster tokenized doc merger

Pull Request - State: closed - Opened by thomwolf 7 months ago

#89 - Fix url stats

Pull Request - State: closed - Opened by thomwolf 7 months ago

#88 - Efficiency: np.fromiter instead of np.array

Pull Request - State: closed - Opened by giorgioangel 7 months ago - 1 comment

#87 - Sbatch arguements treated as filepath

Issue - State: closed - Opened by Anacheron51 8 months ago - 1 comment

#86 - Changed fsx default filepath for logging output to user's home

Pull Request - State: closed - Opened by Anacheron51 8 months ago - 3 comments

#85 - Adds multi node parallelism to local executor

Pull Request - State: closed - Opened by guipenedo 8 months ago

#84 - Add prefix to slurm logs

Pull Request - State: closed - Opened by loubnabnl 8 months ago - 2 comments

#83 - Adds writer adapter

Pull Request - State: closed - Opened by guipenedo 8 months ago - 1 comment

#82 - Improve parallelization in MinhashDedupBuckets

Pull Request - State: closed - Opened by guipenedo 8 months ago

#81 - Requeue job automatically when specific signals are caught

Pull Request - State: closed - Opened by guipenedo 8 months ago

#80 - Multi-node parallelism

Issue - State: closed - Opened by jordane95 8 months ago - 5 comments
Labels: enhancement

#79 - Added upload_block_size parameter

Pull Request - State: closed - Opened by guipenedo 8 months ago

#78 - catch codec error in jsonlreader

Pull Request - State: closed - Opened by guipenedo 8 months ago

#77 - Writer adapter

Issue - State: closed - Opened by jordane95 8 months ago - 4 comments

#76 - Error when running exact_substrings

Issue - State: closed - Opened by jordane95 8 months ago - 2 comments
Labels: bug

#75 - Timeout warning/error

Issue - State: closed - Opened by dittops 8 months ago - 2 comments
Labels: bug

#74 - In-file parallelism

Issue - State: open - Opened by jordane95 8 months ago - 1 comment
Labels: enhancement

#73 - Update exact_substrings.py

Pull Request - State: closed - Opened by jordane95 8 months ago - 11 comments

#72 - Tokenization in Minhash deduplication

Issue - State: closed - Opened by jordane95 8 months ago - 1 comment
Labels: question

#71 - Zero Division Error Fix

Pull Request - State: closed - Opened by NicholasLindner 8 months ago - 8 comments

#70 - Spark support

Issue - State: open - Opened by jordane95 8 months ago - 3 comments
Labels: enhancement

#69 - Implementation of line-wise corrections

Issue - State: open - Opened by jordane95 8 months ago - 1 comment
Labels: enhancement

#68 - Bump fsspec version

Pull Request - State: closed - Opened by 0xh3x 8 months ago - 1 comment

#66 - Division by zero

Issue - State: closed - Opened by IlievskiV 8 months ago - 8 comments
Labels: bug, good first issue

#65 - Improved documentation

Pull Request - State: closed - Opened by guipenedo 8 months ago

#64 - Wrong `docstring` in `UnigramLogProbFilter`

Issue - State: closed - Opened by IlievskiV 8 months ago - 1 comment
Labels: documentation

#63 - added tokenize_from_hf_to_s3

Pull Request - State: closed - Opened by guipenedo 8 months ago

#62 - Support Ray as executor

Issue - State: open - Opened by c21 8 months ago - 2 comments
Labels: enhancement

#61 - Add goose3 extractor

Pull Request - State: closed - Opened by fakerybakery 8 months ago - 1 comment

#60 - Add Goose3 extractor

Pull Request - State: closed - Opened by fakerybakery 8 months ago

#59 - build: replace setup.py with pyproject.toml

Pull Request - State: closed - Opened by baggiponte 8 months ago - 4 comments

#58 - Supporting Apache Beam

Issue - State: open - Opened by sayakpaul 8 months ago
Labels: enhancement

#57 - Expanding beyond text data

Issue - State: closed - Opened by loganhart02 8 months ago - 4 comments
Labels: enhancement

#56 - Use re search instead of findall

Pull Request - State: closed - Opened by fierzdev 8 months ago - 1 comment

#55 - License

Issue - State: closed - Opened by fakerybakery 8 months ago - 9 comments

#54 - replace setup.py with pyproject.toml

Issue - State: closed - Opened by baggiponte 8 months ago - 2 comments
Labels: enhancement

#53 - Clean up DataFolder init arguments

Pull Request - State: closed - Opened by guipenedo 8 months ago

#52 - Documentation

Pull Request - State: closed - Opened by guipenedo 8 months ago

#51 - Adds HuggingFaceReader to read data directly from HF datasets

Pull Request - State: closed - Opened by guipenedo 8 months ago

#50 - Make some dependencies optional

Pull Request - State: closed - Opened by mariosasko 8 months ago - 1 comment

#49 - Switch all IO to fsspec

Pull Request - State: closed - Opened by guipenedo 9 months ago - 3 comments

#48 - Cache downloads in `huggingface_hub`'s assets cache

Pull Request - State: closed - Opened by mariosasko 9 months ago - 1 comment

#47 - Better `Document` attribute names

Pull Request - State: closed - Opened by mariosasko 9 months ago - 1 comment

#46 - Support Python 3.11/3.12

Pull Request - State: closed - Opened by mariosasko 9 months ago

#45 - Add IPC/Feather readers

Pull Request - State: closed - Opened by mariosasko 9 months ago

#44 - Batched tokenization and c4 paragraph filters

Pull Request - State: closed - Opened by guipenedo 9 months ago

#43 - Fix minimal supported Python version

Pull Request - State: closed - Opened by mariosasko 9 months ago - 1 comment

#42 - Misc improvements

Pull Request - State: closed - Opened by mariosasko 9 months ago - 3 comments

#41 - Simpler IO interface

Issue - State: closed - Opened by mariosasko 9 months ago

#40 - Optimize `ParquetReader`

Pull Request - State: closed - Opened by mariosasko 9 months ago - 3 comments

#39 - Support Python 3.8

Pull Request - State: closed - Opened by mariosasko 9 months ago - 3 comments

#38 - recursive was not taken into account in fsspec

Pull Request - State: closed - Opened by thomwolf 10 months ago - 2 comments

#37 - Should have been merged as well in the labelling tool

Pull Request - State: closed - Opened by thomwolf 10 months ago

#36 - Simple pipeline to store all length in stats

Pull Request - State: closed - Opened by thomwolf 10 months ago - 2 comments

#35 - adding very simple labelling to the inspector

Pull Request - State: closed - Opened by thomwolf 10 months ago

#34 - Stats and IO refactor

Pull Request - State: closed - Opened by guipenedo 11 months ago - 1 comment

#33 - Thom updates

Pull Request - State: closed - Opened by thomwolf 11 months ago

#32 - QoL tweaks

Pull Request - State: closed - Opened by thomwolf 11 months ago

#31 - Bloom filter

Pull Request - State: closed - Opened by alexchapeaux 12 months ago

#30 - :bug: Fix bug

Pull Request - State: closed - Opened by alexchapeaux about 1 year ago

#29 - Fix bug ex substrings

Pull Request - State: closed - Opened by alexchapeaux about 1 year ago

#28 - Sen dedup fix formatting

Pull Request - State: closed - Opened by alexchapeaux about 1 year ago

#27 - Improved slurm support

Pull Request - State: closed - Opened by guipenedo about 1 year ago

#26 - Gopher repetition all duplicates bug

Pull Request - State: closed - Opened by alexchapeaux about 1 year ago

#25 - Fix bug in empty file stage 2 SD

Pull Request - State: closed - Opened by alexchapeaux about 1 year ago

#24 - Automatic lang

Pull Request - State: closed - Opened by alexchapeaux about 1 year ago

#23 - fixes for the gopher filters

Pull Request - State: closed - Opened by guipenedo about 1 year ago

#22 - :bug: Fix bug in repetition filter

Pull Request - State: closed - Opened by alexchapeaux about 1 year ago

#21 - Deduplication

Pull Request - State: closed - Opened by alexchapeaux about 1 year ago

#20 - Exactsubstrings

Pull Request - State: closed - Opened by alexchapeaux about 1 year ago

#19 - :white_check_mark: Add filter tests

Pull Request - State: closed - Opened by alexchapeaux about 1 year ago

#18 - fixes for stats on slurm

Pull Request - State: closed - Opened by guipenedo about 1 year ago

#17 - Adds MinHash dedup

Pull Request - State: closed - Opened by guipenedo about 1 year ago

#16 - [Tests] Fix readability

Pull Request - State: closed - Opened by anton-l about 1 year ago - 2 comments

#15 - Exactsubstrings

Pull Request - State: closed - Opened by alexchapeaux about 1 year ago

#14 - Filters

Pull Request - State: closed - Opened by alexchapeaux about 1 year ago

#13 - Make slurm seamlessly handle multi stage configs

Pull Request - State: closed - Opened by guipenedo about 1 year ago

#12 - Deduplication

Pull Request - State: closed - Opened by alexchapeaux about 1 year ago

#11 - Stats

Pull Request - State: closed - Opened by alexchapeaux about 1 year ago - 1 comment

#10 - Added slurm executor & IO refactor

Pull Request - State: closed - Opened by guipenedo about 1 year ago

#9 - Added DocumentTokenizerMerger

Pull Request - State: closed - Opened by guipenedo about 1 year ago

#8 - [Maintenance] Add style checks and tests

Pull Request - State: closed - Opened by anton-l about 1 year ago

#7 - :construction: Add extractors

Pull Request - State: closed - Opened by guipenedo about 1 year ago

#6 - Adds statistics and emojis

Pull Request - State: closed - Opened by guipenedo about 1 year ago

#5 - Added tokenization

Pull Request - State: closed - Opened by guipenedo about 1 year ago

#4 - :sparkles: ExclusionWriter for filters

Pull Request - State: closed - Opened by guipenedo about 1 year ago

#3 - Add filters | Switch from multiprocessing to multiprocess

Pull Request - State: closed - Opened by alexchapeaux over 1 year ago

#2 - :sparkles: adds WarcReader

Pull Request - State: closed - Opened by guipenedo over 1 year ago

#1 - Readers, writers and S3 support

Pull Request - State: closed - Opened by guipenedo over 1 year ago