Ecosyste.ms: Issues
An open API service for providing issue and pull request metadata for open source projects.
GitHub / huggingface/datatrove issues and pull requests
#100 - Adds `depends=` to LocalPipelineExecutor
Pull Request -
State: closed - Opened by guipenedo 7 months ago
- 1 comment
#99 - Support for arbitrary fasttext models
Pull Request -
State: closed - Opened by guipenedo 7 months ago
#98 - Decoupled reading logic from DedupReader
Pull Request -
State: closed - Opened by guipenedo 7 months ago
- 1 comment
#97 - Job depends in local executor
Issue -
State: closed - Opened by jordane95 7 months ago
- 3 comments
Labels: enhancement
#96 - Add adapter in DedupReader
Pull Request -
State: closed - Opened by jordane95 7 months ago
- 2 comments
#95 - Fix compression type
Pull Request -
State: closed - Opened by jordane95 7 months ago
- 1 comment
#94 - Adds language option for nltk
Pull Request -
State: closed - Opened by guipenedo 7 months ago
#93 - Tokenization for Non English data
Issue -
State: closed - Opened by Manel-Hik 7 months ago
- 1 comment
Labels: question
#92 - bugfix stats file not being saved to s3
Pull Request -
State: closed - Opened by guipenedo 7 months ago
#91 - [`Docs`] Fix typos
Pull Request -
State: closed - Opened by tolgacangoz 7 months ago
- 2 comments
#90 - Adding doc strings + adding a faster tokenized doc merger
Pull Request -
State: closed - Opened by thomwolf 7 months ago
#89 - Fix url stats
Pull Request -
State: closed - Opened by thomwolf 7 months ago
#88 - Efficiency: np.fromiter instead of np.array
Pull Request -
State: closed - Opened by giorgioangel 7 months ago
- 1 comment
#87 - Sbatch arguements treated as filepath
Issue -
State: closed - Opened by Anacheron51 8 months ago
- 1 comment
#86 - Changed fsx default filepath for logging output to user's home
Pull Request -
State: closed - Opened by Anacheron51 8 months ago
- 3 comments
#85 - Adds multi node parallelism to local executor
Pull Request -
State: closed - Opened by guipenedo 8 months ago
#84 - Add prefix to slurm logs
Pull Request -
State: closed - Opened by loubnabnl 8 months ago
- 2 comments
#83 - Adds writer adapter
Pull Request -
State: closed - Opened by guipenedo 8 months ago
- 1 comment
#82 - Improve parallelization in MinhashDedupBuckets
Pull Request -
State: closed - Opened by guipenedo 8 months ago
#81 - Requeue job automatically when specific signals are caught
Pull Request -
State: closed - Opened by guipenedo 8 months ago
#80 - Multi-node parallelism
Issue -
State: closed - Opened by jordane95 8 months ago
- 5 comments
Labels: enhancement
#79 - Added upload_block_size parameter
Pull Request -
State: closed - Opened by guipenedo 8 months ago
#78 - catch codec error in jsonlreader
Pull Request -
State: closed - Opened by guipenedo 8 months ago
#77 - Writer adapter
Issue -
State: closed - Opened by jordane95 8 months ago
- 4 comments
#76 - Error when running exact_substrings
Issue -
State: closed - Opened by jordane95 8 months ago
- 2 comments
Labels: bug
#75 - Timeout warning/error
Issue -
State: closed - Opened by dittops 8 months ago
- 2 comments
Labels: bug
#74 - In-file parallelism
Issue -
State: open - Opened by jordane95 8 months ago
- 1 comment
Labels: enhancement
#73 - Update exact_substrings.py
Pull Request -
State: closed - Opened by jordane95 8 months ago
- 11 comments
#72 - Tokenization in Minhash deduplication
Issue -
State: closed - Opened by jordane95 8 months ago
- 1 comment
Labels: question
#71 - Zero Division Error Fix
Pull Request -
State: closed - Opened by NicholasLindner 8 months ago
- 8 comments
#70 - Spark support
Issue -
State: open - Opened by jordane95 8 months ago
- 3 comments
Labels: enhancement
#69 - Implementation of line-wise corrections
Issue -
State: open - Opened by jordane95 8 months ago
- 1 comment
Labels: enhancement
#68 - Bump fsspec version
Pull Request -
State: closed - Opened by 0xh3x 8 months ago
- 1 comment
#67 - TypeError: fsspec.spec.AbstractFileSystem.find() got multiple values for keyword argument 'maxdepth'
Issue -
State: closed - Opened by 0xh3x 8 months ago
#66 - Division by zero
Issue -
State: closed - Opened by IlievskiV 8 months ago
- 8 comments
Labels: bug, good first issue
#65 - Improved documentation
Pull Request -
State: closed - Opened by guipenedo 8 months ago
#64 - Wrong `docstring` in `UnigramLogProbFilter`
Issue -
State: closed - Opened by IlievskiV 8 months ago
- 1 comment
Labels: documentation
#63 - added tokenize_from_hf_to_s3
Pull Request -
State: closed - Opened by guipenedo 8 months ago
#62 - Support Ray as executor
Issue -
State: open - Opened by c21 8 months ago
- 2 comments
Labels: enhancement
#61 - Add goose3 extractor
Pull Request -
State: closed - Opened by fakerybakery 8 months ago
- 1 comment
#60 - Add Goose3 extractor
Pull Request -
State: closed - Opened by fakerybakery 8 months ago
#59 - build: replace setup.py with pyproject.toml
Pull Request -
State: closed - Opened by baggiponte 8 months ago
- 4 comments
#58 - Supporting Apache Beam
Issue -
State: open - Opened by sayakpaul 8 months ago
Labels: enhancement
#57 - Expanding beyond text data
Issue -
State: closed - Opened by loganhart02 8 months ago
- 4 comments
Labels: enhancement
#56 - Use re search instead of findall
Pull Request -
State: closed - Opened by fierzdev 8 months ago
- 1 comment
#55 - License
Issue -
State: closed - Opened by fakerybakery 8 months ago
- 9 comments
#54 - replace setup.py with pyproject.toml
Issue -
State: closed - Opened by baggiponte 8 months ago
- 2 comments
Labels: enhancement
#53 - Clean up DataFolder init arguments
Pull Request -
State: closed - Opened by guipenedo 8 months ago
#52 - Documentation
Pull Request -
State: closed - Opened by guipenedo 8 months ago
#51 - Adds HuggingFaceReader to read data directly from HF datasets
Pull Request -
State: closed - Opened by guipenedo 8 months ago
#50 - Make some dependencies optional
Pull Request -
State: closed - Opened by mariosasko 8 months ago
- 1 comment
#49 - Switch all IO to fsspec
Pull Request -
State: closed - Opened by guipenedo 9 months ago
- 3 comments
#48 - Cache downloads in `huggingface_hub`'s assets cache
Pull Request -
State: closed - Opened by mariosasko 9 months ago
- 1 comment
#47 - Better `Document` attribute names
Pull Request -
State: closed - Opened by mariosasko 9 months ago
- 1 comment
#46 - Support Python 3.11/3.12
Pull Request -
State: closed - Opened by mariosasko 9 months ago
#45 - Add IPC/Feather readers
Pull Request -
State: closed - Opened by mariosasko 9 months ago
#44 - Batched tokenization and c4 paragraph filters
Pull Request -
State: closed - Opened by guipenedo 9 months ago
#43 - Fix minimal supported Python version
Pull Request -
State: closed - Opened by mariosasko 9 months ago
- 1 comment
#42 - Misc improvements
Pull Request -
State: closed - Opened by mariosasko 9 months ago
- 3 comments
#41 - Simpler IO interface
Issue -
State: closed - Opened by mariosasko 9 months ago
#40 - Optimize `ParquetReader`
Pull Request -
State: closed - Opened by mariosasko 9 months ago
- 3 comments
#39 - Support Python 3.8
Pull Request -
State: closed - Opened by mariosasko 9 months ago
- 3 comments
#38 - recursive was not taken into account in fsspec
Pull Request -
State: closed - Opened by thomwolf 10 months ago
- 2 comments
#37 - Should have been merged as well in the labelling tool
Pull Request -
State: closed - Opened by thomwolf 10 months ago
#36 - Simple pipeline to store all length in stats
Pull Request -
State: closed - Opened by thomwolf 10 months ago
- 2 comments
#35 - adding very simple labelling to the inspector
Pull Request -
State: closed - Opened by thomwolf 10 months ago
#34 - Stats and IO refactor
Pull Request -
State: closed - Opened by guipenedo 11 months ago
- 1 comment
#33 - Thom updates
Pull Request -
State: closed - Opened by thomwolf 11 months ago
#32 - QoL tweaks
Pull Request -
State: closed - Opened by thomwolf 11 months ago
#31 - Bloom filter
Pull Request -
State: closed - Opened by alexchapeaux 12 months ago
#30 - :bug: Fix bug
Pull Request -
State: closed - Opened by alexchapeaux about 1 year ago
#29 - Fix bug ex substrings
Pull Request -
State: closed - Opened by alexchapeaux about 1 year ago
#28 - Sen dedup fix formatting
Pull Request -
State: closed - Opened by alexchapeaux about 1 year ago
#27 - Improved slurm support
Pull Request -
State: closed - Opened by guipenedo about 1 year ago
#26 - Gopher repetition all duplicates bug
Pull Request -
State: closed - Opened by alexchapeaux about 1 year ago
#25 - Fix bug in empty file stage 2 SD
Pull Request -
State: closed - Opened by alexchapeaux about 1 year ago
#24 - Automatic lang
Pull Request -
State: closed - Opened by alexchapeaux about 1 year ago
#23 - fixes for the gopher filters
Pull Request -
State: closed - Opened by guipenedo about 1 year ago
#22 - :bug: Fix bug in repetition filter
Pull Request -
State: closed - Opened by alexchapeaux about 1 year ago
#21 - Deduplication
Pull Request -
State: closed - Opened by alexchapeaux about 1 year ago
#20 - Exactsubstrings
Pull Request -
State: closed - Opened by alexchapeaux about 1 year ago
#19 - :white_check_mark: Add filter tests
Pull Request -
State: closed - Opened by alexchapeaux about 1 year ago
#18 - fixes for stats on slurm
Pull Request -
State: closed - Opened by guipenedo about 1 year ago
#17 - Adds MinHash dedup
Pull Request -
State: closed - Opened by guipenedo about 1 year ago
#16 - [Tests] Fix readability
Pull Request -
State: closed - Opened by anton-l about 1 year ago
- 2 comments
#15 - Exactsubstrings
Pull Request -
State: closed - Opened by alexchapeaux about 1 year ago
#14 - Filters
Pull Request -
State: closed - Opened by alexchapeaux about 1 year ago
#13 - Make slurm seamlessly handle multi stage configs
Pull Request -
State: closed - Opened by guipenedo about 1 year ago
#12 - Deduplication
Pull Request -
State: closed - Opened by alexchapeaux about 1 year ago
#11 - Stats
Pull Request -
State: closed - Opened by alexchapeaux about 1 year ago
- 1 comment
#10 - Added slurm executor & IO refactor
Pull Request -
State: closed - Opened by guipenedo about 1 year ago
#9 - Added DocumentTokenizerMerger
Pull Request -
State: closed - Opened by guipenedo about 1 year ago
#8 - [Maintenance] Add style checks and tests
Pull Request -
State: closed - Opened by anton-l about 1 year ago
#7 - :construction: Add extractors
Pull Request -
State: closed - Opened by guipenedo about 1 year ago
#6 - Adds statistics and emojis
Pull Request -
State: closed - Opened by guipenedo about 1 year ago
#5 - Added tokenization
Pull Request -
State: closed - Opened by guipenedo about 1 year ago
#4 - :sparkles: ExclusionWriter for filters
Pull Request -
State: closed - Opened by guipenedo about 1 year ago
#3 - Add filters | Switch from multiprocessing to multiprocess
Pull Request -
State: closed - Opened by alexchapeaux over 1 year ago
#2 - :sparkles: adds WarcReader
Pull Request -
State: closed - Opened by guipenedo over 1 year ago
#1 - Readers, writers and S3 support
Pull Request -
State: closed - Opened by guipenedo over 1 year ago