uber/petastorm issues and pull requests

#806 - Petastorm not working due to PyArrow version hell

Issue - State: open - Opened by kiranzo about 1 month ago - 2 comments

#805 - Petastorm break with pyarrow 13.0 or newer. Stable version of pyarrow is at 16.0 now.

Issue - State: open - Opened by LauritsDixen 5 months ago - 2 comments

#804 - Petastorm hangs forever in DataBricks

Issue - State: open - Opened by juzzmac 7 months ago - 1 comment

#803 - ParquetDataset has an invalid parameter validate_schema

Issue - State: open - Opened by ayushkarnawat 8 months ago - 1 comment

#802 - chore: Update badge pipeline

Pull Request - State: closed - Opened by Juandavi1 11 months ago - 1 comment

#802 - chore: Update badge pipeline

Pull Request - State: closed - Opened by Juandavi1 11 months ago - 1 comment

#801 - make_reader fails for example

Issue - State: closed - Opened by phK3 12 months ago - 1 comment

#800 - FutureWarning: 'ParquetDataset.partitions' attribute is deprecated as of pyarrow 5.0.0 and will be removed in a future version.

Issue - State: open - Opened by ton11111 12 months ago

#800 - FutureWarning: 'ParquetDataset.partitions' attribute is deprecated as of pyarrow 5.0.0 and will be removed in a future version.

Issue - State: open - Opened by ton11111 12 months ago - 1 comment

#799 - make_torch_dataloader using TransformSpec applies transformation on entire dataframe (not lazy loading)

Issue - State: closed - Opened by davegabe about 1 year ago - 2 comments

#798 - Bug in ConcurrentVentilator._ventilate() when randomize_item_order=True and random seed is fixed

Issue - State: open - Opened by JonasRauch about 1 year ago

#797 - Issue with loading nested array type from spark DF to torch

Issue - State: open - Opened by sardinois over 1 year ago

#796 - Add a ThreadPool which respects the order of Parquet dataset pieces.

Pull Request - State: open - Opened by wbeardall over 1 year ago - 3 comments

#795 - String as input in petastorm dataloaders

Issue - State: open - Opened by freud14-tm over 1 year ago - 3 comments

#793 - Seeing worse model performance from using petastorm vs normal pytorch dataloader

Issue - State: open - Opened by AKhazane over 1 year ago - 1 comment

#792 - Add missing field_name in ValueError

Pull Request - State: open - Opened by chasleslr over 1 year ago - 3 comments

#791 - [Test] Run CI against pyspark 3.4

Pull Request - State: open - Opened by WeichenXu123 over 1 year ago - 3 comments

#790 - TypeError: init() missing 2 required positional arguments: 'instance' and 'token'

Issue - State: open - Opened by devVipin01 over 1 year ago

#789 - AttributeError: 'bool' object has no attribute 'map' while using Predicate

Issue - State: open - Opened by xizhenke over 1 year ago

#788 - How to transform the string data to numerical when using make_batch_reader?

Issue - State: open - Opened by xizhenke over 1 year ago

#787 - Make `make_spark_converter` supports creating converter from a saved dataframe path

Pull Request - State: closed - Opened by WeichenXu123 almost 2 years ago - 7 comments

#786 - make_batch_reader Documentation out of date? seed?

Issue - State: open - Opened by Data-drone almost 2 years ago

#785 - Petastorm sharding and setting batch sizes

Issue - State: open - Opened by Data-drone almost 2 years ago

#784 - Prediction issue using Keras and TransformSpec with PySpark

Issue - State: closed - Opened by sdaza almost 2 years ago

#783 - Support results_queue_size parameter in make_batch_reader api

Pull Request - State: closed - Opened by s-udhaya almost 2 years ago - 8 comments

#782 - when hdfs-site.xml file has xi:include tag, the function cann't get hadoop_configuration info

Issue - State: open - Opened by lytk01 almost 2 years ago

#781 - How to pass pin_memory argument when using make_torch_dataloader

Issue - State: closed - Opened by s-udhaya almost 2 years ago - 2 comments

#780 - Customized dataset

Issue - State: closed - Opened by JiajianLu almost 2 years ago - 1 comment

#779 - Random seed doesn't seem to work well

Issue - State: open - Opened by kisel4363 about 2 years ago - 2 comments

#778 - Update CI to use latest versions of pyarrow and numpy. Drop pyarrow 4 test config.

Pull Request - State: open - Opened by selitvin about 2 years ago - 2 comments

#777 - Remove ``LocalDiskArrowTableCache`` and use latest pickle protocol for local caching

Pull Request - State: closed - Opened by selitvin about 2 years ago - 3 comments

#776 - using SHAP with petastorm dataset

Issue - State: open - Opened by sdaza about 2 years ago - 1 comment

#775 - Future Warning importing SparkDatasetConverter.

Issue - State: closed - Opened by kisel4363 about 2 years ago - 2 comments

#774 - Dynamic shape of lables.

Issue - State: open - Opened by ohindialign about 2 years ago - 3 comments

#773 - in_set predicate raises error unhashable type: 'Series'

Issue - State: open - Opened by Joachim-Sh about 2 years ago

#772 - Add a collate_lists_fn

Pull Request - State: open - Opened by selitvin about 2 years ago - 1 comment

#771 - Update pytorch mnist example with up-to-date make_reader() interface

Pull Request - State: closed - Opened by chongxiaoc about 2 years ago - 1 comment

#770 - weighted_sampling_reader

Issue - State: open - Opened by weidezhang about 2 years ago - 3 comments

#769 - make_spark_converter RuntimeError: Vector columns are only supported in pyspark>=3.0

Issue - State: open - Opened by Alxe1 about 2 years ago - 4 comments

#768 - null cache

Issue - State: open - Opened by weidezhang about 2 years ago - 4 comments

#767 - Reader: enable shuffling inside every row group

Pull Request - State: closed - Opened by chongxiaoc about 2 years ago - 2 comments

#766 - upgrade readthedocs to use Py3.7

Pull Request - State: closed - Opened by chongxiaoc about 2 years ago - 1 comment

#765 - make_batch_reader loses dtype with list-of-strings columns, causing Tensorflow error when lists contain a None value

Issue - State: open - Opened by arhan-gunel about 2 years ago

#764 - Will petastorm Dataloader support prefetch like PyTorch Multiprocessing Dataloader?

Issue - State: closed - Opened by MARD1NO about 2 years ago - 1 comment

#763 - PyTorch Batched Non-shuffle Buffer Large Memory Consumption

Issue - State: closed - Opened by chongxiaoc about 2 years ago - 1 comment
Labels: enhancement

#762 - PyTorch: improve memory-efficiency in batched non-shuffle buffer

Pull Request - State: closed - Opened by chongxiaoc about 2 years ago - 3 comments

#761 - dynamic padding via `collate_fn`

Issue - State: open - Opened by Jomonsugi about 2 years ago - 11 comments

#760 - Newer pyarrow versions?

Issue - State: closed - Opened by winding-lines about 2 years ago - 1 comment

#759 - Can we input a custom collate function as an input variable when creating the dataloader ?

Issue - State: open - Opened by shamanez about 2 years ago

#758 - Validate_schema keyword not supported yet

Issue - State: open - Opened by kisel4363 about 2 years ago - 7 comments

#757 - Replace process_iter by pid_exists

Pull Request - State: closed - Opened by MostafaFarahani over 2 years ago - 3 comments

#756 - Performance on large amounts of data

Issue - State: open - Opened by jaycunningham-8451 over 2 years ago - 1 comment

#755 - training from different sources

Issue - State: open - Opened by weidezhang over 2 years ago - 6 comments

#754 - Wrapper for Arrow Datasets & Dataset Pieces

Pull Request - State: open - Opened by aperiodic over 2 years ago - 2 comments

#753 - Update README.rst

Pull Request - State: open - Opened by FeU-aKlos over 2 years ago - 1 comment

#752 - Add Python3.10 to CI docker image

Pull Request - State: open - Opened by selitvin over 2 years ago - 2 comments

#751 - Upgrade CI to use latest packages of tf,pyarrow,numpy in 'latest' CI configuration

Pull Request - State: closed - Opened by selitvin over 2 years ago - 2 comments

#750 - Fix type of the a batch returned by make_batch_reader when TransformSpec's function returns column with all values being None

Pull Request - State: open - Opened by selitvin over 2 years ago - 1 comment

#749 - Do not land: Benchmark size of a parquet file with png files

Pull Request - State: closed - Opened by selitvin over 2 years ago

#748 - Enable batch fetching in parallel

Pull Request - State: open - Opened by jarandaf over 2 years ago - 4 comments

#747 - How to reduce parquet size

Issue - State: open - Opened by journey-wang over 2 years ago - 1 comment

#746 - Import ABC from collections.abc for Python 3.10 compatibility

Pull Request - State: closed - Opened by tirkarthi over 2 years ago - 2 comments

#745 - Test using shared_seed with pytorch converter

Pull Request - State: closed - Opened by selitvin over 2 years ago - 1 comment

#744 - Use of transform_spec in make_batch_reader leads to tensorflow error when column is missing values

Issue - State: open - Opened by oby1 over 2 years ago - 3 comments

#743 - tensorflow pyspark

Issue - State: closed - Opened by malinphy over 2 years ago - 4 comments

#742 - make_batch_reader called by make_torch_loader "got an unexpected keyword argument 'shard_seed'"

Issue - State: closed - Opened by quocdat32461997 over 2 years ago - 2 comments

#741 - `RestrictedUnpickler` is Bypassable

Issue - State: open - Opened by splitline over 2 years ago

#740 - On BatchedDataLoader performance

Issue - State: closed - Opened by jarandaf over 2 years ago - 8 comments

#739 - Speeding up loading data from spark

Issue - State: open - Opened by jmpanfil over 2 years ago - 3 comments

#738 - Ambiguous workflow while using Spark

Issue - State: open - Opened by smartFunX over 2 years ago - 3 comments

#737 - Use highest available pickle protocol when serializing

Pull Request - State: closed - Opened by rbetz over 2 years ago - 9 comments

#736 - Parquet column/modular encryption support for Petastorm

Issue - State: open - Opened by RobindeGrootNL over 2 years ago - 8 comments

#735 - reuse dataset materialized by SparkDatasetConverter

Issue - State: closed - Opened by Riser01 over 2 years ago - 1 comment

#734 - how to use a single dataset to train multiple input model in tensorflow keras useing pentastorm

Issue - State: closed - Opened by Riser01 over 2 years ago

#733 - Tensorflow pentastrom , training stuck

Issue - State: closed - Opened by Riser01 over 2 years ago - 6 comments

#732 - Get rid of RuntimeWarning when using process pool

Pull Request - State: closed - Opened by selitvin over 2 years ago - 1 comment

#731 - Support passing multiple url files to make_reader function.

Pull Request - State: closed - Opened by selitvin over 2 years ago - 3 comments

#730 - Allow more than two namenodes in hdfs configuration file.

Pull Request - State: closed - Opened by selitvin over 2 years ago - 1 comment

#729 - Varying number of examples passed by DataLoader to Pytorch Lightning network

Issue - State: open - Opened by trelium almost 3 years ago - 2 comments

#728 - PyDictReaderWorker does not support multiple paths datset_paths

Issue - State: closed - Opened by zhangzhenyu13 almost 3 years ago - 2 comments

#727 - Large metadata file: Can't load dataset after using Petastorm row_group_indexer

Issue - State: open - Opened by marjanAlbouye almost 3 years ago - 1 comment

#726 - How to stop petastorm dataloaders at end of epoch

Issue - State: open - Opened by jiwidi almost 3 years ago - 3 comments

#725 - Error when using make_spark_converter

Issue - State: closed - Opened by jiwidi almost 3 years ago

#724 - got error AssertionError: Must supply a list of namenodes, but HDFS only supports up to 2 namenode URLs when calling the materialize_dataset() in example

Issue - State: closed - Opened by Ereebay almost 3 years ago - 3 comments

#723 - Use assertEqual instead of assertEquals for Python 3.11 compatibility.

Pull Request - State: closed - Opened by tirkarthi almost 3 years ago - 2 comments

#722 - fix typo "suffling" -> "shuffling"

Pull Request - State: closed - Opened by noxthot almost 3 years ago - 5 comments

#721 - not able to disable shuffling using : make_torch_dataloader

Issue - State: open - Opened by Warra07 about 3 years ago - 2 comments

#720 - make_reader() is taking forever

Issue - State: open - Opened by GraceHLiu about 3 years ago - 14 comments

#719 - Any update on shard imbalance issue for parquet dataset?

Issue - State: closed - Opened by PHILO-HE about 3 years ago - 6 comments

#718 - Use make_batch_reader for petastorm parquet dataset

Issue - State: closed - Opened by PHILO-HE about 3 years ago - 2 comments

#717 - Added fsspec support for _default_delete_dir_handler

Pull Request - State: closed - Opened by manjuransari-zz about 3 years ago - 1 comment

#716 - _default_delete_dir_handler throws error when using default handler

Issue - State: closed - Opened by manjuransari-zz about 3 years ago - 8 comments

#714 - No option to pass storage_options in materialize_dataset()

Issue - State: open - Opened by manjuransari-zz about 3 years ago

#711 - Use spark_test_ctx fixture instead of constructing spark manually

Pull Request - State: open - Opened by selitvin about 3 years ago - 1 comment

#702 - Remove very old pickle compatibility code modifying old atg package names

Pull Request - State: open - Opened by selitvin about 3 years ago - 2 comments

#699 - Access a specific row in the dataframe

Issue - State: open - Opened by 2006pmach about 3 years ago - 4 comments

#690 - Support for parquet files with nested structures

Issue - State: open - Opened by mossadhelali over 3 years ago - 21 comments

#663 - fix get_dataset_path() in fs_utils.py

Pull Request - State: open - Opened by dongpohezui over 3 years ago - 4 comments

#656 - Remove Unischema getattr implementation

Pull Request - State: open - Opened by v01dXYZ over 3 years ago - 2 comments

#641 - Pytorch: add AsyncBatchedDataloader

Pull Request - State: closed - Opened by chongxiaoc over 3 years ago - 7 comments

GitHub / uber/petastorm issues and pull requests