Ecosyste.ms: Issues
An open API service for providing issue and pull request metadata for open source projects.
GitHub / aws/sagemaker-training-toolkit issues and pull requests
#217 - fix: typo in the run unit tests command
Pull Request -
State: closed - Opened by bhaoz about 2 months ago
#216 - fix: run unit tests in sequence order for release process as well to prevent coverage conflicting issues
Pull Request -
State: closed - Opened by bhaoz about 2 months ago
#215 - chore: removing unnecessary logging information
Pull Request -
State: closed - Opened by bhaoz about 2 months ago
#214 - feature: Add support for py39 and py310
Pull Request -
State: closed - Opened by prtsh 4 months ago
- 1 comment
#213 - Validate smddprun() fails with file not found error on AL2023
Issue -
State: closed - Opened by jimmyrigby94 6 months ago
- 1 comment
#212 - build test
Pull Request -
State: open - Opened by emeraldbay 6 months ago
#211 - feature: add python module entrypoint type, add python module support…
Pull Request -
State: open - Opened by clumsy 7 months ago
- 1 comment
#210 - feature: add python module entrypoint type, add python module support…
Pull Request -
State: closed - Opened by clumsy 7 months ago
- 1 comment
#209 - feature: add python module entrypoint type, add python module support…
Pull Request -
State: closed - Opened by clumsy 7 months ago
- 1 comment
#208 - Add TFlops calculator and stuck job monitor
Pull Request -
State: open - Opened by emeraldbay 8 months ago
#207 - Get region with ENV var
Issue -
State: open - Opened by austinmw 8 months ago
#206 - Invalid dash-separated options for description-file
Issue -
State: open - Opened by wickeat 10 months ago
#205 - feature: add python module entrypoint type, add python module support to torch_distributed
Pull Request -
State: closed - Opened by clumsy 10 months ago
- 5 comments
#204 - Training Job "Successful" despite failing due to 100% disk usage
Issue -
State: open - Opened by david-waterworth 11 months ago
#203 - change: update the boto deps to use latest boto
Pull Request -
State: closed - Opened by mufaddal-rohawala 11 months ago
#202 - change: bypass DNS check for studio local exec
Pull Request -
State: closed - Opened by mufaddal-rohawala 12 months ago
#201 - fix: toolkit build failure
Pull Request -
State: closed - Opened by emeraldbay 12 months ago
- 1 comment
#200 - ModuleNotFoundError: Sagemaker only copies entry_point file to /opt/ml/code/ instead of the holy-cloned source code
Issue -
State: open - Opened by celsofranssa 12 months ago
#199 - test
Pull Request -
State: closed - Opened by emeraldbay 12 months ago
#198 - fix: use smddprun only if it is installed
Pull Request -
State: closed - Opened by ruhanprasad 12 months ago
- 1 comment
#197 - fix: Remove Python 3.7 to fix the CI
Pull Request -
State: closed - Opened by emeraldbay 12 months ago
#196 - fix: Add NCCL_PROTO=simple environment variable to handle the out-of-order…
Pull Request -
State: closed - Opened by ruhanprasad 12 months ago
#195 - fix: Test CI
Pull Request -
State: closed - Opened by emeraldbay about 1 year ago
#194 - fix: SMDDP does not support P5 instances with SMP
Pull Request -
State: closed - Opened by apoorvtintin about 1 year ago
- 1 comment
#193 - Issue when training in local mode with huggingface training container
Issue -
State: open - Opened by ojturner about 1 year ago
#192 - fix: SMDDP does not support P5 instances with SMP
Pull Request -
State: closed - Opened by apoorvtintin about 1 year ago
- 2 comments
#191 - P5 instance support
Issue -
State: open - Opened by haozhx23 about 1 year ago
#190 - feat: Initial change for Sagemaker provided health check
Pull Request -
State: closed - Opened by emeraldbay about 1 year ago
#189 - feat: support codeartifact for installing requirements.txt packages
Pull Request -
State: closed - Opened by humanzz about 1 year ago
- 2 comments
#188 - dummy commit to test CI/CD
Pull Request -
State: closed - Opened by emeraldbay about 1 year ago
#187 - feat: support codeartifact for installing requirements.txt packages
Pull Request -
State: closed - Opened by humanzz about 1 year ago
- 5 comments
#186 - Adding sys.path to PYTHONPATH breaks virtual environments
Issue -
State: open - Opened by pdveenstra over 1 year ago
#185 - Add SM dataparallel exception class in mpi distribution
Pull Request -
State: closed - Opened by stu1130 over 1 year ago
- 1 comment
#184 - Deepspeed Launcher
Issue -
State: open - Opened by anupam-dewan over 1 year ago
#183 - Added supported for neuron_parallel_compile for trn1 (trainium)
Pull Request -
State: closed - Opened by VijayNiles over 1 year ago
- 1 comment
#182 - Add NCCL_ALGO env var for modelparallel jobs
Pull Request -
State: closed - Opened by yongyanrao over 1 year ago
- 2 comments
#180 - unpin sagemaker version as the credential issue fixed
Pull Request -
State: closed - Opened by yl-to over 1 year ago
#179 - Testing PR for SageMaker version
Pull Request -
State: closed - Opened by yl-to over 1 year ago
#178 - fix: increase worker waiting time for ORTE proc
Pull Request -
State: closed - Opened by yl-to over 1 year ago
- 1 comment
#177 - change: upagrade protobuf version for tensorflow 2.12
Pull Request -
State: closed - Opened by yl-to over 1 year ago
#176 - fix: Revert SMDDP collectives feature from smdataparallel runner
Pull Request -
State: closed - Opened by vishwakaria over 1 year ago
#175 - Fix: to fix SMTrainingCompilerConfigurationError handling in process.py
Pull Request -
State: closed - Opened by vinayburugu over 1 year ago
- 8 comments
#174 - Publish wheels to PyPI
Issue -
State: open - Opened by hajapy over 1 year ago
#173 - Passing SIGTERM to entrypoint to be able to handle SPOT failures gracefully in user-code
Issue -
State: open - Opened by croth1 over 1 year ago
#172 - fix: SMTrainingCompilerConfigurationError takes no keyword argument
Pull Request -
State: closed - Opened by ShiboXing over 1 year ago
#171 - Fix: Add SMTrainingCompilerConfigurationError to the list of registered exception classes.
Pull Request -
State: closed - Opened by vinayburugu over 1 year ago
- 4 comments
#170 - change: update libraries for SMDDP collectives validation
Pull Request -
State: closed - Opened by vishwakaria over 1 year ago
#169 - Upgrade protobuf to prevent conflicts with smdebugger.
Pull Request -
State: closed - Opened by josephevans over 1 year ago
#168 - Feature: To modify pytorch_xla configuration errors to SMTrainingCompilerConfigurationError
Pull Request -
State: closed - Opened by vinayburugu over 1 year ago
#167 - Support CodeArtifact repositories for installing Python packages
Issue -
State: closed - Opened by humanzz almost 2 years ago
#166 - Stack based error attribution for errors arising from compiler code
Pull Request -
State: closed - Opened by vinayburugu almost 2 years ago
- 15 comments
#165 - Add support for p4de instances, update when FI_EFA_USE_DEVICE_RDMA flag is set to only p4d{e} instances.
Pull Request -
State: closed - Opened by josephevans almost 2 years ago
- 2 comments
#164 - Remove magic strings for attributes like instance type
Issue -
State: open - Opened by vishwakaria almost 2 years ago
#163 - Fix: To add script to build tensorflow container for integration tests
Pull Request -
State: closed - Opened by vinayburugu almost 2 years ago
- 2 comments
#162 - feature: add support for SMDDP collectives to smdataparallel runner
Pull Request -
State: closed - Opened by vishwakaria almost 2 years ago
- 8 comments
#161 - Python 3.6 unsupported [bug/question]
Issue -
State: open - Opened by adamwrobel-ext-gd almost 2 years ago
- 1 comment
#160 - Feature: Stack trace based failure attribution for SageMaker Training Compiler
Pull Request -
State: closed - Opened by vinayburugu almost 2 years ago
- 6 comments
#159 - add general exception to filter
Pull Request -
State: closed - Opened by roywei almost 2 years ago
- 4 comments
#158 - Mpi mode sets all nodes to the same SM_CURRENT_HOST
Issue -
State: open - Opened by verdimrc almost 2 years ago
#157 - Feature: Register tensorflow and xla exception classes to sagemaker-t…
Pull Request -
State: closed - Opened by vinayburugu almost 2 years ago
- 10 comments
#156 - Improve coverage and fix collections DeprecationWarning
Pull Request -
State: closed - Opened by satishpasumarthi almost 2 years ago
- 3 comments
#155 - CVE-2007-4559 Patch
Pull Request -
State: open - Opened by TrellixVulnTeam almost 2 years ago
#154 - feature: Add torch_distributed support for Trainium instances in SageMaker
Pull Request -
State: closed - Opened by satishpasumarthi almost 2 years ago
- 12 comments
#153 - feature: Add neuron cores support
Pull Request -
State: closed - Opened by satishpasumarthi almost 2 years ago
- 1 comment
#152 - Feature: Add Neuron core support
Pull Request -
State: closed - Opened by satishpasumarthi almost 2 years ago
- 1 comment
#151 - feature: Register tensor flow and xla exception classes with sagemaker-training-toolkit
Pull Request -
State: closed - Opened by vinayburugu almost 2 years ago
- 62 comments
#150 - add tensor flow exception classes to the list of exception_classes…
Pull Request -
State: closed - Opened by vinayburugu almost 2 years ago
#149 - change: integrate upcoming dataparallel change to modelparallel
Pull Request -
State: closed - Opened by yongyanrao about 2 years ago
- 3 comments
#148 - Avoid deprecated import via collections.abc.Mapping
Pull Request -
State: closed - Opened by lorenzwalthert about 2 years ago
- 5 comments
#147 - Fix: Args for worker nodes in smdataparallel jobs
Pull Request -
State: closed - Opened by satishpasumarthi about 2 years ago
- 1 comment
#146 - Add debugger exception to error classes
Pull Request -
State: closed - Opened by yl-to about 2 years ago
- 20 comments
#145 - fix: Improve worker nodes waiting mechanism in MPI jobs
Pull Request -
State: closed - Opened by satishpasumarthi about 2 years ago
- 15 comments
#144 - fix: Enable PT XLA distributed training on homogeneous clusters
Pull Request -
State: closed - Opened by Lokiiiiii about 2 years ago
- 2 comments
#143 - Fix: adding EFA specific setup to distributed training runner for PT-XLA
Pull Request -
State: closed - Opened by Lokiiiiii about 2 years ago
- 1 comment
#142 - change: update num_processes_per_host for smdataparallel runner
Pull Request -
State: closed - Opened by vishwakaria about 2 years ago
- 1 comment
#141 - fix: Removed version hardcoding for sagemaker test dependency
Pull Request -
State: closed - Opened by jleeleee about 2 years ago
- 1 comment
#140 - relax exception type
Pull Request -
State: closed - Opened by roywei about 2 years ago
- 8 comments
#139 - change: update distribution_instance_group for pytorch ddp
Pull Request -
State: closed - Opened by vishwakaria about 2 years ago
- 2 comments
#138 - Specify flake8 config file explicitly
Pull Request -
State: closed - Opened by nish21 about 2 years ago
- 4 comments
#137 - Feature: Create a new distribution mechanism for PT-XLA
Pull Request -
State: closed - Opened by Lokiiiiii about 2 years ago
- 44 comments
#136 - fix: handle utf-8 decoding exceptions while processing std streams
Pull Request -
State: closed - Opened by vishwakaria about 2 years ago
- 1 comment
#135 - feature: Heterogeneous cluster changes
Pull Request -
State: closed - Opened by satishpasumarthi about 2 years ago
- 1 comment
#134 - update: protobuf version to overlap with TF requirements
Pull Request -
State: closed - Opened by nish21 over 2 years ago
- 1 comment
#133 - SM library telemetry improvement
Pull Request -
State: closed - Opened by roywei over 2 years ago
- 2 comments
#132 - Version 4.1.4 fails to install because of the missing protobuf dependency
Issue -
State: closed - Opened by szafranek over 2 years ago
- 2 comments
#131 - Fix none exception class issue for mpi
Pull Request -
State: closed - Opened by haohanchen-yagao over 2 years ago
- 5 comments
#130 - Feature: Adding new parameter for TF Multi Worker Mirrored Strategy
Pull Request -
State: closed - Opened by Lokiiiiii over 2 years ago
- 4 comments
#129 - No support for Python 3.10
Issue -
State: open - Opened by peter-wimsey over 2 years ago
- 8 comments
#128 - Hyperparameters not shell escaped
Issue -
State: open - Opened by bstriner over 2 years ago
#127 - Shlex quote
Pull Request -
State: open - Opened by bstriner over 2 years ago
- 2 comments
#126 - Pass SIGTERM to training subprocess
Pull Request -
State: open - Opened by bstriner over 2 years ago
- 9 comments
#125 - Pass SIGTERM to training script to stop training
Issue -
State: open - Opened by bstriner over 2 years ago
#124 - fix: fix flaky issue with incorrect rc being given
Pull Request -
State: closed - Opened by matherit over 2 years ago
- 2 comments
#123 - Use framework provided error class and stack trace as error message
Pull Request -
State: closed - Opened by roywei over 2 years ago
- 17 comments
#122 - fix: missing args when shell script is used
Pull Request -
State: closed - Opened by satishpasumarthi over 2 years ago
- 6 comments
#121 - add back FI_EFA_USE_DEVICE_RDMA=1 flag, revert 2936f22
Pull Request -
State: closed - Opened by ydaiming over 2 years ago
- 15 comments
#120 - feature: Add Native Pytorch DDP Support
Pull Request -
State: closed - Opened by satishpasumarthi over 2 years ago
- 10 comments
#119 - Arguments not always accessible when using bash script for training job
Issue -
State: closed - Opened by marcelgwerder over 2 years ago
- 3 comments
#118 - Enable custom failure logging
Pull Request -
State: closed - Opened by satishpasumarthi over 2 years ago
- 5 comments
Labels: priority: high
#117 - Should to_cmd_args pass complex types through json.dumps instead of str?
Issue -
State: closed - Opened by croth1 over 2 years ago
- 1 comment