GitHub / aws/sagemaker-training-toolkit issues and pull requests
#234 - Loosen or update protobuf pinned version
Issue -
State: open - Opened by wickeat 2 months ago
#233 - setuptools 78.0.1 incompatibility
Issue -
State: open - Opened by aristidesz 4 months ago
#232 - feature: Add Code Owners file
Pull Request -
State: closed - Opened by 992X 6 months ago
#231 - fix: account for possible race condition when creating /opt/ml/code
Pull Request -
State: closed - Opened by benieric 6 months ago
#230 - Create devcontainer.json
Pull Request -
State: open - Opened by technewbie12 8 months ago
- 1 comment
#229 - fix: resolve failing unit test
Pull Request -
State: closed - Opened by jessicazhu3 8 months ago
#228 - fix: avoid parsing stderr as JSON
Pull Request -
State: closed - Opened by danielsnider 8 months ago
- 1 comment
#227 - Fix unknown argument: '-export-dynamic' on macOS
Pull Request -
State: open - Opened by jponf 8 months ago
- 1 comment
#226 - fix: temporarily hardcode neuron cores for trn2
Pull Request -
State: closed - Opened by jessicazhu3 8 months ago
#225 - Build failure on MacOS
Issue -
State: open - Opened by DRKolev-code 8 months ago
- 1 comment
#224 - [TEST]
Pull Request -
State: closed - Opened by SecurityResearcher-yoda 11 months ago
#223 - Fix: Preserve hyperparameter order when invoking training jobs
Pull Request -
State: open - Opened by vsimkus 11 months ago
#222 - Silent Failure if custom image puts something into /opt/ml/code
Issue -
State: open - Opened by njbrake 11 months ago
- 1 comment
#221 - SageMaker training toolkit reorders hyperparameters
Issue -
State: open - Opened by vsimkus 11 months ago
#220 - feature: Add p5 as a supported NCCL instance
Pull Request -
State: closed - Opened by andjsmi 11 months ago
#219 - Add 'ml.p5.48xlarge' as a supported instance for SM_EFA_NCCL_INSTANCES.
Issue -
State: open - Opened by andjsmi 11 months ago
#218 - Extend documentation regarding distributed training for own Docker containers.
Issue -
State: open - Opened by marseller 11 months ago
- 1 comment
#217 - fix: typo in the run unit tests command
Pull Request -
State: closed - Opened by bhaoz 12 months ago
#216 - fix: run unit tests in sequence order for release process as well to prevent coverage conflicting issues
Pull Request -
State: closed - Opened by bhaoz 12 months ago
#215 - chore: removing unnecessary logging information
Pull Request -
State: closed - Opened by bhaoz 12 months ago
#214 - feature: Add support for py39 and py310
Pull Request -
State: closed - Opened by prtsh about 1 year ago
- 1 comment
#213 - Validate smddprun() fails with file not found error on AL2023
Issue -
State: closed - Opened by jimmyrigby94 over 1 year ago
- 1 comment
#212 - build test
Pull Request -
State: open - Opened by emeraldbay over 1 year ago
#211 - feature: add python module entrypoint type, add python module support…
Pull Request -
State: open - Opened by clumsy over 1 year ago
- 1 comment
#210 - feature: add python module entrypoint type, add python module support…
Pull Request -
State: closed - Opened by clumsy over 1 year ago
- 1 comment
#209 - feature: add python module entrypoint type, add python module support…
Pull Request -
State: closed - Opened by clumsy over 1 year ago
- 1 comment
#208 - Add TFlops calculator and stuck job monitor
Pull Request -
State: open - Opened by emeraldbay over 1 year ago
#207 - Get region with ENV var
Issue -
State: open - Opened by austinmw over 1 year ago
#206 - Invalid dash-separated options for description-file
Issue -
State: open - Opened by wickeat over 1 year ago
#205 - feature: add python module entrypoint type, add python module support to torch_distributed
Pull Request -
State: closed - Opened by clumsy over 1 year ago
- 5 comments
#204 - Training Job "Successful" despite failing due to 100% disk usage
Issue -
State: open - Opened by david-waterworth over 1 year ago
#203 - change: update the boto deps to use latest boto
Pull Request -
State: closed - Opened by mufaddal-rohawala over 1 year ago
#202 - change: bypass DNS check for studio local exec
Pull Request -
State: closed - Opened by mufaddal-rohawala almost 2 years ago
#201 - fix: toolkit build failure
Pull Request -
State: closed - Opened by emeraldbay almost 2 years ago
- 1 comment
#200 - ModuleNotFoundError: Sagemaker only copies entry_point file to /opt/ml/code/ instead of the holy-cloned source code
Issue -
State: open - Opened by celsofranssa almost 2 years ago
#199 - test
Pull Request -
State: closed - Opened by emeraldbay almost 2 years ago
#198 - fix: use smddprun only if it is installed
Pull Request -
State: closed - Opened by ruhanprasad almost 2 years ago
- 1 comment
#197 - fix: Remove Python 3.7 to fix the CI
Pull Request -
State: closed - Opened by emeraldbay almost 2 years ago
#196 - fix: Add NCCL_PROTO=simple environment variable to handle the out-of-order…
Pull Request -
State: closed - Opened by ruhanprasad almost 2 years ago
#195 - fix: Test CI
Pull Request -
State: closed - Opened by emeraldbay almost 2 years ago
#194 - fix: SMDDP does not support P5 instances with SMP
Pull Request -
State: closed - Opened by apoorvtintin almost 2 years ago
- 1 comment
#193 - Issue when training in local mode with huggingface training container
Issue -
State: open - Opened by ojturner almost 2 years ago
- 1 comment
#192 - fix: SMDDP does not support P5 instances with SMP
Pull Request -
State: closed - Opened by apoorvtintin almost 2 years ago
- 2 comments
#191 - P5 instance support
Issue -
State: open - Opened by haozhx23 almost 2 years ago
#190 - feat: Initial change for Sagemaker provided health check
Pull Request -
State: closed - Opened by emeraldbay almost 2 years ago
#189 - feat: support codeartifact for installing requirements.txt packages
Pull Request -
State: closed - Opened by humanzz almost 2 years ago
- 2 comments
#188 - dummy commit to test CI/CD
Pull Request -
State: closed - Opened by emeraldbay almost 2 years ago
#187 - feat: support codeartifact for installing requirements.txt packages
Pull Request -
State: closed - Opened by humanzz about 2 years ago
- 5 comments
#186 - Adding sys.path to PYTHONPATH breaks virtual environments
Issue -
State: open - Opened by pdveenstra about 2 years ago
#185 - Add SM dataparallel exception class in mpi distribution
Pull Request -
State: closed - Opened by stu1130 about 2 years ago
- 1 comment
#184 - Deepspeed Launcher
Issue -
State: open - Opened by anupam-dewan about 2 years ago
#183 - Added supported for neuron_parallel_compile for trn1 (trainium)
Pull Request -
State: closed - Opened by VijayNiles about 2 years ago
- 1 comment
#182 - Add NCCL_ALGO env var for modelparallel jobs
Pull Request -
State: closed - Opened by yongyanrao over 2 years ago
- 2 comments
#180 - unpin sagemaker version as the credential issue fixed
Pull Request -
State: closed - Opened by yl-to over 2 years ago
#179 - Testing PR for SageMaker version
Pull Request -
State: closed - Opened by yl-to over 2 years ago
#178 - fix: increase worker waiting time for ORTE proc
Pull Request -
State: closed - Opened by yl-to over 2 years ago
- 1 comment
#177 - change: upagrade protobuf version for tensorflow 2.12
Pull Request -
State: closed - Opened by yl-to over 2 years ago
#176 - fix: Revert SMDDP collectives feature from smdataparallel runner
Pull Request -
State: closed - Opened by vishwakaria over 2 years ago
#175 - Fix: to fix SMTrainingCompilerConfigurationError handling in process.py
Pull Request -
State: closed - Opened by vinayburugu over 2 years ago
- 8 comments
#174 - Publish wheels to PyPI
Issue -
State: open - Opened by hajapy over 2 years ago
#173 - Passing SIGTERM to entrypoint to be able to handle SPOT failures gracefully in user-code
Issue -
State: open - Opened by croth1 over 2 years ago
#172 - fix: SMTrainingCompilerConfigurationError takes no keyword argument
Pull Request -
State: closed - Opened by ShiboXing over 2 years ago
#171 - Fix: Add SMTrainingCompilerConfigurationError to the list of registered exception classes.
Pull Request -
State: closed - Opened by vinayburugu over 2 years ago
- 4 comments
#170 - change: update libraries for SMDDP collectives validation
Pull Request -
State: closed - Opened by vishwakaria over 2 years ago
#169 - Upgrade protobuf to prevent conflicts with smdebugger.
Pull Request -
State: closed - Opened by josephevans over 2 years ago
#168 - Feature: To modify pytorch_xla configuration errors to SMTrainingCompilerConfigurationError
Pull Request -
State: closed - Opened by vinayburugu over 2 years ago
#167 - Support CodeArtifact repositories for installing Python packages
Issue -
State: closed - Opened by humanzz over 2 years ago
#166 - Stack based error attribution for errors arising from compiler code
Pull Request -
State: closed - Opened by vinayburugu over 2 years ago
- 15 comments
#165 - Add support for p4de instances, update when FI_EFA_USE_DEVICE_RDMA flag is set to only p4d{e} instances.
Pull Request -
State: closed - Opened by josephevans over 2 years ago
- 2 comments
#164 - Remove magic strings for attributes like instance type
Issue -
State: open - Opened by vishwakaria over 2 years ago
#163 - Fix: To add script to build tensorflow container for integration tests
Pull Request -
State: closed - Opened by vinayburugu over 2 years ago
- 2 comments
#162 - feature: add support for SMDDP collectives to smdataparallel runner
Pull Request -
State: closed - Opened by vishwakaria over 2 years ago
- 8 comments
#161 - Python 3.6 unsupported [bug/question]
Issue -
State: open - Opened by adamwrobel-ext-gd over 2 years ago
- 1 comment
#160 - Feature: Stack trace based failure attribution for SageMaker Training Compiler
Pull Request -
State: closed - Opened by vinayburugu over 2 years ago
- 6 comments
#159 - add general exception to filter
Pull Request -
State: closed - Opened by roywei over 2 years ago
- 4 comments
#158 - Mpi mode sets all nodes to the same SM_CURRENT_HOST
Issue -
State: open - Opened by verdimrc over 2 years ago
#157 - Feature: Register tensorflow and xla exception classes to sagemaker-t…
Pull Request -
State: closed - Opened by vinayburugu almost 3 years ago
- 10 comments
#156 - Improve coverage and fix collections DeprecationWarning
Pull Request -
State: closed - Opened by satishpasumarthi almost 3 years ago
- 3 comments
#155 - CVE-2007-4559 Patch
Pull Request -
State: open - Opened by TrellixVulnTeam almost 3 years ago
#154 - feature: Add torch_distributed support for Trainium instances in SageMaker
Pull Request -
State: closed - Opened by satishpasumarthi almost 3 years ago
- 12 comments
#153 - feature: Add neuron cores support
Pull Request -
State: closed - Opened by satishpasumarthi almost 3 years ago
- 1 comment
#152 - Feature: Add Neuron core support
Pull Request -
State: closed - Opened by satishpasumarthi almost 3 years ago
- 1 comment
#151 - feature: Register tensor flow and xla exception classes with sagemaker-training-toolkit
Pull Request -
State: closed - Opened by vinayburugu almost 3 years ago
- 62 comments
#150 - add tensor flow exception classes to the list of exception_classes…
Pull Request -
State: closed - Opened by vinayburugu almost 3 years ago
#149 - change: integrate upcoming dataparallel change to modelparallel
Pull Request -
State: closed - Opened by yongyanrao almost 3 years ago
- 3 comments
#148 - Avoid deprecated import via collections.abc.Mapping
Pull Request -
State: closed - Opened by lorenzwalthert almost 3 years ago
- 5 comments
#147 - Fix: Args for worker nodes in smdataparallel jobs
Pull Request -
State: closed - Opened by satishpasumarthi almost 3 years ago
- 1 comment
#146 - Add debugger exception to error classes
Pull Request -
State: closed - Opened by yl-to almost 3 years ago
- 20 comments
#145 - fix: Improve worker nodes waiting mechanism in MPI jobs
Pull Request -
State: closed - Opened by satishpasumarthi almost 3 years ago
- 15 comments
#144 - fix: Enable PT XLA distributed training on homogeneous clusters
Pull Request -
State: closed - Opened by Lokiiiiii almost 3 years ago
- 2 comments
#143 - Fix: adding EFA specific setup to distributed training runner for PT-XLA
Pull Request -
State: closed - Opened by Lokiiiiii almost 3 years ago
- 1 comment
#142 - change: update num_processes_per_host for smdataparallel runner
Pull Request -
State: closed - Opened by vishwakaria almost 3 years ago
- 1 comment
#141 - fix: Removed version hardcoding for sagemaker test dependency
Pull Request -
State: closed - Opened by jleeleee almost 3 years ago
- 1 comment
#140 - relax exception type
Pull Request -
State: closed - Opened by roywei almost 3 years ago
- 8 comments
#139 - change: update distribution_instance_group for pytorch ddp
Pull Request -
State: closed - Opened by vishwakaria almost 3 years ago
- 2 comments
#138 - Specify flake8 config file explicitly
Pull Request -
State: closed - Opened by nish21 almost 3 years ago
- 4 comments
#137 - Feature: Create a new distribution mechanism for PT-XLA
Pull Request -
State: closed - Opened by Lokiiiiii almost 3 years ago
- 44 comments
#136 - fix: handle utf-8 decoding exceptions while processing std streams
Pull Request -
State: closed - Opened by vishwakaria about 3 years ago
- 1 comment
#135 - feature: Heterogeneous cluster changes
Pull Request -
State: closed - Opened by satishpasumarthi about 3 years ago
- 1 comment
#134 - update: protobuf version to overlap with TF requirements
Pull Request -
State: closed - Opened by nish21 about 3 years ago
- 1 comment