An open API service for providing issue and pull request metadata for open source projects.

GitHub / aws/sagemaker-training-toolkit issues and pull requests

#234 - Loosen or update protobuf pinned version

Issue - State: open - Opened by wickeat 2 months ago

#233 - setuptools 78.0.1 incompatibility

Issue - State: open - Opened by aristidesz 4 months ago

#232 - feature: Add Code Owners file

Pull Request - State: closed - Opened by 992X 6 months ago

#230 - Create devcontainer.json

Pull Request - State: open - Opened by technewbie12 8 months ago - 1 comment

#229 - fix: resolve failing unit test

Pull Request - State: closed - Opened by jessicazhu3 8 months ago

#228 - fix: avoid parsing stderr as JSON

Pull Request - State: closed - Opened by danielsnider 8 months ago - 1 comment

#227 - Fix unknown argument: '-export-dynamic' on macOS

Pull Request - State: open - Opened by jponf 8 months ago - 1 comment

#226 - fix: temporarily hardcode neuron cores for trn2

Pull Request - State: closed - Opened by jessicazhu3 8 months ago

#225 - Build failure on MacOS

Issue - State: open - Opened by DRKolev-code 8 months ago - 1 comment

#224 - [TEST]

Pull Request - State: closed - Opened by SecurityResearcher-yoda 11 months ago

#223 - Fix: Preserve hyperparameter order when invoking training jobs

Pull Request - State: open - Opened by vsimkus 11 months ago

#222 - Silent Failure if custom image puts something into /opt/ml/code

Issue - State: open - Opened by njbrake 11 months ago - 1 comment

#221 - SageMaker training toolkit reorders hyperparameters

Issue - State: open - Opened by vsimkus 11 months ago

#220 - feature: Add p5 as a supported NCCL instance

Pull Request - State: closed - Opened by andjsmi 11 months ago

#217 - fix: typo in the run unit tests command

Pull Request - State: closed - Opened by bhaoz 12 months ago

#215 - chore: removing unnecessary logging information

Pull Request - State: closed - Opened by bhaoz 12 months ago

#214 - feature: Add support for py39 and py310

Pull Request - State: closed - Opened by prtsh about 1 year ago - 1 comment

#213 - Validate smddprun() fails with file not found error on AL2023

Issue - State: closed - Opened by jimmyrigby94 over 1 year ago - 1 comment

#212 - build test

Pull Request - State: open - Opened by emeraldbay over 1 year ago

#211 - feature: add python module entrypoint type, add python module support…

Pull Request - State: open - Opened by clumsy over 1 year ago - 1 comment

#210 - feature: add python module entrypoint type, add python module support…

Pull Request - State: closed - Opened by clumsy over 1 year ago - 1 comment

#209 - feature: add python module entrypoint type, add python module support…

Pull Request - State: closed - Opened by clumsy over 1 year ago - 1 comment

#208 - Add TFlops calculator and stuck job monitor

Pull Request - State: open - Opened by emeraldbay over 1 year ago

#207 - Get region with ENV var

Issue - State: open - Opened by austinmw over 1 year ago

#206 - Invalid dash-separated options for description-file

Issue - State: open - Opened by wickeat over 1 year ago

#205 - feature: add python module entrypoint type, add python module support to torch_distributed

Pull Request - State: closed - Opened by clumsy over 1 year ago - 5 comments

#203 - change: update the boto deps to use latest boto

Pull Request - State: closed - Opened by mufaddal-rohawala over 1 year ago

#202 - change: bypass DNS check for studio local exec

Pull Request - State: closed - Opened by mufaddal-rohawala almost 2 years ago

#201 - fix: toolkit build failure

Pull Request - State: closed - Opened by emeraldbay almost 2 years ago - 1 comment

#199 - test

Pull Request - State: closed - Opened by emeraldbay almost 2 years ago

#198 - fix: use smddprun only if it is installed

Pull Request - State: closed - Opened by ruhanprasad almost 2 years ago - 1 comment

#197 - fix: Remove Python 3.7 to fix the CI

Pull Request - State: closed - Opened by emeraldbay almost 2 years ago

#195 - fix: Test CI

Pull Request - State: closed - Opened by emeraldbay almost 2 years ago

#194 - fix: SMDDP does not support P5 instances with SMP

Pull Request - State: closed - Opened by apoorvtintin almost 2 years ago - 1 comment

#193 - Issue when training in local mode with huggingface training container

Issue - State: open - Opened by ojturner almost 2 years ago - 1 comment

#192 - fix: SMDDP does not support P5 instances with SMP

Pull Request - State: closed - Opened by apoorvtintin almost 2 years ago - 2 comments

#191 - P5 instance support

Issue - State: open - Opened by haozhx23 almost 2 years ago

#190 - feat: Initial change for Sagemaker provided health check

Pull Request - State: closed - Opened by emeraldbay almost 2 years ago

#189 - feat: support codeartifact for installing requirements.txt packages

Pull Request - State: closed - Opened by humanzz almost 2 years ago - 2 comments

#188 - dummy commit to test CI/CD

Pull Request - State: closed - Opened by emeraldbay almost 2 years ago

#187 - feat: support codeartifact for installing requirements.txt packages

Pull Request - State: closed - Opened by humanzz about 2 years ago - 5 comments

#186 - Adding sys.path to PYTHONPATH breaks virtual environments

Issue - State: open - Opened by pdveenstra about 2 years ago

#185 - Add SM dataparallel exception class in mpi distribution

Pull Request - State: closed - Opened by stu1130 about 2 years ago - 1 comment

#184 - Deepspeed Launcher

Issue - State: open - Opened by anupam-dewan about 2 years ago

#183 - Added supported for neuron_parallel_compile for trn1 (trainium)

Pull Request - State: closed - Opened by VijayNiles about 2 years ago - 1 comment

#182 - Add NCCL_ALGO env var for modelparallel jobs

Pull Request - State: closed - Opened by yongyanrao over 2 years ago - 2 comments

#180 - unpin sagemaker version as the credential issue fixed

Pull Request - State: closed - Opened by yl-to over 2 years ago

#179 - Testing PR for SageMaker version

Pull Request - State: closed - Opened by yl-to over 2 years ago

#178 - fix: increase worker waiting time for ORTE proc

Pull Request - State: closed - Opened by yl-to over 2 years ago - 1 comment

#177 - change: upagrade protobuf version for tensorflow 2.12

Pull Request - State: closed - Opened by yl-to over 2 years ago

#176 - fix: Revert SMDDP collectives feature from smdataparallel runner

Pull Request - State: closed - Opened by vishwakaria over 2 years ago

#175 - Fix: to fix SMTrainingCompilerConfigurationError handling in process.py

Pull Request - State: closed - Opened by vinayburugu over 2 years ago - 8 comments

#174 - Publish wheels to PyPI

Issue - State: open - Opened by hajapy over 2 years ago

#172 - fix: SMTrainingCompilerConfigurationError takes no keyword argument

Pull Request - State: closed - Opened by ShiboXing over 2 years ago

#170 - change: update libraries for SMDDP collectives validation

Pull Request - State: closed - Opened by vishwakaria over 2 years ago

#169 - Upgrade protobuf to prevent conflicts with smdebugger.

Pull Request - State: closed - Opened by josephevans over 2 years ago

#166 - Stack based error attribution for errors arising from compiler code

Pull Request - State: closed - Opened by vinayburugu over 2 years ago - 15 comments

#164 - Remove magic strings for attributes like instance type

Issue - State: open - Opened by vishwakaria over 2 years ago

#163 - Fix: To add script to build tensorflow container for integration tests

Pull Request - State: closed - Opened by vinayburugu over 2 years ago - 2 comments

#162 - feature: add support for SMDDP collectives to smdataparallel runner

Pull Request - State: closed - Opened by vishwakaria over 2 years ago - 8 comments

#161 - Python 3.6 unsupported [bug/question]

Issue - State: open - Opened by adamwrobel-ext-gd over 2 years ago - 1 comment

#160 - Feature: Stack trace based failure attribution for SageMaker Training Compiler

Pull Request - State: closed - Opened by vinayburugu over 2 years ago - 6 comments

#159 - add general exception to filter

Pull Request - State: closed - Opened by roywei over 2 years ago - 4 comments

#158 - Mpi mode sets all nodes to the same SM_CURRENT_HOST

Issue - State: open - Opened by verdimrc over 2 years ago

#157 - Feature: Register tensorflow and xla exception classes to sagemaker-t…

Pull Request - State: closed - Opened by vinayburugu almost 3 years ago - 10 comments

#156 - Improve coverage and fix collections DeprecationWarning

Pull Request - State: closed - Opened by satishpasumarthi almost 3 years ago - 3 comments

#155 - CVE-2007-4559 Patch

Pull Request - State: open - Opened by TrellixVulnTeam almost 3 years ago

#154 - feature: Add torch_distributed support for Trainium instances in SageMaker

Pull Request - State: closed - Opened by satishpasumarthi almost 3 years ago - 12 comments

#153 - feature: Add neuron cores support

Pull Request - State: closed - Opened by satishpasumarthi almost 3 years ago - 1 comment

#152 - Feature: Add Neuron core support

Pull Request - State: closed - Opened by satishpasumarthi almost 3 years ago - 1 comment

#151 - feature: Register tensor flow and xla exception classes with sagemaker-training-toolkit

Pull Request - State: closed - Opened by vinayburugu almost 3 years ago - 62 comments

#150 - add tensor flow exception classes to the list of exception_classes…

Pull Request - State: closed - Opened by vinayburugu almost 3 years ago

#149 - change: integrate upcoming dataparallel change to modelparallel

Pull Request - State: closed - Opened by yongyanrao almost 3 years ago - 3 comments

#148 - Avoid deprecated import via collections.abc.Mapping

Pull Request - State: closed - Opened by lorenzwalthert almost 3 years ago - 5 comments

#147 - Fix: Args for worker nodes in smdataparallel jobs

Pull Request - State: closed - Opened by satishpasumarthi almost 3 years ago - 1 comment

#146 - Add debugger exception to error classes

Pull Request - State: closed - Opened by yl-to almost 3 years ago - 20 comments

#145 - fix: Improve worker nodes waiting mechanism in MPI jobs

Pull Request - State: closed - Opened by satishpasumarthi almost 3 years ago - 15 comments

#144 - fix: Enable PT XLA distributed training on homogeneous clusters

Pull Request - State: closed - Opened by Lokiiiiii almost 3 years ago - 2 comments

#143 - Fix: adding EFA specific setup to distributed training runner for PT-XLA

Pull Request - State: closed - Opened by Lokiiiiii almost 3 years ago - 1 comment

#142 - change: update num_processes_per_host for smdataparallel runner

Pull Request - State: closed - Opened by vishwakaria almost 3 years ago - 1 comment

#141 - fix: Removed version hardcoding for sagemaker test dependency

Pull Request - State: closed - Opened by jleeleee almost 3 years ago - 1 comment

#140 - relax exception type

Pull Request - State: closed - Opened by roywei almost 3 years ago - 8 comments

#139 - change: update distribution_instance_group for pytorch ddp

Pull Request - State: closed - Opened by vishwakaria almost 3 years ago - 2 comments

#138 - Specify flake8 config file explicitly

Pull Request - State: closed - Opened by nish21 almost 3 years ago - 4 comments

#137 - Feature: Create a new distribution mechanism for PT-XLA

Pull Request - State: closed - Opened by Lokiiiiii almost 3 years ago - 44 comments

#136 - fix: handle utf-8 decoding exceptions while processing std streams

Pull Request - State: closed - Opened by vishwakaria about 3 years ago - 1 comment

#135 - feature: Heterogeneous cluster changes

Pull Request - State: closed - Opened by satishpasumarthi about 3 years ago - 1 comment

#134 - update: protobuf version to overlap with TF requirements

Pull Request - State: closed - Opened by nish21 about 3 years ago - 1 comment