Ecosyste.ms: Issues

An open API service for providing issue and pull request metadata for open source projects.

GitHub / aws/sagemaker-training-toolkit issues and pull requests

#217 - fix: typo in the run unit tests command

Pull Request - State: closed - Opened by bhaoz about 2 months ago

#215 - chore: removing unnecessary logging information

Pull Request - State: closed - Opened by bhaoz about 2 months ago

#214 - feature: Add support for py39 and py310

Pull Request - State: closed - Opened by prtsh 4 months ago - 1 comment

#213 - Validate smddprun() fails with file not found error on AL2023

Issue - State: closed - Opened by jimmyrigby94 6 months ago - 1 comment

#212 - build test

Pull Request - State: open - Opened by emeraldbay 6 months ago

#211 - feature: add python module entrypoint type, add python module support…

Pull Request - State: open - Opened by clumsy 7 months ago - 1 comment

#210 - feature: add python module entrypoint type, add python module support…

Pull Request - State: closed - Opened by clumsy 7 months ago - 1 comment

#209 - feature: add python module entrypoint type, add python module support…

Pull Request - State: closed - Opened by clumsy 7 months ago - 1 comment

#208 - Add TFlops calculator and stuck job monitor

Pull Request - State: open - Opened by emeraldbay 8 months ago

#207 - Get region with ENV var

Issue - State: open - Opened by austinmw 8 months ago

#206 - Invalid dash-separated options for description-file

Issue - State: open - Opened by wickeat 10 months ago

#203 - change: update the boto deps to use latest boto

Pull Request - State: closed - Opened by mufaddal-rohawala 11 months ago

#202 - change: bypass DNS check for studio local exec

Pull Request - State: closed - Opened by mufaddal-rohawala 12 months ago

#201 - fix: toolkit build failure

Pull Request - State: closed - Opened by emeraldbay 12 months ago - 1 comment

#199 - test

Pull Request - State: closed - Opened by emeraldbay 12 months ago

#198 - fix: use smddprun only if it is installed

Pull Request - State: closed - Opened by ruhanprasad 12 months ago - 1 comment

#197 - fix: Remove Python 3.7 to fix the CI

Pull Request - State: closed - Opened by emeraldbay 12 months ago

#195 - fix: Test CI

Pull Request - State: closed - Opened by emeraldbay about 1 year ago

#194 - fix: SMDDP does not support P5 instances with SMP

Pull Request - State: closed - Opened by apoorvtintin about 1 year ago - 1 comment

#192 - fix: SMDDP does not support P5 instances with SMP

Pull Request - State: closed - Opened by apoorvtintin about 1 year ago - 2 comments

#191 - P5 instance support

Issue - State: open - Opened by haozhx23 about 1 year ago

#190 - feat: Initial change for Sagemaker provided health check

Pull Request - State: closed - Opened by emeraldbay about 1 year ago

#189 - feat: support codeartifact for installing requirements.txt packages

Pull Request - State: closed - Opened by humanzz about 1 year ago - 2 comments

#188 - dummy commit to test CI/CD

Pull Request - State: closed - Opened by emeraldbay about 1 year ago

#187 - feat: support codeartifact for installing requirements.txt packages

Pull Request - State: closed - Opened by humanzz about 1 year ago - 5 comments

#185 - Add SM dataparallel exception class in mpi distribution

Pull Request - State: closed - Opened by stu1130 over 1 year ago - 1 comment

#184 - Deepspeed Launcher

Issue - State: open - Opened by anupam-dewan over 1 year ago

#183 - Added supported for neuron_parallel_compile for trn1 (trainium)

Pull Request - State: closed - Opened by VijayNiles over 1 year ago - 1 comment

#182 - Add NCCL_ALGO env var for modelparallel jobs

Pull Request - State: closed - Opened by yongyanrao over 1 year ago - 2 comments

#180 - unpin sagemaker version as the credential issue fixed

Pull Request - State: closed - Opened by yl-to over 1 year ago

#179 - Testing PR for SageMaker version

Pull Request - State: closed - Opened by yl-to over 1 year ago

#178 - fix: increase worker waiting time for ORTE proc

Pull Request - State: closed - Opened by yl-to over 1 year ago - 1 comment

#177 - change: upagrade protobuf version for tensorflow 2.12

Pull Request - State: closed - Opened by yl-to over 1 year ago

#176 - fix: Revert SMDDP collectives feature from smdataparallel runner

Pull Request - State: closed - Opened by vishwakaria over 1 year ago

#175 - Fix: to fix SMTrainingCompilerConfigurationError handling in process.py

Pull Request - State: closed - Opened by vinayburugu over 1 year ago - 8 comments

#174 - Publish wheels to PyPI

Issue - State: open - Opened by hajapy over 1 year ago

#172 - fix: SMTrainingCompilerConfigurationError takes no keyword argument

Pull Request - State: closed - Opened by ShiboXing over 1 year ago

#170 - change: update libraries for SMDDP collectives validation

Pull Request - State: closed - Opened by vishwakaria over 1 year ago

#169 - Upgrade protobuf to prevent conflicts with smdebugger.

Pull Request - State: closed - Opened by josephevans over 1 year ago

#167 - Support CodeArtifact repositories for installing Python packages

Issue - State: closed - Opened by humanzz almost 2 years ago

#166 - Stack based error attribution for errors arising from compiler code

Pull Request - State: closed - Opened by vinayburugu almost 2 years ago - 15 comments

#164 - Remove magic strings for attributes like instance type

Issue - State: open - Opened by vishwakaria almost 2 years ago

#163 - Fix: To add script to build tensorflow container for integration tests

Pull Request - State: closed - Opened by vinayburugu almost 2 years ago - 2 comments

#162 - feature: add support for SMDDP collectives to smdataparallel runner

Pull Request - State: closed - Opened by vishwakaria almost 2 years ago - 8 comments

#161 - Python 3.6 unsupported [bug/question]

Issue - State: open - Opened by adamwrobel-ext-gd almost 2 years ago - 1 comment

#160 - Feature: Stack trace based failure attribution for SageMaker Training Compiler

Pull Request - State: closed - Opened by vinayburugu almost 2 years ago - 6 comments

#159 - add general exception to filter

Pull Request - State: closed - Opened by roywei almost 2 years ago - 4 comments

#158 - Mpi mode sets all nodes to the same SM_CURRENT_HOST

Issue - State: open - Opened by verdimrc almost 2 years ago

#157 - Feature: Register tensorflow and xla exception classes to sagemaker-t…

Pull Request - State: closed - Opened by vinayburugu almost 2 years ago - 10 comments

#156 - Improve coverage and fix collections DeprecationWarning

Pull Request - State: closed - Opened by satishpasumarthi almost 2 years ago - 3 comments

#155 - CVE-2007-4559 Patch

Pull Request - State: open - Opened by TrellixVulnTeam almost 2 years ago

#154 - feature: Add torch_distributed support for Trainium instances in SageMaker

Pull Request - State: closed - Opened by satishpasumarthi almost 2 years ago - 12 comments

#153 - feature: Add neuron cores support

Pull Request - State: closed - Opened by satishpasumarthi almost 2 years ago - 1 comment

#152 - Feature: Add Neuron core support

Pull Request - State: closed - Opened by satishpasumarthi almost 2 years ago - 1 comment

#151 - feature: Register tensor flow and xla exception classes with sagemaker-training-toolkit

Pull Request - State: closed - Opened by vinayburugu almost 2 years ago - 62 comments

#150 - add tensor flow exception classes to the list of exception_classes…

Pull Request - State: closed - Opened by vinayburugu almost 2 years ago

#149 - change: integrate upcoming dataparallel change to modelparallel

Pull Request - State: closed - Opened by yongyanrao about 2 years ago - 3 comments

#148 - Avoid deprecated import via collections.abc.Mapping

Pull Request - State: closed - Opened by lorenzwalthert about 2 years ago - 5 comments

#147 - Fix: Args for worker nodes in smdataparallel jobs

Pull Request - State: closed - Opened by satishpasumarthi about 2 years ago - 1 comment

#146 - Add debugger exception to error classes

Pull Request - State: closed - Opened by yl-to about 2 years ago - 20 comments

#145 - fix: Improve worker nodes waiting mechanism in MPI jobs

Pull Request - State: closed - Opened by satishpasumarthi about 2 years ago - 15 comments

#144 - fix: Enable PT XLA distributed training on homogeneous clusters

Pull Request - State: closed - Opened by Lokiiiiii about 2 years ago - 2 comments

#143 - Fix: adding EFA specific setup to distributed training runner for PT-XLA

Pull Request - State: closed - Opened by Lokiiiiii about 2 years ago - 1 comment

#142 - change: update num_processes_per_host for smdataparallel runner

Pull Request - State: closed - Opened by vishwakaria about 2 years ago - 1 comment

#141 - fix: Removed version hardcoding for sagemaker test dependency

Pull Request - State: closed - Opened by jleeleee about 2 years ago - 1 comment

#140 - relax exception type

Pull Request - State: closed - Opened by roywei about 2 years ago - 8 comments

#139 - change: update distribution_instance_group for pytorch ddp

Pull Request - State: closed - Opened by vishwakaria about 2 years ago - 2 comments

#138 - Specify flake8 config file explicitly

Pull Request - State: closed - Opened by nish21 about 2 years ago - 4 comments

#137 - Feature: Create a new distribution mechanism for PT-XLA

Pull Request - State: closed - Opened by Lokiiiiii about 2 years ago - 44 comments

#136 - fix: handle utf-8 decoding exceptions while processing std streams

Pull Request - State: closed - Opened by vishwakaria about 2 years ago - 1 comment

#135 - feature: Heterogeneous cluster changes

Pull Request - State: closed - Opened by satishpasumarthi about 2 years ago - 1 comment

#134 - update: protobuf version to overlap with TF requirements

Pull Request - State: closed - Opened by nish21 over 2 years ago - 1 comment

#133 - SM library telemetry improvement

Pull Request - State: closed - Opened by roywei over 2 years ago - 2 comments

#132 - Version 4.1.4 fails to install because of the missing protobuf dependency

Issue - State: closed - Opened by szafranek over 2 years ago - 2 comments

#131 - Fix none exception class issue for mpi

Pull Request - State: closed - Opened by haohanchen-yagao over 2 years ago - 5 comments

#130 - Feature: Adding new parameter for TF Multi Worker Mirrored Strategy

Pull Request - State: closed - Opened by Lokiiiiii over 2 years ago - 4 comments

#129 - No support for Python 3.10

Issue - State: open - Opened by peter-wimsey over 2 years ago - 8 comments

#128 - Hyperparameters not shell escaped

Issue - State: open - Opened by bstriner over 2 years ago

#127 - Shlex quote

Pull Request - State: open - Opened by bstriner over 2 years ago - 2 comments

#126 - Pass SIGTERM to training subprocess

Pull Request - State: open - Opened by bstriner over 2 years ago - 9 comments

#125 - Pass SIGTERM to training script to stop training

Issue - State: open - Opened by bstriner over 2 years ago

#124 - fix: fix flaky issue with incorrect rc being given

Pull Request - State: closed - Opened by matherit over 2 years ago - 2 comments

#123 - Use framework provided error class and stack trace as error message

Pull Request - State: closed - Opened by roywei over 2 years ago - 17 comments

#122 - fix: missing args when shell script is used

Pull Request - State: closed - Opened by satishpasumarthi over 2 years ago - 6 comments

#121 - add back FI_EFA_USE_DEVICE_RDMA=1 flag, revert 2936f22

Pull Request - State: closed - Opened by ydaiming over 2 years ago - 15 comments

#120 - feature: Add Native Pytorch DDP Support

Pull Request - State: closed - Opened by satishpasumarthi over 2 years ago - 10 comments

#119 - Arguments not always accessible when using bash script for training job

Issue - State: closed - Opened by marcelgwerder over 2 years ago - 3 comments

#118 - Enable custom failure logging

Pull Request - State: closed - Opened by satishpasumarthi over 2 years ago - 5 comments
Labels: priority: high

#117 - Should to_cmd_args pass complex types through json.dumps instead of str?

Issue - State: closed - Opened by croth1 over 2 years ago - 1 comment