Ecosyste.ms: Issues

An open API service for providing issue and pull request metadata for open source projects.

GitHub / kubeflow/pytorch-operator issues and pull requests

#365 - unable to build image for ppc64le

Issue - State: open - Opened by gajanankulkarni-18 about 3 years ago

#364 - PytorchJob DDP training will stop if I delete a worker pod

Issue - State: open - Opened by Shuai-Xie about 3 years ago - 2 comments

#362 - Multi-gpu in a single pod

Issue - State: open - Opened by wallarug about 3 years ago - 2 comments

#361 - Add notice before archiving

Pull Request - State: closed - Opened by terrytangyuan about 3 years ago - 2 comments
Labels: size/XS, approved, lgtm

#360 - service label mismatches selector, which result in inconsistency

Issue - State: open - Opened by konnase about 3 years ago - 3 comments
Labels: kind/bug

#359 - The training hangs after reloading one of master/worker pods

Issue - State: open - Opened by dmitsf over 3 years ago - 5 comments
Labels: kind/question, area/engprod

#358 - Can not use volcano for Gang Scheduling

Issue - State: closed - Opened by bug-developer021 over 3 years ago

#357 - support set volcano queue name

Pull Request - State: open - Opened by qiankunli over 3 years ago - 2 comments
Labels: size/M, needs-ok-to-test

#356 - Can I freeze pytorchjob training pods and migrate them to other nodes?

Issue - State: open - Opened by Shuai-Xie over 3 years ago - 9 comments

#355 - Pytorch version may have an effect on the training reproduction

Issue - State: open - Opened by Shuai-Xie over 3 years ago - 4 comments

#354 - Different DDP training results of PytorchJob and Bare Metal

Issue - State: open - Opened by Shuai-Xie over 3 years ago - 6 comments

#353 - Can I use hostNetwork to run PytorchJob like on bare metal

Issue - State: closed - Opened by Shuai-Xie over 3 years ago - 3 comments

#352 - Can PytorchJob skip or cancel the init cantainer?

Issue - State: open - Opened by SeibertronSS over 3 years ago - 2 comments

#351 - volcano change the PodGroup CRD APIGroup to volcano.sh

Issue - State: open - Opened by qiankunli over 3 years ago - 1 comment

#350 - How to use DDP in pytorch operator?

Issue - State: closed - Opened by SeibertronSS over 3 years ago - 3 comments

#349 - why worker need initContainer in pytorch-operator?

Issue - State: closed - Opened by zqz-net over 3 years ago - 2 comments

#348 - container "pytorch" is waiting to start: PodInitializing

Issue - State: open - Opened by gogogwwb over 3 years ago - 20 comments
Labels: kind/bug

#347 - Upgrade to v1 CRDs

Issue - State: open - Opened by mcristina422 over 3 years ago - 1 comment

#346 - [feat] Support PyTorch 1.9

Issue - State: open - Opened by gaocegege over 3 years ago - 3 comments

#345 - Fix: Change PTL to release version

Pull Request - State: closed - Opened by jagadeeshi2i over 3 years ago - 2 comments
Labels: size/XS, approved, lgtm, ok-to-test

#343 - Update the versions of common, tfjob and some other modules

Pull Request - State: closed - Opened by paipaoso over 3 years ago - 7 comments
Labels: size/XXL, needs-ok-to-test

#342 - What is the difference between master and worker?

Issue - State: closed - Opened by SeibertronSS over 3 years ago - 6 comments

#341 - Fix 'Invalid Pointer' error when PytorchJob is deleted

Pull Request - State: closed - Opened by alembiewski over 3 years ago - 4 comments
Labels: size/XS, approved, lgtm, needs-ok-to-test

#340 - fell confused about world_size

Issue - State: closed - Opened by ldd91 over 3 years ago

#339 - `init-pytorch` init container image configurable

Issue - State: closed - Opened by apatil4 over 3 years ago - 4 comments

#338 - Add job namespace to `pytorch_operator_jobs_*` counters

Pull Request - State: closed - Opened by alembiewski over 3 years ago - 4 comments
Labels: approved, size/M, lgtm, ok-to-test

#337 - Bert example with Pytorch Lightning

Pull Request - State: closed - Opened by jagadeeshi2i over 3 years ago - 2 comments
Labels: approved, lgtm, size/XL

#336 - Adding example config file

Pull Request - State: closed - Opened by johnugeorge over 3 years ago - 4 comments
Labels: approved, lgtm, size/S

#335 - Worker template should be configurable.

Issue - State: open - Opened by MartinForReal over 3 years ago - 1 comment

#334 - PyTorch Lightning Example.

Issue - State: closed - Opened by tchaton over 3 years ago

#333 - 'host not found' error occurs during PyTorch distributed learning

Issue - State: open - Opened by JGoo1 almost 4 years ago - 1 comment
Labels: kind/feature

#332 - NCCL "Connection Refused" for Worker Pods

Issue - State: open - Opened by twolffpiggott almost 4 years ago - 1 comment

#331 - whether multi-gpu-per-pod setup be supported in PytorchJob

Issue - State: open - Opened by tingweiwu almost 4 years ago - 1 comment

#330 - can I use PyTorchJobClient inside a pod of the cluster?

Issue - State: open - Opened by omlomloml almost 4 years ago - 1 comment

#328 - is there a simpler way to install pytorch-operator

Issue - State: closed - Opened by tingweiwu almost 4 years ago - 2 comments

#327 - Change mnist example to use FashionMNIST

Pull Request - State: closed - Opened by Jeffwan almost 4 years ago - 2 comments
Labels: size/XS, approved, lgtm

#326 - Temporarily disable mnist test case

Pull Request - State: closed - Opened by Jeffwan almost 4 years ago - 3 comments
Labels: approved, lgtm, size/S

#325 - Mnist dataset server is down

Issue - State: open - Opened by Jeffwan almost 4 years ago - 5 comments

#324 - [DO NOT MERGE] Change to test CI

Pull Request - State: closed - Opened by yanniszark almost 4 years ago - 4 comments
Labels: size/XS

#323 - pytorch-operator: Consolidate manifests

Pull Request - State: closed - Opened by yanniszark almost 4 years ago - 7 comments
Labels: approved, lgtm, size/L

#322 - pytorch-operator: Consolidate manifests

Issue - State: closed - Opened by yanniszark almost 4 years ago - 1 comment

#321 - Operator has invalid memory address error on specific pytorchjob spec

Issue - State: open - Opened by ca-scribner almost 4 years ago - 1 comment

#320 - PyTorch Operator: Move manifests development upstream

Pull Request - State: closed - Opened by yanniszark almost 4 years ago - 4 comments
Labels: approved, lgtm, size/L

#319 - Unlable to spawn PyTorchJob due to image alpine dependency of pytorch-operator

Issue - State: open - Opened by asahalyft almost 4 years ago - 4 comments
Labels: kind/bug

#318 - PyTorch Operator: Move manifests development upstream

Issue - State: closed - Opened by yanniszark almost 4 years ago

#317 - Is python sdk still being maintained?

Issue - State: open - Opened by ca-scribner almost 4 years ago - 7 comments

#316 - Migrate to new test-infra

Pull Request - State: closed - Opened by PatrickXYS about 4 years ago - 36 comments
Labels: approved, size/M, lgtm

#315 - add dependabot config script

Pull Request - State: open - Opened by DavidSpek about 4 years ago - 4 comments
Labels: size/L, do-not-merge/hold

#314 - Please create v1.2-branch

Issue - State: closed - Opened by SatwikBhandiwad about 4 years ago - 3 comments

#313 - dist.init_process_group stuck

Issue - State: open - Opened by ravenj73 about 4 years ago - 9 comments

#312 - kubeflow pipelines sdk, distributed multi-node training with autoscaling

Issue - State: closed - Opened by rami3e about 4 years ago - 4 comments

#310 - can I use gpus on specific node to train

Issue - State: closed - Opened by lwj1980s over 4 years ago - 5 comments
Labels: kind/question, area/front-end, question

#309 - Add @andreyvelich to approvers

Pull Request - State: closed - Opened by andreyvelich over 4 years ago - 2 comments
Labels: size/XS, approved, lgtm

#308 - Reuse Common Scripts for Creating / Deleting EKS clusters

Pull Request - State: closed - Opened by PatrickXYS over 4 years ago - 6 comments
Labels: approved, size/M, lgtm

#307 - Do not trigger presubmit jobs for simple changes

Issue - State: open - Opened by Jeffwan over 4 years ago - 1 comment
Labels: area/engprod, kind/feature

#306 - Add Jeffwan@ to OWNERS

Pull Request - State: closed - Opened by Jeffwan over 4 years ago - 11 comments
Labels: size/XS, approved, lgtm

#305 - Move PyTorch Operator e2e tests to AWS Prow

Pull Request - State: closed - Opened by Jeffwan over 4 years ago - 35 comments
Labels: approved, lgtm, size/L

#304 - how can I run a pytorch job with all my Gpu resources

Issue - State: closed - Opened by lwj1980s over 4 years ago - 4 comments
Labels: kind/question

#303 - Add test friendly manifests

Pull Request - State: closed - Opened by Jeffwan over 4 years ago - 6 comments
Labels: size/S

#302 - Make manifest test friendly

Issue - State: closed - Opened by Jeffwan over 4 years ago - 2 comments
Labels: area/engprod, kind/feature

#301 - Support manifest on Kubernetes 1.16+

Pull Request - State: closed - Opened by Jeffwan over 4 years ago - 6 comments
Labels: size/XS

#300 - Updated the image name format for the gcr.io.

Pull Request - State: open - Opened by wuchen03 over 4 years ago - 11 comments
Labels: size/S, ok-to-test

#299 - Activate Travis in PR check

Issue - State: open - Opened by andreyvelich over 4 years ago - 2 comments
Labels: priority/p1, kind/feature

#298 - Change cluster version to 1.16 for e2e test

Pull Request - State: closed - Opened by andreyvelich over 4 years ago - 2 comments
Labels: size/XS

#297 - Test webhook

Pull Request - State: closed - Opened by Jeffwan over 4 years ago - 1 comment
Labels: size/XS

#296 - Support Torch Elastic in pytorch operator

Issue - State: open - Opened by Jeffwan over 4 years ago - 2 comments
Labels: kind/feature

#295 - update pytorch-operator deployment manifests file

Pull Request - State: closed - Opened by myonlyzzy over 4 years ago - 15 comments
Labels: size/XS, approved, lgtm, ok-to-test

#294 - pytorch-operator pod CheckCRDExist failed

Issue - State: closed - Opened by myonlyzzy over 4 years ago - 3 comments
Labels: kind/bug

#293 - Fix Unit Tests

Pull Request - State: closed - Opened by andreyvelich over 4 years ago - 24 comments
Labels: approved, size/M, lgtm

#292 - [bug] Unit test is broken

Issue - State: open - Opened by gaocegege over 4 years ago - 4 comments
Labels: priority/p0, area/engprod, kind/bug

#291 - './pytorch_job_sendrecv.yaml' missing in pytorch-operator/examples/smoke-dist

Issue - State: closed - Opened by Lyken17 over 4 years ago - 6 comments
Labels: area/front-end, kind/bug

#290 - Update README.md

Pull Request - State: closed - Opened by pingsutw over 4 years ago - 2 comments
Labels: size/XS, approved, lgtm

#289 - Update CRD link

Pull Request - State: closed - Opened by pingsutw over 4 years ago - 1 comment
Labels: size/XS, approved, lgtm

#288 - support cleanPodPolicy is Running, same as tf operator

Pull Request - State: closed - Opened by jiaqianjing over 4 years ago - 10 comments
Labels: approved, size/M, lgtm, ok-to-test

#287 - how to create a local non-distributed training

Issue - State: closed - Opened by houz42 over 4 years ago - 7 comments
Labels: kind/question, kind/feature

#286 - chore: Update OWNERS

Pull Request - State: closed - Opened by gaocegege over 4 years ago - 2 comments
Labels: size/XS, approved, lgtm

#285 - Adds notes and example annotation for pytorch job

Pull Request - State: closed - Opened by shawnzhu over 4 years ago - 3 comments
Labels: approved, lgtm, size/S

#284 - PyTorchJob CRD definition link is broken

Issue - State: closed - Opened by sakaia over 4 years ago - 2 comments
Labels: area/docs, kind/bug

#283 - Do we need pod name and namespace in manifests?

Issue - State: open - Opened by gaocegege over 4 years ago - 2 comments
Labels: kind/question, area/operator

#282 - Migrate code implementation to kubeflow/common fashion

Issue - State: open - Opened by Jeffwan over 4 years ago - 3 comments
Labels: kind/feature

#281 - Where are the pytorch-crd and pytorch-operator YAML files?

Issue - State: closed - Opened by g-karthik over 4 years ago - 8 comments
Labels: kind/question

#277 - fix Dockerfile-mpi download miniconda.sh

Pull Request - State: closed - Opened by jiaqianjing over 4 years ago - 9 comments
Labels: size/XS, approved, lgtm, ok-to-test

#275 - OCI Runtime error for init-pytorch on AKS

Issue - State: closed - Opened by wangdian over 4 years ago - 3 comments
Labels: kind/bug

#274 - Update openapi-gen to not rely on vendor

Pull Request - State: closed - Opened by Jeffwan over 4 years ago - 7 comments
Labels: approved, lgtm, size/L

#271 - Distributed mnist is unexpectedly slow

Issue - State: open - Opened by panchul over 4 years ago - 7 comments
Labels: kind/bug

#262 - pin kubenertes client version to work around a bug

Pull Request - State: closed - Opened by jinchihe almost 5 years ago - 5 comments
Labels: approved, lgtm, size/S

#259 - Kubernetes 1.6 support

Issue - State: closed - Opened by posix4e almost 5 years ago - 5 comments
Labels: area/front-end, area/operator, kind/feature

#258 - PyTorchJob worker pods crashloops in non-default namespace

Issue - State: open - Opened by jobvarkey almost 5 years ago - 7 comments
Labels: kind/bug

#257 - Updated the GPU compatible Docker builiding porcess with the Kubeflow…

Pull Request - State: open - Opened by MATRIX4284 about 5 years ago - 8 comments
Labels: size/XS, needs-ok-to-test

#254 - Link to CRD definition is broken

Issue - State: closed - Opened by sakaia about 5 years ago - 4 comments
Labels: area/front-end, kind/bug

#243 - v1beta foldeer has been renamed to v1 so needs the path too

Pull Request - State: open - Opened by MATRIX4284 about 5 years ago - 7 comments
Labels: size/XS, approved, lgtm, ok-to-test

#237 - GCP preemptible instances

Issue - State: open - Opened by Nintorac about 5 years ago - 4 comments

#219 - Right way to use pytorch-operator for multi-node multi-gpu setup

Issue - State: open - Opened by lainisourgod over 5 years ago - 13 comments
Labels: kind/question, area/engprod, priority/p2

#209 - use the priority of kube-batch

Pull Request - State: open - Opened by YesterdayxD over 5 years ago - 11 comments
Labels: size/M, ok-to-test

#190 - Integration into kubeflow pipeline

Issue - State: open - Opened by miguelvr over 5 years ago - 7 comments
Labels: kind/question, area/engprod, priority/p2

#128 - Distribution across multi-gpu nodes

Issue - State: closed - Opened by SeanNaren about 6 years ago - 7 comments
Labels: kind/question