Ecosyste.ms: Issues
An open API service for providing issue and pull request metadata for open source projects.
GitHub / kubeflow/pytorch-operator issues and pull requests
#365 - unable to build image for ppc64le
Issue -
State: open - Opened by gajanankulkarni-18 about 3 years ago
#364 - PytorchJob DDP training will stop if I delete a worker pod
Issue -
State: open - Opened by Shuai-Xie about 3 years ago
- 2 comments
#363 - run https://github.com/kubeflow/pytorch-operator/blob/master/sdk/python/test/test_e2e.py failed
Issue -
State: open - Opened by seansxl about 3 years ago
- 1 comment
#362 - Multi-gpu in a single pod
Issue -
State: open - Opened by wallarug about 3 years ago
- 2 comments
#361 - Add notice before archiving
Pull Request -
State: closed - Opened by terrytangyuan about 3 years ago
- 2 comments
Labels: size/XS, approved, lgtm
#360 - service label mismatches selector, which result in inconsistency
Issue -
State: open - Opened by konnase about 3 years ago
- 3 comments
Labels: kind/bug
#359 - The training hangs after reloading one of master/worker pods
Issue -
State: open - Opened by dmitsf over 3 years ago
- 5 comments
Labels: kind/question, area/engprod
#358 - Can not use volcano for Gang Scheduling
Issue -
State: closed - Opened by bug-developer021 over 3 years ago
#357 - support set volcano queue name
Pull Request -
State: open - Opened by qiankunli over 3 years ago
- 2 comments
Labels: size/M, needs-ok-to-test
#356 - Can I freeze pytorchjob training pods and migrate them to other nodes?
Issue -
State: open - Opened by Shuai-Xie over 3 years ago
- 9 comments
#355 - Pytorch version may have an effect on the training reproduction
Issue -
State: open - Opened by Shuai-Xie over 3 years ago
- 4 comments
#354 - Different DDP training results of PytorchJob and Bare Metal
Issue -
State: open - Opened by Shuai-Xie over 3 years ago
- 6 comments
#353 - Can I use hostNetwork to run PytorchJob like on bare metal
Issue -
State: closed - Opened by Shuai-Xie over 3 years ago
- 3 comments
#352 - Can PytorchJob skip or cancel the init cantainer?
Issue -
State: open - Opened by SeibertronSS over 3 years ago
- 2 comments
#351 - volcano change the PodGroup CRD APIGroup to volcano.sh
Issue -
State: open - Opened by qiankunli over 3 years ago
- 1 comment
#350 - How to use DDP in pytorch operator?
Issue -
State: closed - Opened by SeibertronSS over 3 years ago
- 3 comments
#349 - why worker need initContainer in pytorch-operator?
Issue -
State: closed - Opened by zqz-net over 3 years ago
- 2 comments
#348 - container "pytorch" is waiting to start: PodInitializing
Issue -
State: open - Opened by gogogwwb over 3 years ago
- 20 comments
Labels: kind/bug
#347 - Upgrade to v1 CRDs
Issue -
State: open - Opened by mcristina422 over 3 years ago
- 1 comment
#346 - [feat] Support PyTorch 1.9
Issue -
State: open - Opened by gaocegege over 3 years ago
- 3 comments
#345 - Fix: Change PTL to release version
Pull Request -
State: closed - Opened by jagadeeshi2i over 3 years ago
- 2 comments
Labels: size/XS, approved, lgtm, ok-to-test
#344 - PytorchJob replicas has different node affinity behaviors compared with Deployment
Issue -
State: open - Opened by Shuai-Xie over 3 years ago
- 4 comments
#343 - Update the versions of common, tfjob and some other modules
Pull Request -
State: closed - Opened by paipaoso over 3 years ago
- 7 comments
Labels: size/XXL, needs-ok-to-test
#342 - What is the difference between master and worker?
Issue -
State: closed - Opened by SeibertronSS over 3 years ago
- 6 comments
#341 - Fix 'Invalid Pointer' error when PytorchJob is deleted
Pull Request -
State: closed - Opened by alembiewski over 3 years ago
- 4 comments
Labels: size/XS, approved, lgtm, needs-ok-to-test
#340 - fell confused about world_size
Issue -
State: closed - Opened by ldd91 over 3 years ago
#339 - `init-pytorch` init container image configurable
Issue -
State: closed - Opened by apatil4 over 3 years ago
- 4 comments
#338 - Add job namespace to `pytorch_operator_jobs_*` counters
Pull Request -
State: closed - Opened by alembiewski over 3 years ago
- 4 comments
Labels: approved, size/M, lgtm, ok-to-test
#337 - Bert example with Pytorch Lightning
Pull Request -
State: closed - Opened by jagadeeshi2i over 3 years ago
- 2 comments
Labels: approved, lgtm, size/XL
#336 - Adding example config file
Pull Request -
State: closed - Opened by johnugeorge over 3 years ago
- 4 comments
Labels: approved, lgtm, size/S
#335 - Worker template should be configurable.
Issue -
State: open - Opened by MartinForReal over 3 years ago
- 1 comment
#334 - PyTorch Lightning Example.
Issue -
State: closed - Opened by tchaton over 3 years ago
#333 - 'host not found' error occurs during PyTorch distributed learning
Issue -
State: open - Opened by JGoo1 almost 4 years ago
- 1 comment
Labels: kind/feature
#332 - NCCL "Connection Refused" for Worker Pods
Issue -
State: open - Opened by twolffpiggott almost 4 years ago
- 1 comment
#331 - whether multi-gpu-per-pod setup be supported in PytorchJob
Issue -
State: open - Opened by tingweiwu almost 4 years ago
- 1 comment
#330 - can I use PyTorchJobClient inside a pod of the cluster?
Issue -
State: open - Opened by omlomloml almost 4 years ago
- 1 comment
#329 - worker get connection timed out error in user namespace with sidecar.istio.io/inject=false
Issue -
State: closed - Opened by tingweiwu almost 4 years ago
- 1 comment
#328 - is there a simpler way to install pytorch-operator
Issue -
State: closed - Opened by tingweiwu almost 4 years ago
- 2 comments
#327 - Change mnist example to use FashionMNIST
Pull Request -
State: closed - Opened by Jeffwan almost 4 years ago
- 2 comments
Labels: size/XS, approved, lgtm
#326 - Temporarily disable mnist test case
Pull Request -
State: closed - Opened by Jeffwan almost 4 years ago
- 3 comments
Labels: approved, lgtm, size/S
#325 - Mnist dataset server is down
Issue -
State: open - Opened by Jeffwan almost 4 years ago
- 5 comments
#324 - [DO NOT MERGE] Change to test CI
Pull Request -
State: closed - Opened by yanniszark almost 4 years ago
- 4 comments
Labels: size/XS
#323 - pytorch-operator: Consolidate manifests
Pull Request -
State: closed - Opened by yanniszark almost 4 years ago
- 7 comments
Labels: approved, lgtm, size/L
#322 - pytorch-operator: Consolidate manifests
Issue -
State: closed - Opened by yanniszark almost 4 years ago
- 1 comment
#321 - Operator has invalid memory address error on specific pytorchjob spec
Issue -
State: open - Opened by ca-scribner almost 4 years ago
- 1 comment
#320 - PyTorch Operator: Move manifests development upstream
Pull Request -
State: closed - Opened by yanniszark almost 4 years ago
- 4 comments
Labels: approved, lgtm, size/L
#319 - Unlable to spawn PyTorchJob due to image alpine dependency of pytorch-operator
Issue -
State: open - Opened by asahalyft almost 4 years ago
- 4 comments
Labels: kind/bug
#318 - PyTorch Operator: Move manifests development upstream
Issue -
State: closed - Opened by yanniszark almost 4 years ago
#317 - Is python sdk still being maintained?
Issue -
State: open - Opened by ca-scribner almost 4 years ago
- 7 comments
#316 - Migrate to new test-infra
Pull Request -
State: closed - Opened by PatrickXYS about 4 years ago
- 36 comments
Labels: approved, size/M, lgtm
#315 - add dependabot config script
Pull Request -
State: open - Opened by DavidSpek about 4 years ago
- 4 comments
Labels: size/L, do-not-merge/hold
#314 - Please create v1.2-branch
Issue -
State: closed - Opened by SatwikBhandiwad about 4 years ago
- 3 comments
#313 - dist.init_process_group stuck
Issue -
State: open - Opened by ravenj73 about 4 years ago
- 9 comments
#312 - kubeflow pipelines sdk, distributed multi-node training with autoscaling
Issue -
State: closed - Opened by rami3e about 4 years ago
- 4 comments
#311 - Does pytorch-opterator just simplified the use of nn.parallel.DistributedDataParallel on multi nodes of multi gpu?
Issue -
State: closed - Opened by lwj1980s about 4 years ago
- 2 comments
#310 - can I use gpus on specific node to train
Issue -
State: closed - Opened by lwj1980s over 4 years ago
- 5 comments
Labels: kind/question, area/front-end, question
#309 - Add @andreyvelich to approvers
Pull Request -
State: closed - Opened by andreyvelich over 4 years ago
- 2 comments
Labels: size/XS, approved, lgtm
#308 - Reuse Common Scripts for Creating / Deleting EKS clusters
Pull Request -
State: closed - Opened by PatrickXYS over 4 years ago
- 6 comments
Labels: approved, size/M, lgtm
#307 - Do not trigger presubmit jobs for simple changes
Issue -
State: open - Opened by Jeffwan over 4 years ago
- 1 comment
Labels: area/engprod, kind/feature
#306 - Add Jeffwan@ to OWNERS
Pull Request -
State: closed - Opened by Jeffwan over 4 years ago
- 11 comments
Labels: size/XS, approved, lgtm
#305 - Move PyTorch Operator e2e tests to AWS Prow
Pull Request -
State: closed - Opened by Jeffwan over 4 years ago
- 35 comments
Labels: approved, lgtm, size/L
#304 - how can I run a pytorch job with all my Gpu resources
Issue -
State: closed - Opened by lwj1980s over 4 years ago
- 4 comments
Labels: kind/question
#303 - Add test friendly manifests
Pull Request -
State: closed - Opened by Jeffwan over 4 years ago
- 6 comments
Labels: size/S
#302 - Make manifest test friendly
Issue -
State: closed - Opened by Jeffwan over 4 years ago
- 2 comments
Labels: area/engprod, kind/feature
#301 - Support manifest on Kubernetes 1.16+
Pull Request -
State: closed - Opened by Jeffwan over 4 years ago
- 6 comments
Labels: size/XS
#300 - Updated the image name format for the gcr.io.
Pull Request -
State: open - Opened by wuchen03 over 4 years ago
- 11 comments
Labels: size/S, ok-to-test
#299 - Activate Travis in PR check
Issue -
State: open - Opened by andreyvelich over 4 years ago
- 2 comments
Labels: priority/p1, kind/feature
#298 - Change cluster version to 1.16 for e2e test
Pull Request -
State: closed - Opened by andreyvelich over 4 years ago
- 2 comments
Labels: size/XS
#297 - Test webhook
Pull Request -
State: closed - Opened by Jeffwan over 4 years ago
- 1 comment
Labels: size/XS
#296 - Support Torch Elastic in pytorch operator
Issue -
State: open - Opened by Jeffwan over 4 years ago
- 2 comments
Labels: kind/feature
#295 - update pytorch-operator deployment manifests file
Pull Request -
State: closed - Opened by myonlyzzy over 4 years ago
- 15 comments
Labels: size/XS, approved, lgtm, ok-to-test
#294 - pytorch-operator pod CheckCRDExist failed
Issue -
State: closed - Opened by myonlyzzy over 4 years ago
- 3 comments
Labels: kind/bug
#293 - Fix Unit Tests
Pull Request -
State: closed - Opened by andreyvelich over 4 years ago
- 24 comments
Labels: approved, size/M, lgtm
#292 - [bug] Unit test is broken
Issue -
State: open - Opened by gaocegege over 4 years ago
- 4 comments
Labels: priority/p0, area/engprod, kind/bug
#291 - './pytorch_job_sendrecv.yaml' missing in pytorch-operator/examples/smoke-dist
Issue -
State: closed - Opened by Lyken17 over 4 years ago
- 6 comments
Labels: area/front-end, kind/bug
#290 - Update README.md
Pull Request -
State: closed - Opened by pingsutw over 4 years ago
- 2 comments
Labels: size/XS, approved, lgtm
#289 - Update CRD link
Pull Request -
State: closed - Opened by pingsutw over 4 years ago
- 1 comment
Labels: size/XS, approved, lgtm
#288 - support cleanPodPolicy is Running, same as tf operator
Pull Request -
State: closed - Opened by jiaqianjing over 4 years ago
- 10 comments
Labels: approved, size/M, lgtm, ok-to-test
#287 - how to create a local non-distributed training
Issue -
State: closed - Opened by houz42 over 4 years ago
- 7 comments
Labels: kind/question, kind/feature
#286 - chore: Update OWNERS
Pull Request -
State: closed - Opened by gaocegege over 4 years ago
- 2 comments
Labels: size/XS, approved, lgtm
#285 - Adds notes and example annotation for pytorch job
Pull Request -
State: closed - Opened by shawnzhu over 4 years ago
- 3 comments
Labels: approved, lgtm, size/S
#284 - PyTorchJob CRD definition link is broken
Issue -
State: closed - Opened by sakaia over 4 years ago
- 2 comments
Labels: area/docs, kind/bug
#283 - Do we need pod name and namespace in manifests?
Issue -
State: open - Opened by gaocegege over 4 years ago
- 2 comments
Labels: kind/question, area/operator
#282 - Migrate code implementation to kubeflow/common fashion
Issue -
State: open - Opened by Jeffwan over 4 years ago
- 3 comments
Labels: kind/feature
#281 - Where are the pytorch-crd and pytorch-operator YAML files?
Issue -
State: closed - Opened by g-karthik over 4 years ago
- 8 comments
Labels: kind/question
#277 - fix Dockerfile-mpi download miniconda.sh
Pull Request -
State: closed - Opened by jiaqianjing over 4 years ago
- 9 comments
Labels: size/XS, approved, lgtm, ok-to-test
#275 - OCI Runtime error for init-pytorch on AKS
Issue -
State: closed - Opened by wangdian over 4 years ago
- 3 comments
Labels: kind/bug
#274 - Update openapi-gen to not rely on vendor
Pull Request -
State: closed - Opened by Jeffwan over 4 years ago
- 7 comments
Labels: approved, lgtm, size/L
#271 - Distributed mnist is unexpectedly slow
Issue -
State: open - Opened by panchul over 4 years ago
- 7 comments
Labels: kind/bug
#262 - pin kubenertes client version to work around a bug
Pull Request -
State: closed - Opened by jinchihe almost 5 years ago
- 5 comments
Labels: approved, lgtm, size/S
#259 - Kubernetes 1.6 support
Issue -
State: closed - Opened by posix4e almost 5 years ago
- 5 comments
Labels: area/front-end, area/operator, kind/feature
#258 - PyTorchJob worker pods crashloops in non-default namespace
Issue -
State: open - Opened by jobvarkey almost 5 years ago
- 7 comments
Labels: kind/bug
#257 - Updated the GPU compatible Docker builiding porcess with the Kubeflow…
Pull Request -
State: open - Opened by MATRIX4284 about 5 years ago
- 8 comments
Labels: size/XS, needs-ok-to-test
#254 - Link to CRD definition is broken
Issue -
State: closed - Opened by sakaia about 5 years ago
- 4 comments
Labels: area/front-end, kind/bug
#243 - v1beta foldeer has been renamed to v1 so needs the path too
Pull Request -
State: open - Opened by MATRIX4284 about 5 years ago
- 7 comments
Labels: size/XS, approved, lgtm, ok-to-test
#237 - GCP preemptible instances
Issue -
State: open - Opened by Nintorac about 5 years ago
- 4 comments
#219 - Right way to use pytorch-operator for multi-node multi-gpu setup
Issue -
State: open - Opened by lainisourgod over 5 years ago
- 13 comments
Labels: kind/question, area/engprod, priority/p2
#209 - use the priority of kube-batch
Pull Request -
State: open - Opened by YesterdayxD over 5 years ago
- 11 comments
Labels: size/M, ok-to-test
#190 - Integration into kubeflow pipeline
Issue -
State: open - Opened by miguelvr over 5 years ago
- 7 comments
Labels: kind/question, area/engprod, priority/p2
#128 - Distribution across multi-gpu nodes
Issue -
State: closed - Opened by SeanNaren about 6 years ago
- 7 comments
Labels: kind/question