GitHub / kubeflow/mpi-operator issues and pull requests
#700 - New fix kustomize5 warnings
Pull Request -
State: closed - Opened by vikas-saxena02 3 months ago
- 4 comments
Labels: approved, lgtm, size/M
#699 - Bump golang.org/x/net from 0.36.0 to 0.38.0
Pull Request -
State: closed - Opened by dependabot[bot] 4 months ago
- 1 comment
Labels: approved, lgtm, size/M, dependencies, go
#698 - [feature] pull image from ghcr in manifest
Pull Request -
State: open - Opened by mahdikhashan 4 months ago
- 1 comment
Labels: size/S
#697 - Use cncf-hosted gha runners
Pull Request -
State: open - Opened by jeefy 4 months ago
- 2 comments
Labels: size/XS
#696 - remove zw0610 from reviewer
Pull Request -
State: closed - Opened by zw0610 4 months ago
- 5 comments
Labels: approved, lgtm, size/XS
#695 - Upgrade Go version to 1.24
Pull Request -
State: closed - Opened by tenzen-y 4 months ago
- 2 comments
Labels: lgtm, size/S
#694 - Bump golang.org/x/crypto from 0.31.0 to 0.35.0
Pull Request -
State: closed - Opened by dependabot[bot] 4 months ago
- 1 comment
Labels: approved, lgtm, size/M, dependencies, go
#693 - Remove alculquicondor from OWNERS
Pull Request -
State: closed - Opened by alculquicondor 4 months ago
- 6 comments
Labels: approved, size/XS
#692 - Trust the Intel OneAPI PGP key until it satisfies new APT PGP requirments
Pull Request -
State: closed - Opened by tenzen-y 4 months ago
- 2 comments
Labels: approved, lgtm, size/XS
#691 - Intel OneAPI PGP key does not support new APT requirements
Issue -
State: open - Opened by tenzen-y 4 months ago
#690 - Fix missing ReplicaIndexLabel when using RunLauncherAsWorker
Pull Request -
State: closed - Opened by GonzaloSaez 4 months ago
- 6 comments
Labels: approved, lgtm, size/S
#689 - [feature]: migrate docker image push to ghcr
Pull Request -
State: open - Opened by mahdikhashan 4 months ago
- 10 comments
Labels: size/S
#688 - Fix kustomize5 warnings
Pull Request -
State: closed - Opened by vikas-saxena02 4 months ago
- 13 comments
Labels: size/M, do-not-merge/hold
#687 - launcher pod spec changes not applied when suspend and resume MPIJob
Issue -
State: open - Opened by dbdydgur2244 4 months ago
- 1 comment
#686 - Perform Image building in parallel in CI
Pull Request -
State: closed - Opened by tenzen-y 5 months ago
- 2 comments
Labels: approved, size/S
#685 - Upgrade Debian version to trixie for OpenMPI v5.0
Pull Request -
State: closed - Opened by tenzen-y 5 months ago
- 1 comment
Labels: approved, lgtm, size/S
#684 - Upload container images to Github Container Registry
Issue -
State: open - Opened by tenzen-y 5 months ago
- 4 comments
Labels: kind/feature
#683 - Bump golang.org/x/net from 0.28.0 to 0.36.0
Pull Request -
State: closed - Opened by dependabot[bot] 5 months ago
- 3 comments
Labels: approved, lgtm, size/XS, dependencies, go
#682 - bug(MPI Training) : Scheduling Policy doc bug for MPIJob
Issue -
State: closed - Opened by ttakahashi21 5 months ago
- 9 comments
Labels: kind/bug
#681 - increase `intel-oneapi-mpi-devel` version to 2021.14
Pull Request -
State: open - Opened by mahdikhashan 6 months ago
- 1 comment
Labels: do-not-merge/work-in-progress, size/XS
#680 - Insufficient Slots for MPIJob with 2 Worker Pods and 2 Gaudi Cards Each
Issue -
State: open - Opened by gera-aldama 6 months ago
- 3 comments
#679 - chore: update k8s to v1.32
Pull Request -
State: open - Opened by dongjiang1989 6 months ago
- 5 comments
Labels: approved, lgtm, size/XXL, do-not-merge/hold
#678 - Upgrade Intel MPI version to 2021.14
Issue -
State: open - Opened by tenzen-y 7 months ago
- 6 comments
Labels: help wanted, kind/bug
#677 - Bump golang.org/x/net from 0.28.0 to 0.33.0
Pull Request -
State: closed - Opened by dependabot[bot] 7 months ago
- 4 comments
Labels: approved, lgtm, size/XS, dependencies
#676 - Fix E2E Intel MPI integ tests
Pull Request -
State: closed - Opened by GonzaloSaez 7 months ago
- 2 comments
Labels: approved, lgtm, size/M
#675 - Failed IntelMPI E2E tests
Issue -
State: closed - Opened by tenzen-y 7 months ago
- 10 comments
Labels: kind/bug
#674 - Expose job controller's workqueue rate limiting configs
Pull Request -
State: closed - Opened by roteme-runai 7 months ago
- 8 comments
Labels: approved, lgtm, size/M
#673 - chore: bump golang.org/x/crypto from v0.26.0 to v0.31.0
Pull Request -
State: closed - Opened by cmontemuino 8 months ago
- 8 comments
Labels: approved, size/M
#672 - CVE-2024-45337 in golang.org/x/crypto package
Issue -
State: closed - Opened by cmontemuino 8 months ago
- 1 comment
#671 - DO NOT MERGE: E2E CI CHECK
Pull Request -
State: closed - Opened by tenzen-y 9 months ago
- 4 comments
Labels: size/L, do-not-merge/hold
#670 - Do not create the launcher job if the job starts suspended
Pull Request -
State: open - Opened by GonzaloSaez 9 months ago
- 3 comments
Labels: size/M
#669 - Fix crash in podgroup when runLauncherAsWorker is true
Pull Request -
State: closed - Opened by GonzaloSaez 9 months ago
- 9 comments
Labels: approved, lgtm, size/L
#668 - Update image tag with release-0.6
Pull Request -
State: closed - Opened by tenzen-y 10 months ago
- 4 comments
Labels: approved, lgtm, size/XS
#667 - Reuse the core kubernetes API reason for the BackoffLimitExceeded
Pull Request -
State: closed - Opened by tenzen-y 10 months ago
- 2 comments
Labels: approved, lgtm, size/S
#666 - Fix the 'printf: non-constant format string in call to fmt.Errorf (govet)' lint errors
Pull Request -
State: closed - Opened by tenzen-y 10 months ago
- 2 comments
Labels: approved, lgtm, size/M
#665 - Prepare v0.6.0 release
Pull Request -
State: closed - Opened by tenzen-y 10 months ago
- 6 comments
Labels: approved, lgtm, size/S
#664 - Bump to k8s 1.31
Pull Request -
State: closed - Opened by ArangoGutierrez 10 months ago
- 11 comments
Labels: approved, lgtm, size/XXL
#663 - Obviously specify the supported platforms in Makefile
Pull Request -
State: closed - Opened by tenzen-y 10 months ago
- 2 comments
Labels: approved, lgtm, size/S
#662 - Error Building Custom MPI Image Following Documentation
Issue -
State: open - Opened by luancaarvalho 10 months ago
- 1 comment
#661 - Introduce debian bookworm
Pull Request -
State: closed - Opened by tenzen-y 10 months ago
- 3 comments
Labels: approved, lgtm, size/S
#660 - Add support for linux/ppc64le for MPICH
Issue -
State: open - Opened by tenzen-y 10 months ago
Labels: kind/feature
#659 - Upgrade volcano version to v1.10.0
Pull Request -
State: closed - Opened by tenzen-y 10 months ago
- 3 comments
Labels: approved, lgtm, size/XS
#658 - Issue connecting to nodes that are not within the same cluster
Issue -
State: open - Opened by yxusnapchat 10 months ago
- 2 comments
#657 - Upgrade the k8s dependency versions to 1.30
Pull Request -
State: closed - Opened by tenzen-y 10 months ago
- 3 comments
Labels: approved, lgtm, size/XXL
#656 - Adjust the comment for managedBy
Pull Request -
State: closed - Opened by mszadkow 10 months ago
- 2 comments
Labels: approved, lgtm, size/XS
#655 - Bump K8s to 1.31
Pull Request -
State: closed - Opened by ArangoGutierrez 10 months ago
- 2 comments
Labels: size/XXL
#654 - Release v0.6.0 requirements
Issue -
State: closed - Opened by tenzen-y 10 months ago
- 7 comments
#653 - Upgrade the scheduler-plugins to v0.29.8
Pull Request -
State: closed - Opened by tenzen-y 10 months ago
- 4 comments
Labels: approved, lgtm, size/M
#652 - Next release date with updated k8s libraries for 1.31
Issue -
State: closed - Opened by klueska 10 months ago
- 9 comments
#651 - I set backoffLimit: 3 and restartPolicy: Never but MPIJOB does not create a new pod
Issue -
State: closed - Opened by gyupup 10 months ago
- 6 comments
#650 - Introduce ManagedBy field in RunPolicy
Pull Request -
State: closed - Opened by mszadkow 10 months ago
- 4 comments
Labels: approved, lgtm, size/L, ok-to-test
#649 - How the file at tensorflow-benchmarks.yaml can run an MPI job ?
Issue -
State: closed - Opened by luancaarvalho 11 months ago
- 4 comments
#648 - What scale can mpi-operator support?
Issue -
State: open - Opened by yxzhao6 11 months ago
- 3 comments
#647 - Worker pods not cleaned up upon `MPIJobEvicted` event
Issue -
State: open - Opened by shaowei-su 12 months ago
#646 - Add support for the managedBy field
Issue -
State: closed - Opened by mimowo 12 months ago
- 6 comments
#645 - Question: Is the network traffic of AllReduce(like, ML gradients) encrypted between workers?
Issue -
State: closed - Opened by jsyqrt about 1 year ago
- 10 comments
#644 - ttlSecondsAfterFinished for MPIJob, not only launcher
Issue -
State: open - Opened by hy00nc about 1 year ago
- 6 comments
#643 - "cleanPodPolicy: All" does not clean up launcher pod
Issue -
State: open - Opened by hy00nc about 1 year ago
- 1 comment
#642 - Connection reset
Issue -
State: closed - Opened by bbenshab about 1 year ago
- 4 comments
#641 - how could mpijob of mpi operator worker get the hostname of launcher
Issue -
State: closed - Opened by Oneal65 about 1 year ago
- 2 comments
#640 - fix #639 provide NCCL tests example
Pull Request -
State: open - Opened by samos123 over 1 year ago
- 4 comments
Labels: do-not-merge/work-in-progress, size/L
#639 - NCCL tests example
Issue -
State: open - Opened by samos123 over 1 year ago
- 1 comment
#638 - Update image tag with 0.5
Pull Request -
State: closed - Opened by tenzen-y over 1 year ago
- 2 comments
Labels: approved, lgtm, size/XS
#637 - Upgrade golang and controller-gen
Pull Request -
State: closed - Opened by tenzen-y over 1 year ago
- 2 comments
Labels: approved, lgtm, size/XXL
#636 - Upgrade golang and controller-gen
Pull Request -
State: closed - Opened by alculquicondor over 1 year ago
- 9 comments
Labels: size/XXL
#635 - Replace original pointer methods with ptr libs
Pull Request -
State: closed - Opened by tenzen-y over 1 year ago
- 6 comments
Labels: approved, lgtm, size/L
#634 - Introduce resource multiplication
Pull Request -
State: closed - Opened by tenzen-y over 1 year ago
- 4 comments
Labels: approved, lgtm, size/S
#633 - Upgrade K8s dependencies to v1.29
Pull Request -
State: closed - Opened by tenzen-y over 1 year ago
- 12 comments
Labels: approved, lgtm, size/XXL
#632 - Promote @tenzen-y to approver
Pull Request -
State: closed - Opened by terrytangyuan over 1 year ago
- 2 comments
Labels: approved, size/XS
#631 - Prepare for release 0.5.0
Pull Request -
State: closed - Opened by alculquicondor over 1 year ago
- 5 comments
Labels: approved, lgtm, size/S
#630 - Remove unnecessary RBAC rule for mpijobs-admin***
Pull Request -
State: open - Opened by vishvajit79 over 1 year ago
- 2 comments
Labels: size/XS
#629 - Bump google.golang.org/protobuf from 1.31.0 to 1.33.0
Pull Request -
State: closed - Opened by dependabot[bot] over 1 year ago
- 2 comments
Labels: approved, lgtm, size/XS, dependencies
#628 - Fix: no overwrite when run launcher as worker
Pull Request -
State: closed - Opened by kuizhiqing over 1 year ago
- 1 comment
Labels: approved, lgtm, size/L
#627 - Deprecated pointer, use ptr instead
Pull Request -
State: closed - Opened by kuizhiqing over 1 year ago
- 2 comments
Labels: approved, lgtm, size/L
#626 - make namespace parsing and informers pluggable
Pull Request -
State: open - Opened by emsixteeen over 1 year ago
- 9 comments
Labels: size/L
#625 - removing klog.Fatalf in favor of a shutdown request
Pull Request -
State: closed - Opened by emsixteeen over 1 year ago
- 6 comments
Labels: size/XS
#624 - adding Mac .DS_Store to gitignore
Pull Request -
State: closed - Opened by emsixteeen over 1 year ago
- 1 comment
Labels: approved, lgtm, size/XS
#623 - update auto gen file year to verify generate
Pull Request -
State: closed - Opened by kuizhiqing over 1 year ago
- 2 comments
Labels: approved, lgtm, size/M
#622 - Fix: add ns filter to podLister
Pull Request -
State: closed - Opened by kuizhiqing over 1 year ago
- 3 comments
Labels: approved, lgtm, size/XS
#621 - Wrong host info in discover_hosts.sh
Issue -
State: closed - Opened by kuizhiqing over 1 year ago
#620 - Running in a subset of namespaces
Issue -
State: open - Opened by emsixteeen over 1 year ago
- 8 comments
#619 - Fails mpi-operator early if access to list or watch objects is denied
Pull Request -
State: closed - Opened by emsixteeen over 1 year ago
- 8 comments
Labels: approved, lgtm, size/S
#618 - adding timeout for cache sync
Pull Request -
State: closed - Opened by emsixteeen over 1 year ago
- 14 comments
Labels: size/S
#617 - fix the condition
Pull Request -
State: open - Opened by wang-mask over 1 year ago
- 12 comments
Labels: size/XS
#616 - change1 mv to cp
Pull Request -
State: closed - Opened by wang-mask over 1 year ago
- 3 comments
Labels: approved, lgtm, size/XS
#615 - The operator still creates the launcher when launcherCreationPolicy is "WaitForWorkersReady" and suspend is "true"
Issue -
State: open - Opened by wang-mask over 1 year ago
#614 - "make generate" command run failed
Issue -
State: closed - Opened by wang-mask over 1 year ago
#613 - Replace the plain pod workers with Indexed Job
Issue -
State: open - Opened by tenzen-y over 1 year ago
- 4 comments
#612 - run worker process in launcher pod
Pull Request -
State: closed - Opened by kuizhiqing over 1 year ago
- 31 comments
Labels: approved, lgtm, size/L
#611 - Work with DeepSpeed for large scale training
Issue -
State: open - Opened by kuizhiqing over 1 year ago
- 28 comments
#610 - add deepspeed example
Pull Request -
State: open - Opened by kuizhiqing over 1 year ago
- 5 comments
Labels: do-not-merge/work-in-progress, size/M
#609 - Bump golang.org/x/crypto from 0.14.0 to 0.17.0
Pull Request -
State: closed - Opened by dependabot[bot] over 1 year ago
- 2 comments
Labels: approved, lgtm, size/S, dependencies
#607 - the object has been modified; please apply your changes to the latest version and try again
Issue -
State: open - Opened by gl-001 over 1 year ago
- 8 comments
#606 - fix bug about status absence when worker pod spec is invalid
Pull Request -
State: open - Opened by congpeiqing over 1 year ago
- 1 comment
Labels: size/S
#605 - which is the latest mpi job definition between mpi-operator and training operator
Issue -
State: closed - Opened by sxwl-donggang over 1 year ago
- 4 comments
#604 - Cant get mpijob status when pod template is invalid
Issue -
State: open - Opened by congpeiqing over 1 year ago
- 9 comments
#603 - Bumping opentelemetry libraries
Pull Request -
State: closed - Opened by tenzen-y over 1 year ago
- 2 comments
Labels: approved, lgtm, size/L
#602 - Bump go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc from 0.35.0 to 0.46.0
Pull Request -
State: closed - Opened by dependabot[bot] over 1 year ago
- 4 comments
Labels: size/M, dependencies
#601 - Fix invalid link for horovod cpu-only example Dockerfile
Pull Request -
State: closed - Opened by lianghao208 over 1 year ago
- 2 comments
Labels: approved, lgtm, size/XS
#600 - Fix invalid link for horovod cpu-only example
Pull Request -
State: closed - Opened by lianghao208 over 1 year ago
- 1 comment
Labels: approved, lgtm, size/XS