Ecosyste.ms: Issues

An open API service for providing issue and pull request metadata for open source projects.

GitHub / guidebooks/store issues and pull requests

#791 - fix: remove Items field

Pull Request - State: closed - Opened by Sara-KS 12 months ago

#790 - feat: Update to mcad v1.34.1 support and torchx 0.6.0

Pull Request - State: closed - Opened by Sara-KS 12 months ago

#789 - fix: more EOF protection fixes

Pull Request - State: closed - Opened by starpit over 1 year ago

#788 - Update pvc.yaml - add diskfree parameter

Pull Request - State: closed - Opened by ykoyfman over 1 year ago

#785 - fix: increase max log requests for app logs

Pull Request - State: closed - Opened by starpit over 1 year ago

#784 - fix: ray head wait-for-workers initContainer should retry if wait fails

Pull Request - State: closed - Opened by starpit over 1 year ago

#782 - fix: custodian logs container fails due to unescaped $ in $TAIL

Pull Request - State: closed - Opened by starpit over 1 year ago

#781 - fix: cache ray/torchx helm chart

Pull Request - State: closed - Opened by starpit over 1 year ago

#780 - fix: improve torchx support for running multiple gpus per pod

Pull Request - State: closed - Opened by starpit over 1 year ago

#779 - feat: add some NCCL tweaks

Pull Request - State: closed - Opened by starpit over 1 year ago

#778 - fix: syntax error in multinic for torchx

Pull Request - State: closed - Opened by starpit over 1 year ago

#777 - feat: add multinic support

Pull Request - State: closed - Opened by starpit over 1 year ago

#776 - fix: ray wait for workers initContainer not needed with 0 workers

Pull Request - State: closed - Opened by starpit over 1 year ago

#775 - fix: use initContainer to wait for ray workers

Pull Request - State: closed - Opened by starpit over 1 year ago

#774 - fix: increase ray gcs rpc timeout to 30s

Pull Request - State: closed - Opened by starpit over 1 year ago

#773 - fix: more EOF resiliency fixes for ray and torchx

Pull Request - State: closed - Opened by starpit over 1 year ago

#772 - fix: increase torchx log streaming resilience to network disconnects

Pull Request - State: closed - Opened by starpit over 1 year ago

#771 - fix: wait for ray workers prior to server-side job submit

Pull Request - State: closed - Opened by starpit over 1 year ago

#770 - fix: restore helm delete and increase resilience to network disconnects

Pull Request - State: closed - Opened by starpit over 1 year ago

#769 - fix: avoid helm delete in custodian for now

Pull Request - State: closed - Opened by starpit over 1 year ago

#768 - Revert "fix: avoid use of all-containers in ray log streamer"

Pull Request - State: closed - Opened by starpit over 1 year ago

#767 - fix: all-containers fix should async app logs and sync on ray head logs

Pull Request - State: closed - Opened by starpit over 1 year ago

#766 - Revert "fix: avoid use of all-containers in ray log streamer"

Pull Request - State: closed - Opened by starpit over 1 year ago

#765 - fix: avoid use of all-containers in ray log streamer

Pull Request - State: closed - Opened by starpit over 1 year ago

#764 - fix: increase memory for runtime-env custodian pod

Pull Request - State: closed - Opened by starpit over 1 year ago

#763 - fix: increase memory for ray head logs container

Pull Request - State: closed - Opened by starpit over 1 year ago

#762 - fix: torchx volume mount paths have extra quotes

Pull Request - State: closed - Opened by starpit over 1 year ago

#761 - fix: remove reliance on wget in ray head container

Pull Request - State: closed - Opened by starpit over 1 year ago

#760 - fix: improve custodian memory requests for larger jobs

Pull Request - State: closed - Opened by starpit over 1 year ago

#759 - fix: ignore __pycache__ when bundling up workdir

Pull Request - State: closed - Opened by starpit over 1 year ago

#758 - fix: improve support for pytorch lightning's fsspec[s3] support

Pull Request - State: closed - Opened by starpit over 1 year ago

#757 - fix: do not create gpu custodian container for non-gpu runs

Pull Request - State: closed - Opened by starpit over 1 year ago

#756 - fix: lower memory requests for some of the custodian pods

Pull Request - State: closed - Opened by starpit over 1 year ago

#755 - chore: move custodian to ml/codeflare/custodian

Pull Request - State: closed - Opened by starpit over 1 year ago

#754 - fix: add worker-status to custodian

Pull Request - State: closed - Opened by starpit over 1 year ago

#753 - fix: add runtime-env-setup to custodian

Pull Request - State: closed - Opened by starpit over 1 year ago

#752 - chore: remove old untested 'in-cluster' log aggregator

Pull Request - State: closed - Opened by starpit over 1 year ago

#751 - fix: eliminate newlines from base64

Pull Request - State: closed - Opened by starpit over 1 year ago

#750 - feat: add gpu utilization pod to custodian

Pull Request - State: closed - Opened by starpit over 1 year ago

#749 - feat: add memory utilization pod to custodian

Pull Request - State: closed - Opened by starpit over 1 year ago

#748 - feat: add cpu utilization pod to custodian

Pull Request - State: closed - Opened by starpit over 1 year ago

#747 - fix: use multi-line yaml to improve formatting of logs args

Pull Request - State: closed - Opened by starpit over 1 year ago

#746 - fix: lower custodian logs container 100m/128Mi -> 50m/32Mi

Pull Request - State: closed - Opened by starpit over 1 year ago

#745 - fix: clean up custodian command, and rename container 'logs'

Pull Request - State: closed - Opened by starpit over 1 year ago

#744 - fix: torchx cluster name may end with a dash

Pull Request - State: closed - Opened by starpit over 1 year ago

#743 - fix: owner label default needs to be quoted

Pull Request - State: closed - Opened by starpit over 1 year ago

#742 - fix: add app.kubernetes.io/owner label to pods

Pull Request - State: closed - Opened by starpit over 1 year ago

#741 - fix: add 'app.kubernetes.io/managed-by: codeflare' label to custodian

Pull Request - State: closed - Opened by starpit over 1 year ago

#740 - feat: improve custodian support for torchx, use smaller base image

Pull Request - State: closed - Opened by starpit over 1 year ago

#739 - fix: logs custodian should pull from kubectl logs, not ray job logs

Pull Request - State: closed - Opened by starpit over 1 year ago

#738 - fix: logs custodian has errors with tee'ing to file

Pull Request - State: closed - Opened by starpit over 1 year ago

#737 - feat: rename self-destruct to logs; and increase ttl timeout on its job

Pull Request - State: closed - Opened by starpit over 1 year ago

#736 - fix: final Succeeded message not shown in ray jobs

Pull Request - State: closed - Opened by starpit over 1 year ago

#735 - fix: further improvements to ray log streaming

Pull Request - State: closed - Opened by starpit over 1 year ago

#734 - fix: ray logs not smooth

Pull Request - State: closed - Opened by starpit over 1 year ago

#733 - feat: avoid websocat in ml/ray/run/logs

Pull Request - State: closed - Opened by starpit over 1 year ago

#732 - fix: websocat ray log streaming can be simplified

Pull Request - State: closed - Opened by starpit over 1 year ago

#731 - fix: decrease epochs from 5 to 2 for getting started ray example

Pull Request - State: closed - Opened by starpit over 1 year ago

#730 - fix: ray labels were using /name should use /instance

Pull Request - State: closed - Opened by starpit over 1 year ago

#729 - fix: vmstat data lacks pod/ prefix on pod name

Pull Request - State: closed - Opened by starpit over 1 year ago

#728 - fix: ray jobs emit job env.json only after job is running

Pull Request - State: closed - Opened by starpit over 1 year ago

#727 - fix: improve messaging of torchx wait-till-running

Pull Request - State: closed - Opened by starpit over 1 year ago

#726 - fix: pod-memory stream lacked pod/ prefix for hostname

Pull Request - State: closed - Opened by starpit over 1 year ago

#724 - fix: torchx env isn't written out till the job is already running

Pull Request - State: closed - Opened by starpit over 1 year ago

#723 - fix: capture job env vars for torchx runs

Pull Request - State: closed - Opened by starpit over 1 year ago

#722 - fix: torchx captured logs may not include Succeeded/Failed events

Pull Request - State: closed - Opened by starpit over 1 year ago

#721 - fix: syntax error in code block in torchx status poller

Pull Request - State: closed - Opened by starpit over 1 year ago

#720 - fix: torchx exit handlers were not right

Pull Request - State: closed - Opened by starpit over 1 year ago

#719 - fix: small refinements to torchx logs

Pull Request - State: closed - Opened by starpit over 1 year ago

#718 - fix: remove leftover 'set -x' from debugging

Pull Request - State: closed - Opened by starpit over 1 year ago

#717 - fix: torchx job status file needs to use tee -a to append

Pull Request - State: closed - Opened by starpit over 1 year ago

#716 - fix: improved event handling for torchx exit

Pull Request - State: closed - Opened by starpit over 1 year ago

#715 - fix: improve torchx status events to show Job status

Pull Request - State: closed - Opened by starpit over 1 year ago

#714 - fix: torchx jobs lacked kube event stream

Pull Request - State: closed - Opened by starpit over 1 year ago

#713 - fix: torchx script logic fails if python prefix is not python3

Pull Request - State: closed - Opened by starpit over 1 year ago

#712 - fix: clean up content and coloring of helm install output

Pull Request - State: closed - Opened by starpit over 1 year ago

#711 - fix: torchx cli install fails on zsh

Pull Request - State: closed - Opened by starpit over 1 year ago

#710 - fix: sed RE error can occur in torchx log streamer

Pull Request - State: closed - Opened by starpit over 1 year ago

#709 - fix: pass through guidebook env vars to torchx

Pull Request - State: closed - Opened by starpit over 1 year ago

#708 - fix: ml/torchx/run may fail for users with long user names

Pull Request - State: closed - Opened by starpit over 1 year ago

#707 - fix: torchx log streamer would fail if lines contained control chars

Pull Request - State: closed - Opened by starpit over 1 year ago

#706 - fix: update to official torchx 0.5.0 release

Pull Request - State: closed - Opened by starpit over 1 year ago

#705 - fix: don't fail if we can't hack uid-range

Pull Request - State: closed - Opened by starpit over 1 year ago

#704 - fix: in CI, don't try to use ssh git cloning for workdir

Pull Request - State: closed - Opened by starpit over 1 year ago

#703 - feat: add support for workdir being a github https:// url

Pull Request - State: closed - Opened by starpit over 1 year ago

#702 - fix: ml/torchx/run fails if main python file is not 'main.py'

Pull Request - State: closed - Opened by starpit over 1 year ago

#701 - fix: another fix for relative workdir

Pull Request - State: closed - Opened by starpit over 1 year ago

#700 - fix: further improvements to helm install with relative workdir

Pull Request - State: closed - Opened by starpit over 1 year ago

#698 - fix: force vmstat timestamps to use UTC timezone

Pull Request - State: closed - Opened by starpit over 1 year ago

#697 - fix: capture env.json in log aggregation

Pull Request - State: closed - Opened by starpit over 1 year ago

#695 - fix: gpu stream displays temps with % unit

Pull Request - State: closed - Opened by starpit over 1 year ago

#693 - fix: kubectl linux-arm64 installs arm32 binary

Pull Request - State: closed - Opened by starpit over 1 year ago

#692 - fix: bump to madwizard@8 to adopt shell.stdin convention

Pull Request - State: closed - Opened by starpit over 1 year ago