Ecosyste.ms: Issues
An open API service for providing issue and pull request metadata for open source projects.
GitHub / bytedance/byteps issues and pull requests
#447 - install failed
Issue -
State: open - Opened by themoonstone 7 months ago
#446 - 支持的cuda和pytorch版本
Issue -
State: open - Opened by themoonstone 7 months ago
#445 - support pytorch 2.1.x
Pull Request -
State: closed - Opened by rainj-me about 1 year ago
#444 - Is there any benchmark comparison with Megatron-LM ?
Issue -
State: open - Opened by sequoiar over 1 year ago
#443 - segmentation fault while launching the worker
Issue -
State: open - Opened by xuexiaxie over 1 year ago
- 1 comment
#442 - How does the tensorflow scheduler plugin used in the tf_benchmark_cnn.py
Issue -
State: open - Opened by sxqqslf over 1 year ago
- 1 comment
#441 - Mistakes of Workload calculation
Issue -
State: open - Opened by fly-dragon211 over 1 year ago
- 5 comments
#440 - 安装问题
Issue -
State: open - Opened by QingQingR about 2 years ago
#439 - Supported environment
Issue -
State: closed - Opened by QingQingR about 2 years ago
#438 - broadcast and is_initialized api are not supported with pytorch.
Issue -
State: open - Opened by HangJie720 over 2 years ago
#437 - support for fault tolerance and straggler mitigation
Issue -
State: open - Opened by youshaox over 2 years ago
#436 - Communication failure in MXNet with BytePS
Issue -
State: closed - Opened by qingyangDuan over 2 years ago
- 3 comments
#435 - 3rdparty: update pslite to fix shm name
Pull Request -
State: closed - Opened by ymjiang over 2 years ago
#434 - update shm naming scheme
Pull Request -
State: open - Opened by pleasantrabbit over 2 years ago
#433 - 安装报错
Issue -
State: open - Opened by llplay over 2 years ago
- 1 comment
#432 - torch: update ddp
Pull Request -
State: open - Opened by pleasantrabbit over 2 years ago
#431 - Release BytePS docker image support for TF2
Issue -
State: open - Opened by shaowei-su over 2 years ago
#430 - Running multiple workers on a single GPU machine
Issue -
State: open - Opened by hamidralmasi almost 3 years ago
#429 - launcher: join workers as they exit
Pull Request -
State: closed - Opened by pleasantrabbit almost 3 years ago
#428 - Successfully installed BytePS but cannot import byteps.torch or byteps.tensorflow
Issue -
State: closed - Opened by hamidralmasi almost 3 years ago
- 2 comments
#427 - benchmark with cross barrier error
Issue -
State: open - Opened by panpanli521 almost 3 years ago
#426 - 有计划支持纯cpu吗?我们worker也用cpu机器的
Issue -
State: open - Opened by starkeisntein almost 3 years ago
- 2 comments
#425 - 啥时候支持sparse模型?
Issue -
State: open - Opened by starkeisntein almost 3 years ago
#424 - ps-lite: disable ucx error handling by default
Pull Request -
State: closed - Opened by pleasantrabbit almost 3 years ago
#423 - ps-lite: update ps-lite
Pull Request -
State: closed - Opened by pleasantrabbit almost 3 years ago
#422 - Is it right to do allreduce immediately for non-zero ranks in bytescheduler?
Issue -
State: closed - Opened by sywang0111 almost 3 years ago
- 2 comments
#421 - server: exit log improvement
Pull Request -
State: closed - Opened by ymjiang almost 3 years ago
#420 - torch: fix compression when using apex.amp
Pull Request -
State: closed - Opened by pleasantrabbit almost 3 years ago
#419 - Stuck in the bps.init().
Issue -
State: closed - Opened by Fangjin98 almost 3 years ago
- 7 comments
#418 - The byteps in K8S Pod doesn't have DMLC_WORKER_ID configured.
Issue -
State: open - Opened by jackjinj almost 3 years ago
#417 - How to use gradient accumulate in BytePS torch DDP?
Issue -
State: open - Opened by wuyujiji about 3 years ago
- 5 comments
Labels: enhancement
#416 - tensorflow: fix bug in broadcast_variables
Pull Request -
State: closed - Opened by pleasantrabbit about 3 years ago
#415 - build: update ucx tarball download logic
Pull Request -
State: closed - Opened by pleasantrabbit about 3 years ago
#414 - common: add better support for huge tensors
Pull Request -
State: closed - Opened by ymjiang about 3 years ago
#413 - packaging: download tarballs when running sdist
Pull Request -
State: closed - Opened by pleasantrabbit about 3 years ago
#412 - server: improve thread safety
Pull Request -
State: closed - Opened by ymjiang about 3 years ago
#411 - Training process occurs nan at the first ten batch.
Issue -
State: open - Opened by powermano over 3 years ago
- 2 comments
#410 - pr 363
Pull Request -
State: closed - Opened by pleasantrabbit over 3 years ago
#409 - Update ps lite
Pull Request -
State: closed - Opened by pleasantrabbit over 3 years ago
- 2 comments
#408 - Did BytePS Support multiple NICs now?
Issue -
State: open - Opened by wuyujiji over 3 years ago
- 13 comments
#407 - update doc for core affinity envs
Pull Request -
State: closed - Opened by pleasantrabbit over 3 years ago
#406 - update core binding policy
Pull Request -
State: closed - Opened by pleasantrabbit over 3 years ago
#405 - docker file for bytescheduler does not work
Issue -
State: closed - Opened by zarzen over 3 years ago
- 7 comments
#404 - Does TensorFlow1x support asycn-training?
Issue -
State: open - Opened by jiahuiyang over 3 years ago
- 2 comments
#403 - subprocess.CalledProcessError returned non-zero exit status 132
Issue -
State: closed - Opened by powermano over 3 years ago
- 2 comments
#402 - TensorFlow 2.4+ compatibility
Pull Request -
State: closed - Opened by oliverhu over 3 years ago
#401 - TensorFlow 2.5 compatibility
Issue -
State: open - Opened by oliverhu over 3 years ago
- 1 comment
#400 - the share memory optimization of RDMA in single machine
Issue -
State: open - Opened by wuyujiji over 3 years ago
- 3 comments
#399 - fix bool env, disable avx512
Pull Request -
State: closed - Opened by pleasantrabbit over 3 years ago
#398 - Giving the error munmap_chunk(): invalid pointer in BytePS when DMLC_NUM_WORKER changed from 1 to 2
Issue -
State: open - Opened by udaykiran009 over 3 years ago
- 1 comment
#397 - Distributed training with RDMA errors
Issue -
State: closed - Opened by wuyujiji over 3 years ago
- 16 comments
#396 - Not convergence
Issue -
State: open - Opened by Jon-drugstore over 3 years ago
#395 - gradient compression updates
Pull Request -
State: open - Opened by jasperzhong over 3 years ago
#394 - RDMA_CM_EVENT_ADDR_ERROR raised when running distributed training with PyTorch
Issue -
State: open - Opened by anj-s over 3 years ago
#393 - [Question] Why is byteps compiled in debug mode?
Issue -
State: closed - Opened by showerage over 3 years ago
#392 - Does BytePS support multiple network interface?
Issue -
State: closed - Opened by wuyujiji over 3 years ago
- 4 comments
#391 - Failed to train benchmark on AWS EC2 p3dn.24xlarge instance with RDMA
Issue -
State: open - Opened by YouhuiBai over 3 years ago
- 17 comments
#390 - fix missing import 'warnings'
Pull Request -
State: closed - Opened by VincentLeeMax over 3 years ago
- 1 comment
#389 - fix missing import 'warnings'
Pull Request -
State: closed - Opened by VincentLeeMax over 3 years ago
#388 - How does MXNet implement synchronous training?
Issue -
State: open - Opened by showerage over 3 years ago
- 2 comments
#387 - add SyncBatchNorm
Pull Request -
State: open - Opened by pleasantrabbit over 3 years ago
- 1 comment
#386 - undefined symbol: cudaSetupArgument
Issue -
State: open - Opened by harryhan618 over 3 years ago
#385 - tf: skip bcast if there's only one worker
Pull Request -
State: closed - Opened by pleasantrabbit over 3 years ago
#384 - Use BYTEPS_CUDA_HOME instead of /usr/local/cuda
Pull Request -
State: open - Opened by anj-s over 3 years ago
#383 - Unable to install Pytorch plugin when running python setup.py install
Issue -
State: closed - Opened by anj-s over 3 years ago
- 4 comments
#382 - Is model parallelism supported for PyTorch?
Issue -
State: open - Opened by liaopeiyuan over 3 years ago
- 1 comment
#381 - Bytescheduler global barrier in Tensorflow and Pytorch
Issue -
State: open - Opened by offthewall123 over 3 years ago
- 1 comment
#380 - Unable to run training on a single node due to " Check failed: r == ncclSuccess NCCL error: unhandled cuda error"
Issue -
State: closed - Opened by anj-s over 3 years ago
- 4 comments
#379 - example: fix import for python3.8
Pull Request -
State: closed - Opened by pleasantrabbit over 3 years ago
#378 - tf: fix case in register gradient
Pull Request -
State: closed - Opened by pleasantrabbit over 3 years ago
#377 - RDMA_CM_EVENT_ADDR_ERROR
Issue -
State: open - Opened by Ruinhuang over 3 years ago
- 2 comments
#376 - import issue in example/pytorch/mnist-distributed.py
Issue -
State: closed - Opened by hengruo over 3 years ago
- 1 comment
#375 - Do byteps running NCCL all-reduce in co-locate mode?
Issue -
State: closed - Opened by Ruinhuang over 3 years ago
#374 - Did byteps using NCCL all-reduce with co-locate mode?
Issue -
State: open - Opened by Ruinhuang over 3 years ago
- 1 comment
#373 - A segmentation fault occurs when compressor is used.
Issue -
State: open - Opened by showerage over 3 years ago
- 3 comments
#372 - RDMA: Check failed: mr ibv_reg_mr failed: Cannot allocate memory
Issue -
State: closed - Opened by Ruinhuang over 3 years ago
- 1 comment
#371 - unsupported van type: 1 Error when launch RDMA
Issue -
State: closed - Opened by Ruinhuang over 3 years ago
- 4 comments
#370 - how to reduce the overhead of bytescheduler?
Issue -
State: closed - Opened by gbxu over 3 years ago
- 7 comments
Labels: bytescheduler
#369 - Check failed: mr happens when RDMA enabled
Issue -
State: open - Opened by yma11 over 3 years ago
- 3 comments
#368 - How byteps find the gpu topology?
Issue -
State: closed - Opened by Ruinhuang over 3 years ago
- 9 comments
#367 - is BytePS already including Bytedance Scheduler? Or we need to use them separately?
Issue -
State: open - Opened by nishantagrawalgit almost 4 years ago
- 7 comments
#366 - add no_sync for DDP
Pull Request -
State: closed - Opened by gongwei-130 almost 4 years ago
#365 - Performance regression with multi-node running
Issue -
State: open - Opened by MichaelHsu170 almost 4 years ago
- 14 comments
#364 - torch.autograd.profiler.profile() keyword argument
Pull Request -
State: closed - Opened by dbonner almost 4 years ago
#363 - broadcast_optimizer_state for pytorch needs to be able to handle NoneType params
Pull Request -
State: closed - Opened by dbonner almost 4 years ago
- 7 comments
#362 - broadcast_optimizer_state in pytorch needs to be able to handle NoneType params
Issue -
State: closed - Opened by dbonner almost 4 years ago
- 1 comment
#361 - It's stuck here
Issue -
State: open - Opened by qingfengmingyue almost 4 years ago
- 1 comment
#360 - 2worker more slow than 1 worker
Issue -
State: open - Opened by qingfengmingyue almost 4 years ago
- 3 comments
#359 - Fix Asynchronous Training Bug
Pull Request -
State: open - Opened by jasperzhong almost 4 years ago
- 2 comments
#358 - torch: fix hang after int tensor push_pull
Pull Request -
State: closed - Opened by pleasantrabbit almost 4 years ago
#357 - Turning on async (BYTEPS_ENABLE_ASYNC) crashes the bps server
Issue -
State: open - Opened by ruipeterpan almost 4 years ago
- 25 comments
#356 - [Question] Does replacing torch.distributed.all_reduce with BytePS impact the training curve?
Issue -
State: closed - Opened by ruipeterpan almost 4 years ago
- 8 comments
Labels: good first issue, bps.torch.ddp
#349 - the question about byteps's timeline
Issue -
State: open - Opened by wuyujiji almost 4 years ago
- 20 comments
#348 - How to run communication scheduling with BytePS
Issue -
State: open - Opened by Rivendile almost 4 years ago
- 12 comments
#321 - Error: OS call failed or operation not supported on this OS
Issue -
State: closed - Opened by wuyifan18 about 4 years ago
- 5 comments
#295 - Check failed: mr ibv_reg_mr failed: Cannot allocate memory
Issue -
State: closed - Opened by ChenYuHo about 4 years ago
- 3 comments
#269 - pull is not overlapped with computation
Issue -
State: closed - Opened by YuejiYang over 4 years ago
- 6 comments
#268 - [do not review] Run server under gdb
Pull Request -
State: closed - Opened by pleasantrabbit over 4 years ago
- 2 comments
#266 - [question] When to Use BYTEPS_REDUCE_ROOTS
Issue -
State: closed - Opened by gaocegege over 4 years ago
- 5 comments
#225 - gradient compression support
Pull Request -
State: closed - Opened by jasperzhong over 4 years ago
- 37 comments