bytedance/byteps issues and pull requests

#447 - install failed

Issue - State: open - Opened by themoonstone 7 months ago

#446 - 支持的cuda和pytorch版本

Issue - State: open - Opened by themoonstone 7 months ago

#445 - support pytorch 2.1.x

Pull Request - State: closed - Opened by rainj-me about 1 year ago

#444 - Is there any benchmark comparison with Megatron-LM ?

Issue - State: open - Opened by sequoiar over 1 year ago

#443 - segmentation fault while launching the worker

Issue - State: open - Opened by xuexiaxie over 1 year ago - 1 comment

#442 - How does the tensorflow scheduler plugin used in the tf_benchmark_cnn.py

Issue - State: open - Opened by sxqqslf over 1 year ago - 1 comment

#441 - Mistakes of Workload calculation

Issue - State: open - Opened by fly-dragon211 over 1 year ago - 5 comments

#440 - 安装问题

Issue - State: open - Opened by QingQingR about 2 years ago

#439 - Supported environment

Issue - State: closed - Opened by QingQingR about 2 years ago

#438 - broadcast and is_initialized api are not supported with pytorch.

Issue - State: open - Opened by HangJie720 over 2 years ago

#437 - support for fault tolerance and straggler mitigation

Issue - State: open - Opened by youshaox over 2 years ago

#436 - Communication failure in MXNet with BytePS

Issue - State: closed - Opened by qingyangDuan over 2 years ago - 3 comments

#435 - 3rdparty: update pslite to fix shm name

Pull Request - State: closed - Opened by ymjiang over 2 years ago

#434 - update shm naming scheme

Pull Request - State: open - Opened by pleasantrabbit over 2 years ago

#433 - 安装报错

Issue - State: open - Opened by llplay over 2 years ago - 1 comment

#432 - torch: update ddp

Pull Request - State: open - Opened by pleasantrabbit over 2 years ago

#431 - Release BytePS docker image support for TF2

Issue - State: open - Opened by shaowei-su over 2 years ago

#430 - Running multiple workers on a single GPU machine

Issue - State: open - Opened by hamidralmasi almost 3 years ago

#429 - launcher: join workers as they exit

Pull Request - State: closed - Opened by pleasantrabbit almost 3 years ago

#428 - Successfully installed BytePS but cannot import byteps.torch or byteps.tensorflow

Issue - State: closed - Opened by hamidralmasi almost 3 years ago - 2 comments

#427 - benchmark with cross barrier error

Issue - State: open - Opened by panpanli521 almost 3 years ago

#426 - 有计划支持纯cpu吗？我们worker也用cpu机器的

Issue - State: open - Opened by starkeisntein almost 3 years ago - 2 comments

#425 - 啥时候支持sparse模型?

Issue - State: open - Opened by starkeisntein almost 3 years ago

#424 - ps-lite: disable ucx error handling by default

Pull Request - State: closed - Opened by pleasantrabbit almost 3 years ago

#423 - ps-lite: update ps-lite

Pull Request - State: closed - Opened by pleasantrabbit almost 3 years ago

#422 - Is it right to do allreduce immediately for non-zero ranks in bytescheduler?

Issue - State: closed - Opened by sywang0111 almost 3 years ago - 2 comments

#421 - server: exit log improvement

Pull Request - State: closed - Opened by ymjiang almost 3 years ago

#420 - torch: fix compression when using apex.amp

Pull Request - State: closed - Opened by pleasantrabbit almost 3 years ago

#419 - Stuck in the bps.init().

Issue - State: closed - Opened by Fangjin98 almost 3 years ago - 7 comments

#418 - The byteps in K8S Pod doesn't have DMLC_WORKER_ID configured.

Issue - State: open - Opened by jackjinj almost 3 years ago

#417 - How to use gradient accumulate in BytePS torch DDP?

Issue - State: open - Opened by wuyujiji about 3 years ago - 5 comments
Labels: enhancement

#416 - tensorflow: fix bug in broadcast_variables

Pull Request - State: closed - Opened by pleasantrabbit about 3 years ago

#415 - build: update ucx tarball download logic

Pull Request - State: closed - Opened by pleasantrabbit about 3 years ago

#414 - common: add better support for huge tensors

Pull Request - State: closed - Opened by ymjiang about 3 years ago

#413 - packaging: download tarballs when running sdist

Pull Request - State: closed - Opened by pleasantrabbit about 3 years ago

#412 - server: improve thread safety

Pull Request - State: closed - Opened by ymjiang about 3 years ago

#411 - Training process occurs nan at the first ten batch.

Issue - State: open - Opened by powermano over 3 years ago - 2 comments

#410 - pr 363

Pull Request - State: closed - Opened by pleasantrabbit over 3 years ago

#409 - Update ps lite

Pull Request - State: closed - Opened by pleasantrabbit over 3 years ago - 2 comments

#408 - Did BytePS Support multiple NICs now?

Issue - State: open - Opened by wuyujiji over 3 years ago - 13 comments

#407 - update doc for core affinity envs

Pull Request - State: closed - Opened by pleasantrabbit over 3 years ago

#406 - update core binding policy

Pull Request - State: closed - Opened by pleasantrabbit over 3 years ago

#405 - docker file for bytescheduler does not work

Issue - State: closed - Opened by zarzen over 3 years ago - 7 comments

#404 - Does TensorFlow1x support asycn-training?

Issue - State: open - Opened by jiahuiyang over 3 years ago - 2 comments

#403 - subprocess.CalledProcessError returned non-zero exit status 132

Issue - State: closed - Opened by powermano over 3 years ago - 2 comments

#402 - TensorFlow 2.4+ compatibility

Pull Request - State: closed - Opened by oliverhu over 3 years ago

#401 - TensorFlow 2.5 compatibility

Issue - State: open - Opened by oliverhu over 3 years ago - 1 comment

#400 - the share memory optimization of RDMA in single machine

Issue - State: open - Opened by wuyujiji over 3 years ago - 3 comments

#399 - fix bool env, disable avx512

Pull Request - State: closed - Opened by pleasantrabbit over 3 years ago

#398 - Giving the error munmap_chunk(): invalid pointer in BytePS when DMLC_NUM_WORKER changed from 1 to 2

Issue - State: open - Opened by udaykiran009 over 3 years ago - 1 comment

#397 - Distributed training with RDMA errors

Issue - State: closed - Opened by wuyujiji over 3 years ago - 16 comments

#396 - Not convergence

Issue - State: open - Opened by Jon-drugstore over 3 years ago

#395 - gradient compression updates

Pull Request - State: open - Opened by jasperzhong over 3 years ago

#394 - RDMA_CM_EVENT_ADDR_ERROR raised when running distributed training with PyTorch

Issue - State: open - Opened by anj-s over 3 years ago

#393 - [Question] Why is byteps compiled in debug mode?

Issue - State: closed - Opened by showerage over 3 years ago

#392 - Does BytePS support multiple network interface?

Issue - State: closed - Opened by wuyujiji over 3 years ago - 4 comments

#391 - Failed to train benchmark on AWS EC2 p3dn.24xlarge instance with RDMA

Issue - State: open - Opened by YouhuiBai over 3 years ago - 17 comments

#390 - fix missing import 'warnings'

Pull Request - State: closed - Opened by VincentLeeMax over 3 years ago - 1 comment

#389 - fix missing import 'warnings'

Pull Request - State: closed - Opened by VincentLeeMax over 3 years ago

#388 - How does MXNet implement synchronous training?

Issue - State: open - Opened by showerage over 3 years ago - 2 comments

#387 - add SyncBatchNorm

Pull Request - State: open - Opened by pleasantrabbit over 3 years ago - 1 comment

#386 - undefined symbol: cudaSetupArgument

Issue - State: open - Opened by harryhan618 over 3 years ago

#385 - tf: skip bcast if there's only one worker

Pull Request - State: closed - Opened by pleasantrabbit over 3 years ago

#384 - Use BYTEPS_CUDA_HOME instead of /usr/local/cuda

Pull Request - State: open - Opened by anj-s over 3 years ago

#383 - Unable to install Pytorch plugin when running python setup.py install

Issue - State: closed - Opened by anj-s over 3 years ago - 4 comments

#382 - Is model parallelism supported for PyTorch?

Issue - State: open - Opened by liaopeiyuan over 3 years ago - 1 comment

#381 - Bytescheduler global barrier in Tensorflow and Pytorch

Issue - State: open - Opened by offthewall123 over 3 years ago - 1 comment

#380 - Unable to run training on a single node due to " Check failed: r == ncclSuccess NCCL error: unhandled cuda error"

Issue - State: closed - Opened by anj-s over 3 years ago - 4 comments

#379 - example: fix import for python3.8

Pull Request - State: closed - Opened by pleasantrabbit over 3 years ago

#378 - tf: fix case in register gradient

Pull Request - State: closed - Opened by pleasantrabbit over 3 years ago

#377 - RDMA_CM_EVENT_ADDR_ERROR

Issue - State: open - Opened by Ruinhuang over 3 years ago - 2 comments

#376 - import issue in example/pytorch/mnist-distributed.py

Issue - State: closed - Opened by hengruo over 3 years ago - 1 comment

#375 - Do byteps running NCCL all-reduce in co-locate mode?

Issue - State: closed - Opened by Ruinhuang over 3 years ago

#374 - Did byteps using NCCL all-reduce with co-locate mode?

Issue - State: open - Opened by Ruinhuang over 3 years ago - 1 comment

#373 - A segmentation fault occurs when compressor is used.

Issue - State: open - Opened by showerage over 3 years ago - 3 comments

#372 - RDMA: Check failed: mr ibv_reg_mr failed: Cannot allocate memory

Issue - State: closed - Opened by Ruinhuang over 3 years ago - 1 comment

#371 - unsupported van type: 1 Error when launch RDMA

Issue - State: closed - Opened by Ruinhuang over 3 years ago - 4 comments

#370 - how to reduce the overhead of bytescheduler?

Issue - State: closed - Opened by gbxu over 3 years ago - 7 comments
Labels: bytescheduler

#369 - Check failed: mr happens when RDMA enabled

Issue - State: open - Opened by yma11 over 3 years ago - 3 comments

#368 - How byteps find the gpu topology?

Issue - State: closed - Opened by Ruinhuang over 3 years ago - 9 comments

#367 - is BytePS already including Bytedance Scheduler? Or we need to use them separately?

Issue - State: open - Opened by nishantagrawalgit almost 4 years ago - 7 comments

#366 - add no_sync for DDP

Pull Request - State: closed - Opened by gongwei-130 almost 4 years ago

#365 - Performance regression with multi-node running

Issue - State: open - Opened by MichaelHsu170 almost 4 years ago - 14 comments

#364 - torch.autograd.profiler.profile() keyword argument

Pull Request - State: closed - Opened by dbonner almost 4 years ago

#363 - broadcast_optimizer_state for pytorch needs to be able to handle NoneType params

Pull Request - State: closed - Opened by dbonner almost 4 years ago - 7 comments

#362 - broadcast_optimizer_state in pytorch needs to be able to handle NoneType params

Issue - State: closed - Opened by dbonner almost 4 years ago - 1 comment

#361 - It's stuck here

Issue - State: open - Opened by qingfengmingyue almost 4 years ago - 1 comment

#360 - 2worker more slow than 1 worker

Issue - State: open - Opened by qingfengmingyue almost 4 years ago - 3 comments

#359 - Fix Asynchronous Training Bug

Pull Request - State: open - Opened by jasperzhong almost 4 years ago - 2 comments

#358 - torch: fix hang after int tensor push_pull

Pull Request - State: closed - Opened by pleasantrabbit almost 4 years ago

#357 - Turning on async (BYTEPS_ENABLE_ASYNC) crashes the bps server

Issue - State: open - Opened by ruipeterpan almost 4 years ago - 25 comments

#356 - [Question] Does replacing torch.distributed.all_reduce with BytePS impact the training curve?

Issue - State: closed - Opened by ruipeterpan almost 4 years ago - 8 comments
Labels: good first issue, bps.torch.ddp

#349 - the question about byteps's timeline

Issue - State: open - Opened by wuyujiji almost 4 years ago - 20 comments

#348 - How to run communication scheduling with BytePS

Issue - State: open - Opened by Rivendile almost 4 years ago - 12 comments

#321 - Error: OS call failed or operation not supported on this OS

Issue - State: closed - Opened by wuyifan18 about 4 years ago - 5 comments

#295 - Check failed: mr ibv_reg_mr failed: Cannot allocate memory

Issue - State: closed - Opened by ChenYuHo about 4 years ago - 3 comments

#269 - pull is not overlapped with computation

Issue - State: closed - Opened by YuejiYang over 4 years ago - 6 comments

#268 - [do not review] Run server under gdb

Pull Request - State: closed - Opened by pleasantrabbit over 4 years ago - 2 comments

#266 - [question] When to Use BYTEPS_REDUCE_ROOTS

Issue - State: closed - Opened by gaocegege over 4 years ago - 5 comments

#225 - gradient compression support

Pull Request - State: closed - Opened by jasperzhong over 4 years ago - 37 comments

GitHub / bytedance/byteps issues and pull requests