Ecosyste.ms: Issues
An open API service for providing issue and pull request metadata for open source projects.
GitHub / NVIDIA/nccl-tests issues and pull requests
#287 - NCCL all_reduce_perf errors with 5090s
Issue -
State: open - Opened by RCS1 1 day ago
- 9 comments
#286 - Why is only part of the NICs used during allreduce, and why are inter-node connections not on the same rail even when I explicitly specify using all NICs?
Issue -
State: closed - Opened by FortPercent 10 days ago
#285 - counfused about the calculation of reducescatter algbw
Issue -
State: closed - Opened by feixue1024 10 days ago
- 2 comments
#284 - nccl-tests with 2 gpu nodes times out
Issue -
State: closed - Opened by anoop-agzen 22 days ago
- 8 comments
#283 - NCCL Test Multi-node Bus Bandwidth Tuning issue
Issue -
State: open - Opened by LXLei about 1 month ago
#282 - Running all_reduce in H200 caused CUDA failure
Issue -
State: closed - Opened by joydchh about 1 month ago
- 1 comment
#281 - nccl-tests cannot perform multi-machine interconnection through RDMA in the docker container.
Issue -
State: open - Opened by Eevan-zq about 1 month ago
- 6 comments
#280 - use multi gpu test failure
Issue -
State: open - Opened by 1556900941lizerui about 1 month ago
- 5 comments
#279 - Is it suitable to calculate busBw in AllReduceGetBw for all of the algos?
Issue -
State: open - Opened by shanleo2024 about 1 month ago
#278 - mpirun all_reduce_perf hang with multi-device test
Issue -
State: open - Opened by kubepopeye 2 months ago
- 2 comments
#277 - Why are eight network interface cards (NICs) used instead of four in a two-node, 16-GPU test setup with A100 GPUs?
Issue -
State: open - Opened by lmhahatest 2 months ago
- 2 comments
#276 - [H200: All_reduce] Random Unhandled Cuda Error
Issue -
State: closed - Opened by vitduck 2 months ago
- 2 comments
#275 - how can I solve the problem:Test NCCL failure common.cu:1012 'internal error - please report this issue to the NCCL developers / ' .. dell03 pid 543317: Test failure common.cu:891
Issue -
State: closed - Opened by mstJuly 2 months ago
- 2 comments
#274 - What's the difference in allreduce for factor between the NCCL and NCCL-test?
Issue -
State: open - Opened by networkResearcher 2 months ago
- 2 comments
#273 - The special topology causes the NCCL test to fail
Issue -
State: open - Opened by zh0ngtian 3 months ago
- 7 comments
#272 - [H200] NCCL's All-reduce Performance Exceed 4th Gen NVLINK Spec.
Issue -
State: closed - Opened by vitduck 3 months ago
- 2 comments
#271 - build: update fallback gencodes
Pull Request -
State: closed - Opened by aws-nslick 3 months ago
- 3 comments
#270 - Why is the NCCL test rate of my 4090x8 card so low?
Issue -
State: open - Opened by gogoman010310 3 months ago
#270 - Why is the NCCL test rate of my 4090x8 card so low?
Issue -
State: open - Opened by gogoman010310 3 months ago
#269 - Why is the NCCL test rate of my 4090x8 card so low?
Issue -
State: closed - Opened by gogoman010310 3 months ago
#268 - P2P performance with nccl-tests vs nvbandwidth
Issue -
State: open - Opened by goelayu 3 months ago
#268 - P2P performance with nccl-tests vs nvbandwidth
Issue -
State: open - Opened by goelayu 3 months ago
- 1 comment
#267 - how overall throughout calculate about all2all
Issue -
State: open - Opened by ltm920716 3 months ago
- 5 comments
#266 - question for NCCL write data size
Issue -
State: closed - Opened by gabbychen 3 months ago
- 4 comments
#265 - nccl-tests result is only a half of the bandwidth on bond nics
Issue -
State: closed - Opened by 913871734 3 months ago
#265 - nccl-tests result is only a half of the bandwidth on bond nics
Issue -
State: closed - Opened by 913871734 3 months ago
#264 - Test CUDA failure common.cu:941 'system not yet initialized'
Issue -
State: open - Opened by vijayaramaraju-kalidindi 4 months ago
- 6 comments
#264 - Test CUDA failure common.cu:941 'system not yet initialized'
Issue -
State: open - Opened by vijayaramaraju-kalidindi 4 months ago
- 6 comments
#263 - Test CUDA failure common.cu:941 'invalid device ordinal' when test two nodes with nvhpc
Issue -
State: open - Opened by heya5 4 months ago
- 3 comments
#263 - Test CUDA failure common.cu:941 'invalid device ordinal' when test two nodes with nvhpc
Issue -
State: open - Opened by heya5 4 months ago
- 3 comments
#262 - How to get the latency and the package of NCCL
Issue -
State: closed - Opened by gabbychen 4 months ago
- 4 comments
#261 - Difference between in_place and out_of_place
Issue -
State: open - Opened by 17113325 4 months ago
- 5 comments
#261 - Difference between in_place and out_of_place
Issue -
State: open - Opened by 17113325 4 months ago
- 5 comments
#260 - How to run nccl test in vm without nvswitch passthroughed?
Issue -
State: open - Opened by joydchh 4 months ago
- 1 comment
#260 - How to run nccl test in vm without nvswitch passthroughed?
Issue -
State: open - Opened by joydchh 4 months ago
- 2 comments
#259 - Future-proof ncclstringtotype
Pull Request -
State: closed - Opened by kiskra-nvidia 4 months ago
#258 - Why the effective B/W for each NVlink is 20GB/s instead of 25GB/s
Issue -
State: closed - Opened by gabbychen 4 months ago
- 2 comments
#257 - nccl-tests did not perform as expected
Issue -
State: open - Opened by yalbaba 4 months ago
- 3 comments
#256 - NCCL topology on the VM of H200
Issue -
State: open - Opened by wangjiafu0310 4 months ago
- 7 comments
#255 - nccl-tests hangs when using HPCX
Issue -
State: closed - Opened by ycm0k 5 months ago
- 3 comments
#254 - Multiple MPI ranks using same GPU when conducting multi-node test
Issue -
State: closed - Opened by ycm0k 5 months ago
- 1 comment
#253 - question about pingpong example
Issue -
State: closed - Opened by jinz2014 5 months ago
- 5 comments
#252 - Test NCCL failure common with network error.
Issue -
State: closed - Opened by ismailguzel 5 months ago
- 11 comments
#251 - BW test on V100 4 GPUS is not matched with InfiniBand EDR (Connect-X4)
Issue -
State: open - Opened by javak87 5 months ago
- 1 comment
#250 - Enable P2P on pcie in a nvlink machine
Issue -
State: open - Opened by cll24 6 months ago
- 1 comment
#249 - Getting Avg bus bandwidth = 0 when running all_reduce_perf in nccl-tests in my EC2 G5.8x large
Issue -
State: closed - Opened by rajeshvenkata 6 months ago
- 2 comments
#248 - Running in kubernetes pods Error
Issue -
State: closed - Opened by drikster80 6 months ago
- 2 comments
#247 - NCCL all-reduce test failure due to TL_SHM ERROR, This case was happened on containers on same server.
Issue -
State: closed - Opened by thsmfe001 6 months ago
- 2 comments
#246 - NCCL_Algo=Tree
Issue -
State: open - Opened by afattaholman 7 months ago
- 1 comment
#245 - What does dma_buf do when gpuDirectRdma is disabled ?
Issue -
State: open - Opened by Pavani-Panakanti 7 months ago
- 1 comment
#245 - What does dma_buf do when gpuDirectRdma is disabled ?
Issue -
State: open - Opened by Pavani-Panakanti 7 months ago
- 1 comment
#244 - Test NCCL Hang
Issue -
State: closed - Opened by sdonoso 7 months ago
- 2 comments
#243 - Enhance Multi-Node NCCL Testing with Torch C10D Gloo Framework
Pull Request -
State: open - Opened by hexinw 7 months ago
#243 - Enhance Multi-Node NCCL Testing with Torch C10D Gloo Framework
Pull Request -
State: open - Opened by hexinw 7 months ago
#242 - 2 Node Nccl Test don’t work for A100
Issue -
State: closed - Opened by jeffreyyjp 7 months ago
- 4 comments
#241 - AllReduce Bus Bandwidth decreases with larger network latency
Issue -
State: open - Opened by chenzhu99 7 months ago
#240 - doc: add all2all factor
Pull Request -
State: closed - Opened by OrenLeung 7 months ago
- 1 comment
#239 - fix: nvls all reduce correction factor
Pull Request -
State: open - Opened by OrenLeung 7 months ago
- 4 comments
#238 - all_reduce algo factor for NVLink SHARP In network reductions
Issue -
State: open - Opened by OrenLeung 7 months ago
#238 - all_reduce algo factor for NVLink SHARP In network reductions
Issue -
State: open - Opened by OrenLeung 7 months ago
#237 - how to calculate the tree based allreduce ib bw?
Issue -
State: open - Opened by echobinarybytes 7 months ago
#237 - how to calculate the tree based allreduce ib bw?
Issue -
State: open - Opened by echobinarybytes 7 months ago
#236 - 2 Node Nccl Test don’t work
Issue -
State: open - Opened by SdEnd 7 months ago
- 7 comments
#235 - How do we comprehend the factor between algBw and busBw?
Issue -
State: open - Opened by lianghao208 7 months ago
- 5 comments
#234 - What's multi-allreduce ?
Issue -
State: open - Opened by ProHuper 7 months ago
- 1 comment
#233 - all_reduce_perf core dumped on 4 L20
Issue -
State: closed - Opened by songh11 7 months ago
- 23 comments
#232 - NCCL Tree allreduce test cannot reach the theoretical bus bandwidth on 2 nodes with 4 nics
Issue -
State: closed - Opened by ProHuper 7 months ago
#232 - NCCL Tree allreduce test cannot reach the theoretical bus bandwidth on 2 nodes with 4 nics
Issue -
State: closed - Opened by ProHuper 7 months ago
#231 - Test NCCL failure common.cu:997 'internal error
Issue -
State: closed - Opened by sdonoso 8 months ago
- 9 comments
#230 - what is cu:990 error? how to solve this problem?
Issue -
State: open - Opened by MAKER-park 8 months ago
- 5 comments
#230 - what is cu:990 error? how to solve this problem?
Issue -
State: open - Opened by MAKER-park 8 months ago
- 5 comments
#229 - 2 Nodes nccl-test with mpi hangs
Issue -
State: closed - Opened by sdonoso 8 months ago
- 1 comment
#229 - 2 Nodes nccl-test with mpi hangs
Issue -
State: closed - Opened by sdonoso 8 months ago
- 1 comment
#228 - has nvswitch, but uses 0 nvls channels
Issue -
State: closed - Opened by MiyazonoKaori 8 months ago
- 3 comments
#228 - has nvswitch, but uses 0 nvls channels
Issue -
State: closed - Opened by MiyazonoKaori 8 months ago
- 3 comments
#227 - Test fail caused by ibvwrap.c:160 NCCL WARN Call to ibv_modify_qp failed with error Connection timed out.
Issue -
State: closed - Opened by thsmfe001 8 months ago
- 2 comments
#227 - Test fail caused by ibvwrap.c:160 NCCL WARN Call to ibv_modify_qp failed with error Connection timed out.
Issue -
State: closed - Opened by thsmfe001 8 months ago
- 2 comments
#226 - improve parsing of stepbytes (increment size) argument
Pull Request -
State: closed - Opened by StefanoSalsano 8 months ago
- 1 comment
#226 - improve parsing of stepbytes (increment size) argument
Pull Request -
State: closed - Opened by StefanoSalsano 8 months ago
- 1 comment
#225 - stepbytes (increment size) argument does not support 1M notation
Issue -
State: open - Opened by StefanoSalsano 8 months ago
- 1 comment
#224 - alltoall_perf: each rank is only sending to half of the other ranks
Issue -
State: closed - Opened by russilwvong 8 months ago
- 14 comments
#223 - mpirun all_reduce_perf hang with multi-device test
Issue -
State: open - Opened by 913871734 8 months ago
- 1 comment
#222 - NCCL WARN Cannot use cuda/gdr transports as part of specified UCX_TLS
Issue -
State: open - Opened by liuxingbo12138 9 months ago
- 5 comments
#221 - how to support One Device per Process?
Issue -
State: closed - Opened by jiangxiaobin96 9 months ago
- 4 comments
#220 - 1 GiB headroom might be too small
Issue -
State: open - Opened by Namnamseo 9 months ago
#219 - Test NCCL failure common.cu:959 'internal error - please report this issue to the NCCL developers / '
Issue -
State: open - Opened by Assassin187 9 months ago
- 9 comments
#218 - Rank Assignment Issue under four containers on two different servers.
Issue -
State: closed - Opened by thsmfe001 9 months ago
- 8 comments
#217 - all_reduce_perf hangs; using a single GPU on a 4GPU machine
Issue -
State: closed - Opened by isaacgerg 9 months ago
- 21 comments
#216 - NCCL initialization hangs with 4 GPUs, but works with 2 GPUs
Issue -
State: open - Opened by mickaelseznec 9 months ago
- 4 comments
#215 - NCCL_ALGO on multi-node and multi-GPU
Issue -
State: open - Opened by MajidSalimi 9 months ago
- 3 comments
#214 - SendRecv Time
Issue -
State: open - Opened by osayamenja 10 months ago
- 6 comments
#213 - Nccl test seems run seperately on multi nodes
Issue -
State: closed - Opened by jianh619 10 months ago
- 6 comments
#212 - H100 all reduce performance is poor
Issue -
State: open - Opened by liminn 10 months ago
- 13 comments
#211 - undefined reference nccl*
Issue -
State: closed - Opened by gongyguo 10 months ago
- 1 comment
#210 - Differences problems in performance data of HGX A800 single server N GPUs nccl testing
Issue -
State: open - Opened by cloveryyg 10 months ago
#209 - The network bandwidth in the alltoall_perf test failed to meet expectations
Issue -
State: open - Opened by fj1425fj 10 months ago
- 4 comments
#208 - Test NCCL failure common.cu:954 'unhandled cuda error
Issue -
State: closed - Opened by YingYellow 10 months ago
- 1 comment
#207 - make failed, error -- unsupported GNU version! gcc versions later than 11 are not supported!
Issue -
State: closed - Opened by jxh314 10 months ago
#206 - misc/ibvwrap.cc:278 NCCL WARN Call to ibv_reg_mr_iova2 failed with error Cannot allocate memory
Issue -
State: closed - Opened by jxh314 10 months ago
- 2 comments
#205 - cputime
Issue -
State: open - Opened by tks2004 10 months ago