Ecosyste.ms: Issues

An open API service for providing issue and pull request metadata for open source projects.

GitHub / NVIDIA/nccl-tests issues and pull requests

#287 - NCCL all_reduce_perf errors with 5090s

Issue - State: open - Opened by RCS1 1 day ago - 9 comments

#285 - counfused about the calculation of reducescatter algbw

Issue - State: closed - Opened by feixue1024 10 days ago - 2 comments

#284 - nccl-tests with 2 gpu nodes times out

Issue - State: closed - Opened by anoop-agzen 22 days ago - 8 comments

#283 - NCCL Test Multi-node Bus Bandwidth Tuning issue

Issue - State: open - Opened by LXLei about 1 month ago

#282 - Running all_reduce in H200 caused CUDA failure

Issue - State: closed - Opened by joydchh about 1 month ago - 1 comment

#280 - use multi gpu test failure

Issue - State: open - Opened by 1556900941lizerui about 1 month ago - 5 comments

#278 - mpirun all_reduce_perf hang with multi-device test

Issue - State: open - Opened by kubepopeye 2 months ago - 2 comments

#276 - [H200: All_reduce] Random Unhandled Cuda Error

Issue - State: closed - Opened by vitduck 2 months ago - 2 comments

#273 - The special topology causes the NCCL test to fail

Issue - State: open - Opened by zh0ngtian 3 months ago - 7 comments

#272 - [H200] NCCL's All-reduce Performance Exceed 4th Gen NVLINK Spec.

Issue - State: closed - Opened by vitduck 3 months ago - 2 comments

#271 - build: update fallback gencodes

Pull Request - State: closed - Opened by aws-nslick 3 months ago - 3 comments

#268 - P2P performance with nccl-tests vs nvbandwidth

Issue - State: open - Opened by goelayu 3 months ago

#268 - P2P performance with nccl-tests vs nvbandwidth

Issue - State: open - Opened by goelayu 3 months ago - 1 comment

#267 - how overall throughout calculate about all2all

Issue - State: open - Opened by ltm920716 3 months ago - 5 comments

#266 - question for NCCL write data size

Issue - State: closed - Opened by gabbychen 3 months ago - 4 comments

#262 - How to get the latency and the package of NCCL

Issue - State: closed - Opened by gabbychen 4 months ago - 4 comments

#261 - Difference between in_place and out_of_place

Issue - State: open - Opened by 17113325 4 months ago - 5 comments

#261 - Difference between in_place and out_of_place

Issue - State: open - Opened by 17113325 4 months ago - 5 comments

#260 - How to run nccl test in vm without nvswitch passthroughed?

Issue - State: open - Opened by joydchh 4 months ago - 1 comment

#260 - How to run nccl test in vm without nvswitch passthroughed?

Issue - State: open - Opened by joydchh 4 months ago - 2 comments

#259 - Future-proof ncclstringtotype

Pull Request - State: closed - Opened by kiskra-nvidia 4 months ago

#258 - Why the effective B/W for each NVlink is 20GB/s instead of 25GB/s

Issue - State: closed - Opened by gabbychen 4 months ago - 2 comments

#257 - nccl-tests did not perform as expected

Issue - State: open - Opened by yalbaba 4 months ago - 3 comments

#256 - NCCL topology on the VM of H200

Issue - State: open - Opened by wangjiafu0310 4 months ago - 7 comments

#255 - nccl-tests hangs when using HPCX

Issue - State: closed - Opened by ycm0k 5 months ago - 3 comments

#254 - Multiple MPI ranks using same GPU when conducting multi-node test

Issue - State: closed - Opened by ycm0k 5 months ago - 1 comment

#253 - question about pingpong example

Issue - State: closed - Opened by jinz2014 5 months ago - 5 comments

#252 - Test NCCL failure common with network error.

Issue - State: closed - Opened by ismailguzel 5 months ago - 11 comments

#251 - BW test on V100 4 GPUS is not matched with InfiniBand EDR (Connect-X4)

Issue - State: open - Opened by javak87 5 months ago - 1 comment

#250 - Enable P2P on pcie in a nvlink machine

Issue - State: open - Opened by cll24 6 months ago - 1 comment

#248 - Running in kubernetes pods Error

Issue - State: closed - Opened by drikster80 6 months ago - 2 comments

#246 - NCCL_Algo=Tree

Issue - State: open - Opened by afattaholman 7 months ago - 1 comment

#245 - What does dma_buf do when gpuDirectRdma is disabled ?

Issue - State: open - Opened by Pavani-Panakanti 7 months ago - 1 comment

#245 - What does dma_buf do when gpuDirectRdma is disabled ?

Issue - State: open - Opened by Pavani-Panakanti 7 months ago - 1 comment

#244 - Test NCCL Hang

Issue - State: closed - Opened by sdonoso 7 months ago - 2 comments

#243 - Enhance Multi-Node NCCL Testing with Torch C10D Gloo Framework

Pull Request - State: open - Opened by hexinw 7 months ago

#243 - Enhance Multi-Node NCCL Testing with Torch C10D Gloo Framework

Pull Request - State: open - Opened by hexinw 7 months ago

#242 - 2 Node Nccl Test don’t work for A100

Issue - State: closed - Opened by jeffreyyjp 7 months ago - 4 comments

#240 - doc: add all2all factor

Pull Request - State: closed - Opened by OrenLeung 7 months ago - 1 comment

#239 - fix: nvls all reduce correction factor

Pull Request - State: open - Opened by OrenLeung 7 months ago - 4 comments

#236 - 2 Node Nccl Test don’t work

Issue - State: open - Opened by SdEnd 7 months ago - 7 comments

#235 - How do we comprehend the factor between algBw and busBw?

Issue - State: open - Opened by lianghao208 7 months ago - 5 comments

#234 - What's multi-allreduce ?

Issue - State: open - Opened by ProHuper 7 months ago - 1 comment

#233 - all_reduce_perf core dumped on 4 L20

Issue - State: closed - Opened by songh11 7 months ago - 23 comments

#231 - Test NCCL failure common.cu:997 'internal error

Issue - State: closed - Opened by sdonoso 8 months ago - 9 comments

#230 - what is cu:990 error? how to solve this problem?

Issue - State: open - Opened by MAKER-park 8 months ago - 5 comments

#230 - what is cu:990 error? how to solve this problem?

Issue - State: open - Opened by MAKER-park 8 months ago - 5 comments

#229 - 2 Nodes nccl-test with mpi hangs

Issue - State: closed - Opened by sdonoso 8 months ago - 1 comment

#229 - 2 Nodes nccl-test with mpi hangs

Issue - State: closed - Opened by sdonoso 8 months ago - 1 comment

#228 - has nvswitch, but uses 0 nvls channels

Issue - State: closed - Opened by MiyazonoKaori 8 months ago - 3 comments

#228 - has nvswitch, but uses 0 nvls channels

Issue - State: closed - Opened by MiyazonoKaori 8 months ago - 3 comments

#226 - improve parsing of stepbytes (increment size) argument

Pull Request - State: closed - Opened by StefanoSalsano 8 months ago - 1 comment

#226 - improve parsing of stepbytes (increment size) argument

Pull Request - State: closed - Opened by StefanoSalsano 8 months ago - 1 comment

#225 - stepbytes (increment size) argument does not support 1M notation

Issue - State: open - Opened by StefanoSalsano 8 months ago - 1 comment

#224 - alltoall_perf: each rank is only sending to half of the other ranks

Issue - State: closed - Opened by russilwvong 8 months ago - 14 comments

#223 - mpirun all_reduce_perf hang with multi-device test

Issue - State: open - Opened by 913871734 8 months ago - 1 comment

#221 - how to support One Device per Process?

Issue - State: closed - Opened by jiangxiaobin96 9 months ago - 4 comments

#220 - 1 GiB headroom might be too small

Issue - State: open - Opened by Namnamseo 9 months ago

#218 - Rank Assignment Issue under four containers on two different servers.

Issue - State: closed - Opened by thsmfe001 9 months ago - 8 comments

#217 - all_reduce_perf hangs; using a single GPU on a 4GPU machine

Issue - State: closed - Opened by isaacgerg 9 months ago - 21 comments

#216 - NCCL initialization hangs with 4 GPUs, but works with 2 GPUs

Issue - State: open - Opened by mickaelseznec 9 months ago - 4 comments

#215 - NCCL_ALGO on multi-node and multi-GPU

Issue - State: open - Opened by MajidSalimi 9 months ago - 3 comments

#214 - SendRecv Time

Issue - State: open - Opened by osayamenja 10 months ago - 6 comments

#213 - Nccl test seems run seperately on multi nodes

Issue - State: closed - Opened by jianh619 10 months ago - 6 comments

#212 - H100 all reduce performance is poor

Issue - State: open - Opened by liminn 10 months ago - 13 comments

#211 - undefined reference nccl*

Issue - State: closed - Opened by gongyguo 10 months ago - 1 comment

#208 - Test NCCL failure common.cu:954 'unhandled cuda error

Issue - State: closed - Opened by YingYellow 10 months ago - 1 comment

#205 - cputime

Issue - State: open - Opened by tks2004 10 months ago