NVIDIA/apex issues and pull requests

#1471 - RuntimeError: Error compiling objects for extension error: subprocess-exited-with-error

Issue - State: open - Opened by SamAct over 2 years ago - 3 comments

#1455 - .cu files should not include torch/extension.h

Pull Request - State: open - Opened by lostmsu over 2 years ago - 4 comments

#1439 - Add BF16 support to FusedMixedPrecisionLamb

Pull Request - State: closed - Opened by nv-joseli over 2 years ago

#1420 - Could not find permutation search CUDA kernels, falling back to CPU path

Issue - State: closed - Opened by te-shi over 2 years ago - 5 comments
Labels: bug

#1415 - NVCC --threads option is hardcoded

Issue - State: open - Opened by wvidana over 2 years ago - 2 comments
Labels: bug

#1408 - how to invoke amp.initialize() and amp.scale_loss() from different module

Issue - State: closed - Opened by kehuanfeng over 2 years ago - 2 comments
Labels: bug

#1400 - [transformer] Port Sequence Parallelism (takeover of #1396)

Pull Request - State: closed - Opened by crcrpar over 2 years ago - 1 comment

#1394 - FusedDenseGeluDense output NAN

Issue - State: open - Opened by gongjingcs over 2 years ago - 2 comments
Labels: bug

#1326 - Installation Error

Issue - State: closed - Opened by GMN23362 almost 3 years ago - 2 comments

#1314 - `fused_weight_gradient_mlp_cuda` module not found. gradient accumulation fusion with weight gradient computation disabled.

Issue - State: open - Opened by adore1979 almost 3 years ago - 25 comments

#1293 - The following error occurred while installing apex

Issue - State: closed - Opened by xxw11 almost 3 years ago - 2 comments

#1282 - Handle len(cached_x.grad_fn.next_functions) == 1 in cached_cast

Pull Request - State: open - Opened by jiafatom almost 3 years ago - 8 comments

#1230 - Using apex leeads to a `CUDA out of memory` on an A100

Issue - State: closed - Opened by StrangeTcy about 3 years ago - 2 comments

#1229 - [FMHA] add support for later CUDA (8.x)

Pull Request - State: closed - Opened by jqueguiner about 3 years ago - 4 comments

#1227 - I am a Research Institute of Microsoft Research Institute. When I used apex in mmdection software, the following error occurred, We look forward to your answer. Thank you very much

Issue - State: closed - Opened by xianglei3 about 3 years ago - 3 comments

#1204 - pipeline_parallel - ModuleNotFoundError: No module named 'amp_C'

Issue - State: open - Opened by MatthieuCed about 3 years ago - 20 comments

#1193 - RuntimeError: apex.optimizers.FusedAdam requires cuda extensions

Issue - State: open - Opened by life97 over 3 years ago - 18 comments

#1178 - BFloat16 support in multi_tensor_*

Issue - State: closed - Opened by zhengwy888 over 3 years ago - 2 comments

#1175 - no_sync equivalent used for gradient accumulation

Issue - State: open - Opened by amsword over 3 years ago - 2 comments

#1141 - install apex error, flatten_unflatten.obj cannot open

Issue - State: open - Opened by MrBook2019 over 3 years ago - 6 comments

#1089 - Failed to install apex on CUDA 10.1 torch 1.6.0

Issue - State: closed - Opened by Ema1997 over 3 years ago - 2 comments

#1072 - FastLayerNorm ext not found after install on master

Issue - State: closed - Opened by sshleifer almost 4 years ago - 3 comments

#990 - TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'

Issue - State: open - Opened by KrisWongz about 4 years ago - 14 comments

#965 - RuntimeError: expected scalar type Float but found Half

Issue - State: open - Opened by superlwx over 4 years ago - 7 comments

#961 - Error occurs when building 'apex_C' extension: no such file -> 'flatten_unflatten.o'

Issue - State: closed - Opened by selous123 over 4 years ago - 3 comments

#957 - fatal error: cublas_v2.h: No such file or directory

Issue - State: open - Opened by shizhediao over 4 years ago - 6 comments

#955 - could not install with "pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./"

Issue - State: open - Opened by songtaoshi over 4 years ago - 6 comments

#954 - ModuleNotFoundError: No module named 'fused_layer_norm_cuda'

Issue - State: closed - Opened by ajesujoba over 4 years ago - 3 comments

#874 - Anaconda fail to build with "--cpp_ext" and "--cuda_ext" options

Issue - State: open - Opened by BurguerJohn over 4 years ago - 2 comments

#865 - distributed lamb breaks python-only amp

Issue - State: closed - Opened by lisadunlap over 4 years ago - 10 comments

#855 - LAMB and gradient clipping (instructions vs api)

Issue - State: open - Opened by ggaemo over 4 years ago - 2 comments

#832 - Expected tensor for argument #1 'input' to have the same type as tensor for argument #2 'rois'; but type torch.cuda.HalfTensor does not equal torch.cuda.FloatTensor

Issue - State: open - Opened by sarmientoj24 over 4 years ago - 7 comments

#810 - super slow to build Apex from source in docker

Issue - State: open - Opened by alexucb over 4 years ago - 1 comment

#802 - Build error (error: expected primary-expression before 'some' token)

Issue - State: open - Opened by kkjh0723 over 4 years ago - 24 comments

#777 - " ZeroDivisionError: float division by zero" in scaler.py

Issue - State: closed - Opened by qmpzzpmq almost 5 years ago - 2 comments

#774 - Grad norm cut in half every 2000 steps?

Issue - State: closed - Opened by PCerles almost 5 years ago

#769 - cupy.cuda.driver.CUDADriverError: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered

Issue - State: open - Opened by MittalShruti almost 5 years ago - 3 comments

#715 - problems with fp16 on multi-gpu training

Issue - State: closed - Opened by ssp573 almost 5 years ago - 1 comment

#702 - Update pyprof for nsight

Pull Request - State: closed - Opened by ghost almost 5 years ago - 3 comments

#698 - Avoid exception when initializing FusedNovoGrad with amp

Pull Request - State: closed - Opened by henrymai almost 5 years ago

#694 - Multiple independent models, only one requires apex.amp, crash in non-amp CPU model

Issue - State: open - Opened by lopuhin almost 5 years ago - 13 comments

#635 - Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to

Issue - State: open - Opened by zsun1029 about 5 years ago - 11 comments

#621 - ImportError: cannot import name 'amp'

Issue - State: open - Opened by vr25 about 5 years ago - 13 comments

#573 - Original ImportError was: ModuleNotFoundError("No module named 'amp_C')

Issue - State: closed - Opened by misslibra about 5 years ago - 9 comments

#550 - cuda extensions are being compiled with a version of Cuda that does not match the version used to compile Pytorch binaries.

Issue - State: closed - Opened by antgr about 5 years ago - 15 comments

#548 - Problem installation

Issue - State: open - Opened by emsansone over 5 years ago - 7 comments

#547 - Module 'torch.nn' has no attribute 'backends'

Issue - State: closed - Opened by YuryBolkonsky over 5 years ago - 8 comments

#533 - Not able to observe any speedup on a Nvidia T4 (Turing arch)

Issue - State: open - Opened by aditya1709 over 5 years ago - 4 comments

#519 - RuntimeError: main thread is not in main loop

Issue - State: open - Opened by H-YunHui over 5 years ago - 3 comments

#497 - Installation Error.

Issue - State: open - Opened by chunyuanY over 5 years ago - 2 comments

#466 - remove deprecated backend.FunctionBackend calls

Pull Request - State: closed - Opened by ptrblck over 5 years ago - 2 comments

#465 - AttributeError: 'DistributedDataParallel' object has no attribute 'buckets_ready_size'

Issue - State: open - Opened by makslevental over 5 years ago - 3 comments

#464 - Keep certain modules as FP32

Issue - State: closed - Opened by yaysummeriscoming over 5 years ago - 3 comments

#393 - I try the example when init init_process_group got an error

Issue - State: closed - Opened by PistonY over 5 years ago - 15 comments

#370 - undefined symbol: __ZN2at19UndefinedTensorImpl10_singletonE

Issue - State: closed - Opened by rmrao over 5 years ago - 4 comments

#368 - FileNotFoundError: [Errno 2] No such file or directory: ':/usr/local/cuda:/usr/local/cuda-10.1/bin/nvcc': ':/usr/local/cuda:/usr/local/cuda-10.1/bin/nvcc'

Issue - State: open - Opened by allianceai over 5 years ago - 21 comments

#323 - Hard error on mismatch between torch.version.cuda and + the Cuda toolkit version being used to compile Apex

Pull Request - State: closed - Opened by mcarilli over 5 years ago - 24 comments

#318 - How to handle gradient overflow when training a deep model with mixed precision?

Issue - State: open - Opened by tfwu over 5 years ago - 29 comments

#187 - bugs after apex installation

Issue - State: open - Opened by yinwenpeng almost 6 years ago - 7 comments
Labels: extension build

#161 - No module named 'fused_layer_norm_cuda'

Issue - State: closed - Opened by alvin-leong almost 6 years ago - 23 comments

#116 - TypeError: Class advice impossible in Python3

Issue - State: closed - Opened by lynnna-xu about 6 years ago - 15 comments

#107 - AttributeError: 'DistributedDataParallel' object has no attribute 'callback_queued'

Issue - State: closed - Opened by LightToYang about 6 years ago - 7 comments

#99 - ValueError: Error initializing torch.distributed using env:// rendezvous: environment variable WORLD_SIZE expected, but not set

Issue - State: closed - Opened by yuanfuqiang456 about 6 years ago - 5 comments

#86 - Warning: apex was installed without --cuda_ext.

Issue - State: closed - Opened by amuier about 6 years ago - 35 comments

GitHub / NVIDIA/apex issues and pull requests