laekov/fastmoe issues and pull requests

#151 - Revert "convert input to same type as weight for mixed precision training"

Pull Request - State: closed - Opened by laekov over 1 year ago

#150 - convert input to same type as weight for mixed precision training

Pull Request - State: closed - Opened by santurini over 1 year ago - 3 comments

#150 - convert input to same type as weight for mixed precision training

Pull Request - State: closed - Opened by santurini over 1 year ago - 3 comments

#149 - Training on Single GPU gives NCCL error

Issue - State: closed - Opened by santurini over 1 year ago - 1 comment

#149 - Training on Single GPU gives NCCL error

Issue - State: closed - Opened by santurini over 1 year ago - 1 comment

#148 - MoE DDP + Expert Parallelism

Issue - State: closed - Opened by santurini over 1 year ago - 6 comments

#148 - MoE DDP + Expert Parallelism

Issue - State: closed - Opened by santurini over 1 year ago - 6 comments

#147 - TypeError: linear_forward(): incompatible function arguments

Issue - State: closed - Opened by kamanphoebe over 1 year ago - 4 comments

#147 - TypeError: linear_forward(): incompatible function arguments

Issue - State: closed - Opened by kamanphoebe over 1 year ago - 4 comments

#146 - make FasterMoE more general

Pull Request - State: closed - Opened by zms1999 almost 2 years ago

#146 - make FasterMoE more general

Pull Request - State: closed - Opened by zms1999 almost 2 years ago

#145 - FastMoE with Megatron-LM v2.5

Pull Request - State: closed - Opened by zms1999 almost 2 years ago

#145 - FastMoE with Megatron-LM v2.5

Pull Request - State: closed - Opened by zms1999 almost 2 years ago

#144 - remove synchronize

Pull Request - State: closed - Opened by Fragile-azalea almost 2 years ago

#143 - Is it necessary to use the synchronize operation after the allreduce operation here?

Issue - State: closed - Opened by Fragile-azalea almost 2 years ago - 1 comment

#143 - Is it necessary to use the synchronize operation after the allreduce operation here?

Issue - State: closed - Opened by Fragile-azalea almost 2 years ago - 1 comment

#142 - Compatibility to older cuda and torch 1.13

Pull Request - State: closed - Opened by laekov about 2 years ago

#142 - Compatibility to older cuda and torch 1.13

Pull Request - State: closed - Opened by laekov about 2 years ago

#141 - Diverge gshard gate

Pull Request - State: closed - Opened by laekov about 2 years ago

#141 - Diverge gshard gate

Pull Request - State: closed - Opened by laekov about 2 years ago

#140 - smart_schedule.h bug fixed

Pull Request - State: closed - Opened by lawrence-cj about 2 years ago - 1 comment

#140 - smart_schedule.h bug fixed

Pull Request - State: closed - Opened by lawrence-cj about 2 years ago - 1 comment

#139 - Does FastMoe have a plan to support pipeline parallel with Megatron?

Issue - State: closed - Opened by LitLeo about 2 years ago - 2 comments

#139 - Does FastMoe have a plan to support pipeline parallel with Megatron?

Issue - State: closed - Opened by LitLeo about 2 years ago - 2 comments

#138 - fix bug: add proper comm group

Pull Request - State: closed - Opened by zms1999 about 2 years ago

#138 - fix bug: add proper comm group

Pull Request - State: closed - Opened by zms1999 about 2 years ago

#137 - More GPU number than expert number

Issue - State: closed - Opened by hanxuel about 2 years ago - 5 comments

#137 - More GPU number than expert number

Issue - State: closed - Opened by hanxuel about 2 years ago - 5 comments

#136 - Update version requirement in the documents

Pull Request - State: closed - Opened by laekov over 2 years ago

#136 - Update version requirement in the documents

Pull Request - State: closed - Opened by laekov over 2 years ago

#135 - Document for examples

Pull Request - State: closed - Opened by laekov over 2 years ago

#135 - Document for examples

Pull Request - State: closed - Opened by laekov over 2 years ago

#134 - CUBLAS_STATUS_ARCH_MISMATCH

Issue - State: closed - Opened by Irenehere over 2 years ago - 2 comments

#134 - CUBLAS_STATUS_ARCH_MISMATCH

Issue - State: closed - Opened by Irenehere over 2 years ago - 2 comments

#133 - 'Namespace' object has no attribute 'balance_strategy'

Issue - State: closed - Opened by Irenehere over 2 years ago - 2 comments

#133 - 'Namespace' object has no attribute 'balance_strategy'

Issue - State: closed - Opened by Irenehere over 2 years ago - 2 comments

#132 - fastmoe-master/build/temp.linux-x86_64-3.8/cuda/global_exchange.o: No such file or directory

Issue - State: closed - Opened by Irenehere over 2 years ago - 9 comments

#131 - During inference, I need to run forward on CPU, so FMOE does not support CPU inference now?

Issue - State: closed - Opened by snsun over 2 years ago - 2 comments

#131 - During inference, I need to run forward on CPU, so FMOE does not support CPU inference now?

Issue - State: closed - Opened by snsun over 2 years ago - 2 comments

#130 - Fix GshardGate top1_idx

Pull Request - State: closed - Opened by Fragile-azalea over 2 years ago

#130 - Fix GshardGate top1_idx

Pull Request - State: closed - Opened by Fragile-azalea over 2 years ago

#129 - The top_k in Gshard seems to be one.

Issue - State: closed - Opened by Fragile-azalea over 2 years ago - 3 comments

#128 - About balance loss

Issue - State: closed - Opened by LoganLiu66 over 2 years ago - 3 comments

#128 - About balance loss

Issue - State: closed - Opened by LoganLiu66 over 2 years ago - 3 comments

#127 - update readme: enable NCCL by default

Pull Request - State: closed - Opened by heheda12345 over 2 years ago

#127 - update readme: enable NCCL by default

Pull Request - State: closed - Opened by heheda12345 over 2 years ago

#126 - NCCL Error at /home/xxx/fastmoe/cuda/global_exchange.cpp:127 value 5

Issue - State: closed - Opened by Fragile-azalea over 2 years ago - 2 comments

#126 - NCCL Error at /home/xxx/fastmoe/cuda/global_exchange.cpp:127 value 5

Issue - State: closed - Opened by Fragile-azalea over 2 years ago - 2 comments

#125 - Performance difference when replacing FFN with FMoETransformerMLP in transformer

Issue - State: closed - Opened by LoganLiu66 over 2 years ago - 1 comment

#125 - Performance difference when replacing FFN with FMoETransformerMLP in transformer

Issue - State: closed - Opened by LoganLiu66 over 2 years ago - 1 comment

#124 - module 'fmoe_cuda' has no attribute 'ensure_nccl'

Issue - State: closed - Opened by Fangbo0506 over 2 years ago - 4 comments

#124 - module 'fmoe_cuda' has no attribute 'ensure_nccl'

Issue - State: closed - Opened by Fangbo0506 over 2 years ago - 4 comments

#123 - Fix nccl uid bcast for torch v1.12.0

Pull Request - State: closed - Opened by laekov over 2 years ago

#123 - Fix nccl uid bcast for torch v1.12.0

Pull Request - State: closed - Opened by laekov over 2 years ago

#122 - Ninja Build Stopped Subcommand Failed

Issue - State: closed - Opened by QiyaoWei over 2 years ago - 2 comments

#121 - How to use Convolution operator as the expert?

Issue - State: closed - Opened by hobbitlzy over 2 years ago - 12 comments

#119 - nccl.h is not found or ncclUnhandledCudaError: Call to CUDA function failed

Issue - State: closed - Opened by Fragile-azalea over 2 years ago - 9 comments

#116 - 询问DistributedGroupedDataParallel的使用方式

Issue - State: closed - Opened by Fragile-azalea over 2 years ago - 7 comments

#116 - 询问DistributedGroupedDataParallel的使用方式

Issue - State: closed - Opened by Fragile-azalea over 2 years ago - 7 comments

#111 - python setup.py install error with ["ninja", "-v"]

Issue - State: closed - Opened by louislau1129 over 2 years ago - 11 comments

#105 - How to support data parallel and model parallel for megatron at the same time.

Issue - State: closed - Opened by superqing001 almost 3 years ago - 3 comments

#96 - setup.py install 安装报错

Issue - State: closed - Opened by zxw866 almost 3 years ago - 4 comments

#96 - setup.py install 安装报错

Issue - State: closed - Opened by zxw866 almost 3 years ago - 4 comments

#82 - When running fastmoe with model parallel, the training process hanged

Issue - State: closed - Opened by sandyhouse over 3 years ago - 6 comments

#82 - When running fastmoe with model parallel, the training process hanged

Issue - State: closed - Opened by sandyhouse over 3 years ago - 6 comments

#61 - Adaptation guidelines for Megatron v2.4

Issue - State: closed - Opened by ymjiang over 3 years ago - 6 comments
Labels: good first issue

GitHub / laekov/fastmoe issues and pull requests