laekov/fastmoe issues and pull requests

#214 - Detailed documentation about model parallelism

Issue - State: open - Opened by ZSL98 3 months ago

#213 - smart Schedule中R操作没有和C操作重叠

Issue - State: open - Opened by WhatBrain 4 months ago - 5 comments

#212 - bash run_enwik8_base.sh train train --work_dir /dir/

Issue - State: closed - Opened by WYCAS 4 months ago

#211 - how to run transformer-xl with parallel experts with single gpu?

Issue - State: open - Opened by HudashiNeo 4 months ago - 6 comments

#210 - Do We support DeepSpeed training? Thanks.

Issue - State: open - Opened by lzl-mt 5 months ago - 1 comment

#209 - 前向传播返回值缺少bal_loss

Issue - State: open - Opened by tisgotos 5 months ago - 2 comments

#208 - 您好，请问Megatron-LM的v2.2版本在哪里获取？

Issue - State: closed - Opened by tisgotos 5 months ago - 7 comments

#207 - 打开Smart schedule运行examples/transformer-xl/scripts/run_enwik8_base_moe.sh 报错

Issue - State: open - Opened by WhatBrain 5 months ago - 6 comments

#206 - No hiding output when using `pytest -s`

Pull Request - State: closed - Opened by roastduck 8 months ago

#205 - Make the code neutral to device by removing `.cuda()`

Pull Request - State: closed - Opened by roastduck 8 months ago

#204 - FasterMoE Shadow Policy: Detailed Inquiry

Issue - State: closed - Opened by Guodanding 9 months ago - 7 comments

#203 - Update readme-cn.md

Pull Request - State: closed - Opened by HelloWorldLTY 9 months ago

#202 - DDP error

Issue - State: closed - Opened by Peg-Wu 10 months ago

#201 - CUDA memory increases after each loss.backward()

Issue - State: open - Opened by sreetamasarkar 10 months ago - 6 comments

#200 - Update switch_gate.py

Pull Request - State: closed - Opened by Heihaierr 11 months ago

#199 - A bug in switch_gate

Issue - State: open - Opened by Heihaierr 11 months ago - 6 comments

#198 - About switch_gate

Issue - State: open - Opened by Heihaierr 11 months ago - 1 comment

#197 - multi-node problem

Issue - State: open - Opened by Qianshaowei 11 months ago - 1 comment

#196 - Example to run Megatron

Issue - State: open - Opened by Juanhui28 11 months ago - 3 comments

#195 - [BUG] AttributeError: module 'fmoe_cuda' has no attribute 'assign_pos_'

Issue - State: open - Opened by pangsg 11 months ago - 3 comments

#195 - [BUG] AttributeError: module 'fmoe_cuda' has no attribute 'assign_pos_'

Issue - State: open - Opened by pangsg 11 months ago - 3 comments

#194 - 跑FMOE的时候提示cudaErrorInvalidDevice

Issue - State: closed - Opened by pangsg 12 months ago - 6 comments

#194 - 跑FMOE的时候提示cudaErrorInvalidDevice

Issue - State: closed - Opened by pangsg 12 months ago - 6 comments

#193 - fastmoe支持微调吗

Issue - State: closed - Opened by PowerDispatch 12 months ago

#193 - fastmoe支持微调吗

Issue - State: closed - Opened by PowerDispatch 12 months ago

#192 - fastmoe是否支持微调，page-attention，flasahattention和kvcache，混合精度等

Issue - State: open - Opened by PowerDispatch 12 months ago - 4 comments

#192 - fastmoe是否支持微调，page-attention，flasahattention和kvcache，混合精度等

Issue - State: open - Opened by PowerDispatch 12 months ago - 4 comments

#191 - 请问fastmoe能被集成到VLLM里吗

Issue - State: open - Opened by pangsg 12 months ago - 4 comments

#191 - 请问fastmoe能被集成到VLLM里吗

Issue - State: open - Opened by pangsg 12 months ago - 4 comments

#190 - prep_text8.py没有该脚本

Issue - State: closed - Opened by PowerDispatch 12 months ago - 1 comment

#189 - 我们有线上沟通的群吗

Issue - State: open - Opened by PowerDispatch 12 months ago - 1 comment

#189 - 我们有线上沟通的群吗

Issue - State: open - Opened by PowerDispatch 12 months ago - 1 comment

#188 - 你好，我想请问下在fastmoe中如何定义 dp+mp下的moe

Issue - State: closed - Opened by daixiangzi 12 months ago - 6 comments

#187 - This PR resolves issue #186

Pull Request - State: closed - Opened by Cobalt-27 12 months ago

#187 - This PR resolves issue #186

Pull Request - State: closed - Opened by Cobalt-27 12 months ago

#186 - num_experts argument error for Megatron-LM

Issue - State: closed - Opened by Cobalt-27 12 months ago

#186 - num_experts argument error for Megatron-LM

Issue - State: closed - Opened by Cobalt-27 12 months ago

#185 - [Feature] Make bias of gate optional for naive_gate and its subclasses.

Pull Request - State: closed - Opened by Zhang-RQ about 1 year ago

#185 - [Feature] Make bias of gate optional for naive_gate and its subclasses.

Pull Request - State: closed - Opened by Zhang-RQ about 1 year ago

#184 - 开启Smart schedule时报错Segmentation fault

Issue - State: open - Opened by Xingzhi107 about 1 year ago - 8 comments
Labels: bug

#184 - 开启Smart schedule时报错Segmentation fault

Issue - State: open - Opened by Xingzhi107 about 1 year ago - 8 comments
Labels: bug

#183 - pytest error

Issue - State: open - Opened by R-QinQ about 1 year ago - 3 comments

#183 - pytest error

Issue - State: open - Opened by R-QinQ about 1 year ago - 3 comments

#182 - setup.py error！

Issue - State: closed - Opened by R-QinQ about 1 year ago - 4 comments

#182 - setup.py error！

Issue - State: closed - Opened by R-QinQ about 1 year ago - 4 comments

#181 - ImportError: cannot import name 'get_args' from 'megatron'

Issue - State: open - Opened by peter-fei about 1 year ago - 5 comments

#181 - ImportError: cannot import name 'get_args' from 'megatron'

Issue - State: open - Opened by peter-fei about 1 year ago - 5 comments

#180 - During inference, the output of noisy gate is nan.

Issue - State: open - Opened by zqhang about 1 year ago - 5 comments

#180 - During inference, the output of noisy gate is nan.

Issue - State: open - Opened by zqhang about 1 year ago - 5 comments

#179 - Inconsistent evaluation result when clone expert parameters from original FFN

Issue - State: closed - Opened by Heihaierr about 1 year ago - 1 comment

#179 - Inconsistent evaluation result when clone expert parameters from original FFN

Issue - State: closed - Opened by Heihaierr about 1 year ago - 1 comment

#178 - MOELinear is much slower than torch.nn.Linear

Issue - State: closed - Opened by kamanphoebe about 1 year ago - 7 comments

#178 - MOELinear is much slower than torch.nn.Linear

Issue - State: closed - Opened by kamanphoebe about 1 year ago - 7 comments

#177 - ModuleNotFoundError: No module named 'fmoe_cuda'

Issue - State: open - Opened by Taskii-Lei about 1 year ago - 1 comment

#177 - ModuleNotFoundError: No module named 'fmoe_cuda'

Issue - State: open - Opened by Taskii-Lei about 1 year ago - 3 comments

#176 - how to use balance loss?

Issue - State: open - Opened by Heihaierr about 1 year ago - 1 comment

#176 - how to use balance loss?

Issue - State: open - Opened by Heihaierr about 1 year ago - 1 comment

#175 - update clip-grad-v2.2.patch for grads_in_moe is empty

Pull Request - State: closed - Opened by Fragile-azalea over 1 year ago

#174 - Fix tests

Pull Request - State: closed - Opened by laekov over 1 year ago

#174 - Fix tests

Pull Request - State: closed - Opened by laekov over 1 year ago

#173 - Fit old code with new smgr

Pull Request - State: closed - Opened by laekov over 1 year ago

#173 - Fit old code with new smgr

Pull Request - State: closed - Opened by laekov over 1 year ago

#172 - [BUG FIX] Fix bugs in stream manager.

Pull Request - State: closed - Opened by zms1999 over 1 year ago - 1 comment

#172 - [BUG FIX] Fix bugs in stream manager.

Pull Request - State: closed - Opened by zms1999 over 1 year ago - 1 comment

#171 - fix cublas gemm call for bf16 input

Pull Request - State: closed - Opened by xptree over 1 year ago - 1 comment

#171 - fix cublas gemm call for bf16 input

Pull Request - State: closed - Opened by xptree over 1 year ago - 1 comment

#170 - MOELinear always returns a zero tensor for bf16 input

Issue - State: closed - Opened by xptree over 1 year ago - 1 comment

#170 - MOELinear always returns a zero tensor for bf16 input

Issue - State: closed - Opened by xptree over 1 year ago - 1 comment

#169 - MoE L2 norm reduce in Megatron

Issue - State: closed - Opened by blankde over 1 year ago - 3 comments

#168 - No overlapping observed when enabling Smart Scheduling

Issue - State: open - Opened by chenyu-jiang over 1 year ago - 8 comments

#167 - Update outdated README

Pull Request - State: closed - Opened by zms1999 over 1 year ago

#166 - Outdated doc for smart schedule with num_expert > 1?

Issue - State: closed - Opened by chenyu-jiang over 1 year ago - 1 comment

#166 - Outdated doc for smart schedule with num_expert > 1?

Issue - State: closed - Opened by chenyu-jiang over 1 year ago - 1 comment

#165 - Document for process groups

Pull Request - State: closed - Opened by laekov over 1 year ago

#165 - Document for process groups

Pull Request - State: closed - Opened by laekov over 1 year ago

#164 - Doc-string / Documentation clarification for parallel groups

Issue - State: closed - Opened by XMaster96 over 1 year ago - 2 comments

#164 - Doc-string / Documentation clarification for parallel groups

Issue - State: closed - Opened by XMaster96 over 1 year ago - 2 comments

#163 - Only 204 unique tokens (vocabulary size) in enwik8 (transformer-XL example)

Issue - State: open - Opened by chenwydj over 1 year ago - 3 comments

#162 - fmoe with deepspeed

Pull Request - State: open - Opened by KimmiShi over 1 year ago

#162 - fmoe with deepspeed

Pull Request - State: open - Opened by KimmiShi over 1 year ago

#161 - Mixture of Expert in Vison Task (Segmentation )

Issue - State: open - Opened by deep-matter over 1 year ago - 2 comments

#161 - Mixture of Expert in Vison Task (Segmentation )

Issue - State: open - Opened by deep-matter over 1 year ago - 2 comments

#160 - bf16 support

Pull Request - State: closed - Opened by laekov over 1 year ago

#160 - bf16 support

Pull Request - State: closed - Opened by laekov over 1 year ago

#159 - [WIP] Megatron v3.0.2 with known issues

Pull Request - State: closed - Opened by xptree over 1 year ago - 1 comment

#159 - [WIP] Megatron v3.0.2 with known issues

Pull Request - State: closed - Opened by xptree over 1 year ago - 1 comment

#158 - Is there any plan to adapt to newer version of Megatron-LM?

Issue - State: closed - Opened by lvcc2018 over 1 year ago - 1 comment

#157 - Fix ProcessGroupNCCL mismatch in pytorch2

Pull Request - State: closed - Opened by laekov over 1 year ago

#157 - Fix ProcessGroupNCCL mismatch in pytorch2

Pull Request - State: closed - Opened by laekov over 1 year ago

#156 - Distributed Training is failing

Issue - State: closed - Opened by santurini over 1 year ago - 9 comments

#156 - Distributed Training is failing

Issue - State: closed - Opened by santurini over 1 year ago - 9 comments

#155 - Added link to installation guide

Pull Request - State: closed - Opened by santurini over 1 year ago

#155 - Added link to installation guide

Pull Request - State: closed - Opened by santurini over 1 year ago

#154 - Create installation-guide.md

Pull Request - State: closed - Opened by santurini over 1 year ago - 1 comment

#154 - Create installation-guide.md

Pull Request - State: closed - Opened by santurini over 1 year ago - 1 comment

#153 - Added GitHub Gist link to installation tutorial

Pull Request - State: closed - Opened by santurini over 1 year ago - 4 comments

#153 - Added GitHub Gist link to installation tutorial

Pull Request - State: closed - Opened by santurini over 1 year ago - 4 comments

#152 - Cast input to weights type for AMP support

Pull Request - State: closed - Opened by santurini over 1 year ago

#152 - Cast input to weights type for AMP support

Pull Request - State: closed - Opened by santurini over 1 year ago

#151 - Revert "convert input to same type as weight for mixed precision training"

Pull Request - State: closed - Opened by laekov over 1 year ago

GitHub / laekov/fastmoe issues and pull requests