microsoft/Megatron-DeepSpeed issues and pull requests

#54 - "RuntimeError: trying to initialize the default process group twice!" error with pretrain_gpt example script

Issue - State: closed - Opened by rraminen over 2 years ago - 2 comments

#53 - Adding GPT pretraining distillation and quantization examples

Pull Request - State: closed - Opened by minjiaz over 2 years ago - 1 comment

#52 - Add Codeowner

Pull Request - State: closed - Opened by conglongli over 2 years ago

#51 - [BUG] the gpt model cannot run in specified container

Issue - State: closed - Opened by starkhu over 2 years ago - 4 comments

#50 - Add support for DS comms

Pull Request - State: open - Opened by Quentin-Anthony over 2 years ago - 1 comment

#49 - [OLD] Support DeepSpeed Comms

Pull Request - State: closed - Opened by Quentin-Anthony over 2 years ago

#48 - MoE support

Pull Request - State: closed - Opened by awan-10 over 2 years ago

#47 - MoE support

Pull Request - State: closed - Opened by jeffra over 2 years ago

#46 - How efficient is the BERT and T5 code?

Issue - State: closed - Opened by StellaAthena over 2 years ago - 1 comment

#45 - how can I use the cpu_offload?

Issue - State: closed - Opened by cudaMancpy over 2 years ago

#44 - Update eval readme

Pull Request - State: closed - Opened by conglongli over 2 years ago

#43 - Cannot run the pretrain_gpt example using moe branch

Issue - State: open - Opened by getao over 2 years ago - 3 comments

#42 - Fix grad accum double scaling bug under no pp mode

Pull Request - State: closed - Opened by conglongli over 2 years ago

#41 - Fix typo

Pull Request - State: closed - Opened by mrm8488 over 2 years ago

#40 - ModuleNotFoundError: No module named 'lm_eval.datasets.coqa'

Issue - State: open - Opened by xwuShirley over 2 years ago - 2 comments

#39 - ds_pretrain_gpt_125M_MoE64.sh didn't convergence, loss fly after 3k steps?

Issue - State: closed - Opened by jerryli1981 over 2 years ago - 5 comments

#38 - Minjiaz/mos release

Pull Request - State: closed - Opened by minjiazhang over 2 years ago - 1 comment

#37 - Merged MoS staging to MoE

Pull Request - State: closed - Opened by awan-10 over 2 years ago

#36 - PR-MoE client changes to use the new DS-MoE API

Pull Request - State: closed - Opened by awan-10 over 2 years ago

#35 - PR-MoE changes to match new DS-MoE API

Pull Request - State: closed - Opened by awan-10 over 2 years ago

#34 - Eval harness for dense and MoE model, plus several feature/fixes for dense/MoE training

Pull Request - State: closed - Opened by conglongli over 2 years ago

#33 - fix MoE save interval

Pull Request - State: closed - Opened by conglongli over 2 years ago

#32 - Adding MoS support for Mixture-of-Experts in DeepSpeed

Pull Request - State: closed - Opened by minjiaz over 2 years ago

#31 - Support for MoS

Pull Request - State: closed - Opened by minjiaz over 2 years ago - 1 comment

#30 - How to merge the model partition that use both optimization about megatron's mp and deepspeed's zero 1?

Issue - State: closed - Opened by Tecmus over 2 years ago

#29 - Load moe checkpoint in generate_text.sh

Issue - State: open - Opened by Ag2S1 almost 3 years ago - 5 comments

#28 - Checkpoint for the MoE version

Issue - State: open - Opened by BDHU almost 3 years ago

#27 - DeepSpeed to DeepSpeed converter for changing tp/pp

Pull Request - State: closed - Opened by tjruwase almost 3 years ago - 5 comments

#26 - draft MoE training

Pull Request - State: closed - Opened by awan-10 almost 3 years ago - 1 comment

#25 - unpack list into a tuple constructor for python-3.7

Pull Request - State: closed - Opened by adammoody almost 3 years ago - 2 comments

#24 - Invalid syntax error when unpacking *moe_losses in python-3.7

Issue - State: closed - Opened by adammoody almost 3 years ago - 3 comments

#23 - [checkpoint conversion] meg-ds to meg-ds topology reshaping

Issue - State: open - Opened by stas00 almost 3 years ago - 1 comment

#22 - Fixing the MoE training when using model-parallelism

Pull Request - State: closed - Opened by RezaYazdaniAminabadi almost 3 years ago - 1 comment

#21 - Sync with Megatron-LM

Pull Request - State: closed - Opened by tjruwase almost 3 years ago - 1 comment

#20 - How to run bert with deepspeed?

Issue - State: closed - Opened by MagiaSN almost 3 years ago - 2 comments

#19 - make CL not truncate eval data

Pull Request - State: closed - Opened by conglongli about 3 years ago

#18 - CL script update

Pull Request - State: closed - Opened by conglongli about 3 years ago

#17 - Curriculum learning support

Pull Request - State: closed - Opened by conglongli about 3 years ago

#16 - LM Evaluation Harness Integration

Issue - State: closed - Opened by StellaAthena about 3 years ago

#15 - Convert meg ds to hf

Pull Request - State: closed - Opened by tjruwase about 3 years ago - 2 comments

#14 - Checkpoint conversion tools

Pull Request - State: closed - Opened by tjruwase about 3 years ago - 17 comments

#13 - Make attention mask boolean

Pull Request - State: closed - Opened by tjruwase about 3 years ago

#12 - syncing with the upstream?

Issue - State: open - Opened by stas00 about 3 years ago

#11 - merging the fix from downstream

Issue - State: closed - Opened by stas00 over 3 years ago

#10 - Use new zero.Init() API

Pull Request - State: closed - Opened by tjruwase over 3 years ago

#9 - Pass mpu in zero.Init()

Pull Request - State: closed - Opened by tjruwase over 3 years ago - 1 comment

#8 - query deepspeed global grad norm

Pull Request - State: closed - Opened by ShadenSmith over 3 years ago

#7 - zero.Init() with mpu

Pull Request - State: closed - Opened by tjruwase over 3 years ago - 1 comment

#6 - use pp engine even for pp=1

Pull Request - State: closed - Opened by jeffra over 3 years ago

#5 - improve DS integration docs + evaluation + logging

Pull Request - State: closed - Opened by ShadenSmith over 3 years ago

#4 - fix failure on restart after round 1 train and no eval

Pull Request - State: closed - Opened by stas00 over 3 years ago

#3 - fix failure on restart after round 1 train and no eval

Pull Request - State: closed - Opened by stas00 over 3 years ago

#2 - fix failure on restart after round 1 train and no eval

Pull Request - State: closed - Opened by stas00 over 3 years ago - 1 comment

#1 - Megatron + DeepSpeed + Pipeline Parallelism

Pull Request - State: closed - Opened by jeffra over 3 years ago

GitHub / microsoft/Megatron-DeepSpeed issues and pull requests