microsoft/Megatron-DeepSpeed issues and pull requests

#450 - [Bug]Fix init issue for layer_norm in sequence_parallel for non-CUDA device.

Pull Request - State: open - Opened by ys950902 about 2 months ago - 2 comments

#449 - Model conversion problem

Issue - State: open - Opened by yuanzhiyong1999 about 2 months ago

#448 - [Bug]Fix init issue for rms_norm in sequence_parallel.

Pull Request - State: open - Opened by ys950902 about 2 months ago - 1 comment

#447 - Async allreduce for tensor-parallel

Issue - State: open - Opened by drcanchi about 2 months ago

#446 - [TRACKER] Customer support related PR tracker for Intel devices

Issue - State: open - Opened by delock about 2 months ago

#445 - fix moe tflops

Pull Request - State: open - Opened by ranzhejiang about 2 months ago

#444 - how to calcuate the training throughput

Issue - State: open - Opened by bigtree2020 2 months ago

#441 - Adding the new feature of FPDT

Pull Request - State: open - Opened by YJHMITWEB 3 months ago - 4 comments

#440 - Optimizer problem when using finetune_llama.sh

Issue - State: open - Opened by Kaiizx 3 months ago - 3 comments

#429 - Enable Sequence Parallelism

Pull Request - State: closed - Opened by polisettyvarma 4 months ago - 10 comments

#428 - [Bug] grad_weight can't be NoneType when running with DeepSpeed on Zero3.

Pull Request - State: closed - Opened by ys950902 4 months ago - 8 comments

#379 - AttributeError: 'Namespace' object has no attribute 'deepspeed_config_dict'. Did you mean: 'deepspeed_config'? && batch = next(self.data_iterator)

Issue - State: open - Opened by hi20240217 7 months ago - 2 comments

#100 - DeepSpeed Data Efficiency Library pretraining examples

Pull Request - State: closed - Opened by conglongli almost 2 years ago - 1 comment

#100 - DeepSpeed Data Efficiency Library pretraining examples

Pull Request - State: closed - Opened by conglongli almost 2 years ago - 1 comment

#99 - Fix generate_text.sh Megatron text-generation example working w/ DS inference

Pull Request - State: closed - Opened by lekurile almost 2 years ago

#99 - Fix generate_text.sh Megatron text-generation example working w/ DS inference

Pull Request - State: closed - Opened by lekurile almost 2 years ago

#98 - The FLOPS per GPU reported for the Megatron GPT model by the DeepSpeed Flops Profiler is much lower than that reported in the logs when we run pretrain_gpt.py

Issue - State: open - Opened by shrutiramesh1988 almost 2 years ago - 1 comment

#98 - The FLOPS per GPU reported for the Megatron GPT model by the DeepSpeed Flops Profiler is much lower than that reported in the logs when we run pretrain_gpt.py

Issue - State: open - Opened by shrutiramesh1988 almost 2 years ago - 1 comment

#97 - AttributeError: module 'transformer_inference' has no attribute 'layer_norm_fp16'

Issue - State: open - Opened by ranggihwang almost 2 years ago - 1 comment

#97 - AttributeError: module 'transformer_inference' has no attribute 'layer_norm_fp16'

Issue - State: open - Opened by ranggihwang almost 2 years ago - 1 comment

#96 - Fix the bug of FusedLayerNorm on ROCm

Pull Request - State: closed - Opened by hubertlu-tw almost 2 years ago - 2 comments

#96 - Fix the bug of FusedLayerNorm on ROCm

Pull Request - State: closed - Opened by hubertlu-tw almost 2 years ago - 2 comments

#95 - Layer Norm kernel fails for ROCm

Issue - State: closed - Opened by NouamaneTazi almost 2 years ago - 3 comments

#95 - Layer Norm kernel fails for ROCm

Issue - State: closed - Opened by NouamaneTazi almost 2 years ago - 3 comments

#94 - If I just want to pretrain a simple gpt model without these characteristics, which script should I refer to?

Issue - State: open - Opened by AQA6666 about 2 years ago - 1 comment

#94 - If I just want to pretrain a simple gpt model without these characteristics, which script should I refer to?

Issue - State: open - Opened by AQA6666 about 2 years ago - 1 comment

#93 - The process is stuck at this step:compiling and loading fused kernels ...

Issue - State: open - Opened by AQA6666 about 2 years ago - 1 comment

#93 - The process is stuck at this step:compiling and loading fused kernels ...

Issue - State: open - Opened by AQA6666 about 2 years ago - 1 comment

#92 - Modifying loss checking to support bf16.

Pull Request - State: closed - Opened by jomayeri about 2 years ago

#92 - Modifying loss checking to support bf16.

Pull Request - State: closed - Opened by jomayeri about 2 years ago

#91 - deepspeed to megatron - mismatch in function definition and call

Issue - State: open - Opened by MatejUlcar about 2 years ago

#90 - Vocab size mismatch for T5

Issue - State: open - Opened by ShivanshuPurohit about 2 years ago

#90 - Vocab size mismatch for T5

Issue - State: open - Opened by ShivanshuPurohit about 2 years ago

#89 - How to run use moe on T5?

Issue - State: closed - Opened by YijiaZhao about 2 years ago - 2 comments

#89 - How to run use moe on T5?

Issue - State: closed - Opened by YijiaZhao about 2 years ago - 2 comments

#88 - Updated to Curated acpt env and removed deepspeed install from github

Pull Request - State: closed - Opened by savitamittal1 about 2 years ago

#88 - Updated to Curated acpt env and removed deepspeed install from github

Pull Request - State: closed - Opened by savitamittal1 about 2 years ago

#87 - Fix a bug for gpt pre-training.

Pull Request - State: closed - Opened by FeixLiu about 2 years ago - 2 comments

#87 - Fix a bug for gpt pre-training.

Pull Request - State: closed - Opened by FeixLiu about 2 years ago - 2 comments

#86 - Does Deepspeed compatible with megatron3.0 ?

Issue - State: open - Opened by pangsg about 2 years ago

#86 - Does Deepspeed compatible with megatron3.0 ?

Issue - State: open - Opened by pangsg about 2 years ago

#85 - MoE Checkpoint size

Issue - State: open - Opened by yunoJ about 2 years ago

#85 - MoE Checkpoint size

Issue - State: open - Opened by yunoJ about 2 years ago

#84 - GeLU approximation differs from paper, BERT

Issue - State: closed - Opened by yieldthought about 2 years ago - 1 comment

#84 - GeLU approximation differs from paper, BERT

Issue - State: closed - Opened by yieldthought about 2 years ago - 1 comment

#83 - Issue generating text with GPT: "KeyError: 50284"

Issue - State: open - Opened by gcunhase about 2 years ago

#83 - Issue generating text with GPT: "KeyError: 50284"

Issue - State: open - Opened by gcunhase about 2 years ago

#82 - Issue loading GPT2 checkpoint: "torch.nn.modules.module.ModuleAttributeError: 'ParallelTransformerLayer' object has no attribute 'self_attention'"

Issue - State: open - Opened by gcunhase about 2 years ago - 1 comment

#82 - Issue loading GPT2 checkpoint: "torch.nn.modules.module.ModuleAttributeError: 'ParallelTransformerLayer' object has no attribute 'self_attention'"

Issue - State: open - Opened by gcunhase about 2 years ago - 1 comment

#81 - megatron-deepspeed layernorm has different output compare with megatron-lm?

Issue - State: open - Opened by Kite0011 about 2 years ago

#81 - megatron-deepspeed layernorm has different output compare with megatron-lm?

Issue - State: open - Opened by Kite0011 about 2 years ago

#80 - BERT QQP and RACE fine-tune examples

Pull Request - State: closed - Opened by conglongli about 2 years ago

#80 - BERT QQP and RACE fine-tune examples

Pull Request - State: closed - Opened by conglongli about 2 years ago

#79 - integrate ort

Pull Request - State: closed - Opened by prathikr about 2 years ago - 2 comments

#79 - integrate ort

Pull Request - State: closed - Opened by prathikr about 2 years ago - 2 comments

#78 - attempt at pipelining

Pull Request - State: open - Opened by siddharth9820 about 2 years ago

#78 - attempt at pipelining

Pull Request - State: open - Opened by siddharth9820 about 2 years ago

#77 - fix throughput_calculator

Pull Request - State: closed - Opened by conglongli about 2 years ago

#77 - fix throughput_calculator

Pull Request - State: closed - Opened by conglongli about 2 years ago

#76 - pretrain_gpt_125M_MoE freezes during compilation

Issue - State: closed - Opened by yunoJ over 2 years ago - 1 comment

#76 - pretrain_gpt_125M_MoE freezes during compilation

Issue - State: closed - Opened by yunoJ over 2 years ago - 1 comment

#75 - BERT example

Pull Request - State: closed - Opened by conglongli over 2 years ago

#75 - BERT example

Pull Request - State: closed - Opened by conglongli over 2 years ago

#74 - BERT example staging v1

Pull Request - State: closed - Opened by conglongli over 2 years ago

#74 - BERT example staging v1

Pull Request - State: closed - Opened by conglongli over 2 years ago

#73 - This repo is missing important files

Issue - State: closed - Opened by microsoft-github-policy-service[bot] over 2 years ago

#73 - This repo is missing important files

Issue - State: closed - Opened by microsoft-github-policy-service[bot] over 2 years ago

#72 - Adding Microsoft SECURITY.MD

Pull Request - State: closed - Opened by microsoft-github-policy-service[bot] over 2 years ago

#72 - Adding Microsoft SECURITY.MD

Pull Request - State: closed - Opened by microsoft-github-policy-service[bot] over 2 years ago

#71 - Offloading optimizer to CPU causes "expected input to be on cuda" error; Suggest to fallback to torch.optim.AdamW

Pull Request - State: closed - Opened by hibagus over 2 years ago - 3 comments

#71 - Offloading optimizer to CPU causes "expected input to be on cuda" error; Suggest to fallback to torch.optim.AdamW

Pull Request - State: closed - Opened by hibagus over 2 years ago - 3 comments

#70 - gpt_6.7B_PR-MoE16: CUDA out of memory

Issue - State: open - Opened by fighterhit over 2 years ago - 1 comment

#70 - gpt_6.7B_PR-MoE16: CUDA out of memory

Issue - State: open - Opened by fighterhit over 2 years ago - 1 comment

#69 - add checkpoint throughput measurement

Pull Request - State: closed - Opened by GuanhuaWang over 2 years ago - 1 comment

#69 - add checkpoint throughput measurement

Pull Request - State: closed - Opened by GuanhuaWang over 2 years ago - 1 comment

#68 - Enable Megatron-LM workload on ROCm

Pull Request - State: closed - Opened by rraminen over 2 years ago - 3 comments

#68 - Enable Megatron-LM workload on ROCm

Pull Request - State: closed - Opened by rraminen over 2 years ago - 3 comments

#67 - Question for usage of DeepSpeed transformer kernels

Issue - State: closed - Opened by delock over 2 years ago

#67 - Question for usage of DeepSpeed transformer kernels

Issue - State: closed - Opened by delock over 2 years ago

#66 - add changes for enabling AML run

Pull Request - State: closed - Opened by msp8955 over 2 years ago - 2 comments

#66 - add changes for enabling AML run

Pull Request - State: closed - Opened by msp8955 over 2 years ago - 2 comments

#65 - Merge azure branch manually

Pull Request - State: closed - Opened by awan-10 over 2 years ago

#65 - Merge azure branch manually

Pull Request - State: closed - Opened by awan-10 over 2 years ago

#64 - Azure draft PR -- to be closed after discussion/review

Pull Request - State: closed - Opened by awan-10 over 2 years ago

#64 - Azure draft PR -- to be closed after discussion/review

Pull Request - State: closed - Opened by awan-10 over 2 years ago

#63 - Tensor parallelism for Mixture of Experts

Pull Request - State: closed - Opened by siddharth9820 over 2 years ago - 2 comments

#63 - Tensor parallelism for Mixture of Experts

Pull Request - State: closed - Opened by siddharth9820 over 2 years ago - 2 comments

#62 - [fix]Solve checkpoint loading err with megatron bert_model

Pull Request - State: closed - Opened by kisseternity over 2 years ago - 2 comments

#62 - [fix]Solve checkpoint loading err with megatron bert_model

Pull Request - State: closed - Opened by kisseternity over 2 years ago - 2 comments

#61 - [Bug]Load checkpoint err using pretrain_bert.py with Megatron

Issue - State: closed - Opened by kisseternity over 2 years ago - 2 comments

#61 - [Bug]Load checkpoint err using pretrain_bert.py with Megatron

Issue - State: closed - Opened by kisseternity over 2 years ago - 2 comments

#60 - Minjiaz/compression gpt

Pull Request - State: closed - Opened by minjiaz over 2 years ago

#60 - Minjiaz/compression gpt

Pull Request - State: closed - Opened by minjiaz over 2 years ago

#59 - Debug

Pull Request - State: closed - Opened by rayzzq over 2 years ago - 1 comment

#59 - Debug

Pull Request - State: closed - Opened by rayzzq over 2 years ago - 1 comment

#58 - GPT-2 with pipeline parallel and bfloat16 doesn't work

Issue - State: open - Opened by assij over 2 years ago - 4 comments

#58 - GPT-2 with pipeline parallel and bfloat16 doesn't work

Issue - State: open - Opened by assij over 2 years ago - 4 comments

#57 - AzureML: initial changes for benchmarking

Pull Request - State: closed - Opened by msp8955 over 2 years ago

#56 - Add Zero-offload support

Pull Request - State: closed - Opened by siddharth9820 over 2 years ago

#55 - Add ZeRO-offload support

Pull Request - State: closed - Opened by siddharth9820 over 2 years ago

GitHub / microsoft/Megatron-DeepSpeed issues and pull requests