bigscience-workshop/Megatron-DeepSpeed issues and pull requests

#404 - Why pretrain_llama_distributed.sh use pretrain_gpt.py ?

Issue - State: closed - Opened by BrucePeng92 3 months ago

#403 - How can I set recomputation-granularity,like selective or full?

Issue - State: open - Opened by LordEdison 7 months ago

#402 - Bump black from 21.4b0 to 24.3.0

Pull Request - State: open - Opened by dependabot[bot] 8 months ago
Labels: dependencies

#401 - Hello, what version of the megatron-lm library is your code modified?

Issue - State: open - Opened by 4thGardenOfQMH 9 months ago

#400 - Is this assertion for mask wrong?

Issue - State: open - Opened by yinfangchen 9 months ago - 1 comment

#399 - Feature/tigerbot

Pull Request - State: closed - Opened by i4never about 1 year ago

#398 - Hello, can Megatron-DeepSpeed pre-train llama2?

Issue - State: open - Opened by 13416157913 about 1 year ago

#397 - Cannot run 3D parallelism with tp == 1 dp == 3 pp == 2 degrees

Issue - State: closed - Opened by Heelim-Hong about 1 year ago

#396 - the traing log like this is Normal？ I do not find loss in the logs, and what does the "grad norm: nan" mean?

Issue - State: open - Opened by alphanlp about 1 year ago

#395 - The difference between zero-3 and megatron with zero-2

Issue - State: open - Opened by nicosouth about 1 year ago

#394 - Question about the implementation of mpu.cross_entropy when using tensor parallel

Issue - State: open - Opened by robin087 over 1 year ago

#393 - Feature/tigerbot

Pull Request - State: closed - Opened by i4never over 1 year ago

#392 - questions about inconsistent evaluation result

Issue - State: open - Opened by coorful over 1 year ago

#391 - stage3 error: IndexError: list index out of range

Issue - State: closed - Opened by PhdShi over 1 year ago - 1 comment

#390 - ModuleNotFoundError: No module named 'packaging' when install apex

Issue - State: closed - Opened by SeekPoint over 1 year ago - 3 comments

#389 - ModuleNotFoundError: No module named 'torch' when run 'pip install -e .', but pytorch exists

Issue - State: closed - Opened by SeekPoint over 1 year ago - 2 comments

#388 - Question about ds to universal

Issue - State: open - Opened by saxh over 1 year ago

#387 - RuntimeError: Error building extension 'scaled_upper_triang_masked_softmax_cuda'

Issue - State: open - Opened by zll0000 over 1 year ago - 1 comment

#386 - hello， I meet a problem

Issue - State: open - Opened by etoilestar over 1 year ago - 8 comments

#385 - How to properly use Flops Profiler with pipelined parallelism?

Issue - State: open - Opened by flyingdown over 1 year ago

#384 - Fix/dataloader error

Pull Request - State: closed - Opened by EastInsure over 1 year ago

#383 - pip install -e . failed with ModuleNotFoundError: No module named 'torch'

Issue - State: open - Opened by SeekPoint over 1 year ago - 2 comments

#382 - Help me, I'm dying soon，error: command '/opt/rh/devtoolset-7/root/usr/bin/gcc' failed with exit code 1 error: subprocess-exited-with-error

Issue - State: open - Opened by listwebit over 1 year ago

#381 - Megatron-DeepSpeed only applies to specific models?

Issue - State: open - Opened by Bob-cby over 1 year ago

#380 - Universal checkpoints and MP states

Issue - State: closed - Opened by aitorormazabal over 1 year ago - 2 comments

#379 - The given group does not exist pytorch

Issue - State: open - Opened by germanjke over 1 year ago - 2 comments

#378 - upgrade megatron-lm

Issue - State: open - Opened by dz1iang over 1 year ago

#377 - How can we access to the gradients while the model is training?

Issue - State: open - Opened by BilgehanSel over 1 year ago

#376 - how to do prompt learning with bloom?

Issue - State: open - Opened by moseshu over 1 year ago

#375 - how to frozen some layers of GPT, only fintune last k layers?

Issue - State: open - Opened by joan126 over 1 year ago

#374 - How to convert model weights(e.g., bigscience/bloomz-560m-optimizer-states) to Hugging Face model.bin file?

Issue - State: closed - Opened by qazwsx042 over 1 year ago - 1 comment

#373 - Can I use python only apex for gpt_pretrain?

Issue - State: open - Opened by Luoyang144 over 1 year ago

#372 - how to pretrain t5-lm adapted?

Issue - State: open - Opened by nanyyyyyy over 1 year ago

#371 - How to preprocess data for t5 model?

Issue - State: open - Opened by xiu-ze over 1 year ago

#370 - Add xPos embeddings

Pull Request - State: open - Opened by janEbert over 1 year ago

#369 - Exception: cuda rng state model-parallel-rng is not added

Issue - State: open - Opened by 520jefferson over 1 year ago - 1 comment

#368 - 适配DCU

Pull Request - State: closed - Opened by hepj987 over 1 year ago

#367 - Fix various small problems

Pull Request - State: open - Opened by janEbert over 1 year ago

#366 - How to continue pre-training Bloom?

Issue - State: open - Opened by ShinoharaHare over 1 year ago - 2 comments

#365 - Bloom model training with AML

Pull Request - State: open - Opened by savitamittal1 almost 2 years ago

#364 - Are there any other layer norm functions, such as RMSNorm or DeepNorm

Issue - State: open - Opened by lvcc2018 almost 2 years ago

#363 - Is there any script for pretraining/funting Bloom?

Issue - State: open - Opened by drxmy almost 2 years ago

#362 - Bsevalharness

Pull Request - State: closed - Opened by Muennighoff almost 2 years ago

#361 - Does bigscienece's Megatron-DeepSpeed support ZeRO-stage2+cpu offload?

Issue - State: closed - Opened by drxmy almost 2 years ago

#360 - Fatal error: cuda_fp16.h: No such file or directory on ROCm

Issue - State: open - Opened by lvcc2018 almost 2 years ago - 1 comment

#359 - fintuning bloom 176b with bitfit

Issue - State: closed - Opened by drxmy almost 2 years ago - 2 comments

#358 - Add UL2 data sampling and pretraining

Pull Request - State: open - Opened by janEbert almost 2 years ago - 3 comments

#357 - Add FlashAttention

Pull Request - State: open - Opened by NouamaneTazi almost 2 years ago - 3 comments

#356 - User Warnings for accessing grad attribute of non-leaf Tensors thrown with TP=1 and PP>1

Issue - State: open - Opened by chelseajohn almost 2 years ago - 3 comments

#355 - deepspeed_to_megatron several issues

Issue - State: open - Opened by MatejUlcar about 2 years ago - 4 comments

#354 - Distill BLOOM - tentative 2

Pull Request - State: open - Opened by younesbelkada about 2 years ago

#353 - Enable rocm-support

Pull Request - State: open - Opened by luukkonenr about 2 years ago

#352 - Distill megatron - test Draft WIP

Pull Request - State: closed - Opened by younesbelkada about 2 years ago

#351 - Distill megatron - WIP draft code

Pull Request - State: closed - Opened by younesbelkada about 2 years ago

#350 - Load Bloom Optimizer State (i.e. Bloom 1B1)

Issue - State: open - Opened by philippmtk about 2 years ago - 2 comments

#349 - Encoding checkpoint reshaping guide

Pull Request - State: open - Opened by tjruwase about 2 years ago - 1 comment

#348 - Slower inference results for BLOOM fp16 on identical hardware

Issue - State: open - Opened by sarthaklangde about 2 years ago - 5 comments

#347 - grad norm increase strangely

Issue - State: open - Opened by misska1 about 2 years ago - 12 comments

#346 - How to inference GPT2 with DeepSpeed?

Issue - State: closed - Opened by cdj0311 about 2 years ago - 1 comment

#345 - [bloom inference scripts] improvements

Pull Request - State: closed - Opened by stas00 about 2 years ago

#344 - [Bloom inference] further improvements

Pull Request - State: closed - Opened by stas00 about 2 years ago - 1 comment

#343 - About reshape deepspeed checkpoint

Issue - State: open - Opened by henan991201 about 2 years ago - 20 comments

#342 - Installing Apex on Windows

Issue - State: open - Opened by gordicaleksa about 2 years ago - 1 comment

#341 - pretrain_gpt_distributed.sh ERROR!

Issue - State: closed - Opened by cdj0311 about 2 years ago

#340 - [ds-inference bloom] tweaks

Pull Request - State: closed - Opened by stas00 about 2 years ago - 4 comments

#339 - Followup PR for adding generation-server

Pull Request - State: closed - Opened by mayank31398 about 2 years ago - 12 comments

#338 - About convert deepspeed to deepspeed checkpoint

Issue - State: open - Opened by henan991201 about 2 years ago - 4 comments

#337 - Finetuning BLOOM

Issue - State: open - Opened by AnaRhisT94 about 2 years ago - 5 comments

#336 - Add multiple evaluation compat

Pull Request - State: open - Opened by Muennighoff about 2 years ago

#335 - Changing a single example affects forward pass for other examples in a batch

Issue - State: closed - Opened by mayank31398 about 2 years ago - 4 comments
Labels: bug

#334 - Can we also train BLOOM model using tensor using tensor-Parallelism and efficient fused CUDA kernels

Issue - State: open - Opened by CloudedLeopard17 about 2 years ago - 4 comments

#333 - About convert DS checkpoint to Transformers

Issue - State: closed - Opened by misska1 about 2 years ago - 2 comments

#332 - disable CI

Pull Request - State: closed - Opened by stas00 over 2 years ago - 1 comment

#331 - merge main

Pull Request - State: closed - Opened by Muennighoff over 2 years ago

#330 - DeepSpeed inference support for int8 parameters on BLOOM?

Issue - State: closed - Opened by pai4451 over 2 years ago - 6 comments

#329 - how to convert huggingface model to megatron-deepspeed?

Issue - State: closed - Opened by yayaQAQ over 2 years ago - 8 comments

#328 - Add generation server scripts using HF accelerate and DS-inference

Pull Request - State: closed - Opened by mayank31398 over 2 years ago - 46 comments

#327 - [checkpoints] replace bf16 with fp32 checkpoint weights

Pull Request - State: open - Opened by stas00 over 2 years ago - 3 comments

#326 - Add option to normalize loss per target

Pull Request - State: closed - Opened by Muennighoff over 2 years ago

#325 - Add generation server scripts

Pull Request - State: closed - Opened by mayank31398 over 2 years ago - 1 comment

#324 - Errors in generation (Bloom) when changing options sampling/use_cache

Issue - State: open - Opened by thies1006 over 2 years ago - 29 comments

#323 - Question about downloading checkpoints of 6.3B，2.5B，1.3B

Issue - State: open - Opened by misska1 over 2 years ago - 3 comments

#322 - add args_deepspeed_gpt.sh

Pull Request - State: closed - Opened by xyn1201 over 2 years ago

#321 - Generation server using HF accelerate and DS inference

Pull Request - State: closed - Opened by mayank31398 over 2 years ago - 19 comments

#320 - "Mask is silently ignored due to the use of a custom kernel" with pretrain_gpt_single_node.sh

Issue - State: open - Opened by tianjianjiang over 2 years ago - 4 comments

#319 - where can I download the 176B checkpoint in deepspeed format?

Issue - State: open - Opened by xuyifan-0731 over 2 years ago - 17 comments

#318 - Multi-node inference with Bloom: Unhandled CUDA error in ProcessGroupNCCL.cpp (called from all_reduce in torch)

Issue - State: open - Opened by asaparov over 2 years ago - 31 comments

#314 - How to run generation?

Issue - State: closed - Opened by mayank31398 over 2 years ago - 1 comment

#313 - Prefix LM Eval

Pull Request - State: open - Opened by Muennighoff over 2 years ago - 4 comments

#311 - Add Bitfit

Pull Request - State: open - Opened by Muennighoff over 2 years ago

#309 - Enable loading ckpt for t0 finetuning

Pull Request - State: open - Opened by Muennighoff over 2 years ago

#308 - BLOOM Inference via DeepSpeed-Inference, Accelerate and DeepSpeed-ZeRO

Pull Request - State: closed - Opened by stas00 over 2 years ago - 46 comments

#291 - BigScience Eval Harness

Pull Request - State: open - Opened by Muennighoff over 2 years ago

#284 - MLM adaptation and Multitask Finetuning

Pull Request - State: closed - Opened by lintangsutawika over 2 years ago - 4 comments

#226 - Make sure deepspeed powered models are equivalent with their non deepspeed version

Issue - State: open - Opened by thomasw21 almost 3 years ago - 2 comments
Labels: Good First Issue

#163 - [Tensorboard] Log text prediction in evaluation

Issue - State: open - Opened by thomasw21 about 3 years ago - 14 comments
Labels: Good First Issue

#118 - Corby's numerically more stable self attn version

Pull Request - State: closed - Opened by stas00 about 3 years ago

#114 - Add checks to confirm that the checkpoint conversion script works perfectly correct

Issue - State: closed - Opened by ibeltagy about 3 years ago - 8 comments
Labels: Good First Issue

#100 - Import issues when using evaluation scripts : `module 'megatron' has no attribute 'model'`

Issue - State: closed - Opened by RomanCast about 3 years ago

#99 - Double counts in parameter count

Issue - State: open - Opened by TevenLeScao about 3 years ago - 2 comments

GitHub / bigscience-workshop/Megatron-DeepSpeed issues and pull requests