InternLM/InternEvo issues and pull requests

#374 - fix(linear.py): linear module uneven split is forbidden

Pull Request - State: open - Opened by huangting4201 6 days ago

#373 - fix(monitor): send exception when feishu alert is enable && remove light monitoring address

Pull Request - State: open - Opened by JiaoPL 7 days ago

#372 - [QA] Does internEvo support loongtrain selective checkpoint++?

Issue - State: open - Opened by wplf 7 days ago - 1 comment
Labels: question

#371 - fix(gmm): change communicator.grad_hook to async

Pull Request - State: open - Opened by blankde 7 days ago

#370 - fix(mha.py): fix evaluation argu key err

Pull Request - State: closed - Opened by huangting4201 9 days ago

#369 - feat(fp8): [Work In Progress] enable FP8 training

Pull Request - State: open - Opened by zigzagcai 21 days ago - 1 comment

#368 - remove unused moe changes , modify _q_kv_without_cu_seqlens and _SplitForwardGatherBackward

Pull Request - State: closed - Opened by KkHu-Kistch 26 days ago

#367 - Add hetero feat

Pull Request - State: closed - Opened by fumihwh 26 days ago

#366 - fix(isp.py): fix isp overlap backward allgather twice when activation ckpt 0.x

Pull Request - State: open - Opened by huangting4201 27 days ago

#365 - Add z loss to PipelineSchedule

Pull Request - State: closed - Opened by zhhsplendid 28 days ago

#364 - fix lumina model and add lumina ckpt support

Pull Request - State: closed - Opened by SHshenhao 28 days ago

#363 - fix lumina model and add lumina ckpt support

Pull Request - State: closed - Opened by SHshenhao 28 days ago

#362 - fix lumina model and add lumina ckpt support

Pull Request - State: closed - Opened by SHshenhao 28 days ago

#361 - A PR Provides Multi Machine MPI scripts

Pull Request - State: closed - Opened by zhhsplendid 28 days ago

#360 - fix(mlp.py): fix mlp w1w2w3 init order to w1w3w2

Pull Request - State: open - Opened by huangting4201 29 days ago

#359 - fix llava model device bugs

Pull Request - State: open - Opened by hellozmz 30 days ago

#358 - Feat/refactor process group

Pull Request - State: open - Opened by mwiacx 30 days ago

#357 - feat(pipeline): Zero Bubble V Shape Memory Efficient Editon

Pull Request - State: closed - Opened by li126com about 1 month ago

#356 - Tmp fix QK norm bug

Pull Request - State: closed - Opened by zhhsplendid about 1 month ago

#355 - Feat/heterogeneous x pu training

Pull Request - State: closed - Opened by KkHu-Kistch about 1 month ago

#354 - [QA] 如何进行单卡微调的，需要调整那些设置

Issue - State: open - Opened by OkGuai about 1 month ago
Labels: question

#353 - [Feature] Add Lumina Model to InternEvo. Tested on MUXI single card

Pull Request - State: closed - Opened by zhhsplendid about 1 month ago

#352 - feat(moe): add gshard token rearrange optim

Pull Request - State: open - Opened by blankde about 1 month ago

#351 - fix(checkpoint/components.py): fix lr scheduler resume step count

Pull Request - State: closed - Opened by huangting4201 about 1 month ago

#350 - feat(moe): support moe zero1 setting

Pull Request - State: open - Opened by blankde about 1 month ago

#349 - feat(model): support kv head copy

Pull Request - State: closed - Opened by yingtongxiong about 2 months ago

#348 - fix(moe): dropless moe loss

Pull Request - State: closed - Opened by blankde about 2 months ago

#347 - doc(2d): docs for 2d-attention

Pull Request - State: closed - Opened by yingtongxiong about 2 months ago

#346 - [QA] loong train 支持packed_sample_into_one=false吗

Issue - State: open - Opened by Lzhang-hub 2 months ago - 1 comment
Labels: question

#345 - feat(moe): support group mlp for moe

Pull Request - State: closed - Opened by blankde 2 months ago

#344 - feat(dataloader): refine implementation of mocked and megatron dataloader

Pull Request - State: open - Opened by zigzagcai 2 months ago

#343 - feat(zero bubble): update zbh1

Pull Request - State: open - Opened by li126com 2 months ago

#342 - [Bug] There will be timeout in some cases.

Issue - State: closed - Opened by kkscilife 2 months ago - 1 comment
Labels: bug

#341 - fix inject model and add multimodal dataloader

Pull Request - State: closed - Opened by sallyjunjun 2 months ago

#340 - fix(enable_qkv_fusion): minor fix for qkv fusion

Pull Request - State: closed - Opened by zigzagcai 2 months ago

#339 - fix dispatch model

Pull Request - State: closed - Opened by sallyjunjun 2 months ago

#338 - fix(enable_qkv_fusion): refine wqkv fusion

Pull Request - State: closed - Opened by zigzagcai 2 months ago

#337 - fix wqkv fusion

Pull Request - State: closed - Opened by zigzagcai 2 months ago

#336 - fix wqkv fusion

Pull Request - State: closed - Opened by zigzagcai 2 months ago

#335 - fix wqkv dim when enable qkv fusion

Pull Request - State: closed - Opened by sallyjunjun 2 months ago

#334 - fix(pipeline): fix zero bubble pipeline parallelism

Pull Request - State: closed - Opened by li126com 2 months ago

#333 - Feat(adam): support apex FusedAdam

Pull Request - State: closed - Opened by li126com 2 months ago

#332 - feat(moe): add moe async param handler

Pull Request - State: open - Opened by blankde 2 months ago

#331 - feat(usability): Refine model inject helper to support huggingface models

Pull Request - State: closed - Opened by zigzagcai 2 months ago

#330 - remove isp memory pool

Pull Request - State: closed - Opened by mwiacx 2 months ago

#329 - update test loss

Pull Request - State: open - Opened by li126com 2 months ago

#328 - fix(isp): fix unnecessary module gather for isp

Pull Request - State: closed - Opened by blankde 2 months ago - 2 comments

#327 - add qwen2moe and mixtral

Pull Request - State: closed - Opened by sallyjunjun 3 months ago - 1 comment

#326 - feat(model: impl gpt 567 b

Pull Request - State: closed - Opened by blankde 3 months ago

#325 - [Feature] MoE模型里稠密层和专家层zero和并行的解耦

Issue - State: open - Opened by sunpengsdu 3 months ago
Labels: enhancement

#324 - [Feature] 不使用memory pool

Issue - State: open - Opened by sunpengsdu 3 months ago - 1 comment
Labels: enhancement

#323 - feat(dataloader): Implement megatron dataloader and mocked dataloader

Pull Request - State: closed - Opened by zigzagcai 3 months ago - 1 comment

#322 - feat(moe): support moe isp and no tp

Pull Request - State: closed - Opened by blankde 3 months ago

#321 - feat(moe): support moe no tp

Pull Request - State: closed - Opened by blankde 3 months ago

#320 - feat(moe): support dropless layer

Pull Request - State: closed - Opened by blankde 3 months ago - 3 comments

#319 - fix(ci): fix weekly ci

Pull Request - State: closed - Opened by zigzagcai 3 months ago - 1 comment

#318 - [Bug] There is an error in training : built-in model should inherited from BaseModel

Issue - State: closed - Opened by kkscilife 3 months ago - 1 comment
Labels: bug

#317 - fix(cross_entropy.py): replace the fa loss with apex loss

Pull Request - State: closed - Opened by yingtongxiong 3 months ago

#316 - fix(shard.py): fix isp unpack data indexes err in rotary emb

Pull Request - State: closed - Opened by huangting4201 3 months ago

#315 - add vacab parallel embedding

Pull Request - State: closed - Opened by mwiacx 3 months ago - 1 comment

#314 - fix(ci): fix error in train_CI

Pull Request - State: closed - Opened by zigzagcai 3 months ago

#313 - fix(model): fix bugs of batch generation & support min_new_tokens for inference

Pull Request - State: closed - Opened by x54-729 3 months ago
Labels: bug

#312 - Add new models

Pull Request - State: closed - Opened by sallyjunjun 3 months ago

#311 - fix(embedding): fix incorrect computing of indexes in _update_cos_sin_cache

Pull Request - State: closed - Opened by li126com 3 months ago

#310 - improve documentation

Pull Request - State: closed - Opened by sallyjunjun 3 months ago

#309 - fix(910B): fix bugs in 910B for varlen and fixlen FA

Pull Request - State: closed - Opened by li126com 3 months ago - 2 comments

#308 - fix(isp): fix dist-attn infer

Pull Request - State: closed - Opened by KimmiShi 3 months ago - 1 comment

#307 - [Bug] 910B已知BUG和解决情况

Issue - State: closed - Opened by li126com 3 months ago
Labels: bug

#306 - [Feature] 优化ce_loss计算

Issue - State: closed - Opened by zigzagcai 3 months ago
Labels: enhancement

#305 - add data flow doc

Pull Request - State: closed - Opened by sallyjunjun 3 months ago

#304 - feat(usability): Attempt for easier usability

Pull Request - State: closed - Opened by zigzagcai 3 months ago - 1 comment

#303 - Attempt for easier usability

Pull Request - State: closed - Opened by zigzagcai 3 months ago

#302 - [Bug] Import Error: Import "deeplink_ext.internlm_ops" could not be resolved

Issue - State: closed - Opened by kkscilife 3 months ago - 1 comment
Labels: bug

#301 - support pip install on npu environment

Pull Request - State: closed - Opened by sallyjunjun 3 months ago

#300 - [QA] check import system var at the start of training

Issue - State: open - Opened by sunpengsdu 3 months ago
Labels: question

#299 - Zmz/qwen2

Pull Request - State: closed - Opened by hellozmz 3 months ago

#298 - [Bug] 昇腾910安装internLM环境时报错需要nvcc

Issue - State: closed - Opened by tungsten106 3 months ago - 2 comments
Labels: bug

#297 - fix(launch): remove use_paked_data=use_flash_atten assert

Pull Request - State: closed - Opened by yingtongxiong 4 months ago

#296 - fix(npu): fix npu dim incorrect squeeze when head num=1

Pull Request - State: closed - Opened by SolenoidWGT 4 months ago

#295 - fix hf internlm nan bug

Pull Request - State: closed - Opened by sallyjunjun 4 months ago

#294 - feat(modeling): support qwen2

Pull Request - State: closed - Opened by SolenoidWGT 4 months ago

#293 - feat(trainer_builder): refactor trainer_builder and preserve optional callable for custom model dispatch function in isp mode

Pull Request - State: closed - Opened by zigzagcai 4 months ago - 5 comments

#292 - [QA] 代码中涉及到的字符串比较，整改为枚举类型比较

Issue - State: closed - Opened by sallyjunjun 4 months ago
Labels: question

#291 - [QA] 梳理load_hf_llama_pretrained_weights相关代码逻辑，清理无用代码

Issue - State: closed - Opened by sallyjunjun 4 months ago
Labels: question

#290 - fix(data): fix the unpack data

Pull Request - State: closed - Opened by yingtongxiong 4 months ago

#289 - fix(moe): change moe norm reduced group

Pull Request - State: closed - Opened by blankde 4 months ago - 1 comment

#288 - Feat(*):loong train

Pull Request - State: closed - Opened by huangting4201 4 months ago

#287 - add isp support of huggingface model

Pull Request - State: closed - Opened by sallyjunjun 4 months ago

#286 - [Feature] how to finetuning lora

Issue - State: open - Opened by wen020 4 months ago - 1 comment
Labels: enhancement

#285 - [Bug] RuntimeError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout

Issue - State: open - Opened by kkscilife 4 months ago
Labels: bug

#284 - Hf isp support

Pull Request - State: closed - Opened by sallyjunjun 4 months ago

#283 - feat(varlen): support varlen training for huggingface models

Pull Request - State: closed - Opened by zigzagcai 4 months ago - 5 comments

#282 - feat(pipeline parallel): add zero bubble pipeline parallelism (ZB-H1)

Pull Request - State: closed - Opened by li126com 4 months ago

#281 - feat(setup and docs): add one-click setup and refine docs

Pull Request - State: closed - Opened by zigzagcai 4 months ago

#280 - fix: support newest internevo with deeplink

Pull Request - State: closed - Opened by POI-WX 4 months ago

#279 - [Bug] 仅支持了GShard模式的MoE模型转huggingface

Issue - State: open - Opened by Cerberous 4 months ago
Labels: bug

#278 - [Bug] 训练bf16 infer fp16出现NaN

Issue - State: open - Opened by Cerberous 4 months ago
Labels: bug

#277 - fix(huggingface): fix huggingface dataloader when using some huggingface third-party tokenizers

Pull Request - State: closed - Opened by zigzagcai 4 months ago - 1 comment

#276 - Fix(ckpt): fix llama2 loading function

Pull Request - State: closed - Opened by zigzagcai 5 months ago

#275 - feat(checkpoint): TP recomputation communication optimization

Pull Request - State: open - Opened by li126com 5 months ago

GitHub / InternLM/InternEvo issues and pull requests