microsoft/DeepSpeedExamples issues and pull requests

#954 - fix: the json format of the training imagenet configuration file

Pull Request - State: open - Opened by muskonu 24 days ago

#953 - Cleanup CODEOWNERS

Pull Request - State: closed - Opened by loadams 25 days ago

#952 - Mosm/torch profile

Pull Request - State: closed - Opened by Dharshan-SK about 1 month ago

#951 - Is there any example about DeepSpeed Zero with Ulysses/Ulysses-offload

Issue - State: open - Opened by LSC527 about 1 month ago

#950 - Domino + PP

Issue - State: open - Opened by XZQshiyu about 1 month ago

#949 - Update references to torchvision

Pull Request - State: closed - Opened by loadams about 1 month ago

#948 - Error when running training example DeepSpeed-Domino/pretrain_gpt3_2.7b.sh

Issue - State: closed - Opened by ZhiyiHu1999 about 2 months ago - 1 comment

#947 - remove-redundant-code

Pull Request - State: closed - Opened by simonJJJ 2 months ago

#946 - Assertion `srcIndex < srcSelectDimSize` failed

Issue - State: open - Opened by boqiny 2 months ago - 1 comment

#945 - add checkpoint

Pull Request - State: open - Opened by zhangsmallshark 2 months ago - 1 comment

#944 - Question to attention computation

Issue - State: open - Opened by yuzhenmao 2 months ago

#943 - KV_cache offload

Issue - State: open - Opened by yuzhenmao 2 months ago

#942 - Example and benchmark of APIs to offload states

Pull Request - State: closed - Opened by tohtana 2 months ago

#941 - A bug in argument parser.

Issue - State: open - Opened by ChenDaiwei-99 2 months ago

#940 - Failed to run Domino example

Issue - State: closed - Opened by lucifer1004 3 months ago - 2 comments

#939 - Update DeepSpeed version requirement to >=0.16.0 for Domino

Pull Request - State: closed - Opened by shenzheyu 3 months ago

#938 - Bump the pip group across 9 directories with 15 updates #3

Pull Request - State: open - Opened by akaday 3 months ago

#937 - Bump the pip group across 2 directories with 1 update #2

Pull Request - State: closed - Opened by akaday 3 months ago - 1 comment

#936 - How can I change the master_port when using deepspeed for multi-GPU on single node, i.e. localhost

Issue - State: open - Opened by lovedoubledan 3 months ago - 4 comments

#935 - RuntimeError: CUDA error: no kernel image is available for execution on the device

Issue - State: closed - Opened by mrpeerat 3 months ago - 1 comment

#934 - No module named 'transformers.deepspeed'

Issue - State: closed - Opened by TianyuJIAA 4 months ago - 2 comments

#933 - Fixed mistake in readme

Pull Request - State: closed - Opened by SCheekati 4 months ago

#932 - Does DeepSpeed's Pipeline-Parallelism optimizer supports skip connections?

Issue - State: open - Opened by RoyMahlab 4 months ago

#931 - [cifar ds training]: Set cuda device during initialization of distributed backend.

Pull Request - State: closed - Opened by jagadish-amd 4 months ago - 3 comments

#930 - Εnable reward model offloading option

Pull Request - State: closed - Opened by kfertakis 5 months ago - 2 comments

#929 - Deepspeed-Domino

Pull Request - State: closed - Opened by zhangsmallshark 5 months ago - 3 comments

#928 - After using steps 1, 2, and 3, the test reply content only replies Assistant: </s>。

Issue - State: closed - Opened by jianmomo 5 months ago

#927 - Remove the fixed `eot_token` mechanism for SFT

Pull Request - State: closed - Opened by Xingfu-Yi 5 months ago - 2 comments

#925 - Update requirements for opencv-python CVE

Pull Request - State: closed - Opened by loadams 6 months ago

#924 - AttributeError： 'DeepSpeedEngine' object has no attribute 'model'，

Issue - State: closed - Opened by lovychen 6 months ago - 1 comment

#923 - How to calculate training efficiency ,i.e tokens/sec of step 1 fine tuning of llama2 model ?

Issue - State: open - Opened by sowmya04101998 6 months ago

#922 - Actor loss nan and Resizing model embedding

Issue - State: open - Opened by ouyanmei 6 months ago - 1 comment

#921 - DeepNVMe ZeRO-inf Tutorial

Pull Request - State: closed - Opened by jomayeri 6 months ago

#920 - FileNotFoundError: [Errno 2] No such file or directory: 'numactl'

Issue - State: closed - Opened by zhiwentian 6 months ago - 6 comments

#919 - DeepNVMe README.md add xref

Pull Request - State: closed - Opened by stas00 6 months ago

#916 - Update README.md

Pull Request - State: closed - Opened by keshavkowshik 6 months ago

#916 - Update README.md

Pull Request - State: closed - Opened by keshavkowshik 6 months ago

#915 - step2 without any response for a long time

Issue - State: open - Opened by asfadfaf 6 months ago

#915 - step2 without any response for a long time

Issue - State: open - Opened by asfadfaf 6 months ago

#914 - DeepNVMe example scripts

Pull Request - State: closed - Opened by tjruwase 6 months ago

#913 - Add openai client to deepspeedometer

Pull Request - State: closed - Opened by delock 6 months ago - 2 comments

#912 - Different zero stage the training memory compute

Issue - State: open - Opened by Arcmoon-Hu 7 months ago

#912 - Different zero stage the training memory compute

Issue - State: open - Opened by Arcmoon-Hu 7 months ago

#911 - nvcc fatal : Unsupported gpu architecture 'compute_86' and nvcc fatal : Value 'c++17' is not defined for option 'std'

Issue - State: closed - Opened by Xccanxin 7 months ago - 1 comment

#911 - nvcc fatal : Unsupported gpu architecture 'compute_86' and nvcc fatal : Value 'c++17' is not defined for option 'std'

Issue - State: closed - Opened by Xccanxin 7 months ago - 1 comment

#910 - How to start deepspeed automatically?

Issue - State: closed - Opened by qwerfdsadad 8 months ago - 2 comments

#909 - Consult the first phase.

Issue - State: closed - Opened by csxrzhang 8 months ago - 2 comments

#909 - Consult the first phase.

Issue - State: closed - Opened by csxrzhang 8 months ago - 2 comments

#908 - an error with gradient checkpointing in DeepspeedChat step 3

Issue - State: open - Opened by wangyuwen1999 8 months ago

#908 - an error with gradient checkpointing in DeepspeedChat step 3

Issue - State: open - Opened by wangyuwen1999 8 months ago

#907 - 单机多卡进行RLHF在第三步中使用Qwen模型作Actor Model报错

Issue - State: open - Opened by Dakai798 8 months ago - 1 comment

#907 - 单机多卡进行RLHF在第三步中使用Qwen模型作Actor Model报错

Issue - State: open - Opened by Dakai798 8 months ago - 2 comments

#906 - DeepSpeed-Chat step-1 hanging for a long time

Issue - State: open - Opened by lemon-little 8 months ago

#906 - DeepSpeed-Chat step-1 hanging for a long time

Issue - State: open - Opened by lemon-little 8 months ago

#905 - Enable cpu/xpu support for the benchmarking suite

Pull Request - State: closed - Opened by louie-tsai 9 months ago - 8 comments

#905 - Enable cpu/xpu support for the benchmarking suite

Pull Request - State: closed - Opened by louie-tsai 9 months ago - 8 comments

#904 - CPU OOM when inferencing Llama3-70B-Chinese-Chat

Issue - State: open - Opened by GORGEOUSLCX 9 months ago

#903 - cannot pickle 'Stream' object

Issue - State: open - Opened by teis-e 9 months ago

#903 - cannot pickle 'Stream' object

Issue - State: open - Opened by teis-e 9 months ago

#902 - can not run the test-gpt.sh because of assertionError

Issue - State: open - Opened by leachee99 9 months ago

#901 - 请问fastgen 是否支持长文本和序列并行推理

Issue - State: open - Opened by AceCoder0 9 months ago

#901 - 请问fastgen 是否支持长文本和序列并行推理

Issue - State: open - Opened by AceCoder0 9 months ago

#900 - Add --client-only arg to mii benchmark

Pull Request - State: closed - Opened by delock 10 months ago

#900 - Add --client-only arg to mii benchmark

Pull Request - State: closed - Opened by delock 10 months ago

#899 - Refactored LLM benchmark code

Pull Request - State: closed - Opened by mrwyattii 10 months ago

#899 - Refactored LLM benchmark code

Pull Request - State: closed - Opened by mrwyattii 10 months ago

#898 - fix bug with queue.empty not being reliable

Pull Request - State: closed - Opened by mrwyattii 10 months ago

#897 - Update tokens_per_sec calculation to work w/ stream and non-stream cases

Pull Request - State: closed - Opened by lekurile 10 months ago

#897 - Update tokens_per_sec calculation to work w/ stream and non-stream cases

Pull Request - State: closed - Opened by lekurile 10 months ago

#896 - run-example.sh fails with urllib3.exceptions.ProtocolError: Response ended prematurely

Issue - State: closed - Opened by awan-10 10 months ago - 11 comments

#895 - updating tokens per second to include the token count of generated tokens.

Pull Request - State: closed - Opened by guptha23 10 months ago

#895 - updating tokens per second to include the token count of generated tokens.

Pull Request - State: closed - Opened by guptha23 10 months ago

#894 - [Error] AutoTune: `connect to host localhost port 22: Connection refused`

Issue - State: open - Opened by wqw547243068 10 months ago

#894 - [Error] AutoTune: `connect to host localhost port 22: Connection refused`

Issue - State: open - Opened by wqw547243068 10 months ago

#893 - How to use deepspeed for multi-node and multi-card task in slurm cluster

Issue - State: open - Opened by dshwei 10 months ago

#893 - How to use deepspeed for multi-node and multi-card task in slurm cluster

Issue - State: open - Opened by dshwei 10 months ago

#892 - Does Zero-Inference support TP?

Issue - State: open - Opened by preminstrel 10 months ago - 11 comments

#892 - Does Zero-Inference support TP?

Issue - State: open - Opened by preminstrel 10 months ago - 11 comments

#891 - extend max_prompt_length and input text for 128k evaluation

Pull Request - State: closed - Opened by HeyangQin 10 months ago

#890 - Deepspeed support finetune extra model with lora ?

Issue - State: open - Opened by wanghongqu 10 months ago - 1 comment

#890 - Deepspeed support finetune extra model with lora ?

Issue - State: open - Opened by wanghongqu 10 months ago - 1 comment

#889 - 不同机器上python环境变量路径不同，deepspeed启动后发现找不到其他机器的python环境，如何解决

Issue - State: closed - Opened by liqwertyu 10 months ago

#888 - when calculating actor loss, why the mask is "action_mask[:, start: ] "

Issue - State: closed - Opened by fancghit 11 months ago

#888 - when calculating actor loss, why the mask is "action_mask[:, start: ] "

Issue - State: closed - Opened by fancghit 11 months ago

#887 - The actor constantly generates ['</s>'] or ['<|endoftext|></s>'] after 200 steps in RLHF with hybrid engine disabled

Issue - State: open - Opened by mousewu 11 months ago - 1 comment

#887 - The actor constantly generates ['</s>'] or ['<|endoftext|></s>'] after 200 steps in RLHF with hybrid engine disabled

Issue - State: open - Opened by mousewu 11 months ago - 1 comment

#886 - About multiple-thread attention computation on CPU using zero-inference example.

Issue - State: open - Opened by luckyq 11 months ago

#886 - About multiple-thread attention computation on CPU using zero-inference example.

Issue - State: open - Opened by luckyq 11 months ago

#885 - Suggested GPU to run the demo code of step2_reward_model_finetuning (DeepSpeed-Chat)

Issue - State: open - Opened by wenbozhangjs 11 months ago

#885 - Suggested GPU to run the demo code of step2_reward_model_finetuning (DeepSpeed-Chat)

Issue - State: open - Opened by wenbozhangjs 11 months ago

#884 - [REQUEST] More fine-grained distributed strategies for RLHF training

Issue - State: open - Opened by youshaox 11 months ago

#884 - [REQUEST] More fine-grained distributed strategies for RLHF training

Issue - State: open - Opened by youshaox 11 months ago

#883 - The reward value did not increase.

Issue - State: open - Opened by Sun-Shiqi 11 months ago - 1 comment

#883 - The reward value did not increase.

Issue - State: open - Opened by Sun-Shiqi 11 months ago - 1 comment

#882 - Fix response check in call_aml function

Pull Request - State: closed - Opened by HeyangQin 11 months ago

#881 - Update throughput-latency plot script

Pull Request - State: closed - Opened by lekurile 11 months ago

#880 - [Inference Benchmark] set `num_requests` based on `num_clients`

Pull Request - State: closed - Opened by mrwyattii 11 months ago

#879 - Confusion about Deepspeed Inference

Issue - State: open - Opened by ZekaiGalaxy 11 months ago - 1 comment

#879 - Confusion about Deepspeed Inference

Issue - State: open - Opened by ZekaiGalaxy 11 months ago - 1 comment

#878 - `AttributeError: readonly attribute` while trying to run training/HelloDeepSpeed

Issue - State: open - Opened by htjain 11 months ago

GitHub / microsoft/DeepSpeedExamples issues and pull requests