Ecosyste.ms: Issues

An open API service for providing issue and pull request metadata for open source projects.

GitHub / stas00/ml-engineering issues and pull requests

#68 - f

Issue - State: closed - Opened by sbhavani 15 days ago

#67 - bb

Issue - State: closed - Opened by sbhavani 15 days ago

#66 - [Question] `FSDP` vs `Deepspeed ZeRO3 / ZeRO++`

Issue - State: closed - Opened by jeromeku 20 days ago - 2 comments

#65 - Change "Gbps" to "GBps" to fix a tiny typo in network.README.md

Pull Request - State: closed - Opened by Txxx926 24 days ago

#64 - grad checkpoint tiny error

Pull Request - State: closed - Opened by baochi0212 24 days ago - 4 comments

#63 - fix tiny error

Pull Request - State: closed - Opened by baochi0212 24 days ago

#62 - slurm job array change nodes

Pull Request - State: closed - Opened by ethanhe42 26 days ago - 1 comment

#61 - slurm job array change nodes

Issue - State: closed - Opened by ethanhe42 about 1 month ago - 1 comment

#60 - Update GH200 MAMF

Pull Request - State: closed - Opened by yaolu about 1 month ago - 1 comment

#59 - Max Achievable TFLOP/s on H100 without warmup

Pull Request - State: closed - Opened by OrenLeung about 1 month ago - 1 comment

#58 - MAMF - GH200

Issue - State: closed - Opened by frankschae about 1 month ago - 10 comments

#56 - MAMAF + AMD debug

Pull Request - State: closed - Opened by stas00 about 2 months ago

#55 - fix table

Pull Request - State: closed - Opened by 152334H 2 months ago - 1 comment

#54 - fix typo

Pull Request - State: closed - Opened by yaolu 3 months ago - 1 comment

#53 - add citation

Pull Request - State: closed - Opened by stas00 4 months ago

#52 - Adding another logbook (kinda)

Issue - State: open - Opened by boweiliu 4 months ago - 2 comments

#50 - Fix in ai-battlefield.md

Pull Request - State: closed - Opened by andy-yangz 5 months ago - 1 comment

#49 - Fix incorrect Nvidia retired GPU page size mention.

Pull Request - State: closed - Opened by cf-natali 5 months ago - 1 comment

#48 - Fix a couple formulas rendering.

Pull Request - State: closed - Opened by cf-natali 5 months ago - 1 comment

#47 - MFU + HFU redux

Pull Request - State: closed - Opened by stas00 5 months ago - 2 comments

#46 - SWIGLU: clarifications

Pull Request - State: closed - Opened by stas00 5 months ago - 4 comments

#45 - Question about the right hidden dim when using SwiGLU

Issue - State: closed - Opened by Thytu 5 months ago - 3 comments

#44 - fix bf16 <-> fp16 dtype statement

Pull Request - State: closed - Opened by stas00 6 months ago

#43 - fix tpu v4 hbm2 bw

Pull Request - State: closed - Opened by stas00 6 months ago

#42 - fix typo in emulate multi node

Pull Request - State: closed - Opened by Thytu 6 months ago - 1 comment

#41 - Question about changing precision post training

Issue - State: closed - Opened by Thytu 6 months ago - 2 comments

#40 - TPU v4 has 1,200GB/s of mem bandwidth and not 2,400, right?

Issue - State: closed - Opened by rodrigo-f-nogueira 6 months ago - 1 comment

#39 - Fix broken links.

Pull Request - State: closed - Opened by cf-natali 6 months ago - 1 comment

#38 - [AI battlefield] Update NVLink bandwidths to uni-directional numbers.

Pull Request - State: closed - Opened by cf-natali 6 months ago - 1 comment

#37 - ML

Issue - State: closed - Opened by lelikdr 6 months ago

#36 - Add num_processes and num_machines to accelerate launcher

Pull Request - State: closed - Opened by adamlin120 6 months ago - 1 comment

#35 - [Network] Complete missing sentence

Pull Request - State: closed - Opened by patrickvonplaten 7 months ago - 1 comment

#34 - [Network] Some typos in the README

Pull Request - State: closed - Opened by patrickvonplaten 7 months ago - 1 comment

#32 - discuss the solutions to Not fully recovering spikes

Issue - State: closed - Opened by pengzhangzhi 7 months ago - 7 comments

#31 - Update README.md in network chapter, update bandwidth info

Pull Request - State: closed - Opened by kisseternity 7 months ago - 1 comment

#30 - Conflicting opinions about streaming data from cloud storage?

Issue - State: closed - Opened by hacobe 7 months ago - 2 comments

#29 - Update ai-battlefield.md

Pull Request - State: closed - Opened by findmyway 7 months ago - 1 comment

#28 - Quarto Site

Issue - State: closed - Opened by saforem2 7 months ago - 3 comments

#27 - Fix single node networking analysis

Pull Request - State: closed - Opened by haidark 7 months ago - 1 comment

#26 - Update README.md

Pull Request - State: closed - Opened by pitmonticone 7 months ago - 1 comment

#25 - Reorg 2

Pull Request - State: closed - Opened by stas00 7 months ago

#24 - Add flash attention to overview

Pull Request - State: closed - Opened by Quentin-Anthony 7 months ago - 1 comment

#23 - Clarification for gradient memory in mixed precision training

Issue - State: closed - Opened by SumanthRH 8 months ago - 3 comments

#22 - Add cookbook and model co-design refs

Pull Request - State: closed - Opened by Quentin-Anthony 8 months ago - 1 comment

#21 - restructuring tools

Pull Request - State: closed - Opened by stas00 8 months ago

#20 - pip install -r build/requirements.txt fails due to github_md_utils

Issue - State: closed - Opened by ebowman 8 months ago - 3 comments

#19 - Fix typo in README.md

Pull Request - State: closed - Opened by nicolapace 8 months ago - 1 comment

#18 - fix typo

Pull Request - State: closed - Opened by g1y5x3 9 months ago - 1 comment

#17 - Update emulate-multi-node.md

Pull Request - State: closed - Opened by saforem2 9 months ago - 2 comments

#16 - Fix typo

Pull Request - State: closed - Opened by pitmonticone 9 months ago - 1 comment

#15 - Improve folder structure

Issue - State: closed - Opened by heyimjonas 10 months ago - 3 comments

#14 - Update ai-battlefield.md

Pull Request - State: closed - Opened by eryk-mazus 10 months ago - 1 comment

#13 - Daisy chain batch jobs

Issue - State: closed - Opened by adammoody 10 months ago - 1 comment

#12 - Update ai-battlefield.md

Pull Request - State: closed - Opened by evelynmitchell 10 months ago - 1 comment

#11 - Update GPU guide with IPU info

Pull Request - State: closed - Opened by thecharlieblake 10 months ago - 1 comment

#10 - Typo fixes

Pull Request - State: closed - Opened by BioGeek 10 months ago - 3 comments

#9 - GPU requirements and cost estimation.

Issue - State: closed - Opened by Anindyadeep 11 months ago - 4 comments

#8 - Minor Typo in emulate multi node

Issue - State: closed - Opened by anindya-saha 11 months ago - 4 comments

#7 - [feat] md2pdf

Pull Request - State: closed - Opened by pengzhangzhi 11 months ago - 13 comments

#6 - convert markdown to pdf

Issue - State: closed - Opened by pengzhangzhi 11 months ago - 10 comments

#5 - Missing `hparams` section

Issue - State: closed - Opened by jvmncs 11 months ago - 2 comments

#4 - PaLM training instability

Pull Request - State: closed - Opened by cx0 11 months ago - 1 comment

#3 - Fix typos

Pull Request - State: closed - Opened by pitmonticone 12 months ago - 2 comments

#2 - Convert to bfloat16 failing

Issue - State: closed - Opened by mhillebrand about 1 year ago - 2 comments

#1 - Parallel training hangs

Issue - State: closed - Opened by mhillebrand over 2 years ago - 10 comments