Ecosyste.ms: Issues
An open API service for providing issue and pull request metadata for open source projects.
GitHub / stas00/ml-engineering issues and pull requests
#79 - Support more dtype caculate max flops in mamf-finder tools
Pull Request -
State: open - Opened by BBuf 5 days ago
#77 - Empirical measurements of sustained bandwidth and flops
Issue -
State: open - Opened by fluidnumerics-joe about 1 month ago
#76 - AMD MI250X MAMF efficiency is wrong
Issue -
State: closed - Opened by rlrs about 2 months ago
- 3 comments
#75 - How can operations with the INT8 data type be performed using a GPU accelerator card? Is dequantization required?
Issue -
State: closed - Opened by cpollo55 about 2 months ago
- 5 comments
#74 - MAMF Finder usage of MxNxK differs from BLAS gemm MxNxK
Issue -
State: closed - Opened by GKolling about 2 months ago
- 2 comments
#73 - PDF link in readme doesn't work
Issue -
State: closed - Opened by sytelus about 2 months ago
- 3 comments
#72 - GPU utilization monitoring
Issue -
State: closed - Opened by fortminors about 2 months ago
- 6 comments
#71 - Performance Profiling
Issue -
State: closed - Opened by jeromeku 2 months ago
- 2 comments
#70 - Namespaces as a solution for performance measurements in shared clusters.
Pull Request -
State: closed - Opened by BrunoScaglione 2 months ago
- 4 comments
#69 - An Enhanced PyTorch distributed benchmarking script with Machine Learning for enhancing speed and efficiently of resources.
Pull Request -
State: closed - Opened by RahulVadisetty91 3 months ago
- 1 comment
#66 - [Question] `FSDP` vs `Deepspeed ZeRO3 / ZeRO++`
Issue -
State: closed - Opened by jeromeku 3 months ago
- 2 comments
#65 - Change "Gbps" to "GBps" to fix a tiny typo in network.README.md
Pull Request -
State: closed - Opened by Txxx926 3 months ago
#64 - grad checkpoint tiny error
Pull Request -
State: closed - Opened by baochi0212 3 months ago
- 4 comments
#63 - fix tiny error
Pull Request -
State: closed - Opened by baochi0212 3 months ago
#62 - slurm job array change nodes
Pull Request -
State: closed - Opened by ethanhe42 3 months ago
- 1 comment
#61 - slurm job array change nodes
Issue -
State: closed - Opened by ethanhe42 3 months ago
- 1 comment
#60 - Update GH200 MAMF
Pull Request -
State: closed - Opened by yaolu 4 months ago
- 1 comment
#59 - Max Achievable TFLOP/s on H100 without warmup
Pull Request -
State: closed - Opened by OrenLeung 4 months ago
- 1 comment
#58 - MAMF - GH200
Issue -
State: closed - Opened by frankschae 4 months ago
- 10 comments
#56 - MAMAF + AMD debug
Pull Request -
State: closed - Opened by stas00 4 months ago
#55 - fix table
Pull Request -
State: closed - Opened by 152334H 5 months ago
- 1 comment
#54 - fix typo
Pull Request -
State: closed - Opened by yaolu 5 months ago
- 1 comment
#53 - add citation
Pull Request -
State: closed - Opened by stas00 6 months ago
#52 - Adding another logbook (kinda)
Issue -
State: closed - Opened by boweiliu 7 months ago
- 4 comments
#50 - Fix in ai-battlefield.md
Pull Request -
State: closed - Opened by andy-yangz 7 months ago
- 1 comment
#49 - Fix incorrect Nvidia retired GPU page size mention.
Pull Request -
State: closed - Opened by cf-natali 8 months ago
- 1 comment
#48 - Fix a couple formulas rendering.
Pull Request -
State: closed - Opened by cf-natali 8 months ago
- 1 comment
#47 - MFU + HFU redux
Pull Request -
State: closed - Opened by stas00 8 months ago
- 2 comments
#46 - SWIGLU: clarifications
Pull Request -
State: closed - Opened by stas00 8 months ago
- 4 comments
#45 - Question about the right hidden dim when using SwiGLU
Issue -
State: closed - Opened by Thytu 8 months ago
- 3 comments
#44 - fix bf16 <-> fp16 dtype statement
Pull Request -
State: closed - Opened by stas00 8 months ago
#43 - fix tpu v4 hbm2 bw
Pull Request -
State: closed - Opened by stas00 8 months ago
#42 - fix typo in emulate multi node
Pull Request -
State: closed - Opened by Thytu 8 months ago
- 1 comment
#41 - Question about changing precision post training
Issue -
State: closed - Opened by Thytu 8 months ago
- 2 comments
#40 - TPU v4 has 1,200GB/s of mem bandwidth and not 2,400, right?
Issue -
State: closed - Opened by rodrigo-f-nogueira 8 months ago
- 1 comment
#39 - Fix broken links.
Pull Request -
State: closed - Opened by cf-natali 8 months ago
- 1 comment
#38 - [AI battlefield] Update NVLink bandwidths to uni-directional numbers.
Pull Request -
State: closed - Opened by cf-natali 8 months ago
- 1 comment
#36 - Add num_processes and num_machines to accelerate launcher
Pull Request -
State: closed - Opened by adamlin120 8 months ago
- 1 comment
#35 - [Network] Complete missing sentence
Pull Request -
State: closed - Opened by patrickvonplaten 9 months ago
- 1 comment
#34 - [Network] Some typos in the README
Pull Request -
State: closed - Opened by patrickvonplaten 9 months ago
- 1 comment
#32 - discuss the solutions to Not fully recovering spikes
Issue -
State: closed - Opened by pengzhangzhi 9 months ago
- 7 comments
#31 - Update README.md in network chapter, update bandwidth info
Pull Request -
State: closed - Opened by kisseternity 9 months ago
- 1 comment
#30 - Conflicting opinions about streaming data from cloud storage?
Issue -
State: closed - Opened by hacobe 9 months ago
- 2 comments
#29 - Update ai-battlefield.md
Pull Request -
State: closed - Opened by findmyway 9 months ago
- 1 comment
#28 - Quarto Site
Issue -
State: closed - Opened by saforem2 9 months ago
- 3 comments
#27 - Fix single node networking analysis
Pull Request -
State: closed - Opened by haidark 9 months ago
- 1 comment
#26 - Update README.md
Pull Request -
State: closed - Opened by pitmonticone 10 months ago
- 1 comment
#25 - Reorg 2
Pull Request -
State: closed - Opened by stas00 10 months ago
#24 - Add flash attention to overview
Pull Request -
State: closed - Opened by Quentin-Anthony 10 months ago
- 1 comment
#23 - Clarification for gradient memory in mixed precision training
Issue -
State: closed - Opened by SumanthRH 10 months ago
- 3 comments
#22 - Add cookbook and model co-design refs
Pull Request -
State: closed - Opened by Quentin-Anthony 10 months ago
- 1 comment
#21 - restructuring tools
Pull Request -
State: closed - Opened by stas00 10 months ago
#20 - pip install -r build/requirements.txt fails due to github_md_utils
Issue -
State: closed - Opened by ebowman 10 months ago
- 3 comments
#19 - Fix typo in README.md
Pull Request -
State: closed - Opened by nicolapace 10 months ago
- 1 comment
#18 - fix typo
Pull Request -
State: closed - Opened by g1y5x3 11 months ago
- 1 comment
#17 - Update emulate-multi-node.md
Pull Request -
State: closed - Opened by saforem2 11 months ago
- 2 comments
#16 - Fix typo
Pull Request -
State: closed - Opened by pitmonticone 11 months ago
- 1 comment
#15 - Improve folder structure
Issue -
State: closed - Opened by heyimjonas 12 months ago
- 3 comments
#14 - Update ai-battlefield.md
Pull Request -
State: closed - Opened by eryk-mazus 12 months ago
- 1 comment
#13 - Daisy chain batch jobs
Issue -
State: closed - Opened by adammoody 12 months ago
- 1 comment
#12 - Update ai-battlefield.md
Pull Request -
State: closed - Opened by evelynmitchell 12 months ago
- 1 comment
#11 - Update GPU guide with IPU info
Pull Request -
State: closed - Opened by thecharlieblake about 1 year ago
- 1 comment
#10 - Typo fixes
Pull Request -
State: closed - Opened by BioGeek about 1 year ago
- 3 comments
#9 - GPU requirements and cost estimation.
Issue -
State: closed - Opened by Anindyadeep about 1 year ago
- 4 comments
#8 - Minor Typo in emulate multi node
Issue -
State: closed - Opened by anindya-saha about 1 year ago
- 4 comments
#7 - [feat] md2pdf
Pull Request -
State: closed - Opened by pengzhangzhi about 1 year ago
- 13 comments
#6 - convert markdown to pdf
Issue -
State: closed - Opened by pengzhangzhi about 1 year ago
- 10 comments
#5 - Missing `hparams` section
Issue -
State: closed - Opened by jvmncs about 1 year ago
- 2 comments
#4 - PaLM training instability
Pull Request -
State: closed - Opened by cx0 about 1 year ago
- 1 comment
#3 - Fix typos
Pull Request -
State: closed - Opened by pitmonticone about 1 year ago
- 2 comments
#2 - Convert to bfloat16 failing
Issue -
State: closed - Opened by mhillebrand over 1 year ago
- 2 comments
#1 - Parallel training hangs
Issue -
State: closed - Opened by mhillebrand over 2 years ago
- 10 comments