microsoft/mup issues and pull requests

#81 - Add support for dataclasses

Pull Request - State: open - Opened by francois-rozet about 1 month ago - 1 comment

#80 - More options of input/output types in coord_check

Issue - State: open - Opened by francois-rozet about 1 month ago

#79 - CNN utility

Pull Request - State: closed - Opened by JeremyCCHsu 3 months ago

#78 - How to use with SSL methods like DINOv2?

Issue - State: open - Opened by josephcappadona 4 months ago

#77 - MuP for RNNs

Issue - State: open - Opened by norikazu99 4 months ago

#76 - Not getting perf improvements from muP at ~1.5B scale

Issue - State: open - Opened by gordicaleksa 4 months ago

#75 - fix: adopt mup/Transformers API for torch2.3

Pull Request - State: open - Opened by emergenz 5 months ago

#74 - MuP for Mamba

Issue - State: open - Opened by norikazu99 6 months ago

#73 - Refactor: Addressing Sources of User Error

Pull Request - State: open - Opened by thomasfortin1 7 months ago - 1 comment

#72 - Support FSDP usage

Pull Request - State: open - Opened by janEbert 7 months ago - 1 comment

#71 - Increasing coord check for the network output

Issue - State: open - Opened by AkshitaB 8 months ago - 2 comments

#70 - mu parametrization for gated-mlp and group-query attention

Issue - State: open - Opened by ftgreat 8 months ago

#69 - Reproducing Figure 1 using 'examples/Transformer/main.py'

Issue - State: open - Opened by jndean 11 months ago

#68 - coord_check for model that returns loss function directly

Issue - State: open - Opened by ad8e 11 months ago

#67 - Reproducing the validation accuracy vs learning rates curve on ResNet

Issue - State: open - Opened by liulei277 11 months ago - 1 comment

#66 - Questions for training gpt-2 using mup

Issue - State: closed - Opened by jiangjiadi about 1 year ago - 6 comments

#65 - add width_mult to optimizer dict

Pull Request - State: open - Opened by marcobellagente93 about 1 year ago

#64 - About Learning rate decay

Issue - State: open - Opened by afcruzs about 1 year ago - 2 comments

#63 - Demo notebook

Pull Request - State: closed - Opened by edwardjhu about 1 year ago

#62 - Unclear `assert_hidden_size_inf` triggers

Issue - State: closed - Opened by dreavjr about 1 year ago - 1 comment

#61 - dim_feedforward

Issue - State: closed - Opened by dreavjr about 1 year ago

#60 - Usage with torch.compile in Pytorch 2?

Issue - State: open - Opened by dreavjr about 1 year ago - 2 comments

#59 - FSDP support?

Issue - State: open - Opened by platers about 1 year ago - 3 comments

#58 - Interpreting jitter in coordcheck

Issue - State: closed - Opened by leenachennuru about 1 year ago - 2 comments

#57 - Some questions about the implementation of muP.

Issue - State: open - Opened by lepodl about 1 year ago

#56 - µTransfer across batch size && weight decay setting

Issue - State: open - Opened by PanYue2023 over 1 year ago

#55 - _rescale_parameters() inconsistent with the paper for the tied embedding scenario?

Issue - State: open - Opened by ofivite over 1 year ago - 2 comments

#54 - Is it possible to also scale the depth of the model?

Issue - State: open - Opened by ricomnl over 1 year ago - 5 comments

#53 - Once the best HPs have been found, does the final model have to be trained with `mup` or can one just use the found HPs and train the model in a standard way?

Issue - State: closed - Opened by ricomnl over 1 year ago

#52 - Reproducing the training loss vs learning rates curve on MLP

Issue - State: closed - Opened by jhj0411jhj over 1 year ago - 5 comments

#51 - Warmup schedule when changing the number of tokens/steps (GPT-3 experiment detail)

Issue - State: open - Opened by sashaDoubov over 1 year ago

#48 - Positional Embeddings should be MuReadout parameters ?

Issue - State: open - Opened by codedecde over 1 year ago - 2 comments

#47 - Question about the difference between init code and paper

Issue - State: closed - Opened by midori1 over 1 year ago - 2 comments

#46 - Does mup support fine tuning pretrained models

Issue - State: closed - Opened by jhj0411jhj over 1 year ago - 2 comments

#45 - Embedding Multiplier for Transformer - Clarification

Issue - State: closed - Opened by sashaDoubov over 1 year ago - 2 comments

#43 - Are Sequentials with list comprehension handled incorrectly?

Issue - State: open - Opened by RobertBaruch over 1 year ago - 2 comments

#42 - interpreting coord checks

Issue - State: closed - Opened by llucid-97 over 1 year ago - 2 comments

#41 - in mlp example: 2 problems

Issue - State: open - Opened by yjjinjie over 1 year ago - 1 comment

#40 - Questions on learning schedule and binary classification

Issue - State: closed - Opened by FlamingHorizon over 1 year ago - 12 comments

#39 - Can base model be larger than target model?

Issue - State: closed - Opened by jhj0411jhj over 1 year ago - 3 comments

#38 - coord check plot improvements

Pull Request - State: closed - Opened by TevenLeScao almost 2 years ago - 1 comment

#37 - Allowing users to create their own shapes

Pull Request - State: closed - Opened by TevenLeScao almost 2 years ago

#36 - Should query layers in self-attention be initialized to 0 in practice?

Issue - State: closed - Opened by xinwuyun almost 2 years ago - 2 comments

#35 - Plot bugfix

Pull Request - State: closed - Opened by TevenLeScao almost 2 years ago

#33 - fix: dtype for newer torch versions

Pull Request - State: closed - Opened by zanussbaum almost 2 years ago - 1 comment

#32 - Proper error return in coord_check.py

Pull Request - State: closed - Opened by TevenLeScao almost 2 years ago - 1 comment

#31 - Finetuning a Pretrained Model Using MuP

Issue - State: closed - Opened by zanussbaum almost 2 years ago - 3 comments

#30 - Issue in reproducing the training loss vs learning rates curve

Issue - State: closed - Opened by NicolasWinckler almost 2 years ago - 5 comments

#29 - Are parameters with no "infinite" dimensions allowed?

Issue - State: closed - Opened by callumm-graphcore about 2 years ago - 5 comments

#28 - LayerNorm Gain and Bias Multipliers

Issue - State: closed - Opened by AWildridge about 2 years ago - 2 comments

#27 - MuP Coord Check not Working with Electra Style Model

Issue - State: closed - Opened by zanussbaum about 2 years ago - 8 comments

#26 - Has MuP been tested on segmentation models?

Issue - State: open - Opened by isdj about 2 years ago - 4 comments

#25 - Should `base=None` be used in `set_base_shapes` for model used for tuning?

Issue - State: open - Opened by callumm-graphcore about 2 years ago - 2 comments

#24 - Batch size, Seq len, Step Transfering

Issue - State: closed - Opened by timothyxp about 2 years ago - 2 comments

#23 - Conv1D Coord check looks good (I think), but μTransfer does not seem to work?

Issue - State: closed - Opened by zanussbaum over 2 years ago - 20 comments

#22 - Coord check looks good, but μTransfer is not working as expected

Issue - State: closed - Opened by shjwudp over 2 years ago - 6 comments

#21 - Does mup support Swin Transformer v2 model?

Issue - State: open - Opened by shiyf129 over 2 years ago - 2 comments

#20 - muP for contrastive losses

Issue - State: closed - Opened by xwjabc over 2 years ago - 2 comments

#19 - missing os import in mup/examples/MLP/main.py ?

Issue - State: closed - Opened by james-simon over 2 years ago - 1 comment

#18 - mu parametrization for channel attention

Issue - State: closed - Opened by xwjabc over 2 years ago - 5 comments

#17 - mu parametrization for multi-head attention / grouped convolution

Issue - State: closed - Opened by xwjabc over 2 years ago - 3 comments

#16 - Optimizers for coord check

Issue - State: closed - Opened by xwjabc over 2 years ago - 2 comments

#15 - Torchdistx

Pull Request - State: closed - Opened by edwardjhu over 2 years ago - 2 comments

#14 - Coord-check for conv1d

Issue - State: closed - Opened by bob80333 over 2 years ago - 17 comments

#13 - ResNet readout_zero_init=True?

Issue - State: closed - Opened by D-X-Y over 2 years ago - 2 comments

#12 - Hyperparameter search on base models

Issue - State: closed - Opened by davisyoshida over 2 years ago - 2 comments

#11 - integration with Flax?

Issue - State: open - Opened by nestordemeure over 2 years ago - 4 comments

#10 - Examples with ConvNets

Issue - State: closed - Opened by Aboussejra over 2 years ago - 2 comments

#9 - Does MuReadout apply to all outputs on which loss is computed?

Issue - State: closed - Opened by jaivardhankapoor over 2 years ago - 2 comments

GitHub / microsoft/mup issues and pull requests