RLHFlow/Online-RLHF issues and pull requests

#32 - Unexpected keyword argument 'beta' in DPOTrainer initialization

Issue - State: open - Opened by YijuGuo about 2 months ago

#31 - Inquiry about the version of alignment-handbook used in the installation guide

Issue - State: open - Opened by YijuGuo about 2 months ago - 1 comment

#30 - Unable to install axolotl due to missing bitsandbytes==0.45.0 dependency

Issue - State: open - Opened by YijuGuo about 2 months ago - 2 comments

#29 - Questions About Data chosen Strategies

Issue - State: open - Opened by nantenT 2 months ago - 2 comments

#28 - About the results of vanilla DPO

Issue - State: open - Opened by lucasliunju 3 months ago - 1 comment

#27 - Reward-KL Comparison

Issue - State: open - Opened by vincezh2000 3 months ago - 1 comment

#26 - Update README.md

Pull Request - State: closed - Opened by ElegantLin 3 months ago

#25 - add v2 models

Pull Request - State: closed - Opened by xypan0 3 months ago

#24 - SFT training objective

Issue - State: open - Opened by ljb121002 4 months ago - 3 comments

#23 - Negative reward when serving ArmoRM-Llama3-8B-v0.1

Issue - State: open - Opened by maoliyuan 5 months ago - 4 comments

#22 - Question about CUDA/NVCC setups

Issue - State: open - Opened by rqzhangberkeley 6 months ago - 1 comment

#21 - Question about the iteration dataset (information leakage)?

Issue - State: closed - Opened by hhhhzzzzz 6 months ago - 8 comments

#20 - Questions about Nectar Datasets

Issue - State: open - Opened by XinZhao0211 6 months ago - 4 comments

#19 - pip's dependency conflict: accelerate

Issue - State: closed - Opened by liwd190019 6 months ago - 2 comments

#18 - Reference policy ablations

Issue - State: closed - Opened by yesiam-png 7 months ago - 9 comments

#17 - Phi3 has a nearly constant DPO loss of 0.69xx

Issue - State: open - Opened by Arnav0400 7 months ago - 6 comments

#16 - large max_steps?

Issue - State: closed - Opened by hunterlang 7 months ago - 1 comment

#15 - One question about the loss function given a gold reward model

Issue - State: closed - Opened by srzer 8 months ago - 2 comments

#14 - numpy version and transformers version

Issue - State: closed - Opened by WayXG 8 months ago - 1 comment

#13 - More RLHF algorithms in the implementation

Issue - State: closed - Opened by WayXG 8 months ago - 1 comment

#12 - question about dpo dataset

Issue - State: closed - Opened by LiuChen19960902 8 months ago - 1 comment

#11 - Distributed training in stage 3.3 keeps hanging

Issue - State: closed - Opened by srzer 8 months ago - 2 comments

#10 - corrected max_model_len to be max_input_length

Pull Request - State: closed - Opened by eddyliu5 8 months ago - 2 comments

#9 - update the figure in readme

Issue - State: closed - Opened by WayXG 8 months ago - 1 comment

#8 - questions about dpo

Issue - State: closed - Opened by hong-xl 8 months ago - 5 comments

#7 - Iterative pipeline question

Issue - State: closed - Opened by matouk98 8 months ago - 4 comments

#6 - Model evaluation issue

Issue - State: closed - Opened by matouk98 8 months ago - 5 comments

#5 - Questions about training data during iterative DPO

Issue - State: closed - Opened by hong-xl 8 months ago - 3 comments

#4 - Fail to load weight from pair-preference-model-LLaMA3-8B

Issue - State: open - Opened by matouk98 8 months ago - 2 comments

#3 - Cannot Reproduce the DPO Checkpoint

Issue - State: closed - Opened by gesy17 9 months ago - 1 comment

#2 - How train sft on rtx4090?

Issue - State: closed - Opened by utrobinmv 9 months ago - 1 comment

#1 - Fix readme typo

Pull Request - State: closed - Opened by erjanmx 9 months ago - 1 comment

Ecosyste.ms: Issues

GitHub / RLHFlow/Online-RLHF issues and pull requests

#32 - Unexpected keyword argument 'beta' in DPOTrainer initialization

#31 - Inquiry about the version of alignment-handbook used in the installation guide

#30 - Unable to install axolotl due to missing bitsandbytes==0.45.0 dependency

#29 - Questions About Data chosen Strategies

#28 - About the results of vanilla DPO

#27 - Reward-KL Comparison

#26 - Update README.md

#25 - add v2 models

#24 - SFT training objective

#23 - Negative reward when serving ArmoRM-Llama3-8B-v0.1

#22 - Question about CUDA/NVCC setups

#21 - Question about the iteration dataset (information leakage)?

#20 - Questions about Nectar Datasets

#19 - pip's dependency conflict: accelerate

#18 - Reference policy ablations

#17 - Phi3 has a nearly constant DPO loss of 0.69xx

#16 - large max_steps?

#15 - One question about the loss function given a gold reward model

#14 - numpy version and transformers version

#13 - More RLHF algorithms in the implementation

#12 - question about dpo dataset

#11 - Distributed training in stage 3.3 keeps hanging

#10 - corrected max_model_len to be max_input_length

#9 - update the figure in readme

#8 - questions about dpo

#7 - Iterative pipeline question

#6 - Model evaluation issue

#5 - Questions about training data during iterative DPO

#4 - Fail to load weight from pair-preference-model-LLaMA3-8B

#3 - Cannot Reproduce the DPO Checkpoint

#2 - How train sft on rtx4090?

#1 - Fix readme typo