What is Direct Preference Optimization ?
LLM
Fine Tuning
Reinforcement Learning with Human Feedback (RLHF) is the current state-of-the-art technique to fine tune LLMs. However, a recent and a much simpler improvement on RLHF was published in paper titled 'Direct Preference Optimization-Your Language Model is Secretly a Reward Model'.