Reinforcement Learning from Human Feedback (RLHF)¶
Entity Type: Glossary
ID: reinforcement-learning-from-human-feedback
Definition: A training methodology that uses human preferences to fine-tune language models and other AI systems. The process involves training a reward model on human preference data, then using reinforcement learning algorithms like Proximal Policy Optimization (PPO) to optimize the base model according to these learned human preferences. This approach has been crucial for creating helpful, harmless, and honest AI assistants.
Related Terms: - reinforcement-learning - human-preference-learning - reward-modeling - proximal-policy-optimization - ai-alignment
Source Urls: - https://arxiv.org/abs/2203.02155 - https://openai.com/research/learning-from-human-preferences - https://huggingface.co/blog/rlhf
Tags: - ai-safety - alignment - reinforcement-learning - training-methods
Status: active
Version: 1.0.0
Created At: 2025-09-10
Last Updated: 2025-09-10