Skip to content

Reinforcement Learning from Human Feedback (RLHF)

Entity Type: Glossary ID: reinforcement-learning-from-human-feedback

Definition: A training methodology that uses human preferences to fine-tune language models and other AI systems. The process involves training a reward model on human preference data, then using reinforcement learning algorithms like Proximal Policy Optimization (PPO) to optimize the base model according to these learned human preferences. This approach has been crucial for creating helpful, harmless, and honest AI assistants.

Related Terms: - reinforcement-learning - human-preference-learning - reward-modeling - proximal-policy-optimization - ai-alignment

Source Urls: - https://arxiv.org/abs/2203.02155 - https://openai.com/research/learning-from-human-preferences - https://huggingface.co/blog/rlhf

Tags: - ai-safety - alignment - reinforcement-learning - training-methods

Status: active

Version: 1.0.0

Created At: 2025-09-10

Last Updated: 2025-09-10