Coding

Generative Reward Modeling (GRM) and GRPO

May 22, 2025

Part 1: Generative Reward Modeling (GRM) and GRPO: Evaluation and Traditional PPO

The rapid evolution of Large Language Models (LLMs) has unlocked unprecedented capabilities, yet harnessing their full potential demands sophisticated fine-tuning. At Rihal, our commitment to delivering cutting-edge AI solutions means we must continuously explore and implement the most advanced methodologies. This is crucial for developing models that are not only powerful but also accurate, reliable, and deeply aligned with user needs. Adopting state-of-the-art techniques like Generative Reward Modeling (GRM) coupled with Generative-Reward Policy Optimization (GRPO) is therefore a strategic imperative for us. This approach, which moves beyond simple feedback to richer, descriptive critiques, allows Rihal to build superior, client-centric AI, maintain a competitive edge through innovation, and enhance model alignment with nuanced human values. By investing in such advanced paradigms, we aim to push the boundaries of LLM performance, ensuring our solutions are insightful, dependable, and truly beneficial. This article (and its successor) delves into the principles that lead to such advanced techniques.

Evaluation

To understand what's happening, let's begin with how models are evaluated. One of the most effective methods, second only to human evaluation, is LLM-as-a-Judge, where a language model itself is used as an independent judge. This approach commonly relies on two main evaluation methods.

1. Pairwise Comparisons and the ELO System

In this scenario, two models—let’s call them Model A and Model B—are given the same task. They each generate a response, and an LLM judge, unaware of which model produced which answer, selects the superior one. After many such "duels," each model receives a ranking using the ELO system, similar to how chess players are ranked. This enables us to sort models by performance based on how often one "outperforms" another on specific tasks.

2. Reference-Based Scoring on a 10-Point Scale

The second method assumes the presence of a ground truth answer. Here, the judge receives the task, a generated response, and the reference answer. Its task is to assign a score from 1 to 10, evaluating aspects like accuracy, completeness, logic, and style. This allows for a more nuanced understanding of how close a model's response is to an ideal one, especially in subjective tasks.

Both methods are widely used: the first is useful for leaderboard creation and identifying top-performing models, while the second offers deeper insights and diagnostics into generation quality.

We will focus on the second method, as our subsequent exploration builds upon its principles.

It's Useful for the Judge to Reflect

When a model acts as a judge and rates answers on a 10-point scale relative to a reference, it becomes crucial for the model to reason out loud before giving a final score. This "reasoning out loud" approach—where the model first articulates its analysis of the response's strengths and weaknesses before delivering a verdict—significantly improves the consistency and accuracy of the evaluation.

At Rihal, we observed that if you immediately ask the model for a score, it often relies on intuition without analyzing why the response is good or bad. This makes the scores less reliable—the same response might receive different ratings depending on prompt wording or the model’s "mood." However, if the model is forced to first explain what is right in the response, identify logical gaps, and assess how well the style fits the context, the final score tends to align more closely with human judgment.

This reflective step acts as a self-check: the model structures its understanding, gathers arguments, and thereby stabilizes the outcome. We noticed that this method reduces score variance and increases correlation with human ratings.

Training

To understand what we’re unpacking here, let’s talk about training.

An LLM is the result of a multi-stage training process, with each phase shaping the model’s intelligence. Analogous to human learning, a model first learns to "read" (comprehend patterns), then to "follow instructions," and finally, to "converse and adapt." In AI, these stages are known as Pretraining, SFT (Supervised Fine-Tuning), and PPO (Proximal Policy Optimization).

1. Pretraining — the model’s childhood

In this initial stage, the model ingests massive amounts of text: books, articles, websites, code, and dialogues. It’s like giving a child everything there is to read. The model’s task is simple—predict the next word in a stream of text. This is where it gains its foundational knowledge of the world, logic, languages, writing styles, and structure. At this point, it cannot yet follow instructions, but it develops an intuitive "feel" for how language works.

2. Supervised Fine-Tuning (SFT) — the instruction-following school

After the basics, the model progresses to a new level. Now, it's taught to follow specific instructions: answer questions, summarize, translate, etc. It learns from labeled examples: given a prompt, here is the desired correct response. The model learns to “obey” and provide accurate, appropriate outputs. At this stage, it becomes more controllable and useful for practical tasks.

3. Proximal Policy Optimization (PPO) — maturing through feedback

Finally, once the basics are learned, the model begins to respond to feedback—much like praise or correction in human learning. PPO is a reinforcement learning algorithm where the model doesn’t just repeat what it’s seen, but optimizes its outputs to receive higher "rewards"—typically approval from another model or a human evaluator. This is how it learns which responses users find helpful, polite, or ethical. Here, models start to converse, display tact, and justify answers.

Together, these three stages form the backbone of a well-tuned language model. They enable models not just to memorize and repeat, but to learn to listen, understand, and respond as expected.

PPO with an LLM-based Reward Model

Once a model has learned to follow instructions, it enters the reinforcement learning phase—a stage where it begins to adapt its behavior based on feedback. This is where Proximal Policy Optimization (PPO) comes into play. Instead of simply imitating examples, the model now generates responses and receives a reward signal indicating how good the response is, typically on a scale from 0 to 1.

This reward usually comes from a reward model—an evaluator trained to judge between responses. The model then updates itself in a direction that increases the likelihood of producing responses that score higher. This phase is crucial for shaping responses to be not only correct but also helpful, polite, ethical, or aligned with human preferences.

Using an LLM as the Reward Model: More Nuanced Feedback

Traditionally, the reward models used in PPO have been relatively simple—often trained classifiers or regressors that return a score based on implicit heuristics. However, they offer no transparency: they merely indicate “this is better,” without articulating why.

By replacing this component with a full-fledged LLM-as-a-Judge, we can significantly improve the quality and nuance of the reward signal. The judge receives the task, candidate responses, and optionally a reference answer. It evaluates each response through internal reasoning, analyzing aspects like clarity, logic, relevance, factual accuracy, and tone. Only then does it assign a numerical reward—typically still a single value used by PPO to guide learning.

The key advantage here is not merely the score itself but its foundation in multi-dimensional, language-based reasoning. This allows the reward to reflect more human-like criteria, improving training outcomes without changing the PPO mechanism.

Why This Matters

Integrating an LLM-judge into PPO allows us to preserve the benefits of reinforcement learning—stability, scalability, and policy optimization—while enhancing the signal quality that guides the model’s evolution. The model learns not just to maximize a narrow metric, but to align more closely with evaluative principles that resemble human judgment: logical consistency, informativeness, helpfulness, and more.

Although the model doesn’t directly access the judge’s reasoning during training, the improved scoring it receives helps it generalize better. This results in more robust, aligned, and interpretable behavior over time—a step closer to learning not just what to say, but why it's the right thing to say.

Conclusion

The journey to superior LLMs involves meticulous evaluation and a structured training pipeline. Techniques like LLM-as-a-Judge, particularly when the judge "reasons out loud," provide robust evaluation. The standard training progression from Pretraining to SFT and finally to PPO with an LLM-based reward model lays a strong foundation. This setup, while powerful, primarily relies on feedback for a single generated response at a time. The next evolution in this process, which we will explore in the subsequent article, involves techniques that can leverage multiple samples and richer, descriptive critiques to further enhance model performance and alignment.

Part 2: Generative Reward Modeling (GRM) and GRPO: Enhancing LLM Fine-Tuning with Richer Feedback

In our previous discussion, we explored how Large Language Models (LLMs) are evaluated and trained, culminating in the Proximal Policy Optimization (PPO) with an LLM-based reward model. While this approach enhances feedback quality by leveraging an LLM-as-a-Judge that can "reason out loud," it traditionally processes feedback on a single-response basis. This article introduces Generative Reward Modeling (GRM) and Generative-Reward Policy Optimization (GRPO) – a powerful paradigm that moves beyond single-sample feedback to leverage richer, descriptive critiques across multiple candidate answers, significantly boosting the fine-tuning process. At Rihal, we see this as a key to building superior, client-centric AI.

GRM & GRPO — Learning from Many Answers at Once

The LLM-as-a-Judge + PPO recipe we just covered rests on a simple loop: the policy generates one reply, the judge grades it, PPO tweaks the policy, and the process starts over. That minimal feedback works, but it also means the judge never sees the variety of answers the model could have produced in that moment. As a result, each gradient step is driven by a single, possibly lucky (or unlucky) sample.

Generative Reward Modeling (GRM) breaks that bottleneck. Instead of producing a lone scalar score, a GRM generates a concise textual critique and then assigns a score in plain language. Because the critique is free-form text, the same model can be asked to evaluate two, five, or ten candidate answers side-by-side. This broader view is what powers GRPO (Generative-Reward Policy Optimisation):

During each GRPO iteration, the policy samples several candidate responses to the same prompt.
The GRM examines the whole bundle, reasons out loud within its critique, and assigns each a score.
Those scores form a richer reward signal; the policy moves toward patterns that consistently place the best-scored answer among its samples.

More Samples, Steadier Signals

Because the GRM’s output is textual, we can run the evaluation many times with different decoding temperatures. Each run yields a slightly different critique and score distribution. Averaging these independent verdicts smooths out randomness and sharpens the guidance the policy receives. In practice, we see:

Lower noise: Outlier judgments from any single GRM pass are diluted.
Better exploration: The policy is rewarded for producing sets of strong answers, not just one lucky shot.
Alignment with human rubrics: The GRM’s reasoning step forces it to articulate where each answer shines or fails, nudging scores toward criteria humans care about.

One Judge Versus Many Reviewers

Put side-by-side, the contrast is clear:

AspectLLM-Judge + PPOGRM + GRPOResponses evaluated per step1Multiple (e.g., 2-10+)Feedback formSingle numberFree-form critique + numberVariance controlNone beyond KL penaltyMultiple evaluations, averagedReasoning visibilityOptionalAlways required before scoring

Both frameworks rely on the judge thinking first and scoring second; the difference is that GRM turns that reflection into explicit text and applies it to multiple candidate answers, not just one. The result is a denser learning signal and, after a few training rounds, a policy that not only produces higher-quality first drafts but can generate diverse good answers within the same sampling window.

In other words, GRM + GRPO lets the model learn from all the ideas it has, not just the first one that escapes the sampling loop—a small architectural change that unlocks a much richer form of feedback.

Code

Before you copy the snippet into a notebook or a training script, keep these practical notes in mind—they can save the usual “why is nothing happening?” head-scratching:

Dependencies: Ensure unsloth, async-openai, httpx, and the standard libraries (re, asyncio, os) are installed in the same Python environment that will run GRPO.
Environment variables: Set OPENAI_API_KEY (and any organization-specific keys) in your shell before you launch the run; otherwise, the asynchronous OpenAI client will raise an exception at the first request.
Chat-style completions expected: The completions argument must follow the OpenAI chat format ([{role: "assistant", content: "…"}]). If you call another backend, adapt the extraction logic so the reward function still receives a list of strings.
Safe fall-backs: When no <answer> tags are found, the function returns a zeroed reward vector on purpose; this prevents “nan” gradients and keeps the optimizer stable even on bad or malformed data.
Illustrative Logging: The print statements in the provided code are for illustrative and debugging purposes. They can be modified or removed for production use.

With those guard-rails in place, you can drop the code block below straight into the Unsloth GRPO notebook (or any RLHF loop) and start experimenting with richer, critique-based rewards.

And to keep things simple, we used Unsloth’s notebook implementations for QWEN GRPO training: https://docs.unsloth.ai/get-started/unsloth-notebooks

All we need to do to integrate GRM into GRPO is to replace one of the reward functions, something like this:

def correctness_reward_func(prompts, completions, answer, **kwargs) -> list[float]:
responses = [completion[0]['content'] for completion in completions]
q = prompts[0][-1]['content']
extracted_responses = [extract_xml_answer(r) for r in responses]
print('-'*20, f"Question:\n{q}", f"\nAnswer:\n{answer[0]}", f"\nResponse:\n{responses[0]}", f"\nExtracted:\n{extracted_responses[0]}")
return [2.0 if r == a else 0.0 for r, a in zip(extracted_responses, answer)]

With something like this

_BOXED_RE= re.compile(r""" # Any of the following, case‑insensitive \*?\*?# optional '**' from Markdown bold \\?boxed # 'boxed' with *optional* leading back‑slash \*?\*?# closing '**' if it was bold \s*[:\-–]?\s*# optional punctuation after the word \{ # opening brace ([^}]*) # capture everything up to the first closing brace \} # closing brace """, flags=re.IGNORECASE| re.VERBOSE| re.DOTALL,)
def_parse_scores(text:str) -> list[float]: float_pattern =r"-?\d+(?:\.\d+)?" m =_BOXED_RE.search(text)if m: nums = re.findall(float_pattern, m.group(1))if nums:return [float(n) for n in nums] lines = [ln.strip() for ln in text.splitlines() if ln.strip()]for line inreversed(lines):if"score"in line.lower(): nums = re.findall(float_pattern, line)if nums:return [float(n) for n in nums]if lines: nums = re.findall(float_pattern, lines[-1])if nums:return [float(n) for n in nums]return []
asyncdef_grm_reward_func_async(prompts,completions,**kwargs): n_iter = kwargs.get("grm_iterations",8) num_samples =len(prompts) common_conversation_text =""for msg in prompts[0]:if msg.get("role") =="system":continue role = msg.get("role","") content = msg.get("content","") common_conversation_text +=f"{role}: {content}\n" grading_prompt =f"""You are a skilled little expert at scoring responses. You should evaluate given responses basedon the given judging criteria.
Given the context of the conversation (the last round is theUser’s query) and multiple responses from the Assistant, you need to refer to the [GeneralEvaluation Criteria] to score the responses. Based on the general evaluation criteria, statepotential other specific criteria to the query, the weights of different criteria, and then providean overall comprehensive score upon them.
Each score is an integer between 1 and 10,with a higher score indicating that the response meets the relevant criteria more closely. For example, a score of 1 means the response does not meet the criteria at all, a score of 6 means the response meets only some parts, and a score of 10 means the response perfectly meets the evaluation criteria.
Before scoring, please analyze step by step. Your scoring needs to be asstrict as possible.
#### Evaluation Criteria ####1. Instruction Adherence:\n - Fully Adhered (9-10 points): The response fully complies withall instructions and requirements of the question.\n - Partially Adhered (6-8 points): Theresponse meets most of the instructions but has some omissions or misunderstandings.\n -Basically Adhered (3-5 points): The response meets some instructions, but the mainrequirements are not fulfilled.\n - Not Adhered (1-2 points): The response does not meet anyinstructions.\n Example: If the question requires three examples and the response providesonly one, it falls under “Partially Adhered.”2. Usefulness:\n - Highly Useful (9-10 points): The response provides comprehensive andaccurate information, fully addressing the issue.\n - Useful but Incomplete (6-8 points):The response provides some useful information, but lacks details or accuracy.\n - LimitedUsefulness (3-5 points): The response offers little useful information, with most contentbeing irrelevant or incorrect.\n - Useless or Incorrect (1-2 points): The response is completelyirrelevant or incorrect.\n Example: If there are factual errors in the response but the overalldirection is correct, it falls under “Useful but Incomplete.”3. Level of Detail:\n - Very Detailed (9-10 points): The response includes ample detailscovering all aspects of the issue.\n - Detailed but Slightly Lacking (6-8 points): The responseis fairly detailed but misses some important details.\n - Basically Detailed (3-5 points): Theresponse provides some details but is not thorough enough overall.\n - Not Detailed (1-2points): The response is very brief and lacks necessary details.\n Example: If the responseprovides only a simple conclusion without an explanation, it falls under “Not Detailed.”4. Relevance:\n - Highly Relevant (9-10 points): The response is highly relevant to the question, with information closely aligned with the topic.\n - Generally Relevant (6-8 points):The response is generally relevant but includes some unnecessary information.\n -
Partially Relevant (3-5 points): The response has a lot of content that deviates from the topic.\n
- Not Relevant (1-2 points): The response is completely irrelevant.\n Example: If the response strays from the topic but still provides some relevant information, it falls under “Partially Relevant.”
#### Conversation Context ####
{common_conversation_text}
#### Responses to be Scored ####"""
pattern_answer = re.compile(r"<answer>(.*?)</answer>", flags=re.DOTALL) parsed_count =0# Track how many of the 8 completions have valid <answer> tags
for i, comp inenumerate(completions): raw_response_text =""if comp andisinstance(comp,list) and comp[0].get("content"): raw_response_text = comp[0]["content"].strip()
match_ans = pattern_answer.search(raw_response_text)if match_ans: new_response_text = match_ans.group(1).strip() parsed_count +=1else: new_response_text ="I do not know"
grading_prompt +=f"[The Begin of Response {i+1}]\n{new_response_text}\n[The End of Response {i+1}]\n"
# If no valid <answer> tags were found, skip grading and assign 0 to all.if parsed_count ==0:return [0.0] * num_samples
grading_prompt +="""\n#### Output Format Requirements ####Output with three lines:Specific Criteria: <Other potential criteria specific to the query and the context, and the weights of each criteria>.Analysis: <Compare different responses based on given Criteria>.Scores: <the overall comprehensive score of all responses in order, separate by comma in the boxed, e.g., \boxed{x, x} if there exists 2 responses>."""
api_key = os.environ.get("OPENAI_API_KEY")ifnot api_key:raiseEnvironmentError("OPENAI_API_KEY environment variable not set.") client =AsyncOpenAI(api_key=api_key)
asyncdefget_scores_once(): backoff =INITIAL_BACKOFFfor attempt inrange(1,MAX_RETRIES+1):try: resp =await client.responses.create( model="gpt-4o-mini", input=grading_prompt, temperature=0.5, ) text = resp.output_text.strip() scores =_parse_scores(text)print("\n---Start---\nLLM Input:\n")print(grading_prompt)print("\nLLM Output:\n")print(text)print("\nRaw parsed scores:\n")print(scores)print("\n---End---\n")iflen(scores) != num_samples:print(f"Got {len(scores)} scores but expected {num_samples}, skipping this response")returnNonereturn scores
except (APIConnectionError, httpx.ConnectError) as e:if attempt <MAX_RETRIES:print(f"[Attempt {attempt}/{MAX_RETRIES}] Connection failed: {e!r}")print(f"→ retrying in {backoff} s…")await asyncio.sleep(backoff) backoff *=2else:print(f"[Attempt {attempt}/{MAX_RETRIES}] Connection failed, max retries reached. Skipping.")returnNone
tasks = [get_scores_once() for _ inrange(n_iter)] results =await asyncio.gather(*tasks)
final_scores = []for response_idx inrange(num_samples): valid_scores = []for res in results:ifnot res:continueif response_idx <len(res):# Constrain raw score to [0..10]. If out-of-range, treat as 0. raw_val = res[response_idx]if raw_val <0or raw_val >10: raw_val =0 valid_scores.append(raw_val)
# Average the valid scoresif valid_scores: avg_score =sum(valid_scores) /len(valid_scores)else:# No valid scores at all avg_score =0.0
final_scores.append(avg_score)
return final_scores
defcorrectness_reward_func(prompts,completions,**kwargs):""" GRM reward function that evaluates all responses together using _grm_reward_func_async. 1) If no <answer> tags are found at all, returns all zeros. 2) Otherwise, raw scores in [0..10]. - Raw=0 => 0 - (0..5] => [-1..0] - (5..10] => (0..2] """try: raw_scores = asyncio.run(_grm_reward_func_async(prompts, completions,**kwargs)) normalized_scores = []for score in raw_scores:# If score is exactly 0, keep it at 0if score ==0: normalized_scores.append(0.0)# If score is in (0, 5], map linearly to [-1..0]elif0< score <=5:# 5 -> 0, 0 -> -1# f(score) = (score - 5)/5 => [-1, 0] norm = (score -5) /5 normalized_scores.append(norm)# If score is in (5, 10], map linearly to (0..2]else:# 5 < score <= 10# 5 -> 0, 10 -> 2# g(score) = (score - 5) * (2/5) => (0..2] norm = (score -5) * (2.0/5.0) normalized_scores.append(norm)
print("\n---Scores START!---")print('\nraw_scores:', [f'{s:.2f}'for s in raw_scores])print('\nnormalized_scores:', [f'{s:.2f}'for s in normalized_scores])print("\n---Scores END!---\n")
return normalized_scoresexceptRuntimeErroras e:print("\nError:", e)return [0.0] *len(prompts)
```

Conclusion

Large Language Models (LLMs) achieve optimal performance when evaluation and training are tightly intertwined:

LLM-as-a-Judge provides two complementary lenses: pairwise ELO rankings for quick model sorting and reference-based 10-point scoring for fine-grained diagnostics. (Covered in Part 1)
For both approaches, asking the judge to reason out loud stabilizes scores and improves their correlation with human opinion. (Covered in Part 1)
PPO converts these scores into policy updates, but a single scalar reward per step limits the signal's richness. (Covered in Part 1)
GRM + GRPO widen this aperture by evaluating many candidate answers, incorporating written critiques, using multiple sampling passes, and averaging scores. The richer feedback lowers variance, rewards diversity, and teaches the policy why an answer is good—not just that it is.

In practice, the shift from a silent reward model to a generative one typically involves little more than an additional API call, yet it can unlock a markedly steadier learning curve. At Rihal, we have found that even modest experiments—such as eight GRM evaluations per batch and basic score normalization—produce policies that are both more helpful on the first draft and better at exploring alternative phrasings.

The next step is straightforward: swap in a GRM and observe the potential for more stable gradients. From there, iterate on the critique prompt, tune the temperature mix, and measure—always measure—against human judgments. The iterative loop is simple; the gains, in many cases, are quickly apparent.

عرض مقالات الأخرى