MT Reward Tree

Abstract

Process reward models (PRMs) have shown success in complex reasoning tasks for large language models (LLMs). However, their application to machine translation (MT) remains underexplored due to the lack of systematic methodologies and evaluation benchmarks. To address this gap, we introduce MT-RewardTree, a comprehensive framework for constructing, evaluating, and deploying process reward models in MT. We propose a novel method for automatically generating high-quality token-level preference pairs using approximate Monte Carlo Tree Search (MCTS), mitigating the prohibitive cost of human annotation. Our framework establishes the first MT-specific reward model benchmark and provides a systematic comparison of different reward modeling architectures, revealing that token-level supervision effectively captures fine-grained preferences.

Experimental results demonstrate that our MT-PRM-Qwen-2.5-3B achieves state-of-the-art performance in both token-level and sequence-level evaluation given the same input prefix. Furthermore, we showcase practical applications where PRMs enable test-time alignment for LLMs without additional training and significantly improve performance in hypothesis ensembling. Our work provides valuable insights into the role of reward models in MT research. Our code and data will be publicly available.

Token-level Preference Pairs Construction

We propose a token-centric approach that quantifies token quality based on its potential to contribute to higher-quality translations. This method aligns with Monte Carlo based PRMs construction techniques in mathematics, where step-wise quality is determined by its incremental contribution to deriving correct answers.

Selection:
The first phase involves selecting a portion of the existing tree that is most promising for further expansion. Starting from the root node, a standard approach would traverse the tree down to a leaf using the PUCT algorithm, and we select the existing prompt and previously generated tokens as the prefix y_<t.

Expansion:
If the selected leaf node is not an EOS token, the node is expanded by generating k candidate children. In our paper, we select the top-2 candidate tokens that have the highest logits. These top-2 tokens, sharing the same prefix y_<t, form the basis for our token-level preference pair.

Simulation:
From each expanded node a, we generate n complete translation rollouts until an EOS token is reached. We than evaluate the quality of the full translation sequence. These scores are averaged and further assigned as the value of node a.

Back-propagation:
We compare the values of node a_t1 and a_t2 to determine which expanded token is superior. Finally, we retain the node with the higher value. This node, along with its corresponding prefix y_<t, is then used as the starting point in the next simulation cycle, beginning again at Step 1.

Practical Insights

Test-time Alignment:

Test-time alignment, also known as decoding-time alignment refers to the process of adjusting an LM’s output during inference to better align with human preferences, without additional training or fine-tuning. Its application in MT remains underexplored.
In the context of MT, given the prior context s_<t and timestamp t, we define the reward-guided scoring function for a candidate token a as:

Compared to standard decoding strategies, this approach offers a more refined scoring function, as it encourages the generated text to: 1) Maintain semantic coherence and relevance with the prior context, and 2) Align more closely with reward-based criteria and human preferences. Test-time alignment also substantially reduces the need for the extensive resources typically required for LM alignment training.
We use Qwen2.5-14B-Instruct for generating tokens and leverage MT-PRM-LLaMA3.2-3B and MT-PRM-Qwen-2.5-3B as the models for providing token-level rewards. We randomly sample 500 cases from the WMT 2023 testset. The reward-guided decoding methods outperform the standard greedy decoding in both EN-RU and ZH-EN translation tasks, evaluated by the COMET, COMETKiwi and XCOMET-XL metrics. For instance, using the XCOMET-XL metric, LLaMA PRM and Qwen PRM outperform the standard greedy decoding by 17.5% and 17.9% in the EN-RU task respectively.

Hypothesis Ensembling:

Ensembling is widely recognized for its ability to combine multiple complementary models to improve performance in machine learning. In this work, we investigate two complementary ensembling approaches for MT: ranking-based selection and Minimum Bayes Risk decoding, and correspond with the formula below:

Where f represents either Strong LLM as a judge (e.g., Gemini-2.0-Flash6) or Our PRM, converted to sequence-level scoring through weighted DPO rewards, and the u(·) can be instantiated using traditional metrics such as BLEU, BERTScore and the reference based metric COMET.
We evaluate on 500 cases sampled from the WMT 2023 dataset, using the TowerInstruct-7B-v0.2 model with nucleus sampling to generate 8 candidate translations for each case. As shown below, our MT-PRM-LLaMA-3.2-3B outperforms MBR decoding, based on BLEU and BERTScore, by 0.93% and 0.27%, respectively. It even surpasses the commercial LLM Gemini-2.0-Flash by 0.51%.

MT Reward Tree Structure.

Abstract

Token-level Preference Pairs Construction

Experimental Results

Practical Insights