README.md
14.2 KB · 317 lines · markdown Raw
1 ---
2 license: llama3.3
3 language:
4 - en
5 base_model:
6 - meta-llama/Llama-3.3-70B-Instruct
7 pipeline_tag: text-generation
8 tags:
9 - llm-as-judge
10 - evaluation
11 ---
12 # Model Card for RootSignals-Judge-Llama-70B
13
14 **Root Judge** is a powerful mid-sized LLM that enables reliable and customizable LLM system evaluations.
15 Root Judge was post-trained from *Llama-3.3-70B-Instruct* on a high quality, human-annotated dataset mix for pairwise preference choice judgments and multi-turn instruction following with source citing.
16 The model weights are freely available in FP8 to facilitate cost effective research as well as commercial use.
17
18 **Root Judge**’s performance surpasses the Llama-3.3-Instruct model and similar sized open models on Instruction following and
19 achieves SOTA on hallucination detection compared to leading closed models, at a fraction of the cost.
20
21 # 1. Intended Use Cases
22 **Root Judge** is primarily intended to be used as an LLM-as-a-Judge in various contexts such as:
23 - Detecting context-grounded hallucinations, e.g. for *Retrieval Augmented Generation* (RAG) settings in an explainable manner, providing a justification for the score
24 - Pairwise preference judgments due to strong evaluation instruction-following capabilities
25 - Serving as a custom evaluation metric powered by use case specific evaluation rubrics
26 - Assisting inference-time search or synthetic data tasks that require Best-of-N decisions
27 - Privacy-focused settings that require local deployments
28
29 # 2. Performance Summary
30
31 **Root Judge** outperforms leading closed models when detecting instruction following failures on evaluations
32 while providing detailed, structured justifications on long inputs of up to 32k tokens on internal benchmarks and halubench public.
33
34 ## 2.1 Hallucination Detection (in RAG setting)
35
36 📊 Benchmark: [HaluBench Test Set](https://huggingface.co/datasets/PatronusAI/HaluBench):
37
38 Rank | Model | Test Samples | Pass@1 Rate (%) | Cost ($)
39 | --- | --- | --- | --- | --- |
40 **1** | **Root Judge** | 14900 | **86.3** | **3.98**
41 2 | GPT-4o | 14900 | 86.1 | 33.12
42 3 | o1-preview | 14899 | 85.3 | 1062*
43 4 | Claude Sonnet-3.5 | 14797 | 85.2 | 42.94
44 5 | Llama3.1-70b-Instruct| 13969 | 84.7 | 27.43
45 6 | o1-mini | 14655 | 83.7 | 156
46 7 | Llama3.1-405b-Instruct | 14881 | 83.6 | 269.82
47
48 `*`=benchmarked as o1-preview; at current o1 prices, without reasoning tokens, the cost would start at $198.74 instead
49 Local Costs based on lambdalabs instances at January 2025 prices
50
51 [🔎 Detailed Performance Breakdown - Hallucination Detection](https://docs.google.com/spreadsheets/d/1NM9VgGG9_-1kQbepeoueUTkvT1bDeRndTD4RM5iV7l4/edit?usp=sharing)
52
53 ## 2.2 Instruction Following
54
55 📊 Instruction-following performance in various diverse benchmarks compared to other open-weights judge and reward models (higher is better):
56
57 Rank | Model | VRAM (GB) | GSM8K (%) | IFEval (%) | MUSR-Murder (%) | MUSR-Object (%) | MUSR-Team (%) | Avg Score | Relative to Root Judge (%) |
58 | ---|--------------|------------|--------|---------|--------------|--------------|------------|------------|--------------------|
59 **1** | **Root Judge** | 70 | **94.6 ± 0.6** | **93.9** | 52.8 ± 3.2 | 24.6 ± 2.7 | **56.8 ± 3.1** | **64.5** | 100 |
60 2 | Llama-3.3-70B | 140 | 94.4 ± 0.6 | 93.4 | 54.0 ± 3.2 | 23.4 ± 2.7 | 56.0 ± 3.2 | 64.3 | 99.5 |
61 3 | Patronus-70B | 140 | 91.7 ± 0.8 | 83.7 | 54.4 ± 3.2 | 24.6 ± 2.7 | 48.8 ± 3.2 | 60.6 | 93.9 |
62 4 | Nemotron-70B | 70 | 80.1 ± 1.1 | 85.0 | 53.6 ± 3.2 | 23.8 ± 2.7 | 55.6 ± 3.1 | 59.6 | 92.4 |
63 5 | Qwen-2.5-32B | 64 | 87.4 ± 0.9 | 87.5 | 58.8 ± 3.1 | 23.1 ± 2.6 | 45.2 ± 3.2 | 60.4 | 93.6 |
64 6 | Flow Judge | 16 | 78.7 ± 1.1 | 64.6 | **60.8 ± 3.1** | 23.4 ± 2.7 | 35.6 ± 3.0 | 52.6 | 81.5 |
65 7 | Glider | 8 | 78.7 ± 1.1 | 56.5 | 59.2 ± 3.1 | **35.9 ± 3.0** | 43.2 ± 3.1 | 54.7 | 84.8 |
66
67 [🔎 Detailed Performance Breakdown | Intruction-following](https://docs.google.com/spreadsheets/d/1cTPQZbUvelSlLkqj4kO-EQXFDkw17WXKHAeGg02-8Qg/edit?usp=sharing)
68
69 ## 2.3 Root Signals Internal Benchmarks
70
71 📊 Benchmark: Root Signals Internal Hallucination Detection Benchmark
72
73 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6343d9d3e01a38440eeffc9c/rHq5RakEPkOlnC69MOl1e.png)
74 *Image 1: Total pass@1 rates and consistency (delta) assessed via ensemble of leading 3rd party models.*
75
76
77 ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6343d9d3e01a38440eeffc9c/zfsh6HTbYH1HpLItWgq8u.png)
78 *Image 2: Custom rubric instruction-following by high level task.*
79
80 **Root Judge** was tested to support complex, user-defined scoring (rating) rubrics over large context sizes It provides granular qualitative feedback and supports structured evaluation outputs as well as tool calling.
81
82 ## 2.4 Other Benchmarks
83
84 <details>
85 <summary>📊 RewardBench</summary>
86
87 [RewardBench](https://huggingface.co/spaces/allenai/reward-bench)
88
89 | Benchmark Task | Score | Total | Accuracy |
90 |------------------------|-------|-------|-----------|
91 | alpacaeval-easy | 99.0 | 100 | 0.99 |
92 | alpacaeval-hard | 93.0 | 95 | 0.97894737|
93 | alpacaeval-length | 86.0 | 95 | 0.90526316|
94 | donotanswer | 73.5 | 136 | 0.54044118|
95 | hep-cpp | 159.0 | 164 | 0.96951220|
96 | hep-go | 159.0 | 164 | 0.96951220|
97 | hep-java | 161.0 | 164 | 0.98170732|
98 | hep-js | 159.0 | 164 | 0.96951220|
99 | hep-python | 158.0 | 164 | 0.96341463|
100 | hep-rust | 152.0 | 164 | 0.92682927|
101 | llmbar-adver-GPTInst | 69.0 | 92 | 0.75 |
102 | llmbar-adver-GPTOut | 39.0 | 47 | 0.82978723|
103 | llmbar-adver-manual | 32.0 | 46 | 0.69565217|
104 | llmbar-adver-neighbor | 74.0 | 134 | 0.55223881|
105 | llmbar-natural | 94.0 | 100 | 0.94 |
106 | math-prm | 357.0 | 447 | 0.79865772|
107 | mt-bench-easy | 28.0 | 28 | 1.0 |
108 | mt-bench-hard | 32.0 | 37 | 0.86486486|
109 | mt-bench-med | 40.0 | 40 | 1.0 |
110 | refusals-dangerous | 73.5 | 100 | 0.735 |
111 | refusals-offensive | 89.0 | 100 | 0.89 |
112 | xstest-should-refuse | 140.5 | 154 | 0.91233766|
113 | xstest-should-respond | 245.0 | 250 | 0.98 |
114 | Chat | | | 0.96648045|
115 | Chat Hard | | | 0.74561404|
116 | Safety | | | 0.83986486|
117 | Reasoning | | | 0.88103618|
118
119 </details>
120
121 Despite our main focus on nuanced and transparent judgement of candidate responses,
122 we test the judge model checkpoints extensively on public and private benchmarks,
123 to avoid known issues with performance drops such as catastrophic forgetting and find that the model
124 preserves general capabilities of Llama-3.3-70B-Instruct after dynamic weights quantization,
125 while also slightly outperforming it on public instruction following benchmarks such as IFEval and MuSR
126
127 # 3. Getting Started
128
129 ## 3.1 Via Root Signals Python SDK
130
131 Model is available on our [platform](https://app.rootsignals.ai/register?utm_campaign=55516392-Hugging%20Face&utm_source=https%3A%2F%2Fhuggingface.co%2Froot-signals) as part of our evaluation suite, for no additional cost.
132
133 Install our [python library](https://github.com/root-signals/rs-python-sdk):
134 ```bash
135 pip install root-signals
136 ```
137
138 Import:
139 ```python
140 from root import RootSignals
141 client = RootSignals()
142 ```
143
144 Create a custom evaluator powered by **Root Judge**:
145 ```python
146 my_custom_judge = client.evaluators.create(
147 name="Political Text Evaluator",
148 intent="To measure the politics-relatedness of a given text",
149 predicate="Assess if a text containts political jargon or talks about politics: {{response}}",
150 model="RootJudge",
151 )
152 ```
153
154 Execute:
155 ```python
156 result = my_custom_judge.run(
157 response="A defence spending target of 3% of GDP is more likely than the 5% aim pushed by US President Donald Trump, say members of the parliamentary Defence Committee."
158 )
159 print(result.score) # normalized score between [0-1]
160 print(result.justification) # detailed reasoning for the score
161 ```
162
163 ## 3.2 Locally
164
165 We recommend using [SGLang](https://github.com/sgl-project/sglang) for production use-cases together with *xml tags* for important sections in your prompt. While the model can run on 80GB VRAM, we recommend at least 96GB for evaluating long-context RAG inputs.
166
167 SGlang example for a single Nvidia H100 (80GB):
168 ```bash
169 docker run \
170 --gpus all \
171 --ipc=host \
172 -p 8000:8000 \
173 -v huggingface:/root/.cache/huggingface \
174 --volume /etc/localtime:/etc/localtime:ro \
175 -d docker.io/lmsysorg/sglang:v0.4.2-cu124-srt \
176 python3 -m sglang.launch_server \
177 --model-path root-signals/RootSignals-Judge-Llama-70B \
178 --host 0.0.0.0 \
179 --port 8000 \
180 --mem-fraction-static 0.89 \
181 --grammar-backend xgrammar \
182 --enable-torch-compile \
183 --disable-cuda-graph
184 ```
185
186 We validated the model on arm64 with [vLLM](https://github.com/vllm-project/vllm) on Nvidia GH200 as well with max outputs up to 64k tokens:
187 ```bash
188 docker run \
189 --gpus all \
190 --ipc=host \
191 -p 8000:8000 \
192 -v huggingface:/root/.cache/huggingface \
193 --volume /etc/localtime:/etc/localtime:ro \
194 -d drikster80/vllm-gh200-openai:v0.6.4.post1 \
195 --model root-signals/RootSignals-Judge-Llama-70B \
196 --gpu-memory-utilization 0.95 \
197 --max-model-len 64k \
198 --block_size 16 \
199 --enable_prefix_caching
200 ```
201
202 Detect hallucinations from context, example uses halubench:
203 ```python
204 decompose_system_instruction = """
205 <TASK>
206 You are a fair judge that detects hallucinations and unjustified assumptions from question-document-answer triplets provided by the user.
207 Always follow the instructions below and provide your reasoning and verdict in the format specified.
208 </TASK>
209
210 <INSTRUCTIONS>
211 #1. Identify key elements in the question.
212 #2. List all relevant facts provided in the document.
213 #3. Break down the answer into its component claims.
214 #4. For each claim in the answer:
215 #a. Is it explicitly supported by the document? If yes, quote the relevant part.
216 #b. Is it a reasonable inference from the document? If yes, explain the reasoning.
217 #c. Is it unsupported or contradicted by the document? If yes, explain why.
218 #5. Check for any information in the answer that's present in the question but not in the document.
219 #6. Verify that no additional information is introduced in the answer that isn't in the document or question.
220 #7. Assess if the answer makes any unjustified connections or assumptions.
221 </INSTRUCTIONS>
222
223 <OUTPUT_EXAMPLE>
224 {"REASONING": "Your reasoning here where you cite the instruction step by number and provide your reasoning", "VERDICT": "PASS" or "FAIL"}
225 </OUTPUT_EXAMPLE>
226 """
227
228 decompose_prompt = """
229 <QUESTION>: {question} </QUESTION>
230 <DOCUMENT>: {document} </DOCUMENT>
231 <ANSWER>: {answer} </ANSWER>
232 """.strip()
233
234 import os
235 import json
236 import pandas as pd
237 from openai import OpenAI
238 from pprint import pprint
239 from pydantic import BaseModel
240
241 testset_df = pd.read_parquet("hf://datasets/PatronusAI/HaluBench/data/test-00000-of-00001.parquet")
242 testset_df = testset_df.sample(frac=1).reset_index(drop=True)
243 example_row = testset_df.iloc[0]
244
245 class DecomposeResponse(BaseModel):
246 REASONING: str
247 VERDICT: str
248
249 client = OpenAI(base_url="http://localhost:8000/v1") # export a different one for e.g. sglang, openrouter, etc.
250
251 response = client.beta.chat.completions.parse(
252 model="root-signals/RootSignals-Judge-Llama-70B", # or `RootJudge` if you are using the RootSignals API
253 messages=[
254 {"role": "system", "content": decompose_system_instruction},
255 {"role": "user", "content": decompose_prompt.format(
256 question=example_row["question"],
257 document=example_row["passage"],
258 answer=example_row["answer"])},
259 ],
260 response_format=DecomposeResponse,
261 ).choices[0].message.parsed
262
263 pprint(response.REASONING)
264 pprint(response.VERDICT)
265 ```
266
267 ```
268 > ('Following the instructions: #1, the key element in the question is the '
269 "nationality of the magazines. #2, the document states that 'The Woman's "
270 "Viewpoint was a woman's magazine founded in Texas in 1923' and 'Pick Me Up! "
271 "is a British weekly women's magazine'. #3, the answer claims both magazines "
272 'are British. #4, checking each claim in the answer: a) The document does not '
273 "support the claim that The Woman's Viewpoint is British, instead, it says "
274 "the magazine was founded in Texas. b) There's no reasonable inference from "
275 "the document that would suggest The Woman's Viewpoint is British. c) The "
276 "claim about The Woman's Viewpoint is contradicted by the document. #5, the "
277 'answer introduces information (both being British) not supported by the '
278 'document. #6, additional information about both magazines being British is '
279 'introduced in the answer without being present in the document or question. '
280 '#7, the answer makes an unjustified assumption by stating both magazines are '
281 "British despite the document clearly stating The Woman's Viewpoint was "
282 'founded in Texas, implying it is not British. Therefore, the answer fails to '
283 'accurately reflect the information provided in the document and makes '
284 'unjustified assumptions based on the information given in the question and '
285 "document.', ")
286 'FAIL'
287 ```
288
289 # 4. Model Details
290
291 ## 4.1 Overview
292
293 - **Developed by:** [Root Signals Inc](https://www.scorable.ai)
294 - **Model type:** Text-Only Decoder Transformer
295 - **Language(s) (NLP):** Primarily English
296 - **Finetuned from model:** meta-llama/Llama-3.3-70B-Instruct
297
298 ## 4.2 Training Details
299
300 - **Training regime:** DPO with IPO loss for 3 Epochs, bfloat16 mixed-precision on 384 GPUs
301 - **Hardware Type:** LUMI-G / AMD Radeon Instinct™ MI250X
302 - **Cloud Provider:** [LUMI Supercomputer](https://lumi-supercomputer.eu)
303 - **Compute Region:** Finland
304
305
306 # 5. Contact
307
308 **Links**
309 - [Scorable Homepage](https://www.scorable.ai/)
310 - [Scorable Platform](https://app.scorable.ai/?utm_campaign=55516392-Hugging%20Face&utm_source=https%3A%2F%2Fhuggingface.co%2Froot-signals)
311 - [Python SDK](https://github.com/root-signals/rs-sdk/blob/main/python/README.md)
312 - [Python SDK Docs](https://sdk.rootsignals.ai/en/latest/quickstart.html)
313 - [TypeScript SDK](https://github.com/root-signals/rs-sdk/blob/main/typescript/README.md)
314 - [Discord](https://discord.gg/EhazTQsFnj)
315
316 **Email**
317 - hello@scorable.ai