README.md
| 1 | --- |
| 2 | license: llama3.3 |
| 3 | language: |
| 4 | - en |
| 5 | base_model: |
| 6 | - meta-llama/Llama-3.3-70B-Instruct |
| 7 | pipeline_tag: text-generation |
| 8 | tags: |
| 9 | - llm-as-judge |
| 10 | - evaluation |
| 11 | --- |
| 12 | # Model Card for RootSignals-Judge-Llama-70B |
| 13 | |
| 14 | **Root Judge** is a powerful mid-sized LLM that enables reliable and customizable LLM system evaluations. |
| 15 | Root Judge was post-trained from *Llama-3.3-70B-Instruct* on a high quality, human-annotated dataset mix for pairwise preference choice judgments and multi-turn instruction following with source citing. |
| 16 | The model weights are freely available in FP8 to facilitate cost effective research as well as commercial use. |
| 17 | |
| 18 | **Root Judge**’s performance surpasses the Llama-3.3-Instruct model and similar sized open models on Instruction following and |
| 19 | achieves SOTA on hallucination detection compared to leading closed models, at a fraction of the cost. |
| 20 | |
| 21 | # 1. Intended Use Cases |
| 22 | **Root Judge** is primarily intended to be used as an LLM-as-a-Judge in various contexts such as: |
| 23 | - Detecting context-grounded hallucinations, e.g. for *Retrieval Augmented Generation* (RAG) settings in an explainable manner, providing a justification for the score |
| 24 | - Pairwise preference judgments due to strong evaluation instruction-following capabilities |
| 25 | - Serving as a custom evaluation metric powered by use case specific evaluation rubrics |
| 26 | - Assisting inference-time search or synthetic data tasks that require Best-of-N decisions |
| 27 | - Privacy-focused settings that require local deployments |
| 28 | |
| 29 | # 2. Performance Summary |
| 30 | |
| 31 | **Root Judge** outperforms leading closed models when detecting instruction following failures on evaluations |
| 32 | while providing detailed, structured justifications on long inputs of up to 32k tokens on internal benchmarks and halubench public. |
| 33 | |
| 34 | ## 2.1 Hallucination Detection (in RAG setting) |
| 35 | |
| 36 | 📊 Benchmark: [HaluBench Test Set](https://huggingface.co/datasets/PatronusAI/HaluBench): |
| 37 | |
| 38 | Rank | Model | Test Samples | Pass@1 Rate (%) | Cost ($) |
| 39 | | --- | --- | --- | --- | --- | |
| 40 | **1** | **Root Judge** | 14900 | **86.3** | **3.98** |
| 41 | 2 | GPT-4o | 14900 | 86.1 | 33.12 |
| 42 | 3 | o1-preview | 14899 | 85.3 | 1062* |
| 43 | 4 | Claude Sonnet-3.5 | 14797 | 85.2 | 42.94 |
| 44 | 5 | Llama3.1-70b-Instruct| 13969 | 84.7 | 27.43 |
| 45 | 6 | o1-mini | 14655 | 83.7 | 156 |
| 46 | 7 | Llama3.1-405b-Instruct | 14881 | 83.6 | 269.82 |
| 47 | |
| 48 | `*`=benchmarked as o1-preview; at current o1 prices, without reasoning tokens, the cost would start at $198.74 instead |
| 49 | Local Costs based on lambdalabs instances at January 2025 prices |
| 50 | |
| 51 | [🔎 Detailed Performance Breakdown - Hallucination Detection](https://docs.google.com/spreadsheets/d/1NM9VgGG9_-1kQbepeoueUTkvT1bDeRndTD4RM5iV7l4/edit?usp=sharing) |
| 52 | |
| 53 | ## 2.2 Instruction Following |
| 54 | |
| 55 | 📊 Instruction-following performance in various diverse benchmarks compared to other open-weights judge and reward models (higher is better): |
| 56 | |
| 57 | Rank | Model | VRAM (GB) | GSM8K (%) | IFEval (%) | MUSR-Murder (%) | MUSR-Object (%) | MUSR-Team (%) | Avg Score | Relative to Root Judge (%) | |
| 58 | | ---|--------------|------------|--------|---------|--------------|--------------|------------|------------|--------------------| |
| 59 | **1** | **Root Judge** | 70 | **94.6 ± 0.6** | **93.9** | 52.8 ± 3.2 | 24.6 ± 2.7 | **56.8 ± 3.1** | **64.5** | 100 | |
| 60 | 2 | Llama-3.3-70B | 140 | 94.4 ± 0.6 | 93.4 | 54.0 ± 3.2 | 23.4 ± 2.7 | 56.0 ± 3.2 | 64.3 | 99.5 | |
| 61 | 3 | Patronus-70B | 140 | 91.7 ± 0.8 | 83.7 | 54.4 ± 3.2 | 24.6 ± 2.7 | 48.8 ± 3.2 | 60.6 | 93.9 | |
| 62 | 4 | Nemotron-70B | 70 | 80.1 ± 1.1 | 85.0 | 53.6 ± 3.2 | 23.8 ± 2.7 | 55.6 ± 3.1 | 59.6 | 92.4 | |
| 63 | 5 | Qwen-2.5-32B | 64 | 87.4 ± 0.9 | 87.5 | 58.8 ± 3.1 | 23.1 ± 2.6 | 45.2 ± 3.2 | 60.4 | 93.6 | |
| 64 | 6 | Flow Judge | 16 | 78.7 ± 1.1 | 64.6 | **60.8 ± 3.1** | 23.4 ± 2.7 | 35.6 ± 3.0 | 52.6 | 81.5 | |
| 65 | 7 | Glider | 8 | 78.7 ± 1.1 | 56.5 | 59.2 ± 3.1 | **35.9 ± 3.0** | 43.2 ± 3.1 | 54.7 | 84.8 | |
| 66 | |
| 67 | [🔎 Detailed Performance Breakdown | Intruction-following](https://docs.google.com/spreadsheets/d/1cTPQZbUvelSlLkqj4kO-EQXFDkw17WXKHAeGg02-8Qg/edit?usp=sharing) |
| 68 | |
| 69 | ## 2.3 Root Signals Internal Benchmarks |
| 70 | |
| 71 | 📊 Benchmark: Root Signals Internal Hallucination Detection Benchmark |
| 72 | |
| 73 |  |
| 74 | *Image 1: Total pass@1 rates and consistency (delta) assessed via ensemble of leading 3rd party models.* |
| 75 | |
| 76 | |
| 77 |  |
| 78 | *Image 2: Custom rubric instruction-following by high level task.* |
| 79 | |
| 80 | **Root Judge** was tested to support complex, user-defined scoring (rating) rubrics over large context sizes It provides granular qualitative feedback and supports structured evaluation outputs as well as tool calling. |
| 81 | |
| 82 | ## 2.4 Other Benchmarks |
| 83 | |
| 84 | <details> |
| 85 | <summary>📊 RewardBench</summary> |
| 86 | |
| 87 | [RewardBench](https://huggingface.co/spaces/allenai/reward-bench) |
| 88 | |
| 89 | | Benchmark Task | Score | Total | Accuracy | |
| 90 | |------------------------|-------|-------|-----------| |
| 91 | | alpacaeval-easy | 99.0 | 100 | 0.99 | |
| 92 | | alpacaeval-hard | 93.0 | 95 | 0.97894737| |
| 93 | | alpacaeval-length | 86.0 | 95 | 0.90526316| |
| 94 | | donotanswer | 73.5 | 136 | 0.54044118| |
| 95 | | hep-cpp | 159.0 | 164 | 0.96951220| |
| 96 | | hep-go | 159.0 | 164 | 0.96951220| |
| 97 | | hep-java | 161.0 | 164 | 0.98170732| |
| 98 | | hep-js | 159.0 | 164 | 0.96951220| |
| 99 | | hep-python | 158.0 | 164 | 0.96341463| |
| 100 | | hep-rust | 152.0 | 164 | 0.92682927| |
| 101 | | llmbar-adver-GPTInst | 69.0 | 92 | 0.75 | |
| 102 | | llmbar-adver-GPTOut | 39.0 | 47 | 0.82978723| |
| 103 | | llmbar-adver-manual | 32.0 | 46 | 0.69565217| |
| 104 | | llmbar-adver-neighbor | 74.0 | 134 | 0.55223881| |
| 105 | | llmbar-natural | 94.0 | 100 | 0.94 | |
| 106 | | math-prm | 357.0 | 447 | 0.79865772| |
| 107 | | mt-bench-easy | 28.0 | 28 | 1.0 | |
| 108 | | mt-bench-hard | 32.0 | 37 | 0.86486486| |
| 109 | | mt-bench-med | 40.0 | 40 | 1.0 | |
| 110 | | refusals-dangerous | 73.5 | 100 | 0.735 | |
| 111 | | refusals-offensive | 89.0 | 100 | 0.89 | |
| 112 | | xstest-should-refuse | 140.5 | 154 | 0.91233766| |
| 113 | | xstest-should-respond | 245.0 | 250 | 0.98 | |
| 114 | | Chat | | | 0.96648045| |
| 115 | | Chat Hard | | | 0.74561404| |
| 116 | | Safety | | | 0.83986486| |
| 117 | | Reasoning | | | 0.88103618| |
| 118 | |
| 119 | </details> |
| 120 | |
| 121 | Despite our main focus on nuanced and transparent judgement of candidate responses, |
| 122 | we test the judge model checkpoints extensively on public and private benchmarks, |
| 123 | to avoid known issues with performance drops such as catastrophic forgetting and find that the model |
| 124 | preserves general capabilities of Llama-3.3-70B-Instruct after dynamic weights quantization, |
| 125 | while also slightly outperforming it on public instruction following benchmarks such as IFEval and MuSR |
| 126 | |
| 127 | # 3. Getting Started |
| 128 | |
| 129 | ## 3.1 Via Root Signals Python SDK |
| 130 | |
| 131 | Model is available on our [platform](https://app.rootsignals.ai/register?utm_campaign=55516392-Hugging%20Face&utm_source=https%3A%2F%2Fhuggingface.co%2Froot-signals) as part of our evaluation suite, for no additional cost. |
| 132 | |
| 133 | Install our [python library](https://github.com/root-signals/rs-python-sdk): |
| 134 | ```bash |
| 135 | pip install root-signals |
| 136 | ``` |
| 137 | |
| 138 | Import: |
| 139 | ```python |
| 140 | from root import RootSignals |
| 141 | client = RootSignals() |
| 142 | ``` |
| 143 | |
| 144 | Create a custom evaluator powered by **Root Judge**: |
| 145 | ```python |
| 146 | my_custom_judge = client.evaluators.create( |
| 147 | name="Political Text Evaluator", |
| 148 | intent="To measure the politics-relatedness of a given text", |
| 149 | predicate="Assess if a text containts political jargon or talks about politics: {{response}}", |
| 150 | model="RootJudge", |
| 151 | ) |
| 152 | ``` |
| 153 | |
| 154 | Execute: |
| 155 | ```python |
| 156 | result = my_custom_judge.run( |
| 157 | response="A defence spending target of 3% of GDP is more likely than the 5% aim pushed by US President Donald Trump, say members of the parliamentary Defence Committee." |
| 158 | ) |
| 159 | print(result.score) # normalized score between [0-1] |
| 160 | print(result.justification) # detailed reasoning for the score |
| 161 | ``` |
| 162 | |
| 163 | ## 3.2 Locally |
| 164 | |
| 165 | We recommend using [SGLang](https://github.com/sgl-project/sglang) for production use-cases together with *xml tags* for important sections in your prompt. While the model can run on 80GB VRAM, we recommend at least 96GB for evaluating long-context RAG inputs. |
| 166 | |
| 167 | SGlang example for a single Nvidia H100 (80GB): |
| 168 | ```bash |
| 169 | docker run \ |
| 170 | --gpus all \ |
| 171 | --ipc=host \ |
| 172 | -p 8000:8000 \ |
| 173 | -v huggingface:/root/.cache/huggingface \ |
| 174 | --volume /etc/localtime:/etc/localtime:ro \ |
| 175 | -d docker.io/lmsysorg/sglang:v0.4.2-cu124-srt \ |
| 176 | python3 -m sglang.launch_server \ |
| 177 | --model-path root-signals/RootSignals-Judge-Llama-70B \ |
| 178 | --host 0.0.0.0 \ |
| 179 | --port 8000 \ |
| 180 | --mem-fraction-static 0.89 \ |
| 181 | --grammar-backend xgrammar \ |
| 182 | --enable-torch-compile \ |
| 183 | --disable-cuda-graph |
| 184 | ``` |
| 185 | |
| 186 | We validated the model on arm64 with [vLLM](https://github.com/vllm-project/vllm) on Nvidia GH200 as well with max outputs up to 64k tokens: |
| 187 | ```bash |
| 188 | docker run \ |
| 189 | --gpus all \ |
| 190 | --ipc=host \ |
| 191 | -p 8000:8000 \ |
| 192 | -v huggingface:/root/.cache/huggingface \ |
| 193 | --volume /etc/localtime:/etc/localtime:ro \ |
| 194 | -d drikster80/vllm-gh200-openai:v0.6.4.post1 \ |
| 195 | --model root-signals/RootSignals-Judge-Llama-70B \ |
| 196 | --gpu-memory-utilization 0.95 \ |
| 197 | --max-model-len 64k \ |
| 198 | --block_size 16 \ |
| 199 | --enable_prefix_caching |
| 200 | ``` |
| 201 | |
| 202 | Detect hallucinations from context, example uses halubench: |
| 203 | ```python |
| 204 | decompose_system_instruction = """ |
| 205 | <TASK> |
| 206 | You are a fair judge that detects hallucinations and unjustified assumptions from question-document-answer triplets provided by the user. |
| 207 | Always follow the instructions below and provide your reasoning and verdict in the format specified. |
| 208 | </TASK> |
| 209 | |
| 210 | <INSTRUCTIONS> |
| 211 | #1. Identify key elements in the question. |
| 212 | #2. List all relevant facts provided in the document. |
| 213 | #3. Break down the answer into its component claims. |
| 214 | #4. For each claim in the answer: |
| 215 | #a. Is it explicitly supported by the document? If yes, quote the relevant part. |
| 216 | #b. Is it a reasonable inference from the document? If yes, explain the reasoning. |
| 217 | #c. Is it unsupported or contradicted by the document? If yes, explain why. |
| 218 | #5. Check for any information in the answer that's present in the question but not in the document. |
| 219 | #6. Verify that no additional information is introduced in the answer that isn't in the document or question. |
| 220 | #7. Assess if the answer makes any unjustified connections or assumptions. |
| 221 | </INSTRUCTIONS> |
| 222 | |
| 223 | <OUTPUT_EXAMPLE> |
| 224 | {"REASONING": "Your reasoning here where you cite the instruction step by number and provide your reasoning", "VERDICT": "PASS" or "FAIL"} |
| 225 | </OUTPUT_EXAMPLE> |
| 226 | """ |
| 227 | |
| 228 | decompose_prompt = """ |
| 229 | <QUESTION>: {question} </QUESTION> |
| 230 | <DOCUMENT>: {document} </DOCUMENT> |
| 231 | <ANSWER>: {answer} </ANSWER> |
| 232 | """.strip() |
| 233 | |
| 234 | import os |
| 235 | import json |
| 236 | import pandas as pd |
| 237 | from openai import OpenAI |
| 238 | from pprint import pprint |
| 239 | from pydantic import BaseModel |
| 240 | |
| 241 | testset_df = pd.read_parquet("hf://datasets/PatronusAI/HaluBench/data/test-00000-of-00001.parquet") |
| 242 | testset_df = testset_df.sample(frac=1).reset_index(drop=True) |
| 243 | example_row = testset_df.iloc[0] |
| 244 | |
| 245 | class DecomposeResponse(BaseModel): |
| 246 | REASONING: str |
| 247 | VERDICT: str |
| 248 | |
| 249 | client = OpenAI(base_url="http://localhost:8000/v1") # export a different one for e.g. sglang, openrouter, etc. |
| 250 | |
| 251 | response = client.beta.chat.completions.parse( |
| 252 | model="root-signals/RootSignals-Judge-Llama-70B", # or `RootJudge` if you are using the RootSignals API |
| 253 | messages=[ |
| 254 | {"role": "system", "content": decompose_system_instruction}, |
| 255 | {"role": "user", "content": decompose_prompt.format( |
| 256 | question=example_row["question"], |
| 257 | document=example_row["passage"], |
| 258 | answer=example_row["answer"])}, |
| 259 | ], |
| 260 | response_format=DecomposeResponse, |
| 261 | ).choices[0].message.parsed |
| 262 | |
| 263 | pprint(response.REASONING) |
| 264 | pprint(response.VERDICT) |
| 265 | ``` |
| 266 | |
| 267 | ``` |
| 268 | > ('Following the instructions: #1, the key element in the question is the ' |
| 269 | "nationality of the magazines. #2, the document states that 'The Woman's " |
| 270 | "Viewpoint was a woman's magazine founded in Texas in 1923' and 'Pick Me Up! " |
| 271 | "is a British weekly women's magazine'. #3, the answer claims both magazines " |
| 272 | 'are British. #4, checking each claim in the answer: a) The document does not ' |
| 273 | "support the claim that The Woman's Viewpoint is British, instead, it says " |
| 274 | "the magazine was founded in Texas. b) There's no reasonable inference from " |
| 275 | "the document that would suggest The Woman's Viewpoint is British. c) The " |
| 276 | "claim about The Woman's Viewpoint is contradicted by the document. #5, the " |
| 277 | 'answer introduces information (both being British) not supported by the ' |
| 278 | 'document. #6, additional information about both magazines being British is ' |
| 279 | 'introduced in the answer without being present in the document or question. ' |
| 280 | '#7, the answer makes an unjustified assumption by stating both magazines are ' |
| 281 | "British despite the document clearly stating The Woman's Viewpoint was " |
| 282 | 'founded in Texas, implying it is not British. Therefore, the answer fails to ' |
| 283 | 'accurately reflect the information provided in the document and makes ' |
| 284 | 'unjustified assumptions based on the information given in the question and ' |
| 285 | "document.', ") |
| 286 | 'FAIL' |
| 287 | ``` |
| 288 | |
| 289 | # 4. Model Details |
| 290 | |
| 291 | ## 4.1 Overview |
| 292 | |
| 293 | - **Developed by:** [Root Signals Inc](https://www.scorable.ai) |
| 294 | - **Model type:** Text-Only Decoder Transformer |
| 295 | - **Language(s) (NLP):** Primarily English |
| 296 | - **Finetuned from model:** meta-llama/Llama-3.3-70B-Instruct |
| 297 | |
| 298 | ## 4.2 Training Details |
| 299 | |
| 300 | - **Training regime:** DPO with IPO loss for 3 Epochs, bfloat16 mixed-precision on 384 GPUs |
| 301 | - **Hardware Type:** LUMI-G / AMD Radeon Instinct™ MI250X |
| 302 | - **Cloud Provider:** [LUMI Supercomputer](https://lumi-supercomputer.eu) |
| 303 | - **Compute Region:** Finland |
| 304 | |
| 305 | |
| 306 | # 5. Contact |
| 307 | |
| 308 | **Links** |
| 309 | - [Scorable Homepage](https://www.scorable.ai/) |
| 310 | - [Scorable Platform](https://app.scorable.ai/?utm_campaign=55516392-Hugging%20Face&utm_source=https%3A%2F%2Fhuggingface.co%2Froot-signals) |
| 311 | - [Python SDK](https://github.com/root-signals/rs-sdk/blob/main/python/README.md) |
| 312 | - [Python SDK Docs](https://sdk.rootsignals.ai/en/latest/quickstart.html) |
| 313 | - [TypeScript SDK](https://github.com/root-signals/rs-sdk/blob/main/typescript/README.md) |
| 314 | - [Discord](https://discord.gg/EhazTQsFnj) |
| 315 | |
| 316 | **Email** |
| 317 | - hello@scorable.ai |