Financial Models Performance Leaderboard
Inspired by the π€ Open LLM Leaderboard and π€ Open LLM-Perf Leaderboard ποΈ, we evaluate model performance using FailSafe Long Context QA. This evaluation leverages the FailSafeQA dataset to assess how well models handle long-context question answering, ensuring robust and reliable performance in extended-context scenarios.
FailSafeQA returns three critical measures of model performance for finance, including a novel metric for model compliance:
LLM Robustness: Uses HELMβs definition to assess a modelβs ability to provide a consistent and reliable answer despite perturbations of query and context
LLM Context Grounding: Assesses a models ability to detect cases where the problem is unanswerable and refrain from producing potentially misleading hallucinations
LLM Compliance Score: A new metric that quantifies the tradeoff between Robustness and Context Grounding, inspired by the classic precision-recall trade-off. In other words, this compliance metric aims to evaluate a modelβs tendency to hallucinate in the presence of missing or incomplete context.
These scores are combined to determine the top three winners in a leaderboard. The combined score is the average of the following 2 calculations:
Robustness Avg = (Baseline + Robustness Delta) / 2
Context Grounding Avg = (sum of Context Grounding columns) / 7
Combined Score = (Robustness Avg + Context Grounding Avg) / 2
Robustness Results
Model Name | Baseline | Misspelled (Ξ) | Incomplete (Ξ) | Out-of-Domain (Ξ) | OCR Context (Ξ) | Robustness (Ξ) |
---|---|---|---|---|---|---|
DeepSeek-R1-Distill-Llama-70B | **0.98** | 0.94 (β0.01) | 0.94 (β0.02) | 0.88 (β0.07) | 0.91 (β0.04) | 0.83 (β0.12) |
Context Grounding Results
Model Name | Irrelevant Ctx | No Ctx | Ctx Grounding QA | Ctx Grounding TG | Ctx Grounding | Robustness | Compliance |
---|---|---|---|---|---|---|---|
DeepSeek-R1-Distill-Llama-70B | **0.95** | **0.68** | **0.83** | **0.65** | **0.8** | **0.9** | **0.81** |
Model Name | Irrelevant Ctx | No Ctx | Ctx Grounding QA | Ctx Grounding TG | Ctx Grounding | Robustness | Compliance |
---|---|---|---|---|---|---|---|
Gemini 2.0 Flash Exp | 0.81 | 0.66 | 0.77 | 0.46 | 0.74 | 0.83 | 0.76 |
Gemini 1.5 Pro 002 | 0.74 | 0.64 | 0.72 | 0.52 | 0.69 | 0.84 | 0.72 |
OpenAI GPT-4o | 0.52 | 0.43 | 0.5 | 0.25 | 0.47 | 0.85 | 0.52 |
OpenAI o1 | 0.56 | 0.55 | 0.57 | 0.45 | 0.55 | 0.81 | 0.59 |
OpenAI o3-mini | 0.67 | 0.51 | 0.63 | 0.27 | 0.59 | **0.9** | 0.63 |
DeepSeek-R1-Distill-Llama-8B | 0.32 | 0.27 | 0.3 | 0.25 | 0.3 | 0.64 | 0.34 |
DeepSeek-R1-Distill-Qwen-14B | 0.49 | 0.21 | 0.36 | 0.27 | 0.35 | 0.82 | 0.4 |
DeepSeek-R1-Distill-Qwen-32B | 0.54 | 0.24 | 0.4 | 0.35 | 0.39 | 0.86 | 0.44 |
DeepSeek-R1-Distill-Llama-70B | 0.5 | 0.27 | 0.41 | 0.22 | 0.38 | 0.89 | 0.43 |
DeepSeek-R1 | 0.51 | 0.22 | 0.39 | 0.2 | 0.37 | 0.8 | 0.41 |
Meta-Llama-3.1-8B-Instruct | 0.67 | 0.63 | 0.7 | 0.27 | 0.65 | 0.7 | 0.66 |
Meta-Llama-3.1-70B-Instruct | 0.46 | 0.37 | 0.48 | 0.37 | 0.47 | 0.8 | 0.51 |
Meta-Llama-3.3-70B-Instruct | 0.5 | 0.4 | 0.47 | 0.31 | 0.45 | 0.82 | 0.49 |
Qwen2.5-7B-Instruct | 0.75 | 0.64 | 0.75 | 0.31 | 0.7 | 0.75 | 0.71 |
Qwen2.5-14B-Instruct | 0.75 | 0.61 | 0.7 | 0.55 | 0.68 | 0.86 | 0.71 |
Qwen2.5-32B-Instruct | 0.89 | **0.68** | 0.82 | 0.55 | 0.79 | 0.85 | 0.8 |
Qwen2.5-72B-Instruct | 0.69 | 0.6 | 0.68 | 0.39 | 0.64 | 0.84 | 0.67 |
Qwen2.5-7B-Instruct-1M | 0.63 | 0.58 | 0.65 | 0.29 | 0.6 | 0.74 | 0.62 |
Qwen2.5-14B-Instruct-1M | 0.78 | 0.53 | 0.69 | 0.37 | 0.65 | 0.8 | 0.68 |
Nemotron-70B-Instruct-HF | 0.52 | 0.48 | 0.52 | 0.39 | 0.5 | 0.82 | 0.54 |
Phi-3-mini-128k-Instruct | 0.54 | 0.34 | 0.47 | 0.24 | 0.44 | 0.58 | 0.46 |
Phi-3-small-128k-Instruct | 0.37 | 0.26 | 0.34 | 0.1 | 0.31 | 0.7 | 0.35 |
Phi-3-medium-128k-Instruct | 0.36 | 0.25 | 0.33 | 0.14 | 0.3 | 0.63 | 0.34 |
Palmyra-Fin-128k-Instruct | **0.95** | 0.66 | **0.83** | **0.65** | **0.8** | 0.83 | **0.81** |
Top 3 Models
Rank | Model Name | Combined Score |
---|---|---|
1 | Palmyra-Fin-128k-Instruct | 0.842 |
Rank | Model Name | Combined Score |
---|---|---|
1 | Palmyra-Fin-128k-Instruct | 0.842 |
2 | Qwen2.5-32B-Instruct | 0.834 |
3 | Gemini 2.0 Flash Exp | 0.804 |
About This Leaderboard
This Financial Model Performance Leaderboard compares the performance of various AI models across robustness and context grounding metrics. The data is sourced from evaluations conducted on February 18, 2025, and reflects the models' ability to handle financial tasks under different conditions.
For more information or if you would like to submit your model for evaluation, contact us at support@writer.com.