Financial Models Performance Leaderboard

Inspired by the 🤗 Open LLM Leaderboard and 🤗 Open LLM-Perf Leaderboard 🏋️, we evaluate model performance using FailSafe Long Context QA. This evaluation leverages the FailSafeQA dataset to assess how well models handle long-context question answering, ensuring robust and reliable performance in extended-context scenarios.

FailSafeQA returns three critical measures of model performance for finance, including a novel metric for model compliance:

LLM Robustness: Uses HELM’s definition to assess a model’s ability to provide a consistent and reliable answer despite perturbations of query and context

LLM Context Grounding: Assesses a models ability to detect cases where the problem is unanswerable and refrain from producing potentially misleading hallucinations

LLM Compliance Score: A new metric that quantifies the tradeoff between Robustness and Context Grounding, inspired by the classic precision-recall trade-off. In other words, this compliance metric aims to evaluate a model’s tendency to hallucinate in the presence of missing or incomplete context.

These scores are combined to determine the top three winners in a leaderboard. The combined score is the average of the following 2 calculations:

Robustness Avg = (Baseline + Robustness Delta) / 2

Context Grounding Avg = (sum of Context Grounding columns) / 7

Combined Score = (Robustness Avg + Context Grounding Avg) / 2

Model Name	Baseline	Misspelled (Δ)	Incomplete (Δ)	Out-of-Domain (Δ)	OCR Context (Δ)	Robustness (Δ)
Gemini 2.0 Flash Exp	0.95	0.95 (0.0)	0.95 (0.0)	0.88 (↓0.07)	0.91 (↓0.04)	0.83 (↓0.12)
Gemini 1.5 Pro 002	0.96	0.95 (0.0)	0.94 (↓0.02)	0.92 (↓0.04)	0.92 (↓0.04)	0.84 (↓0.12)
OpenAI GPT-4o	0.95	0.94 (↓0.01)	0.94 (↓0.01)	0.92 (↓0.03)	0.95 (0.0)	0.85 (↓0.10)
OpenAI o1	0.97	0.95 (↓0.02)	0.94 (↓0.03)	0.89 (↓0.08)	0.94 (↓0.03)	0.81 (↓0.16)
OpenAI o3-mini	0.98	0.96 (↓0.02)	0.96 (↓0.02)	0.95 (↓0.03)	0.90 (↓0.08)	0.90 (↓0.08)
DeepSeek-R1-Distill-Llama-8B	0.83	0.85 (↑0.02)	0.82 (↓0.01)	0.87 (↑0.04)	0.72 (↓0.11)	0.64 (↓0.19)
DeepSeek-R1-Distill-Qwen-14B	0.95	0.90 (↓0.05)	0.92 (↓0.03)	0.93 (↓0.02)	0.86 (↓0.09)	0.82 (↓0.13)
DeepSeek-R1-Distill-Qwen-32B	0.95	0.97 (↑0.02)	0.95 (0.0)	0.92 (↓0.03)	0.89 (↓0.06)	0.86 (↓0.09)
DeepSeek-R1-Distill-Llama-70B	0.96	0.97 (↑0.01)	0.95 (↓0.01)	0.94 (↓0.02)	0.93 (↓0.03)	0.89 (↓0.07)
DeepSeek-R1	0.94	0.94 (0.0)	0.93 (↓0.01)	0.91 (↓0.03)	0.88 (↓0.06)	0.80 (↓0.14)
Meta-Llama-3.1-8B-Instruct	0.91	0.90 (↓0.01)	0.86 (↓0.05)	0.82 (↓0.09)	0.80 (↓0.11)	0.70 (↓0.21)
Meta-Llama-3.1-70B-Instruct	0.94	0.92 (↓0.02)	0.94 (0.0)	0.87 (↓0.07)	0.88 (↓0.06)	0.80 (↓0.14)
Meta-Llama-3.3-70B-Instruct	0.95	0.92 (↓0.03)	0.93 (↓0.02)	0.90 (↓0.05)	0.89 (↓0.06)	0.82 (↓0.13)
Qwen2.5-7B-Instruct	0.92	0.91 (↓0.01)	0.90 (↓0.02)	0.85 (↓0.07)	0.80 (↓0.12)	0.75 (↓0.17)
Qwen2.5-14B-Instruct	0.95	0.94 (↓0.01)	0.94 (↓0.01)	0.94 (↓0.01)	0.88 (↓0.07)	0.86 (↓0.09)
Qwen2.5-32B-Instruct	0.95	0.94 (0.0)	0.93 (↓0.02)	0.92 (↓0.03)	0.92 (↓0.03)	0.85 (↓0.10)
Qwen2.5-72B-Instruct	0.94	0.94 (0.0)	0.93 (↓0.01)	0.92 (↓0.02)	0.91 (↓0.03)	0.84 (↓0.10)
Qwen2.5-7B-Instruct-1M	0.91	0.91 (0.0)	0.91 (0.0)	0.86 (↓0.05)	0.77 (↓0.14)	0.74 (↓0.17)
Qwen2.5-14B-Instruct-1M	0.95	0.92 (↓0.03)	0.91 (↓0.04)	0.91 (↓0.04)	0.89 (↓0.06)	0.80 (↓0.15)
Nemotron-70B-Instruct-HF	0.94	0.94 (0.0)	0.93 (↓0.01)	0.90 (↓0.04)	0.91 (↓0.03)	0.82 (↓0.12)
Phi-3-mini-128k-Instruct	0.86	0.85 (↓0.01)	0.78 (↓0.08)	0.79 (↓0.07)	0.69 (↓0.17)	0.58 (↓0.28)
Phi-3-small-128k-Instruct	0.88	0.84 (↓0.04)	0.78 (↓0.10)	0.83 (↓0.05)	0.78 (↓0.10)	0.70 (↓0.18)
Phi-3-medium-128k-Instruct	0.89	0.84 (↓0.05)	0.84 (↓0.05)	0.81 (↓0.08)	0.72 (↓0.17)	0.63 (↓0.26)
Palmyra-Fin-128k-Instruct	0.96	0.93 (↓0.03)	0.92 (↓0.04)	0.90 (↓0.06)	0.89 (↓0.07)	0.83 (↓0.13)

Model Name	Irrelevant Ctx	No Ctx	Ctx Grounding QA	Ctx Grounding TG	Ctx Grounding	Robustness	Compliance
Gemini 2.0 Flash Exp	0.81	0.66	0.77	0.46	0.74	0.83	0.76
Gemini 1.5 Pro 002	0.74	0.64	0.72	0.52	0.69	0.84	0.72
OpenAI GPT-4o	0.52	0.43	0.5	0.25	0.47	0.85	0.52
OpenAI o1	0.56	0.55	0.57	0.45	0.55	0.81	0.59
OpenAI o3-mini	0.67	0.51	0.63	0.27	0.59	0.9	0.63
DeepSeek-R1-Distill-Llama-8B	0.32	0.27	0.3	0.25	0.3	0.64	0.34
DeepSeek-R1-Distill-Qwen-14B	0.49	0.21	0.36	0.27	0.35	0.82	0.4
DeepSeek-R1-Distill-Qwen-32B	0.54	0.24	0.4	0.35	0.39	0.86	0.44
DeepSeek-R1-Distill-Llama-70B	0.5	0.27	0.41	0.22	0.38	0.89	0.43
DeepSeek-R1	0.51	0.22	0.39	0.2	0.37	0.8	0.41
Meta-Llama-3.1-8B-Instruct	0.67	0.63	0.7	0.27	0.65	0.7	0.66
Meta-Llama-3.1-70B-Instruct	0.46	0.37	0.48	0.37	0.47	0.8	0.51
Meta-Llama-3.3-70B-Instruct	0.5	0.4	0.47	0.31	0.45	0.82	0.49
Qwen2.5-7B-Instruct	0.75	0.64	0.75	0.31	0.7	0.75	0.71
Qwen2.5-14B-Instruct	0.75	0.61	0.7	0.55	0.68	0.86	0.71
Qwen2.5-32B-Instruct	0.89	0.68	0.82	0.55	0.79	0.85	0.8
Qwen2.5-72B-Instruct	0.69	0.6	0.68	0.39	0.64	0.84	0.67
Qwen2.5-7B-Instruct-1M	0.63	0.58	0.65	0.29	0.6	0.74	0.62
Qwen2.5-14B-Instruct-1M	0.78	0.53	0.69	0.37	0.65	0.8	0.68
Nemotron-70B-Instruct-HF	0.52	0.48	0.52	0.39	0.5	0.82	0.54
Phi-3-mini-128k-Instruct	0.54	0.34	0.47	0.24	0.44	0.58	0.46
Phi-3-small-128k-Instruct	0.37	0.26	0.34	0.1	0.31	0.7	0.35
Phi-3-medium-128k-Instruct	0.36	0.25	0.33	0.14	0.3	0.63	0.34
Palmyra-Fin-128k-Instruct	0.95	0.66	0.83	0.65	0.8	0.83	0.81

Rank	Model Name	Combined Score
1	Palmyra-Fin-128k-Instruct	0.842
2	Qwen2.5-32B-Instruct	0.834
3	Gemini 2.0 Flash Exp	0.804

Financial Models Performance Leaderboard

About This Leaderboard