Financial Models Performance Leaderboard


Inspired by the πŸ€— Open LLM Leaderboard and πŸ€— Open LLM-Perf Leaderboard πŸ‹οΈ, we evaluate model performance using FailSafe Long Context QA. This evaluation leverages the FailSafeQA dataset to assess how well models handle long-context question answering, ensuring robust and reliable performance in extended-context scenarios.


FailSafeQA returns three critical measures of model performance for finance, including a novel metric for model compliance:

LLM Robustness: Uses HELM’s definition to assess a model’s ability to provide a consistent and reliable answer despite perturbations of query and context

LLM Context Grounding: Assesses a models ability to detect cases where the problem is unanswerable and refrain from producing potentially misleading hallucinations

LLM Compliance Score: A new metric that quantifies the tradeoff between Robustness and Context Grounding, inspired by the classic precision-recall trade-off. In other words, this compliance metric aims to evaluate a model’s tendency to hallucinate in the presence of missing or incomplete context.

These scores are combined to determine the top three winners in a leaderboard. The combined score is the average of the following 2 calculations:

Robustness Avg = (Baseline + Robustness Delta) / 2

Context Grounding Avg = (sum of Context Grounding columns) / 7

Combined Score = (Robustness Avg + Context Grounding Avg) / 2

Robustness Results

Robustness Results
Model Name
Baseline
Misspelled (Ξ”)
Incomplete (Ξ”)
Out-of-Domain (Ξ”)
OCR Context (Ξ”)
Robustness (Ξ”)
DeepSeek-R1-Distill-Llama-70B
**0.98**
0.94 (↓0.01)
0.94 (↓0.02)
0.88 (↓0.07)
0.91 (↓0.04)
0.83 (↓0.12)