ICML A Transparent Fairness Evaluation Protocol for Open-Source Language Model Benchmarking on the Blockchain

Poster
in
Affinity Workshop: LatinX in AI

A Transparent Fairness Evaluation Protocol for Open-Source Language Model Benchmarking on the Blockchain

Hugo Massaroli · Leonardo Iara · Emmanuel Iarussi · Viviana Siless

[ Abstract ] [ Project Page ]

[ Poster] [ OpenReview]

Abstract:

Large language models (LLMs) are increasingly deployed in real-world applications, yet concerns about their fairness persist—especially in high-stakes domains like criminal justice, education, healthcare, and finance. This paper introduces a transparent evaluation protocol for benchmarking the fairness of open-source LLMs using smart contracts on the Internet Computer Protocol (ICP) blockchain. Our method ensures verifiable, immutable, and reproducible evaluations by executing on-chain HTTP requests to hosted Hugging Face endpoints and storing datasets, prompts, and metrics directly on-chain. We benchmark Llama, DeepSeek, and Mistral models on two fairness-sensitive datasets: COMPAS for recidivism prediction and PISA for academic performance forecasting. Fairness is assessed using statistical parity, equal opportunity, and structured Context Association Metrics (CAT). We further extend our analysis with a multilingual evaluation across English, Spanish, and Portuguese using the Kaleidoscope benchmark, revealing cross-linguistic disparities. All code and results are open source, enabling community audits and longitudinal fairness tracking across model versions.

Chat is not available.

Poster in Affinity Workshop: LatinX in AI

A Transparent Fairness Evaluation Protocol for Open-Source Language Model Benchmarking on the Blockchain

Hugo Massaroli · Leonardo Iara · Emmanuel Iarussi · Viviana Siless

Poster
in
Affinity Workshop: LatinX in AI