Poster
in
Affinity Workshop: LatinX in AI
A Transparent Fairness Evaluation Protocol for Open-Source Language Model Benchmarking on the Blockchain
Hugo Massaroli · Leonardo Iara · Emmanuel Iarussi · Viviana Siless
Large language models (LLMs) are increasingly deployed in real-world applications, yet concerns about their fairness persist—especially in high-stakes domains like criminal justice, education, healthcare, and finance. This paper introduces a transparent evaluation protocol for benchmarking the fairness of open-source LLMs using smart contracts on the Internet Computer Protocol (ICP) blockchain. Our method ensures verifiable, immutable, and reproducible evaluations by executing on-chain HTTP requests to hosted Hugging Face endpoints and storing datasets, prompts, and metrics directly on-chain. We benchmark Llama, DeepSeek, and Mistral models on two fairness-sensitive datasets: COMPAS for recidivism prediction and PISA for academic performance forecasting. Fairness is assessed using statistical parity, equal opportunity, and structured Context Association Metrics (CAT). We further extend our analysis with a multilingual evaluation across English, Spanish, and Portuguese using the Kaleidoscope benchmark, revealing cross-linguistic disparities. All code and results are open source, enabling community audits and longitudinal fairness tracking across model versions.