ICML 2025 Expos

OmniLong: A Resource-Effective Context Scaling Framework for Multimodal LLM Fine-tuning

Expo Talk Panel

Yin Song · Chen Wu

[ West Ballroom C ]

Abstract

This presentation introduces OmniLong, a novel computational framework addressing the fundamental challenge of context length scaling in multimodal large language models (MLLMs). While extended contextual understanding across high-frame-rate videos and lengthy documents represents a critical frontier for practical applications, current approaches necessitate substantial computational infrastructure that creates significant barriers to entry. OmniLong offers a paradigm shift through a cohesive architecture that simultaneously extends context across textual and visual modalities while reducing computational requirements. Through advanced sequence parallelism and strategic CPU-GPU memory management techniques, OmniLong demonstrates superior computational efficiency by successfully fine-tuning models on high-density video content comprising up to 2048 sampled frames while utilizing only 8 A100 GPUs. Empirical evaluation shows OmniLong-enhanced models consistently outperform their foundational counterparts on established benchmarks, with OmniLong-Qwen2.5-VL-7B achieving particularly notable results on the VideoMME leaderboard for video analysis tasks. This talk will present a comprehensive analysis of OmniLong's technical architecture, optimization methodology, and broader implications for democratizing access to state-of-the-art multimodal AI capabilities across research institutions and industrial applications with diverse resource constraints.

Gen AI Applications in Amazon Pharmacy

Expo Talk Panel

Yifu Chen · Cristobal Pais

[ West Ballroom B ]

Abstract

We present two innovative applications of Large Language Models (LLMs), at Amazon Pharmacy: (1) an AI assistant for customer support and (2) a medication direction copilot for patient safety. The Pharmacy AI Assistant, launched in March 2024 based on LLM and RAG, led to an 11% reduction in human support contact rate and a 15% improvement in issue resolution rates. We later introduced a new feature: a hybrid architecture integrating multi-armed bandits with LLMs to dynamically suggest follow-up questions, addressing challenges in customer inquiry articulation while balancing exploration and exploitation in conversational flows, which led to an additional 9% reduction in contact rate; (2) MEDIC (medication direction copilot), a system that emulates pharmacist reasoning by fine-tuning an LLM with a compact set of expert-annotated directions to accurately extract and communicate core prescription components. When compared against two state-of-the-art LLM-based benchmarks, competitors, they recorded 1.51 and 4.38 times more near-miss events (errors caught before reaching our patients) than MEDIC. Production deployment demonstrated a 33% reduction in medication direction errors. These systems demonstrate how LLMs, when enhanced with domain expertise and strategic architectural decisions, can significantly improve both customer experience and patient safety in online pharmacy operations.

Thursday Night Football on Prime Video – Broadcast Innovation

Expo Talk Panel

Tal Darom

[ West Ballroom D ]

Abstract

This lecture will present the advanced analytics and machine learning-powered features developed by Amazon to enhance the live viewing experience for Thursday Night Football on Prime Video. Leveraging player tracking data, the team developed three novel features: Defensive Vulnerability, which identifies defensive weak spots before the snap; Pressure Alerts, a deep learning model that predicts quarterback pressure; and Coverage Prediction, which forecasts man vs. zone coverage. These features harness Amazon's cloud and edge computing capabilities to deliver real-time insights to both casual and avid fans. The results highlight how sports broadcasting is shifting towards a data-driven approach powered by the latest advancements in artificial intelligence.

JokeEval: Are the Jokes Funny? Review of Computational Evaluation Techniques to improve Joke Generation

Expo Talk Panel

Sulbha Jain

[ West Ballroom D ]

Abstract

Humor is a nuanced and essential facet of human communication, often relying on incongruity, surprise, and cultural context to elicit amusement. This paper presents JokeEval, a computational framework designed to evaluate the quality of AIgenerated jokes. Through empirical experiments on both synthetic and open-source datasets, we demonstrate that machine learning techniques—particularly a hybrid Convolutional Neural Network with recurrent layers—can effectively distinguish between “Funny” and “Not Funny” jokes, achieving a statistically significant F1-score of 71.2% on the ColBERT dataset. Our methodology leverages highdimensional vector embeddings, crowd-sourced human annotations, and diverse evaluation pipelines—including supervised classifiers, deep neural networks, and LLM-as-a-judge protocols—to assess humor at scale. In doing so, we highlight both the promise and current limitations of AI in understanding and generating humor. The results pave the way for more engaging, human-aligned content generation and offer a feedback loop to iteratively improve joke-writing capabilities in virtual assistants and other AI-driven systems.

A Unified Framework for Generative AI Safety

Expo Talk Panel

Pin-Yu Chen

[ West Ballroom C ]

Abstract

Large language models (LLMs) and Generative AI (GenAI) are at the forefront of frontier AI research and technology. With their rapidly increasing popularity and availability, challenges and concerns about their misuse and safety risks are becoming more prominent than ever. In this talk, we introduce a unified computational framework for evaluating and improving a wide range of safety challenges in generative AI. Specifically, we will show new tools and insights to explore and mitigate the safety and robustness risks associated with state-of-the-art LLMs and GenAI models, including (i) safety risks in fine-tuning LLMs, (ii) LLM jailbreak mitigation, (iii) prompt engineering for safety debugging, and (iv) robust detection of AI-generated content.

Distributed Computing Architectures as a Solution to AI's Energy Crisis: Empirical Analysis of Decentralized Training

Expo Talk Panel

Greg Osuri

[ West Ballroom A ]

Abstract

The exponential growth in AI model size has created unprecedented energy demands that challenge traditional computing infrastructure. Recent industry reports have estimated that by 2040, AI inference and training will collectively require 600 terawatt-hours annually—equivalent to the energy consumption of a medium-sized industrial nation. Current hyperscaler architectures introduce critical bottlenecks: geographically concentrated energy demands, transmission constraints, and concerning environmental impacts, with some facilities resorting to fossil fuel consumption to meet power requirements. Greg Osuri, founder and core contributor of Akash Network, will discuss how decentralized marketplaces efficiently allocate resources across geographically dispersed nodes. He will demonstrate how Akash has achieved approximately 70% resource utilization rates across heterogeneous hardware configurations, including recent breakthroughs in distributed training algorithms that overcome previous limitations in heterogeneous compute environments. The presentation will include a technical analysis of small modular data center architectures optimized for distributed AI workloads, including their integration with renewable energy sources. This will highlight how decentralized approaches can address current energy constraints while democratizing access to compute resources, potentially preventing market concentration that threatens open innovation in AI research.

Human-Aligned Long-Form Evaluation (HALF-Eval): Framework for Assessing AI-Generated Content

Expo Talk Panel

Sulbha Jain

[ West Ballroom D ]

Abstract

Evaluating the quality of long-form AI-generated content remains a significant challenge, particularly in achieving consistent alignment with human judgment across diverse formats. This paper presents the Human-Aligned Long-Form Evaluation (HALF-Eval) framework, a generalize, scalable and systematic methodology for assessing the quality of AI-generated long form contents e.g. articles, blogs, and essays. HALF-Eval utilizes a structured checklist-based evaluation to capture essential dimensions of content quality, including depth, coherence, relevance, and evidence support. By leveraging human-annotated data, the framework trains machine learning models to aggregate individual checklist scores into comprehensive quality assessments, enabling automated and reliable classification of content as high- or low-quality. Experimental results demonstrate that HALF-Eval outperforms conventional LLM-based scoring approaches, achieving closer alignment with human evaluators and providing actionable feedback for iterative content improvement. The proposed framework offers a robust foundation for advancing grounded, human-centric evaluation systems and supports the scalable generation of high-quality AI-driven long-form content.

Re-Imagine: Symbolic Benchmark Synthesis for Reasoning Evaluation

Expo Talk Panel

Rachel Lawrence

[ West Ballroom C ]

Abstract

Recent Large Language Models (LLMs) have reported high accuracy on reasoning benchmarks. However, it is still unclear whether the observed results arise from true “reasoning” or from statistical recall of the training set. Inspired by the ladder of causation (Pearl, 2009) and its three levels (associations, interventions and counterfactuals), this paper introduces RE-IMAGINE: a framework to characterize a hierarchy of reasoning ability in LLMs, alongside an automated pipeline to generate problem variations at different levels of the hierarchy. By altering problems in an intermediate symbolic representation, RE-IMAGINE generates arbitrarily many problems that are not solvable using memorization alone. Moreover, the framework is general and can work across reasoning domains, including math, code, and logic. We demonstrate our framework on four widely-used benchmarks to evaluate several families of LLMs, and observe reductions in performance when the models are queried with problem variations. These assessments indicate a degree of reliance on statistical recall for past performance, and open the door to further research targeting skills across the reasoning hierarchy.

Distillation Scaling Laws

Expo Talk Panel

Dan Busbridge · Jason Ramapuram · Russ Webb · Floris Weers

[ West Ballroom A ]

Abstract

Smaller models are cheaper to serve, faster, use less battery, produce less heat, have a lower inference carbon footprint, and are easier to study for academics. Historically, small, capable models have been expensive to train. Knowledge distillation can reduce pretraining costs, yet is poorly understood - we find a distillation scaling law, enabling efficient pretraining strategies, bringing our understanding of distillation closer to our understanding of supervised pretraining.

Otter: Generating Tests from Issues to Validate SWE Patches

Expo Talk Panel

Toufique Ahmed

[ West Ballroom B ]

Abstract

Recent SWE agents generate code to resolve issues. While great for productivity, such systems make good tests even more important. Unfortunately, most prior work on test generation assumes that the code under test already exists. Instead, we are looking at the case where the code patch that resolves the issue has not yet been written. We introduce Otter, an LLM-based solution for generating tests from issues. Otter augments LLMs with rule-based analysis to check and repair their outputs, and introduces a novel self-reflective action planning stage. As of March 9, 2025, Otter is the SOTA for this scenario, topping the SWT-Bench Verified leaderboard.

Foundation Models for Automated Trading

Expo Talk Panel

Marc Khoury

[ West Ballroom D ]

Abstract

Hudson River Trading is a global trading firm and the leader in applying deep learning to financial markets. Every day our models process terabytes of market data from every financial product in the world, and must predict and act robustly under regime shifts, against adversarial participants, and under tight latency constraints. In this talk, we'll describe the problem that researchers at HRT actually solve. We'll introduce the rich but noisy market data that HRT processes and describe the businesses of providing liquidity and price discovery. Using data equivalent to trillions of tokens, we'll talk about the novel modeling, regularization, and engineering challenges we've solved to build the most predictive foundation models for financial markets in the world.

Improving LLM Benchmarks: Making AI Work for Real-World Needs

Expo Talk Panel

Jonathan Siddharth

[ West Meeting Room 220-222 ]

Abstract

To make AI models truly useful in real-world settings, we need better ways to measure their performance. This talk will focus on how we can improve benchmarks, ensuring LLMs are tested in ways that reflect actual business challenges. Jonathan will discuss how using real user feedback and industry-specific examples can create more meaningful tests for AI models. We’ll explore ways to measure AI performance based on practical tasks that require applying the model’s conceptual understanding. This will complement the many existing benchmarks that focus on evaluating AI models across a range of conceptual understanding tasks. By designing evaluation methods that reflect real-world use, we can help bridge the gap between research and business, making AI more effective and reliable in everyday applications.

Situating principles in context for synthetic data

Expo Talk Panel

Raya Horesh

[ West Ballroom A ]

Abstract

Codifying context in data represents not just a technical challenge, but a necessary evolution in how we imbue artificial systems with the nuanced understanding that defines human intelligence. As machine learning systems grow increasingly complex, the demand for high-quality data continues to rise dramatically, particularly in domains where real-world data is scarce or where expert annotations are prohibitively expensive. Despite significant advancements in synthetic data generation techniques, a fundamental challenge persists: synthetic data often lacks the rich contextual dimensions found in naturally occurring data. Synthetic data generation must evolve beyond non-robust performance metrics to incorporate crucial contextual elements—historical, social, human, and physical—that gives data meaning in real-world applications. Current approaches to synthetic data frequently produce technically valid but contextually impoverished datasets, limiting their effectiveness when deployed in complex environments. Emerging strategies for codifying context include the use of personas, AI constitutions, value systems, and expert/domain knowledge. One such strategy is Situated Principles (SPRI) framework which demonstrates how context-situated principles, generated dynamically for each input query, can guide large language models to produce responses that align with complex human values without extensive human oversight. This approach suggests task agnostic pathways for embedding contextual richness in synthetic data generation pipelines. As the …

The Next Frontier in Enterprise AI: A Vision for Generalist Agents

Expo Talk Panel

Ido Levy

[ West Ballroom C ]

Abstract

How can AI revolutionize how businesses use and interact with software at every level? Today’s emerging computer-using generalist agents offer a glimpse of the future, acting as autonomous operators capable of orchestrating tasks across browsers, desktop applications, and enterprise APIs—whether for customer support, data analytics, or regulatory compliance. In this talk, we share IBM’s broader vision for universal, enterprise-ready AI agents that unify end-to-end workflows, seamlessly adapt to complex digital environments, and operate with minimal specialized programming.We highlight how flexible yet robust frameworks, embedded safety and reliability, and far-reaching business impact come together to enable end-to-end automation that reduces operational costs, enhances human productivity, and unlocks entirely new categories of innovation. We also discuss how these agents fit into IBM’s overarching AI roadmap, ensuring alignment with trustworthiness, scalability, smaller models, and human-machine collaboration. [Attendees will leave with a holistic understanding of the breakthroughs and the challenges ahead—and why universal generalist agents may represent the next great leap in enterprise AI.]

Structured Foundation Models Meets AutoML: Shattering the SOTA with AutoGluon & GraphStorm

Expo Talk Panel

Nick Erickson · Abdul Fatir Ansari · Boran Han · Huzefa Rangwala

[ East Ballroom A ]

Abstract

Real-world data is messy, heterogeneous, and increasingly complex. Simultaneously, production systems must operate at scale with consistent performance. This paradox creates a significant challenge: how do we build sophisticated models that can handle complex data while maintaining production reliability? We present AWS's OSS advancements to bridge this gap by automating critical but time-consuming aspects of the ML pipeline. By providing low-code, easy-to-use frameworks that can handle tabular, graph, time series, and multi-modal data, we're democratizing access to sophisticated ML capabilities. This means businesses of all sizes - not just tech giants - can leverage ML for competitive advantage. In this talk, we will showcase state-of-the-art algorithms and research advancements, such as techniques for automatic graph construction from tabular data and efficient tabular model selection. Furthermore, we will share our recent approaches to push the boundaries of the AutoML domain with AutoGluon, including the integration of foundation models for improved time series forecasting and tabular data prediction, and an LLM-powered agent system for automated data science.

Prompt Declaration Language (PDL)

Expo Demonstration

Mandana Vaziri

[ West Exhibition Hall A-B1 ]

Abstract

Programming with LLMs requires careful orchestration of prompts with workflow logic and agentic patterns. Unfortunately, previous frameworks rely on deeply buried prompts that work well for a specific model and pattern but are difficult to adapt to new settings. We designed PDL, a new programming language for LLMs. PDL keeps prompts at the forefront using YAML, with a small set of simple but powerful logic blocks to assemble workflows or agents. Prompt contexts are accumulated implicitly, simplifying model chaining. PDL comes with visual tools for observability and experimentation and the implementation automatically parallelizes model calls. PDL supports a variety of model providers, including but not limited to IBM's watsonx-ai with Granite models and the new granite-io library.

Knobs of the Mind: Dopamine, Serotonin, and a Maze-Running Rover

Expo Demonstration

Jin Tan Ruan · Dario Fumarola · Jess Torres

[ West Exhibition Hall A-B1 ]

Abstract

Modern deep-RL policies crack under distribution shifts because every new environment demands another slog of back-prop. We flip the script: first train once, then lock every weight and steer behaviour with three neuromodulatory “mood knobs.” Dopamine-like reward gain fires up or damps down the urge to chase pay-offs. Serotonin 5-HT2-like exploration gain widens or narrows the agent’s repertoire. Serotonin 5-HT1-like risk penalty injects real-time caution when danger spikes. These scalars mimic the way real neuromodulators gate cortical circuits: they change a neuron’s responsiveness in milliseconds without rewriting the synapse. That gives us a clean separation between slow structural learning (the frozen network) and fast functional adaptation (the gains). Shifting the knobs costs almost nothing computationally yet lets one policy jump across grid mazes, procedurally generated dungeons, and even onto a Jetson-Nano quadruped robot dog - all while dodging the usual “catastrophic forgetting” trap. The takeaway: treating reinforcement learning agents like brains - plastic weights plus fluid neuro-chemistry - delivers instant, reversible behavioural tuning and makes real-world deployment far less brittle.

Real-World Autonomy: Building Modular, Voice-Guided Embodied Agents with SLMs and Vision

Expo Demonstration

Aastha Varma · Sushant Moon · Jess Torres

[ West Exhibition Hall A-B1 ]

Abstract

We present a new approach to embodied intelligence—one grounded in modular AI systems - combining small language models (SLMs), vision models, and speech interfaces. This architecture enables fast, intuitive agent behavior—even in low-resource, real-world environments. Our prototype, an AI-powered exoskeleton, performs physical tasks through natural human interaction. It operates in three modes: Shadow (mimic gestures), Command(respond to voice), and Training (learn by demonstration). High-level reasoning is handled by SLMs, while fast, modular controllers manage low-level control. This approach removes the need for heavy simulations and makes it easier for engineers and researchers to build real-world systems with limited resources.

IBM software agent for code (ISAC)

Expo Demonstration

Hongfei Tian

[ West Exhibition Hall A-B1 ]

Abstract

Resolving issues from an issue tracker on a source-code repository is tedious and expensive when done by hand. Recently, the SWE-bench leaderboard has seen submissions by several LLM-based agents that do this automatically. Unfortunately, these agents rely on closed-source frontier models, making them expensive and raising data-sharing concerns for industrial use. In contrast, we built IBM software agent for code, which works with a variety of open-source models such as Llama, Granite, and Mistral. ISAC uses sub-agents that are specialized for sub-tasks of localization, editing, and testing. Each sub-task is within reach of the capabilities of an open-source model. Furthermore, ISAC uses automated checking and repair of various common mistakes made by models, uses structured formats for data passed between sub-agents, and uses ensembling at multiple levels.

Production Strength Sales Agent

Expo Demonstration

Gegi Thomas

[ West Exhibition Hall A-B1 ]

Abstract

AI Agents have been growing in significance as the next generation technology to enable increase in user productivity. However, state of the art agents have been restricted to mostly proof of concepts and great demos. We take a step towards production ready AI agents by showcasing a number of innovative agentic middleware components. These components integrate seamlessly along with the LLM to improve the robustness and efficiency of the Agent.

Excitement-Driven AI Sports Commentary Generation

Expo Demonstration

Yang Zhang

[ West Exhibition Hall A-B1 ]

Abstract

Speech language models refer to language models with speech processing and understanding capabilities. One key desirable capability for speech language models is the ability to capture the intricate interdependency between content and prosody, which many existing works fail to accomplish satisfactorily. We propose a speech language model that explicitly represents the prosody information and its relationship with text and thus is surprisingly capable of generating expressive speech appropriate to the context. In this demo, we combine our speech modeling technology with multi-modal language models into an expressive AI sports commentary generation system. The system analyzes tennis game videos and generates expressive play-by-play speech commentary. Notably, the system can detect the excitement level of the play from crowd and player reactions and adjust the excitement level of the generated speech accordingly.

Building Advanced Software Engineering LLM Agents with Codellm-Devkit (CLDK)

Expo Demonstration

Divya Sankar

[ West Exhibition Hall A-B1 ]

Abstract

Large Language Model (LLM) agents are increasingly being used for code intelligence, automated reasoning, and software analysis. However, their effectiveness depends on their ability to perform deep, structured analysis across multiple programming languages. This workshop will introduce CLDK (CodeLLM DevKit), a powerful multilingual static and dynamic analysis framework designed to supercharge LLM-based coding agents. The demos will illustrate how CLDK enables ﬁne-grained program understanding, reasoning, and transformation across diverse programming languages. We will demonstrate how CLDK can be integrated with LLM agents to enhance tasks such as code refactoring, automated debugging, test generation, and vulnerability detection. The workshop will feature interactive discussions, live demos, and hands-on exercises to help researchers and developers build next-generation AI- powered software agents.

WCA4Z Platform - Accelerating Legacy Code Modernization Using AI Agents, Deep Program Analysis, and Scalable AI Pipelines

Expo Demonstration

Raviv Gal

[ West Exhibition Hall A-B1 ]

Abstract

Modernizing legacy mainframe applications is critical for enterprises seeking agility, security, and competitiveness. While these systems remain robust, they are becoming increasingly difficult to maintain due to skill shortages and integration complexity. We present WCA4Z, an AI-powered framework for understanding and transforming legacy code. WCA4Z combines static analysis, multi-agent systems, and large language models to automate key stages of the modernization lifecycle, including semantic code comprehension, modular refactoring, COBOL-to-Java transformation, and equivalence validation. The framework also offers strong support for LLM lifecycle management through configurable, reproducible workflows and an integrated dashboard.

Uncertainty Estimation in LLM-Generated Content

Expo Workshop

Anoop Kumar · Alfy Samuel · Vivek Datla · Geoff Pleiss · Sanghamitra Dutta · Michael Kirchhof

[ West Ballroom B ]

Abstract

The ability of Large Language Models (LLMs) to accurately estimate uncertainty is not just a theoretical concern; it’s a fundamental bottleneck hindering their safe and effective deployment in high-stakes, industrial-scale applications. The gap between model confidence and actual correctness poses an immediate and escalating risk. To mitigate these risks, this workshop convenes leading industry experts and academic researchers to confront the urgent challenges in LLM uncertainty estimation. We must define the state-of-the-art, establish rigorous evaluation standards, and forge a path toward reliable AI. This workshop will focus on: - Calibration: How can we ensure LLMs’ confidence levels align with their true accuracy? - Confidence-Aware Generation: What novel methods can enable LLMs to express their own uncertainty during content creation? - Out-of-Distribution Detection: How do we equip LLMs to recognize and flag inputs that lie outside their training data? - Uncertainty Communication: What are the most effective techniques for conveying LLM uncertainty to end-users, fostering trust and informed decision-making? - Benchmarking: What are the various metrics to measure how well models and express quantify uncertainty. The insights and collaborations generated here will directly shape the future of LLM development and deployment.

AI in Finance: Innovation & Emerging Opportunities

Expo Workshop

Daniele Rosa · Austin Zhang · Melissa Rad · Matteo Sesia

[ West Ballroom A ]

Abstract

This dynamic 90-minute session features a series of engaging lightning talks showcasing the forefront of AI and Machine Learning within the financial services industry. Discover novel in-house innovations, including Grembe, a Capital One-developed system leveraging graph embeddings on transactional data for enhanced financial understanding across fraud detection and customer behavior modeling. Explore MACAW, an Advanced Quantitative Method employing multi-agentic Large Language Model workflows to tackle complex financial reasoning. Learn about our research on Fortifying Conversational AI through intelligent input guardrails that enhance the security and reliability of LLM-driven interactions. A key highlight will be spotlights on emerging research and innovative concepts from our academic partners as they address critical challenges in AI for finance. We aim to feature several distinct examples of this cutting-edge work, providing insights into novel approaches and potential future directions. Examples of explored areas include trustworthy and responsible AI (e.g., with Columbia University), pioneering reliable and ethical decision-making (e.g., with USC), and building world models for financial decision-making (e.g., with the University of Maryland). This session will provide ICML participants with a concise yet comprehensive overview of impactful AI research and applications in finance, fostering conversation and dialogue about notable advances, ongoing research, and critical challenges …

Graph Foundation Models: Thoughts and Results

Expo Talk Panel

Bryan Perozzi · Michael Galkin

[ West Ballroom D ]

Abstract

This will be a 30-60 minute presentation covering our ongoing work to generalize graph learning models across tasks. We’ll provide an overview of Graph Foundation Models (GFMs), defining them as single models designed to learn transferable representations for generalization across diverse graphs and tasks, contrasting them with traditional graph learning approaches. Then we’ll discuss the motivation for GFMs advocating for the need for transferable learning and generalization. We’ll highlight successful GFM examples in link prediction and node classification, while also acknowledging open challenges such as feature heterogeneity and task generalization. Finally, we’ll briefly explore the intersection of GFMs and Large Language Models (LLMs), including text-space approaches and enhancing LLM reasoning with graph structures. We expect that this Expo will be of broad interest to ICML attendees. All presenters are experts currently working in this area. Bryan Perozzi has been working on graph machine learning for 10+ years, and has 20000+ citations in the area. Michael Galkin has 2800 citations and an h-index of 28.

Evaluation of GenAI models

Expo Workshop

Fnu Disha · Jason Whang · biancen xie

[ West Ballroom C ]

Abstract

This workshop will explore cutting-edge research in evaluating and ensuring the trustworthiness of Generative AI, including Large Language Models (LLMs) and Diffusion Models. As these models become increasingly integrated into decision-making, robust evaluation is crucial. We'll delve into diverse strategies for building more reliable Generative AI across various applications. Topics include: • Holistic Evaluation: Datasets, metrics, and methodologies. • Trustworthiness: o Truthfulness: Addressing misinformation, hallucinations, inconsistencies, and biases. o Safety & Security: Preventing harmful and toxic content, and protecting privacy. o Ethics: Aligning with social norms, values, regulations, and laws. • User-Centric Assessment: Evaluating models from a user perspective. • Multi-Perspective Evaluation: Focusing on reasoning, knowledge, problem-solving, and user alignment. • Cross-Modal Evaluation: Integrating text, image, audio, and other modalities. This workshop aims to bring together researchers from machine learning, data mining, and related fields to foster interdisciplinary collaboration. Through invited talks, paper presentations, and panel discussions, we aim to share insights and spark collaborations between academia and industry. Researchers from various fields, including Data Mining, Machine Learning, NLP, and Information Retrieval, are encouraged to participate.

Building Production Ready Agentic Systems: Architecture, LLM-based Evaluation, and GRPO Training

Expo Talk Panel

Andrew McNamara · Ben Lafferty · Michael Garner

[ West Meeting Room 220-222 ]

Abstract

In this talk we will discuss how we leverage the latest LLM and agentic patterns to create a Shopify assistant, Sidekick, with multiple skills to perform actions on the Shopify platform on behalf of merchants. We’ll talk about curation of datasets, tooling, MCP, post-training techniques (SFT and reinforcement learning with GRPO), prompting, structured generation through CFG, agent evaluation, experimentation, and even peek down several briefly-explored rabbit holes. We will also demonstrate the orchestration of the models, systems, and solutions that serve and improve Sidekick as it is currently offered to Shopify merchants.