Building Human-AI Alignment: Specifying, Inspecting, and Modeling AI Behaviors
The learned behaviors of AI and robot agents should align with the intentions of their human designers. Toward this goal, people must be able to easily specify, inspect, and model agent behaviors. For specifications, we will consider expert-written reward functions for reinforcement learning (RL) and non-expert preferences for reinforcement learning from human feedback (RLHF). I will show evidence that experts are bad at writing reward functions: even in a trivial setting, experts write specifications that are overfit to a particular RL algorithm, and they often write erroneous specifications for agents that fail to encode their true intent. Next, I will show that the common approach to learning a reward function from non-experts in RLHF uses an inductive bias that fails to encode how humans express preferences, and that our proposed bias better encodes human preferences both theoretically and empirically. For inspection, humans must be able to assess the behaviors an agent learns from a given specification. I will discuss a method to find settings that exhibit particular behaviors, like out-of-distribution failures. Lastly, cognitive science theories attempt to show how people build conceptual models that explain agent behaviors. I will show evidence that some of these theories are used in research to support humans, but that we can still build better curricula for modeling. Collectively, my research provides evidence that—even with the best of intentions— current human-AI systems often fail to induce alignment, and my research proposes promising directions for how to build better aligned human-AI systems.
Rigorous measurement in text-to-image systems (and AI more broadly?)
As large pretrained models underlying generative AI systems have grown larger, inscrutable, and widely-deployed, interest in understanding their nature as emergent rather than engineered systems has grown. I believe to move this "ersatz natural science" of AI forward, we need to focus on building rigorous observational tools for these systems, which can characterize capabilities unambiguously. At their best, benchmarks and metrics could meet this need, but at present they are often treated as mere leaderboards to chase and only very indirectly measure capabilities of interest. This talk covers three works on this topic: first, a work laying out the high-level case for building a subfield of "model metrology" which focuses on building better benchmarks and metrics. Then, it covers two works on metrology in the generative image domain: first, a work which assesses multilingual conceptual knowledge in text-to-image (T2I) systems, and second, a meta-benchmark that demonstrates how many T2I prompt faithfulness benchmarks actually fail to capture the compositionality characteristics of T2I systems which they purport to measure. This line of inquiry is intended to help move benchmarking toward the ideal of rigorous tools of scientific observation.
Visible-Thermal Image Registration and Translation for Remote Medical Applications
Thermal imagery captured in the Long Wave Infrared (LWIR) spectrum has long-played a vital role in thermal physiology. Signs of stress and inflammation which are unseen in the visible spectrum, can be detected in LWIR due to principles of blackbody radiation. As a result, thermal facial imagery provides a unique modality for physiological assessment of states such as chronic pain. In this presentation, I will provide a presentation of my research into image registration to align visible-thermal images that serve as a prerequisite for image-to-image translation using conditional GANs and Diffusion Models. I will share recent work leading research with the National Institutes of Health applying this research in a real-world setting on cancer patients suffering from chronic pain.
Learning to Synthesize Images with Multimodal and Hierarchical Inputs
In recent years, image synthesis and manipulation has experienced remarkable advancements driven by deep learning algorithms and web-scale data, yet there persists a notable disconnect between the intricate nature of human ideas and the simplistic input structures employed by the existing models. In this talk, I will present our research towards a more natural way for controllable image synthesis inspired by the coarse-to-fine workflow of human artists and the inherently multimodal aspect of human thought processes. We consider the inputs of semantic and visual modality at varying levels of hierarchy. For the semantic modality, we introduce a general framework for modeling semantic inputs of different levels, which includes image-level text prompts and pixel-level label maps as two extremes and brings a series of mid-level regional descriptions with different precision. For the visual modality, we explore the use of low-level and high-level visual inputs aligning with the natural hierarchy of visual processing. Additionally, as the misuse of generated images becomes a societal threat, I will introduce our findings on the trustworthiness of deep generative models in the second part of this talk and potential future research directions.
Structured Event Reasoning with Large Language Models
Reasoning about real-life events is a unifying challenge in AI and NLP that has profound utility in a variety of domains, while any fallacy in high-stake applications like law, medicine, and science could be catastrophic.
Able to work with diverse text in these domains, large language models (LLMs) have proven capable of answering questions and solving problems.
In this talk, I demonstrate that end-to-end LLMs still systematically fail on reasoning tasks of complex events.
Moreover, their black-box nature gives rise to little interpretability and user control.
To address these issues, I propose two general approaches to use LLMs in conjunction with a structured representation of events.
The first is a language-based representation involving relations of sub-events that can be learned by LLMs via fine-tuning.
The second is a symbolic representation involving states of entities that can be leveraged by either LLMs or deterministic solvers.
On a suite of event reasoning tasks, I show that both approaches outperform end-to-end LLMs in terms of performance and trustworthiness.
Visual Concept Learning Beyond Appearances: Modernizing a Couple of Classic Ideas
The goal of Computer Vision, as coined by Marr, is to develop algorithms to answer "What are", "Where at", "When from" visual appearance.
The speaker, among others, recognizes the importance of studying underlying entities and relations beyond visual appearance, following an Active Perception paradigm.
This talk will present the speaker's efforts over the last decade, ranging from 1) reasoning beyond appearance for vision and language tasks (VQA, captioning, T2I, etc.), and addressing their evaluation misalignment, through 2) reasoning about implicit properties, to 3) their roles in a Robotic visual concept learning framework.
The talk will also feature the Active Perception Group (APG)’s projects addressing emerging challenges of the nation in automated mobility and intelligent transportation domains, at the ASU School of Computing and Augmented Intelligence (SCAI).
Advancing Multimodal Retrieval and Generation: From General to Biomedical Domains
This talk explores advancements in multimodal retrieval and generation across general and biomedical domains. The first work introduces a multimodal retriever and reader pipeline for vision-based question answering, using image-text queries to retrieve and interpret relevant textual knowledge. The second work simplifies this approach with an efficient end-to-end retrieval model, removing dependencies on intermediate models like object detectors. The final part presents a biomedical-focused multimodal generation model, capable of classifying and explaining labels in images with text prompts. Together, these works demonstrate significant progress in integrating visual and textual data processing in diverse applications.
Making Machine Learning Models Safer: Data and Model Perspectives
As machine learning systems are increasingly deployed in real-world settings like healthcare, finance, and scientific applications, ensuring their safety and reliability is crucial. However, many state-of-the-art ML models still suffer from issues like poor out-of-distribution generalization, sensitivity to input corruptions, requiring large amounts of data, and inadequate calibration - limiting their robustness and trustworthiness for critical real-world applications. In this talk, I will first present a broad overview of different safety considerations for modern ML systems. I will then proceed to discuss our recent efforts in making ML models safer from two complementary perspectives - (i) manipulating data and (ii) enriching the model capabilities by developing novel training mechanisms. I will discuss our work on designing new data augmentation techniques for object detection followed by demonstrating how, in the absence of data from desired target domains of interest, one could leverage pre-trained generative models for efficient synthetic data generation. Next, I will present a new paradigm of training deep networks called model anchoring and show how one could achieve similar properties to an ensemble but through a single model. I will specifically discuss how model anchoring can significantly enrich the class of hypothesis functions being sampled and demonstrate its effectiveness through its improved performance on several safety benchmarks. I will conclude by highlighting exciting future research directions for producing robust ML models through leveraging multi-modal foundation models.
Learning Actions from Humans in Video
The prevalent computer vision paradigm in the realm of action understanding is to directly transfer advances in object recognition toward action understanding. In this presentation I discuss the motivations for an alternative embodied approach centered around the modelling of actions rather than objects and survey recent work of ours along these lines, as well as promising possible future directions.