Structured Event Reasoning with Large Language Models
Reasoning about real-life events is a unifying challenge in AI and NLP that has profound utility in a variety of domains, while any fallacy in high-stake applications like law, medicine, and science could be catastrophic. Able to work with diverse text in these domains, large language models (LLMs) have proven capable of answering questions and solving problems. In this talk, I demonstrate that end-to-end LLMs still systematically fail on reasoning tasks of complex events. Moreover, their black-box nature gives rise to little interpretability and user control. To address these issues, I propose two general approaches to use LLMs in conjunction with a structured representation of events. The first is a language-based representation involving relations of sub-events that can be learned by LLMs via fine-tuning. The second is a symbolic representation involving states of entities that can be leveraged by either LLMs or deterministic solvers. On a suite of event reasoning tasks, I show that both approaches outperform end-to-end LLMs in terms of performance and trustworthiness.
Visual Concept Learning Beyond Appearances: Modernizing a Couple of Classic Ideas
The goal of Computer Vision, as coined by Marr, is to develop algorithms to answer "What are", "Where at", "When from" visual appearance. The speaker, among others, recognizes the importance of studying underlying entities and relations beyond visual appearance, following an Active Perception paradigm. This talk will present the speaker's efforts over the last decade, ranging from 1) reasoning beyond appearance for vision and language tasks (VQA, captioning, T2I, etc.), and addressing their evaluation misalignment, through 2) reasoning about implicit properties, to 3) their roles in a Robotic visual concept learning framework. The talk will also feature the Active Perception Group (APG)’s projects addressing emerging challenges of the nation in automated mobility and intelligent transportation domains, at the ASU School of Computing and Augmented Intelligence (SCAI).
Advancing Multimodal Retrieval and Generation: From General to Biomedical Domains
This talk explores advancements in multimodal retrieval and generation across general and biomedical domains. The first work introduces a multimodal retriever and reader pipeline for vision-based question answering, using image-text queries to retrieve and interpret relevant textual knowledge. The second work simplifies this approach with an efficient end-to-end retrieval model, removing dependencies on intermediate models like object detectors. The final part presents a biomedical-focused multimodal generation model, capable of classifying and explaining labels in images with text prompts. Together, these works demonstrate significant progress in integrating visual and textual data processing in diverse applications.
Making Machine Learning Models Safer: Data and Model Perspectives
As machine learning systems are increasingly deployed in real-world settings like healthcare, finance, and scientific applications, ensuring their safety and reliability is crucial. However, many state-of-the-art ML models still suffer from issues like poor out-of-distribution generalization, sensitivity to input corruptions, requiring large amounts of data, and inadequate calibration - limiting their robustness and trustworthiness for critical real-world applications. In this talk, I will first present a broad overview of different safety considerations for modern ML systems. I will then proceed to discuss our recent efforts in making ML models safer from two complementary perspectives - (i) manipulating data and (ii) enriching the model capabilities by developing novel training mechanisms. I will discuss our work on designing new data augmentation techniques for object detection followed by demonstrating how, in the absence of data from desired target domains of interest, one could leverage pre-trained generative models for efficient synthetic data generation. Next, I will present a new paradigm of training deep networks called model anchoring and show how one could achieve similar properties to an ensemble but through a single model. I will specifically discuss how model anchoring can significantly enrich the class of hypothesis functions being sampled and demonstrate its effectiveness through its improved performance on several safety benchmarks. I will conclude by highlighting exciting future research directions for producing robust ML models through leveraging multi-modal foundation models.
Learning Actions from Humans in Video
The prevalent computer vision paradigm in the realm of action understanding is to directly transfer advances in object recognition toward action understanding. In this presentation I discuss the motivations for an alternative embodied approach centered around the modelling of actions rather than objects and survey recent work of ours along these lines, as well as promising possible future directions.