Understanding people Can we learn to understand all aspects of people, i.e., appearance, behavior, from images and videos? Multi-modal learning Many interesting data we encounter in real applications are multi-modal: there exists multiple types of data tha reflect the same concept. How can we learn about them? *Figures were generated using Stable Diffusion.