Existing state-of-the-art computer vision models usually specialize in single domains or tasks, while human-level recognition can be contextual for diverse scales and tasks. This specialization isolates different vision tasks and hinders the deployment of robust and effective vision systems. In this talk, I will discuss contextural image representations for different scales and tasks through the lens of pixel-level prediction. These connections, built by the study of dilated convolutions and deep layer aggregation, can interpret convolutional network behaviors and lead to model frameworks applicable to a wide range of tasks. Beyond contextual, I will argue that image representation should also be dynamic and predictive. I will illustrate the case with input-dependent dynamic networks, which lead to new insights into the relationship of zero-shot/few-shot learning and network pruning, and with semantic predictive control, which utilizes prediction for better driving policy learning. To conclude, I will discuss the on-going system and algorithm investigations that couple representation learning and real-world interaction to build intelligent agents that can continuously learn from and interact with the world.
Fisher Yu is a postdoctoral researcher at UC Berkeley. He pursued his Ph.D. degree at Princeton University. He will join the ETH Zurich faculty as Assistant Professor in Computer Vision in 2020. His research interest lies in image representation learning and interactive data processing systems. His works focus on seeking connections between computer vision problems and building unified image representation frameworks. Through the lens of image representation, he is also studying high-level understanding of dynamic 3D scenes. He serves as reviewers for major conferences and journals in computer vision, machine learning, and robotics. He has also led the organization of multiple CVPR workshops.