Vision has become the major sensory input of many systems and robots in the past few years. Efforts in camera development, dataset annotation, and model design have greatly advanced the frontiers of computer vision. However, many other sensory data are still under-explored, e.g. sounds, thermal, RF, point clouds. In this talk, I will introduce a cross modal self-supervised learning paradigm, to show how we can use our achievements in computer vision to assist the development of other sensory modalities. Such learning paradigm could be solutions for problems suffering from scarcity of annotations. And we envision it to become the major learning scheme of future robots that are equipped with increasing number of sensors.
Hang Zhao is a PhD candidate at MIT CSAIL, supervised by Professor Antonio Torralba. His research focuses on computer vision and cross modal learning. Before that, he got his Bachelor’s degree from Zhejiang University in 2013. He is a 2019 Snap Research Fellow, and has interned at NVIDIA in 2015, MERL in 2016 and Facebook in 2017. His recent work on cross modal self-supervised learning was widely covered by media such as BBC, NBC, MIT News.