Login [Center] Logout Join Us Guidelines  I  中文  I  CQI

Reconstructing Hand-Object Interactions in 3D at Internet Scale

Speaker: Jane Wu UC Berkeley
Time: 2024-08-14 13:30-2024-08-14 14:30
Venue: Seminar Room 2, 19th Floor, Tower C, TusPark (https://meeting.tencent.com/dm/UgDRVlhZf2Lq)


Objects manipulated by the hand are particularly challenging to reconstruct from in-the-wild RGB images or videos. Not only does the hand occlude much of the object, but also the object is often only visible in a small number of image pixels. We present a scalable paradigm for handheld object reconstruction that builds on recent breakthroughs in large language/vision models and 3D object datasets. Our model, MCC-Hand-Object (MCC-HO), jointly reconstructs hand and object geometry given a single RGB image and inferred 3D hand as inputs. Subsequently, we use GPT-4(V) to retrieve a 3D object model that matches the object in the image and rigidly align the model to the network-inferred geometry; we call this alignment Retrieval-Augmented Reconstruction (RAR). We believe the combination of MCC-HO and RAR unlocks the possibility of labeling 3D hand-object interactions at Internet scale.

Short Bio:

Jane is a postdoctoral fellow in EECS at UC Berkeley, advised by Jitendra Malik. She received her Ph.D. degree in Computer Science from Stanford University, advised by Ronald Fedkiw in the Stanford AI Lab. Her research interests are in designing machine learning models that reason about the geometry and dynamics of the physical 3D world, with applications in AR/VR and robotics. In her Ph.D. research, Jane investigated a general paradigm whereby high-frequency information is procedurally embedded into low-frequency data so that when the latter is smoothed by the network the former still retains its high-frequency detail. This paradigm was applied to various aspects of the cloth capture pipeline, ranging from simulation to data acquisition to reconstruction. Concurrently, Jane has also spent time at Google and NVIDIA working on diffusion models for novel view synthesis and autonomous vehicles perception.