Open access
Author
Date
2021Type
- Doctoral Thesis
ETH Bibliography
yes
Altmetrics
Abstract
Humans develop a common-sense understanding of the physical behaviour of the
world, within the first year of their life. We are able to identify 3D objects
in a scene, infer their geometric and physical properties, predict physical
events in dynamic environments and act based on our interaction with the world.
Our understanding of our surroundings relies heavily on our ability to properly
reason about the arrangement of elements in a scene. Inspired by early works in
cognitive science that stipulate that the human visual system perceives objects
as a collection of semantically coherent parts and in turn uses them to easily
associate unknown objects with object parts whose functionality is already
known, researchers developed compositional representations capable of capturing
the functional composition and spatial arrangement of objects and object parts
in a scene. In the first two parts of this dissertation, we propose learning-based
solutions for recovering the 3D object geometry using semantically consistent
part arrangements. Finally, we introduce a network architecture that
synthesizes indoor environments as object arrangements, whose
functional composition and spatial configuration follows clear patterns
that are directly inferred from data.
First, we present an unsupervised learning-based approach for recovering shape
abstractions using superquadric surfaces as atomic elements. We demonstrate
that superquadrics lead to more expressive part decompositions while being
easier to learn than cuboidal primitives. Moreover, we provide an analytical
solution to the Chamfer loss which avoids the need for computational expensive
reinforcement learning or iterative prediction.
Next, we introduce a novel 3D primitive representation that defines
primitives using an Invertible Neural Network (INN) that implements homeomorphic
mappings between a sphere and the target object. Since this representation does
not impose any constraint on the shape of the predicted primitives, they can
capture complex geometries using an order of magnitude fewer parts than
existing primitive-based representations. We consider this representation a
first step towards bridging the gap between interpretable and high fidelity
primitive-based reconstructions.
Subsequently, we introduce a structure-aware representation that jointly recovers
the geometry of a 3D object as a set of primitives as well as its latent
hierarchical structure without any part-level supervision. Our model recovers
the higher level structural decomposition of various objects in the form of a
binary tree of primitives, where simple parts are represented with fewer
primitives and more complex parts are modeled with more components. We
demonstrate that considering the latent hierarchical layout of an object into
parts facilitates reasoning about the 3D object geometry.
Finally, we propose a neural network architecture for synthesizing indoor scenes
by plausibly arranging objects within the scene boundaries. In particular,
given a room type (e.g. bedroom, living room) and its shape, our model
generates meaningful object arrangements by sequentially placing objects in a
permutation-invariant fashion. In contrast to prior work, which poses scene
synthesis as a sequence generation problem, our model generates rooms as unordered sets
of objects. This allows us to perform various interactive scenarios such as
room completion, failure case correction, object suggestions with
user-provided constraints etc.
To summarize, we propose novel primitive-based representations that do not
limit the available shape vocabulary on a specific set of shapes such as
cuboids, spheres, planes etc. Next, we introduce a structure-aware
representation that considers part relationships and represents object parts with
multiple levels of granularity, where geometrically complex parts are modeled
with more components and simpler parts with fewer components. Finally, we
propose a network architecture that generates indoor scenes by properly
arranging objects within a room's boundaries. Our model enables new interactive
applications for semi-automated scene authoring that were not possible before. Show more
Permanent link
https://doi.org/10.3929/ethz-b-000521013Publication status
publishedExternal links
Search print copy at ETH Library
Contributors
Examiner: Van Gool, Luc
Examiner: Geiger, Andreas
Examiner: Ferrari, Vittorio
Examiner: Tombari, Federico
Examiner: Savva, Manolis
Publisher
ETH ZurichSubject
Primitive-based representations; 3D reconstruction; Structure-aware representations; Scene understanding; Scene synthesis; Interpretable representations; Unsupervised learning; Generative modellingOrganisational unit
03514 - Van Gool, Luc (emeritus) / Van Gool, Luc (emeritus)
More
Show all metadata
ETH Bibliography
yes
Altmetrics