Over the past decade, several approaches have been proposed to learn disentangled representations for video prediction. However, reported experiments are mostly based on standard benchmark datasets such as Moving MNIST and Bouncing Balls. In this work, we address the problem of learning disentangled representation for video prediction in an industrial environment. To this end, we use decompositional disentangled variational auto-encoder, a deep generative model that aims to decompose and recognize overlapped boxes on a pallet. Specifically, this approach disentangles each frame into a dynamic component (box appearance) and a temporally variant component (box location). We evaluate this approach on a new dataset, which contains 40000 video sequences. The experimental results demonstrate the ability to learn both the decomposition of the bounding boxes and their reconstruction without explicit supervision.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.