Video Predictive Object Detector

Abstract

With the rise of video datasets and self-driving cars, many industries seek a way to perform quick object detection on video, as well as perform predictive tracking on these objects. We propose a predictive video object detector (POD net) integrating the YOLOv2 framework with the convolutional 2-dimensional (2D) Long Short Term Memory (LSTM) model proposed by Shi et al.. Our POD net performs object detection using YOLOv2 and object prediction using the LSTM model in an iterative manner with a view to improve object detection in video streams via object prediction. In this study we present two different approaches that we implemented to predict objects in subsequent video clips. The first approach, PODv1, applies a post-temporal pattern matching mechanism wherein the YOLOv2 detector is used to detect objects in multiple images and the LSTM layer is used to perform temporal feature mapping across the output tensors of the detectors. The second approach, PODv2, provides better results by applying the temporal feature mapping first across the images and then feeding the output into the YOLOv2 detector which is wrapped using a Time Distributed layer. We tested POD net on the MOT 2017 dataset and the network was able to perform predictive object detection and tracking, demonstrating that the LSTM layer is useful for a variety of video analysis problems.