Subscribe to the PwC Newsletter

Join the community, add a new evaluation result row, object detection.

2709 papers with code • 70 benchmarks • 233 datasets

Object detection is the task of detecting instances of objects of a certain class within an image. The state-of-the-art methods can be categorized into two main types: one-stage methods and two stage-methods. One-stage methods prioritize inference speed, and example models include YOLO, SSD and RetinaNet. Two-stage methods prioritize detection accuracy, and example models include Faster R-CNN, Mask R-CNN and Cascade R-CNN.

The most popular benchmark is the MSCOCO dataset. Models are typically evaluated according to a Mean Average Precision metric.

( Image credit: Detectron )

research papers on object detection

Benchmarks Add a Result

research papers on object detection

Most implemented papers

Deep residual learning for image recognition.

Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.

YOLOv3: An Incremental Improvement

research papers on object detection

At 320x320 YOLOv3 runs in 22 ms at 28. 2 mAP, as accurate as SSD but three times faster.

YOLO9000: Better, Faster, Stronger

On the 156 classes not in COCO, YOLO9000 gets 16. 0 mAP.

YOLOv4: Optimal Speed and Accuracy of Object Detection

There are a huge number of features which are said to improve Convolutional Neural Network (CNN) accuracy.

SSD: Single Shot MultiBox Detector

weiliu89/caffe • 8 Dec 2015

Experimental results on the PASCAL VOC, MS COCO, and ILSVRC datasets confirm that SSD has comparable accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference.

Focal Loss for Dense Object Detection

Our novel Focal Loss focuses training on a sparse set of hard examples and prevents the vast number of easy negatives from overwhelming the detector during training.

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

research papers on object detection

In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals.

Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance.

MMDetection: Open MMLab Detection Toolbox and Benchmark

In this paper, we introduce the various features of this toolbox.

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

We present a class of efficient models called MobileNets for mobile and embedded vision applications.

Captcha Page

We apologize for the inconvenience...

To ensure we keep this website safe, please can you confirm you are a human by ticking the box below.

If you are unable to complete the above request please contact us using the below link, providing a screenshot of your experience.

Please solve this CAPTCHA to request unblock to the website

research papers on object detection


Priyesh Sinha

Apr 23, 2021

5 AI/ML Research Papers on Object Detection You Must Read

Great papers…, mobilenets: efficient convolutional neural networks for mobile vision applications.

By Andrew G. Howard • Menglong Zhu • Bo Chen • Dmitry Kalenichenko • Weijun Wang • Tobias Weyand • Marco Andreetto • Hartwig Adam

We present a class of efficient models called MobileNets for mobile and embedded vision applications. MobileNets are based on a streamlined architecture that uses depth-wise separable convolutions to build light weight deep neural networks. We introduce two simple global hyper-parameters that efficiently trade off between latency and accuracy. These hyper-parameters allow the model builder to choose the right sized model for their application based on the constraints of the problem. We present extensive experiments on resource and accuracy tradeoffs and show strong performance compared to other popular models on ImageNet classification. We then demonstrate the effectiveness of MobileNets across a wide range of applications and use cases including object detection, finegrain classification, face attributes and large scale geo-localization.

Paper can be found here :

Code can be found here :


Imagededup is a python package that simplifies the task of finding exact and near duplicates in an image collection…, context r-cnn: long term temporal context for per-camera object detection.

By Sara Beery • Guanhang Wu • Vivek Rathod • Ronny Votel • Jonathan Huang

In static monitoring cameras, useful contextual information can stretch far beyond the few seconds typical video understanding models might see: subjects may exhibit similar behavior over multiple days, and background objects remain static. Due to power and storage constraints, sampling frequencies are low, often no faster than one frame per second, and sometimes are irregular due to the use of a motion trigger. In order to perform well in this setting, models must be robust to irregular sampling rates. In this paper we propose a method that leverages temporal context from the unlabeled frames of a novel camera to improve performance at that camera. Specifically, we propose an attention-based approach that allows our model, Context R-CNN, to index into a long term memory bank constructed on a per-camera basis and aggregate contextual features from other frames to boost object detection performance on the current frame. We apply Context R-CNN to two settings: (1) species detection using camera traps, and (2) vehicle detection in traffic cameras, showing in both settings that Context R-CNN leads to performance gains over strong baselines. Moreover, we show that increasing the contextual time horizon leads to improved results. When applied to camera trap data from the Snapshot Serengeti dataset, Context R-CNN with context from up to a month of images outperforms a single-frame baseline by 17.9% mAP, and outperforms S3D (a 3d convolution based baseline) by 11.2% mAP.


Creating accurate machine learning models capable of localizing and identifying multiple objects in a single image…, efficientdet: scalable and efficient object detection.

By Mingxing Tan • Ruoming Pang • Quoc V. Le

Model efficiency has become increasingly important in computer vision. In this paper, we systematically study neural network architecture design choices for object detection and propose several key optimizations to improve efficiency. First, we propose a weighted bi-directional feature pyramid network (BiFPN), which allows easy and fast multiscale feature fusion; Second, we propose a compound scaling method that uniformly scales the resolution, depth, and width for all backbone, feature network, and box/class prediction networks at the same time. Based on these optimizations and better backbones, we have developed a new family of object detectors, called EfficientDet, which consistently achieve much better efficiency than prior art across a wide spectrum of resource constraints. In particular, with single model and single-scale, our EfficientDet-D7 achieves state-of-the-art 55.1 AP on COCO test-dev with 77M parameters and 410B FLOPs, being 4x — 9x smaller and using 13x — 42x fewer FLOPs than previous detectors.


The pytorch re-implement of the official efficientdet with sota performance in real time, original paper link…, searching for mobilenetv3.

By Andrew Howard • Mark Sandler • Grace Chu • Liang-Chieh Chen • Bo Chen • Mingxing Tan • Weijun Wang • Yukun Zhu • Ruoming Pang • Vijay Vasudevan • Quoc V. Le • Hartwig Adam

We present the next generation of MobileNets based on a combination of complementary search techniques as well as a novel architecture design. MobileNetV3 is tuned to mobile phone CPUs through a combination of hardware-aware network architecture search (NAS) complemented by the NetAdapt algorithm and then subsequently improved through novel architecture advances. This paper starts the exploration of how automated search algorithms and network design can work together to harness complementary approaches improving the overall state of the art. Through this process we create two new MobileNet models for release: MobileNetV3-Large and MobileNetV3-Small which are targeted for high and low resource use cases. These models are then adapted and applied to the tasks of object detection and semantic segmentation. For the task of semantic segmentation (or any dense pixel prediction), we propose a new efficient segmentation decoder Lite Reduced Atrous Spatial Pyramid Pooling (LR-ASPP). We achieve new state of the art results for mobile classification, detection and segmentation. MobileNetV3-Large is 3.2\% more accurate on ImageNet classification while reducing latency by 15\% compared to MobileNetV2. MobileNetV3-Small is 4.6\% more accurate while reducing latency by 5\% compared to MobileNetV2. MobileNetV3-Large detection is 25\% faster at roughly the same accuracy as MobileNetV2 on COCO detection. MobileNetV3-Large LR-ASPP is 30\% faster than MobileNetV2 R-ASPP at similar accuracy for Cityscapes segmentation.


A big thank you to my github sponsors for their support in addition to the sponsors at the link above, i've received…, objects as points.

By Xingyi Zhou • Dequan Wang • Philipp Krähenbühl

Detection identifies objects as axis-aligned boxes in an image. Most successful object detectors enumerate a nearly exhaustive list of potential object locations and classify each. This is wasteful, inefficient, and requires additional post-processing. In this paper, we take a different approach. We model an object as a single point — — the center point of its bounding box. Our detector uses keypoint estimation to find center points and regresses to all other object properties, such as size, 3D location, orientation, and even pose. Our center point based approach, CenterNet, is end-to-end differentiable, simpler, faster, and more accurate than corresponding bounding box based detectors. CenterNet achieves the best speed-accuracy trade-off on the MS COCO dataset, with 28.1% AP at 142 FPS, 37.4% AP at 52 FPS, and 45.1% AP with multi-scale testing at 1.4 FPS. We use the same approach to estimate 3D bounding box in the KITTI benchmark and human pose on the COCO keypoint dataset. Our method performs competitively with sophisticated multi-stage methods and runs in real-time.


Object detection, 3d detection, and pose estimation using center point detection: objects as points , xingyi zhou…, mobilenetv2: inverted residuals and linear bottlenecks.

By Mark Sandler • Andrew Howard • Menglong Zhu • Andrey Zhmoginov • Liang-Chieh Chen

In this paper we describe a new mobile architecture, MobileNetV2, that improves the state of the art performance of mobile models on multiple tasks and benchmarks as well as across a spectrum of different model sizes. We also describe efficient ways of applying these mobile models to object detection in a novel framework we call SSDLite. Additionally, we demonstrate how to build mobile semantic segmentation models through a reduced form of DeepLabv3 which we call Mobile DeepLabv3. The MobileNetV2 architecture is based on an inverted residual structure where the input and output of the residual block are thin bottleneck layers opposite to traditional residual models which use expanded representations in the input an MobileNetV2 uses lightweight depthwise convolutions to filter features in the intermediate expansion layer. Additionally, we find that it is important to remove non-linearities in the narrow layers in order to maintain representational power. We demonstrate that this improves performance and provide an intuition that led to this design. Finally, our approach allows decoupling of the input/output domains from the expressiveness of the transformation, which provides a convenient framework for further analysis. We measure our performance on Imagenet classification, COCO object detection, VOC image segmentation. We evaluate the trade-offs between accuracy, and number of operations measured by multiply-adds (MAdd), as well as the number of parameters


The torchvision package consists of popular datasets, model architectures, and common image transformations for…, speed/accuracy trade-offs for modern convolutional object detectors.

By Jonathan Huang • Vivek Rathod • Chen Sun • Menglong Zhu • Anoop Korattikara • Alireza Fathi • Ian Fischer • Zbigniew Wojna • Yang song • Sergio Guadarrama • Kevin Murphy

The goal of this paper is to serve as a guide for selecting a detection architecture that achieves the right speed/memory/accuracy balance for a given application and platform. To this end, we investigate various ways to trade accuracy for speed and memory usage in modern convolutional object detection systems. A number of successful systems have been proposed in recent years, but apples-to-apples comparisons are difficult due to different base feature extractors (e.g., VGG, Residual Networks), different default image resolutions, as well as different hardware and software platforms. We present a unified implementation of the Faster R-CNN [Ren et al., 2015], R-FCN [Dai et al., 2016] and SSD [Liu et al., 2015] systems, which we view as “meta-architectures” and trace out the speed/accuracy trade-off curve created by using alternative feature extractors and varying other critical parameters such as image size within each of these meta-architectures. On one extreme end of this spectrum where speed and memory are critical, we present a detector that achieves real time speeds and can be deployed on a mobile device. On the opposite end in which accuracy is critical, we present a detector that achieves state-of-the-art performance measured on the COCO detection task.


This repository contains code to instantiate and deploy an object detection model. this model recognizes the objects….

References and credits —

arXiv is a free distribution service and an open-access archive for 1,721,837 scholarly articles in the fields of…

Recommended articles —, 5 ai/ml research papers on image generation you must read.

The Top 5 Deep Learning Libraries And Frameworks

Deep learning.

Libraries And Frameworks Deep

51+ Data Sets for Beginner Data Science and Machine Learning Projects

5 Trending AI/ML Research Papers

Amazing papers…, best ai/ml research papers for nlp, more from datadriveninvestor.

empowerment through data, knowledge, and expertise. subscribe to DDIntel at

About Help Terms Privacy

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store

Priyesh Sinha

IITian| Mech Engineer| Pursuing Data Science Courses

Text to speech

People also looked at

Perspective article, object detection: current and future directions.


Object detection is a key ability required by most computer and robot vision systems. The latest research on this area has been making great progress in many directions. In the current manuscript, we give an overview of past research on object detection, outline the current main research directions, and discuss open problems and possible future directions.

1. Introduction

During the last years, there has been a rapid and successful expansion on computer vision research. Parts of this success have come from adopting and adapting machine learning methods, while others from the development of new representations and models for specific computer vision problems or from the development of efficient solutions. One area that has attained great progress is object detection. The present works gives a perspective on object detection research .

Given a set of object classes, object detection consists in determining the location and scale of all object instances, if any, that are present in an image . Thus, the objective of an object detector is to find all object instances of one or more given object classes regardless of scale, location, pose, view with respect to the camera, partial occlusions, and illumination conditions.

In many computer vision systems, object detection is the first task being performed as it allows to obtain further information regarding the detected object and about the scene. Once an object instance has been detected (e.g., a face), it is be possible to obtain further information, including: (i) to recognize the specific instance (e.g., to identify the subject’s face), (ii) to track the object over an image sequence (e.g., to track the face in a video), and (iii) to extract further information about the object (e.g., to determine the subject’s gender), while it is also possible to (a) infer the presence or location of other objects in the scene (e.g., a hand may be near a face and at a similar scale) and (b) to better estimate further information about the scene (e.g., the type of scene, indoor versus outdoor, etc.), among other contextual information.

Object detection has been used in many applications, with the most popular ones being: (i) human-computer interaction (HCI), (ii) robotics (e.g., service robots), (iii) consumer electronics (e.g., smart-phones), (iv) security (e.g., recognition, tracking), (v) retrieval (e.g., search engines, photo management), and (vi) transportation (e.g., autonomous and assisted driving). Each of these applications has different requirements, including: processing time (off-line, on-line, or real-time), robustness to occlusions, invariance to rotations (e.g., in-plane rotations), and detection under pose changes. While many applications consider the detection of a single object class (e.g., faces) and from a single view (e.g., frontal faces), others require the detection of multiple object classes (humans, vehicles, etc.), or of a single class from multiple views (e.g., side and frontal view of vehicles). In general, most systems can detect only a single object class from a restricted set of views and poses.

Several surveys on detection and recognition have been published during the last years [see Hjelmås and Low (2001) , Yang et al. (2002) , Sun et al. (2006) , Li and Allinson (2008) , Enzweiler and Gavrila (2009) , Dollar et al. (2012) , Andreopoulos and Tsotsos (2013) , Li et al. (2015) , and Zafeiriou et al. (2015) ], and there are four main problems related to object detection. The first one is object localization , which consists of determining the location and scale of a single object instance known to be present in the image; the second one is object presence classification , which corresponds to determining whether at least one object of a given class is present in an image (without giving any information about the location, scale, or the number of objects), while the third problem is object recognition , which consist in determining if a specific object instance is present in the image. The fourth related problem is view and pose estimation , which consist of determining the view of the object and the pose of the object.

The problem of object presence classification can be solved using object detection techniques, but in general, other methods are used, as determining the location and scale of the objects is not required, and determining only the presence can be done more efficiently. In some cases, object recognition can be solved using methods that do not require detecting the object in advance [e.g., using methods based on Local Interest Points such as Tuytelaars and Mikolajczyk (2008) and Ramanan and Niranjan (2012) ]. Nevertheless, solving the object detection problem would solve (or help simplifying) these related problems. An additional, recently addressed problem corresponds to determining the “objectness” of an image patch, i.e., measuring the likeliness for an image window to contain an object of any class [e.g., Alexe et al. (2010) , Endres and Hoiem (2010) , and Huval et al. (2013) ].

In the following, we give a summary of past research on object detection, present an overview of current research directions, and discuss open problems and possible future directions, all this with a focus on the classifiers and architectures of the detector, rather than on the used features.

2. A Brief Review of Object Detection Research

Early works on object detection were based on template matching techniques and simple part-based models [e.g., Fischler and Elschlager (1973) ]. Later, methods based on statistical classifiers (e.g., Neural Networks, SVM, Adaboost, Bayes, etc.) were introduced [e.g., Osuna et al. (1997) , Rowley et al. (1998) , Sung and Poggio (1998) , Schneiderman and Kanade (2000) , Yang et al. (2000a , b ), Fleuret and Geman (2001) , Romdhani et al. (2001) , and Viola and Jones (2001) ]. This initial successful family of object detectors, all of them based on statistical classifiers, set the ground for most of the following research in terms of training and evaluation procedures and classification techniques.

Because face detection is a critical ability for any system that interacts with humans, it is the most common application of object detection. However, many additional detection problems have been studied [e.g., Papageorgiou and Poggio (2000) , Agarwal et al. (2004) , Alexe et al. (2010) , Everingham et al. (2010) , and Andreopoulos and Tsotsos (2013) ]. Most cases correspond to objects that people often interact with, such as other humans [e.g., pedestrians ( Papageorgiou and Poggio, 2000 ; Viola and Jones, 2002 ; Dalal and Triggs, 2005 ; Bourdev et al., 2010 ; Paisitkriangkrai et al., 2015 )] and body parts [( Kölsch and Turk, 2004 ; Ong and Bowden, 2004 ; Wu and Nevatia, 2005 ; Verschae et al., 2008 ; Bourdev and Malik, 2009 ) e.g., faces, hands, and eyes], as well as vehicles [( Papageorgiou and Poggio, 2000 ; Felzenszwalb et al., 2010b ), e.g., cars and airplanes], and animals [e.g., Fleuret and Geman (2008) ].

Most object detection systems consider the same basic scheme, commonly known as sliding window : in order to detect the objects appearing in the image at different scales and locations, an exhaustive search is applied. This search makes use of a classifier, the core part of the detector, which indicates if a given image patch, corresponds to the object or not. Given that the classifier basically works at a given scale and patch size, several versions of the input image are generated at different scales, and the classifier is used to classify all possible patches of the given size, for each of the downscaled versions of the image.

Basically, three alternatives exist to the sliding window scheme. The first one is based on the use of bag-of-words ( Weinland et al., 2011 ; Tsai, 2012 ), method sometimes used for verifying the presence of the object, and that in some cases can be efficiently applied by iteratively refining the image region that contains the object [e.g., Lampert et al. (2009) ]. The second one samples patches and iteratively searches for regions of the image where it is likely that the object is present [e.g., Prati et al. (2012) ]. These two schemes reduce the number of image patches where to perform the classification, seeking to avoid an exhaustive search over all image patches. The third scheme finds key-points and then matches them to perform the detection [e.g., Azzopardi and Petkov (2013) ]. These schemes cannot always guarantee that all object’s instances will be detected.

3. Object Detection Approaches

Object detection methods can be grouped in five categories, each with merits and demerits: while some are more robust, others can be used in real-time systems, and others can be handle more classes, etc. Table 1 gives a qualitative comparison.

Table 1. Qualitative comparison of object detection approaches .

3.1. Coarse-to-Fine and Boosted Classifiers

The most popular work in this category is the boosted cascade classifier of Viola and Jones (2004) . It works by efficiently rejecting, in a cascade of test/filters, image patches that do not correspond to the object. Cascade methods are commonly used with boosted classifiers due to two main reasons: (i) boosting generates an additive classifier, thus it is easy to control the complexity of each stage of the cascade and (ii) during training, boosting can be also used for feature selection, allowing the use of large (parametrized) families of features. A coarse-to-fine cascade classifier is usually the first kind of classifier to consider when efficiency is a key requirement. Recent methods based on boosted classifiers include Li and Zhang (2004) , Gangaputra and Geman (2006) , Huang et al. (2007) , Wu and Nevatia (2007) , Verschae et al. (2008) , and Verschae and Ruiz-del-Solar (2012) .

3.2. Dictionary Based

The best example in this category is the Bag of Word method [e.g., Serre et al. (2005) and Mutch and Lowe (2008) ]. This approach is basically designed to detect a single object per image, but after removing a detected object, the remaining objects can be detected [e.g., Lampert et al. (2009) ]. Two problems with this approach are that it cannot robustly handle well the case of two instances of the object appearing near each other, and that the localization of the object may not be accurate.

3.3. Deformable Part-Based Model

This approach considers object and part models and their relative positions. In general, it is more robust that other approaches, but it is rather time consuming and cannot detect objects appearing at small scales. It can be traced back to the deformable models ( Fischler and Elschlager, 1973 ), but successful methods are recent ( Felzenszwalb et al., 2010b ). Relevant works include Felzenszwalb et al. (2010a) and Yan et al. (2014) , where efficient evaluation of deformable part-based model is implemented using a coarse-to-fine cascade model for faster evaluation, Divvala et al. (2012) , where the relevance of the part-models is analyzed, among others [e.g., Azizpour and Laptev (2012) , Zhu and Ramanan (2012) , and Girshick et al. (2014) ].

3.4. Deep Learning

One of the first successful methods in this family is based on convolutional neural networks ( Delakis and Garcia, 2004 ). The key difference between this and the above approaches is that in this approach the feature representation is learned instead of being designed by the user, but with the drawback that a large number of training samples is required for training the classifier. Recent methods include Dean et al. (2013) , Huval et al. (2013) , Ouyang and Wang (2013) , Sermanet et al. (2013) , Szegedy et al. (2013) , Zeng et al. (2013) , Erhan et al. (2014) , Zhou et al. (2014) , and Ouyang et al. (2015) .

3.5. Trainable Image Processing Architectures

In such architectures, the parameters of predefined operators and the combination of the operators are learned, sometimes considering an abstract notion of fitness. These are general-purpose architectures, and thus they can be used to build several modules of a larger system (e.g., object recognition, key point detectors and object detection modules of a robot vision system). Examples include trainable COSFIRE filters ( Azzopardi and Petkov, 2013 , 2014 ), and Cartesian Genetic Programming (CGP) ( Harding et al., 2013 ; Leitner et al., 2013 ).

4. Current Research Problems

Table 2 presents a summary of solved, current, and open problems. In the present section we discuss current research directions.

Table 2. Summary of current directions and open problems .

4.1. Multi-Class

Many applications require detecting more than one object class. If a large number of classes is being detected, the processing speed becomes an important issue, as well as the kind of classes that the system can handle without accuracy loss. Works that have addressed the multi-class detection problem include Torralba et al. (2007) , Razavi et al. (2011) , Benbouzid et al. (2012) , Song et al. (2012) , Verschae and Ruiz-del-Solar (2012) , and Erhan et al. (2014) . Efficiency has been addressed, e.g., by using the same representation for several object classes, as well as by developing multi-class classifiers designed specifically to detect multiple classes. Dean et al. (2013) presents one of the few existing works for very large-scale multi-class object detection, where 100,000 object classes were considered.

4.2. Multi-View, Multi-Pose, Multi-Resolution

Most methods used in practice have been designed to detect a single object class under a single view, thus these methods cannot handle multiple views, or large pose variations; with the exception of deformable part-based models which can deal with some pose variations. Some works have tried to detect objects by learning subclasses ( Wu and Nevatia, 2007 ) or by considering views/poses as different classes ( Verschae and Ruiz-del-Solar, 2012 ); in both cases improving the efficiency and robustness. Also, multi-pose models [e.g., Erol et al. (2007) ] and multi-resolution models [e.g., Park et al. (2010) ] have been developed.

4.3. Efficiency and Computational Power

Efficiency is an issue to be taken into account in any object detection system. As mentioned, a coarse-to-fine classifier is usually the first kind of classifier to consider when efficiency is a key requirement [e.g., Viola et al. (2005) ], while reducing the number of image patches where to perform the classification [e.g., Lampert et al. (2009) ] and efficiently detecting multiple classes [e.g., Verschae and Ruiz-del-Solar (2012) ] have also been used. Efficiency does not imply real-time performance, and works such as Felzenszwalb et al. (2010b) are robust and efficient, but not fast enough for real-time problems. However, using specialized hardware (e.g., GPU) some methods can run in real-time (e.g., deep learning).

4.4. Occlusions, Deformable Objects, and Interlaced Object and Background

Dealing with partial occlusions is also an important problem, and no compelling solution exits, although relevant research has been done [e.g., Wu and Nevatia (2005) ]. Similarly, detecting objects that are not “closed,” i.e., where objects and background pixels are interlaced with background is still a difficult problem. Two examples are hand detection [e.g., Kölsch and Turk (2004) ] and pedestrian detection [see Dollar et al. (2012) ]. Deformable part-based model [e.g., Felzenszwalb et al. (2010b) ] have been to some extend successful under this kind of problem, but further improvement is still required.

4.5. Contextual Information and Temporal Features

Integrating contextual information (e.g., about the type of scene, or the presence of other objects) can increase speed and robustness, but “when and how” to do this (before, during or after the detection), it is still an open problem. Some proposed solutions include the use of (i) spatio-temporal context [e.g., Palma-Amestoy et al. (2010) ], (ii) spatial structure among visual words [e.g., Wu et al. (2009) ], and (iii) semantic information aiming to map semantically related features to visual words [e.g., Wu et al. (2010) ], among many others [e.g., Torralba and Sinha (2001) , Divvala et al. (2009) , Sun et al. (2012) , Mottaghi et al. (2014) , and Cadena et al. (2015) ]. While most methods consider the detection of objects in a single frame, temporal features can be beneficial [e.g., Viola et al. (2005) and Dalal et al. (2006) ].

5. Open Problems and Future Directions

In the following, we outline problems that we believe have not been addressed, or addressed only partially, and may be interesting relevant research directions.

5.1. Open-World Learning and Active Vision

An important problem is to incrementally learn, to detect new classes, or to incrementally learn to distinguish among subclasses after the “main” class has been learned. If this can be done in an unsupervised way, we will be able to build new classifiers based on existing ones, without much additional effort, greatly reducing the effort required to learn new object classes. Note that humans are continuously inventing new objects, fashion changes, etc., and therefore detection systems will need to be continuously updated, adding new classes, or updating existing ones. Some recent works have addressed these issues, mostly based on deep learning and transfer learning methods [e.g., Bengio (2012) , Mesnil et al. (2012) , and Kotzias et al. (2014) ]. This open-world learning is of particular importance in robot applications, case where active vision mechanisms can aid in the detection and learning [e.g., Paletta and Pinz (2000) and Correa et al. (2012) ].

5.2. Object-Part Relation

During the detection process, should we detect the object first or the parts first? This is a basic dilemma, and no clear solution exists. Probably, the search for the object and for the parts must be done concurrently where both processes give feedback to each other. How to do this is still an open problem and is likely related to how to use of context information. Moreover, in cases the object part can be also decomposed in subparts, an interaction among several hierarchies emerge, and in general it is not clear what should be done first.

5.3. Multi-Modal Detection

The use of new sensing modalities, in particular depth and thermal cameras, has seen some development in the last years [e.g., Fehr and Burkhardt (2008) and Correa et al. (2012) ]. However, the methods used for processing visual images are also used for thermal images, and to a lesser degree for depth images. While using thermal images makes easier to discriminate the foreground from the background, it can only be applied to objects that irradiate infrared light (e.g., mammals, heating, etc.). Using depth images is easy to segment the objects, but general methods for detecting specific classes has not been proposed, and probably higher resolution depth images are required. It seems that depth and thermal cameras alone are not enough for object detection, at least with their current resolution, but further advances can be expected as the sensing technology improves.

5.4. Pixel-Level Detection (Segmentation) and Background Objects

In many applications, we may be interested in detecting objects that are usually considered as background. The detection of such “background objects,” such as rivers, walls, mountains, has not been addressed by most of the here mentioned approaches. In general, this kind of problem has been addressed by first segmenting the image and later labeling each segment of the image [e.g., Peng et al. (2013) ]. Of course, for successfully detecting all objects in a scene, and to completely understand the scene, we will need to have a pixel level detection of the objects, and further more, a 3D model of such scene. Therefore, at some point object detection and image segmentation methods may need to be integrated. We are still far from attaining such automatic understanding of the world, and to achieve this, active vision mechanisms might be required [e.g., Aloimonos et al. (1988) and Cadena et al. (2015) ].

6. Conclusion

Object detection is a key ability for most computer and robot vision system. Although great progress has been observed in the last years, and some existing techniques are now part of many consumer electronics (e.g., face detection for auto-focus in smartphones) or have been integrated in assistant driving technologies, we are still far from achieving human-level performance, in particular in terms of open-world learning. It should be noted that object detection has not been used much in many areas where it could be of great help. As mobile robots, and in general autonomous machines, are starting to be more widely deployed (e.g., quad-copters, drones and soon service robots), the need of object detection systems is gaining more importance. Finally, we need to consider that we will need object detection systems for nano-robots or for robots that will explore areas that have not been seen by humans, such as depth parts of the sea or other planets, and the detection systems will have to learn to new object classes as they are encountered. In such cases, a real-time open-world learning ability will be critical.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.


This research was partially funded by the FONDECYT Projects 3120218 and 1130153 (CONICYT, Chile).

Agarwal, S., Awan, A., and Roth, D. (2004). Learning to detect objects in images via a sparse, part-based representation. IEEE Trans. Pattern Anal. Mach. Intell. 26, 1475–1490. doi: 10.1109/TPAMI.2004.108

PubMed Abstract | CrossRef Full Text | Google Scholar

Alexe, B., Deselaers, T., and Ferrari, V. (2010). “What is an object?,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on (San Francisco, CA: IEEE), 73–80. doi:10.1109/CVPR.2010.5540226

CrossRef Full Text | Google Scholar

Aloimonos, J., Weiss, I., and Bandyopadhyay, A. (1988). Active vision. Int. J. Comput. Vis. 1, 333–356. doi:10.1007/BF00133571

Andreopoulos, A., and Tsotsos, J. K. (2013). 50 years of object recognition: directions forward. Comput. Vis. Image Underst. 117, 827–891. doi:10.1016/j.cviu.2013.04.005

Azizpour, H., and Laptev, I. (2012). “Object detection using strongly-supervised deformable part models,” in Computer Vision-ECCV 2012 (Florence: Springer), 836–849.

Google Scholar

Azzopardi, G., and Petkov, N. (2013). Trainable cosfire filters for keypoint detection and pattern recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35, 490–503. doi:10.1109/TPAMI.2012.106

Azzopardi, G., and Petkov, N. (2014). Ventral-stream-like shape representation: from pixel intensity values to trainable object-selective cosfire models. Front. Comput. Neurosci. 8:80. doi:10.3389/fncom.2014.00080

Benbouzid, D., Busa-Fekete, R., and Kegl, B. (2012). “Fast classification using sparse decision dags,” in Proceedings of the 29th International Conference on Machine Learning (ICML-12), ICML ‘12 , eds J. Langford and J. Pineau (New York, NY: Omnipress), 951–958.

Bengio, Y. (2012). “Deep learning of representations for unsupervised and transfer learning,” in ICML Unsupervised and Transfer Learning, Volume 27 of JMLR Proceedings , eds I. Guyon, G. Dror, V. Lemaire, G. W. Taylor, and D. L. Silver (Bellevue: JMLR.Org), 17–36.

Bourdev, L. D., Maji, S., Brox, T., and Malik, J. (2010). “Detecting people using mutually consistent poselet activations,” in Computer Vision – ECCV 2010 – 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part VI, Volume 6316 of Lecture Notes in Computer Science , eds K. Daniilidis, P. Maragos, and N. Paragios (Heraklion: Springer), 168–181.

Bourdev, L. D., and Malik, J. (2009). “Poselets: body part detectors trained using 3d human pose annotations,” in IEEE 12th International Conference on Computer Vision, ICCV 2009, Kyoto, Japan, September 27 – October 4, 2009 (Kyoto: IEEE), 1365–1372.

Cadena, C., Dick, A., and Reid, I. (2015). “A fast, modular scene understanding system using context-aware object detection,” in Robotics and Automation (ICRA), 2015 IEEE International Conference on (Seattle, WA).

Correa, M., Hermosilla, G., Verschae, R., and Ruiz-del-Solar, J. (2012). Human detection and identification by robots using thermal and visual information in domestic environments. J. Intell. Robot Syst. 66, 223–243. doi:10.1007/s10846-011-9612-2

Dalal, N., and Triggs, B. (2005). “Histograms of oriented gradients for human detection,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on , Vol. 1 (San Diego, CA: IEEE), 886–893. doi:10.1109/CVPR.2005.177

Dalal, N., Triggs, B., and Schmid, C. (2006). “Human detection using oriented histograms of flow and appearance,” in Computer Vision ECCV 2006, Volume 3952 of Lecture Notes in Computer Science , eds A. Leonardis, H. Bischof, and A. Pinz (Berlin: Springer), 428–441.

Dean, T., Ruzon, M., Segal, M., Shlens, J., Vijayanarasimhan, S., Yagnik, J., et al. (2013). “Fast, accurate detection of 100,000 object classes on a single machine,” in Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on (Washington, DC: IEEE), 1814–1821.

Delakis, M., and Garcia, C. (2004). Convolutional face finder: a neural architecture for fast and robust face detection. IEEE Trans. Pattern Anal. Mach. Intell. 26, 1408–1423. doi:10.1109/TPAMI.2004.97

Divvala, S., Hoiem, D., Hays, J., Efros, A., and Hebert, M. (2009). “An empirical study of context in object detection,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on (Miami, FL: IEEE), 1271–1278. doi:10.1109/CVPR.2009.5206532

Divvala, S. K., Efros, A. A., and Hebert, M. (2012). “How important are deformable parts in the deformable parts model?,” in Computer Vision-ECCV 2012. Workshops and Demonstrations (Florence: Springer), 31–40.

Dollar, P., Wojek, C., Schiele, B., and Perona, P. (2012). Pedestrian detection: an evaluation of the state of the art. IEEE Trans. Pattern Anal. Mach. Intell. 34, 743–761. doi:10.1109/TPAMI.2011.155

Endres, I., and Hoiem, D. (2010). “Category independent object proposals,” in Proceedings of the 11th European Conference on Computer Vision: Part V, ECCV’10 (Berlin: Springer-Verlag), 575–588.

Enzweiler, M., and Gavrila, D. (2009). Monocular pedestrian detection: survey and experiments. IEEE Trans. Pattern Anal. Mach. Intell. 31, 2179–2195. doi:10.1109/TPAMI.2008.260

Erhan, D., Szegedy, C., Toshev, A., and Anguelov, D. (2014). “Scalable object detection using deep neural networks,” in Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (Columbus, OH: IEEE), 2155–2162. doi:10.1109/CVPR.2014.276

Erol, A., Bebis, G., Nicolescu, M., Boyle, R. D., and Twombly, X. (2007). Vision-based hand pose estimation: a review. Comput. Vis. Image Underst. 108, 52–73; Special Issue on Vision for Human-Computer Interaction. doi:10.1016/j.cviu.2006.10.012

Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., and Zisserman, A. (2010). The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 88, 303–338. doi:10.1007/s11263-009-0275-4

Fehr, J., and Burkhardt, H. (2008). “3d rotation invariant local binary patterns,” in Pattern Recognition, 2008. ICPR 2008. 19th International Conference on (Tampa, FL: IEEE), 1–4. doi:10.1109/ICPR.2008.4761098

Felzenszwalb, P. F., Girshick, R. B., and McAllester, D. (2010a). “Cascade object detection with deformable part models,” in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on (San Francisco, CA: IEEE), 2241–2248.

Felzenszwalb, P., Girshick, R., McAllester, D., and Ramanan, D. (2010b). Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 32, 1627–1645. doi:10.1109/TPAMI.2009.167

Fischler, M. A., and Elschlager, R. (1973). The representation and matching of pictorial structures. IEEE Trans. Comput. C-22, 67–92. doi:10.1109/T-C.1973.223602

Fleuret, F., and Geman, D. (2001). Coarse-to-fine face detection. Int. J. Comput. Vis. 41, 85–107. doi:10.1023/A:1011113216584

Fleuret, F., and Geman, D. (2008). Stationary features and cat detection. Journal of Machine Learning Research (JMLR) 9, 2549–2578.

Gangaputra, S., and Geman, D. (2006). “A design principle for coarse-to-fine classification,” in Proc. of the IEEE Conference of Computer Vision and Pattern Recognition , Vol. 2 (New York, NY: IEEE), 1877–1884. doi:10.1109/CVPR.2006.21

Girshick, R., Donahue, J., Darrell, T., and Malik, J. (2014). “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (Columbus, OH: IEEE), 580–587.

Harding, S., Leitner, J., and Schmidhuber, J. (2013). “Cartesian genetic programming for image processing,” in Genetic Programming Theory and Practice X, Genetic and Evolutionary Computation , eds R. Riolo, E. Vladislavleva, M. D. Ritchie, and J. H. Moore (New York, NY: Springer), 31–44.

Hjelmås, E., and Low, B. K. (2001). Face detection: a survey. Comput. Vis. Image Underst. 83, 236–274. doi:10.1006/cviu.2001.0921

Huang, C., Ai, H., Li, Y., and Lao, S. (2007). High-performance rotation invariant multiview face detection. IEEE Trans. Pattern Anal. Mach. Intell. 29, 671–686. doi:10.1109/TPAMI.2007.1011

Huval, B., Coates, A., and Ng, A. (2013). Deep Learning for Class-Generic Object Detection . arXiv preprint arXiv:1312.6885.

Kölsch, M., and Turk, M. (2004). “Robust hand detection,” in Proceedings of the Sixth International Conference on Automatic Face and Gesture Recognition (Seoul: IEEE), 614–619.

Kotzias, D., Denil, M., Blunsom, P., and de Freitas, N. (2014). Deep Multi-Instance Transfer Learning . CoRR, abs/1411.3128.

Lampert, C. H., Blaschko, M., and Hofmann, T. (2009). Efficient subwindow search: a branch and bound framework for object localization. IEEE Trans. Pattern Anal. Mach. Intell. 31, 2129–2142. doi:10.1109/TPAMI.2009.144

Leitner, J., Harding, S., Chandrashekhariah, P., Frank, M., Frster, A., Triesch, J., et al. (2013). Learning visual object detection and localisation using icvision. Biol. Inspired Cogn. Archit. 5, 29–41; Extended versions of selected papers from the Third Annual Meeting of the {BICA} Society (BICA 2012). doi:10.1016/j.bica.2013.05.009

Li, J., and Allinson, N. M. (2008). A comprehensive review of current local features for computer vision. Neurocomputing 71, 1771–1787; Neurocomputing for Vision Research Advances in Blind Signal Processing. doi:10.1016/j.neucom.2007.11.032

Li, S. Z., and Zhang, Z. (2004). Floatboost learning and statistical face detection. IEEE Trans. Pattern Anal. Mach. Intell. 26, 1112–1123. doi:10.1109/TPAMI.2004.68

Li, Y., Wang, S., Tian, Q., and Ding, X. (2015). Feature representation for statistical-learning-based object detection: a review. Pattern Recognit. 48, 3542–3559. doi:10.1016/j.patcog.2015.04.018

Mesnil, G., Dauphin, Y., Glorot, X., Rifai, S., Bengio, Y., Goodfellow, I. J., et al. (2012). “Unsupervised and transfer learning challenge: a deep learning approach,” in JMLR W& CP: Proceedings of the Unsupervised and Transfer Learning Challenge and Workshop , Vol. 27, eds I. Guyon, G. Dror, V. Lemaire, G. Taylor, and D. Silver (Bellevue: 97–110.

Mottaghi, R., Chen, X., Liu, X., Cho, N.-G., Lee, S.-W., Fidler, S., et al. (2014). “The role of context for object detection and semantic segmentation in the wild,” in Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (Columbus, OH: IEEE), 891–898. doi:10.1109/CVPR.2014.119

Mutch, J., and Lowe, D. G. (2008). Object class recognition and localization using sparse features with limited receptive fields. Int. J. Comput. Vis. 80, 45–57. doi:10.1007/s11263-007-0118-0

Ong, E.-J., and Bowden, R. (2004). “A boosted classifier tree for hand shape detection,” in Proceedings of the Sixth International Conference on Automatic Face and Gesture Recognition (Seoul: IEEE), 889–894. doi:10.1109/AFGR.2004.1301646

Osuna, E., Freund, R., and Girosi, F. (1997). “Training support vector machines: an application to face detection,” in Proc. of the IEEE Conference of Computer Vision and Pattern Recognition (San Juan: IEEE), 130–136. doi:10.1109/CVPR.1997.609310

Ouyang, W., and Wang, X. (2013). “Joint deep learning for pedestrian detection,” in Computer Vision (ICCV), 2013 IEEE International Conference on (Sydney, VIC: IEEE), 2056–2063. doi:10.1109/ICCV.2013.257

Ouyang, W., Wang, X., Zeng, X., Qiu, S., Luo, P., Tian, Y., et al. (2015). “Deepid-net: deformable deep convolutional neural networks for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (Boston, MA: IEEE), 2403–2412.

Paisitkriangkrai, S., Shen, C., and van den Hengel, A. (2015). Pedestrian detection with spatially pooled features and structured ensemble learning. IEEE Trans. Pattern Anal. Mach. Intell. PP, 1. doi:10.1109/TPAMI.2015.2474388

Paletta, L., and Pinz, A. (2000). Active object recognition by view integration and reinforcement learning. Rob. Auton. Syst. 31, 71–86. doi:10.1016/S0921-8890(99)00079-2

Palma-Amestoy, R., Ruiz-del Solar, J., Yanez, J. M., and Guerrero, P. (2010). Spatiotemporal context integration in robot vision. Int. J. Human. Robot. 07, 357–377. doi:10.1142/S0219843610002192

Papageorgiou, C., and Poggio, T. (2000). A trainable system for object detection. Int. J. Comput. Vis. 38, 15–33. doi:10.1023/A:1008162616689

Park, D., Ramanan, D., and Fowlkes, C. (2010). “Multiresolution models for object detection,” in Computer Vision ECCV 2010, Volume 6314 of Lecture Notes in Computer Science , eds K. Daniilidis, P. Maragos, and N. Paragios (Berlin: Springer), 241–254.

Peng, B., Zhang, L., and Zhang, D. (2013). A survey of graph theoretical approaches to image segmentation. Pattern Recognit. 46, 1020–1038. doi:10.1016/j.patcog.2012.09.015

Prati, A., Gualdi, G., and Cucchiara, R. (2012). Multistage particle windows for fast and accurate object detection. IEEE Trans. Pattern Anal. Mach. Intell. 34, 1589–1604. doi:10.1109/TPAMI.2011.247

Ramanan, A., and Niranjan, M. (2012). A review of codebook models in patch-based visual object recognition. J. Signal Process. Syst. 68, 333–352. doi:10.1007/s11265-011-0622-x

Razavi, N., Gall, J., and Van Gool, L. (2011). “Scalable multi-class object detection,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on (Providence, RI: IEEE), 1505–1512. doi:10.1109/CVPR.2011.5995441

Romdhani, S., Torr, P., Scholkopf, B., and Blake, A. (2001). “Computationally efficient face detection,” in Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on , Vol. 2 (Vancouver, BC: IEEE), 695–700. doi:10.1109/ICCV.2001.937694

Rowley, H. A., Baluja, S., and Kanade, T. (1998). Neural network-based detection. IEEE Trans. Pattern Anal. Mach. Intell. 20, 23–28. doi:10.1109/34.655647

Schneiderman, H., and Kanade, T. (2000). “A statistical model for 3D object detection applied to faces and cars,” in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (Hilton Head, SC: IEEE), 746–751.

Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., and LeCun, Y. (2013). Overfeat: Integrated Recognition, Localization and Detection Using Convolutional Networks . arXiv preprint arXiv:1312.6229.

Serre, T., Wolf, L., and Poggio, T. (2005). “Object recognition with features inspired by visual cortex,” in CVPR (2) (San Diego, CA: IEEE Computer Society), 994–1000.

Song, H. O., Zickler, S., Althoff, T., Girshick, R., Fritz, M., Geyer, C., et al. (2012). “Sparselet models for efficient multiclass object detection,” in Computer Vision-ECCV 2012 (Florence: Springer), 802–815.

Sun, M., Bao, S., and Savarese, S. (2012). Object detection using geometrical context feedback. Int. J. Comput. Vis. 100, 154–169. doi:10.1007/s11263-012-0547-2

Sun, Z., Bebis, G., and Miller, R. (2006). On-road vehicle detection: a review. IEEE Trans. Pattern Anal. Mach. Intell. 28, 694–711. doi:10.1109/TPAMI.2006.104

Sung, K.-K., and Poggio, T. (1998). Example-based learning for viewed-based human face detection. IEEE Trans. Pattern Anal. Mach. Intell. 20, 39–51. doi:10.1109/34.655648

Szegedy, C., Toshev, A., and Erhan, D. (2013). “Deep neural networks for object detection,” in Advances in Neural Information Processing Systems 26 , eds C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger (Harrahs and Harveys: Curran Associates, Inc), 2553–2561.

Torralba, A., Murphy, K. P., and Freeman, W. T. (2007). Sharing visual features for multiclass and multiview object detection. IEEE Trans. Pattern Anal. Mach. Intell. 29, 854–869. doi:10.1109/TPAMI.2007.1055

Torralba, A., and Sinha, P. (2001). “Statistical context priming for object detection,” in Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on , Vol. 1 (Vancouver, BC: IEEE), 763–770. doi:10.1109/ICCV.2001.937604

Tsai, C.-F. (2012). Bag-of-words representation in image annotation: a review. ISRN Artif. Intell. 2012, 19. doi:10.5402/2012/376804

Tuytelaars, T., and Mikolajczyk, K. (2008). Local invariant feature detectors: a survey. Found. Trends Comput. Graph. Vis. 3, 177–280. doi:10.1561/0600000017

Verschae, R., and Ruiz-del-Solar, J. (2012). “Tcas: a multiclass object detector for robot and computer vision applications,” in Advances in Visual Computing, Volume 7431 of Lecture Notes in Computer Science , eds G. Bebis, R. Boyle, B. Parvin, D. Koracin, C. Fowlkes, S. Wang, et al. (Berlin: Springer), 632–641.

Verschae, R., Ruiz-del-Solar, J., and Correa, M. (2008). A unified learning framework for object detection and classification using nested cascades of boosted classifiers. Mach. Vis. Appl. 19, 85–103. doi:10.1007/s00138-007-0084-0

Viola, P., and Jones, M. (2001). “Rapid object detection using a boosted cascade of simple features,” in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition (Kauai: IEEE), 511–518. doi:10.1109/CVPR.2001.990517

Viola, P., and Jones, M. (2002). “Fast and robust classification using asymmetric adaboost and a detector cascade,” in Advances in Neural Information Processing System 14 (Vancouver: MIT Press), 1311–1318.

Viola, P., Jones, M., and Snow, D. (2005). Detecting pedestrians using patterns of motion and appearance. Int. J. Comput. Vis. 63, 153–161. doi:10.1007/s11263-005-6644-8

Viola, P., and Jones, M. J. (2004). Robust real-time face detection. Int. J. Comput. Vis. 57, 137–154. doi:10.1023/B:VISI.0000013087.49260.fb

Weinland, D., Ronfard, R., and Boyer, E. (2011). A survey of vision-based methods for action representation, segmentation and recognition. Comput. Vis. Image Underst. 115, 224–241. doi:10.1016/j.cviu.2010.10.002

Wu, B., and Nevatia, R. (2005). “Detection of multiple, partially occluded humans in a single image by bayesian combination of edgelet part detectors,” in ICCV ‘05: Proceedings of the 10th IEEE Int. Conf. on Computer Vision (ICCV’05) Vol 1 (Washington, DC: IEEE Computer Society), 90–97.

Wu, B., and Nevatia, R. (2007). “Cluster boosted tree classifier for multi-view, multi-pose object detection,” in ICCV (Rio de Janeiro: IEEE), 1–8.

Wu, L., Hoi, S., and Yu, N. (2010). Semantics-preserving bag-of-words models and applications. IEEE Trans. Image Process. 19, 1908–1920. doi:10.1109/TIP.2010.2045169

Wu, L., Hu, Y., Li, M., Yu, N., and Hua, X.-S. (2009). Scale-invariant visual language modeling for object categorization. IEEE Trans. Multimedia 11, 286–294. doi:10.1109/TMM.2008.2009692

Yan, J., Lei, Z., Wen, L., and Li, S. Z. (2014). “The fastest deformable part model for object detection,” in Computer Vision and Pattern Recognition (CVPR), 2014 IEEE Conference on (Columbus, OH: IEEE), 2497–2504.

Yang, M.-H., Ahuja, N., and Kriegman, D. (2000a). “Mixtures of linear subspaces for face detection,” in Proc. Fourth IEEE Int. Conf. on Automatic Face and Gesture Recognition (Grenoble: IEEE), 70–76.

Yang, M.-H., Roth, D., and Ahuja, N. (2000b). “A SNoW-based face detector,” in Advances in Neural Information Processing Systems 12 (Denver: MIT press), 855–861.

Yang, M.-H., Kriegman, D., and Ahuja, N. (2002). Detecting faces in images: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 24, 34–58. doi:10.1109/34.982883

Zafeiriou, S., Zhang, C., and Zhang, Z. (2015). A survey on face detection in the wild: past, present and future. Comput. Vis. Image Underst. 138, 1–24. doi:10.1016/j.cviu.2015.03.015

Zeng, X., Ouyang, W., and Wang, X. (2013). “Multi-stage contextual deep learning for pedestrian detection,” in Computer Vision (ICCV), 2013 IEEE International Conference on (Washington, DC: IEEE), 121–128.

Zhou, B., Khosla, A., Lapedriza, À., Oliva, A., and Torralba, A. (2014). Object Detectors Emerge in Deep Scene Cnns . CoRR, abs/1412.6856.

Zhu, X., and Ramanan, D. (2012). “Face detection, pose estimation, and landmark localization in the wild,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on (Providence: IEEE), 2879–2886.

Keywords: object detection, perspective, mini review, current directions, open problems

Citation: Verschae R and Ruiz-del-Solar J (2015) Object Detection: Current and Future Directions. Front. Robot. AI 2:29. doi: 10.3389/frobt.2015.00029

Received: 20 July 2015; Accepted: 04 November 2015; Published: 19 November 2015

Reviewed by:

Copyright: © 2015 Verschae and Ruiz-del-Solar. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Rodrigo Verschae,

† Present address: Rodrigo Verschae, Graduate School of Informatics, Kyoto University, Kyoto, Japan

Please note that Internet Explorer version 8.x is not supported as of January 1, 2016. Please refer to this support page for more information.


Procedia Computer Science

Application of deep learning for object detection.

The ubiquitous and wide applications like scene understanding, video surveillance, robotics, and self-driving systems triggered vast research in the domain of computer vision in the most recent decade. Being the core of all these applications, visual recognition systems which encompasses image classification, localization and detection have achieved great research momentum. Due to significant development in neural networks especially deep learning, these visual recognition systems have attained remarkable performance. Object detection is one of these domains witnessing great success in computer vision. This paper demystifies the role of deep learning techniques based on convolutional neural network for object detection. Deep learning frameworks and services available for object detection are also enunciated. Deep learning techniques for state-of-the-art object detection systems are assessed in this paper.

Cited by (0)

Object detection in real time based on improved single shot multi-box detector algorithm

EURASIP Journal on Wireless Communications and Networking volume  2020 , Article number:  204 ( 2020 ) Cite this article

19k Accesses

64 Citations

Metrics details

In today’s scenario, the fastest algorithm which uses a single layer of convolutional network to detect the objects from the image is single shot multi-box detector (SSD) algorithm. This paper studies object detection techniques to detect objects in real time on any device running the proposed model in any environment. In this paper, we have increased the classification accuracy of detecting objects by improving the SSD algorithm while keeping the speed constant. These improvements have been done in their convolutional layers, by using depth-wise separable convolution along with spatial separable convolutions generally called multilayer convolutional neural networks. The proposed method uses these multilayer convolutional neural networks to develop a system model which consists of multilayers to classify the given objects into any of the defined classes. The schemes then use multiple images and detect the objects from these images, labeling them with their respective class label. To speed up the computational performance, the proposed algorithm is applied along with the multilayer convolutional neural network which uses a larger number of default boxes and results in more accurate detection. The accuracy in detecting the objects is checked by different parameters such as loss function, frames per second (FPS), mean average precision (mAP), and aspect ratio. Experimental results confirm that our proposed improved SSD algorithm has high accuracy.

1 Introduction

The information age has witnessed the rapid development of wireless network technology, which has attracted the attention of researchers and practitioners due to its unique characteristics such as flexible structure and efficiency. As wireless network technology continues to evolve, it has brought great convenience to people’s life and work with its powerful technical capabilities. Wireless networks have gradually facilitated the main stream of people’s online life. At the same time, the advent of 5G network will further enable the greater development and more advanced applications of wireless network technology. The future generations of wireless networks will provide strong support for related applications such as Internet of Things (IoT) and virtual reality (VR). Many of these applications connect to each other and transmit information within networks based on the detection of specific target objects. In order to achieve a comprehensive network connection between people and people, things and people, and things and things, one of the key tasks of future applications is to identify the target in a real-time manner in the wireless networks [ 1 ].

Identifying each object in a picture or scene with the help of computer/software is called object detection. Object detection is one of the most important problems in the area of wireless network computer vision. It is the basis of complex vision tasks such as target tracking and scene understanding and is widely used in wireless networks. The task of object detection is to determine whether there are objects belonging to the specified category in the image. If it exists, then the subsequent task is to identify its category and location information. Traditional object detection algorithms are mainly devoted to the detection of a few types of targets, such as pedestrian detection [ 2 ] and infrared target detection [ 3 ]. Due to the recent advance of deep learning technology [ 4 ], especially after the appearance of the deep convolution neural network (CNN) technology, object detection algorithms have made a breakthrough development. Within these algorithms, three major methods widely adopted in this field are You Only Look Once (YOLO), single shot multi-box detector (SSD), and faster region CNN (F-RCNN) [ 5 ].

However, with the upcoming of 5G, the characteristics of wireless network, such as massive data, service evolution, data diversification, and uneven spatial-temporal distribution of data, have posed severe challenges to object detection under a real-time environment. Besides, real-time object detection also needs to be completed on any device and in any environment. To address the challenges, this paper proposes object detection technique to detect objects in real time with a model that can be executed on any device in any environment. Specifically, our proposed method applies convolutional neural networks to develop a model that consists of multiple layers to classify the given objects into several defined classes. Based on the recent advancement in deep learning with image processing, the proposed schemes then use multiple images and detect the objects from these images, labeling them with their respective class label. These images can be from videos which are fed into the model we prepared, and the training of the model takes place until the error rate is reduced to an acceptable level. To speed up the computational performance of the object detection technique, we have used improved single shot multi-box detector (SSD) algorithm along with the faster region convolutional neural network. We also conduct experiments to check the accuracy of our proposed method in detecting the objects with different parameters including loss function, mean average precision (mAP), and frames per second. The experiment results demonstrate that the proposed model has a high performance in detect accurate objects for real-time applications.

Specifically, this research makes contributions to the existing literature by improving the accuracy of SSD algorithm for detecting smaller objects. SSD algorithm works well in detecting large objects but is less accurate in detecting smaller objects. Hence, we modify the SSD algorithm to achieve acceptable accuracy for detecting smaller objects. The images or scenes are taken from web cameras and we have used Pascal visual object class (VOC) and common objects in context (COCO) datasets to carry out experiments. We capture object detection (OD) datasets from our center for image processing lab. We make use of different libraries to form a network and use tensorflow-GPU 1.5. For experimental setup, tensorflow directory, SSD MobilenetV1 FPN Feature Extractor, tensorflow object detection API, and anaconda virtual environment are used. This entire setup enables us to produce real-time object detection in a better way.

The rest of this paper is organized as follows. The next section summarizes related work with a focus on the existing techniques of object detection. The third section discusses about the improved SSD algorithm. The fourth section represents the experimental results. The fifth section describes discussion and analysis, limitations, and future research directions. The final section concludes the paper.

2 Related work

2.1 computer vision detection.

In 2012, Alex [ 6 ] used the deep CNN Alex Net to win the championship in the task of ILSVRC 2012 image classification, which was superior to the traditional algorithms. Then scholars began to study the application of deep CNN in object detection. They used Alex Netto construct algorithms, such as R-CNN [ 7 , 8 , 9 ], YOLO [ 5 ], SSD [ 10 ], and others, which resulted in a surging research stream of computer vision detection.

Girshick et al. [ 8 ] proposed a method R-CNN by successfully combining region proposals with CNNs, which improves mean average precision (mAP) by more than 30%. The next year Girshick [ 11 ] named a new algorithm faster R-CNN, which employs spatial pyramid pooling networks. But it had a bottleneck in region proposal computation. In order to overcome this disadvantage, Ren et al. [ 12 ] successfully introduced a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network. Cao et al. [ 13 ] proposed a rotation invariant faster R-CNN target detection algorithm. By adding regularization constraints to the target function of the model, the invariance of the target CNN feature rotation is enhanced. The model improved the accuracy by an average of 2.4%. Dai et al. [ 14 ] proposed position-sensitive score maps to address a dilemma between translation-invariance in image classification and translation variance in object detection and successfully executed them 2.5–20 times faster than the F-RCNN counterpart. Lin et al. [ 15 ] developed a top-down architecture with lateral connections, called Feature Pyramid Network (FPN), by building high-level semantic feature maps at all scales. All these algorithms successfully solved the problem in object detection. However, there are still defects in accuracy and speed for wireless network object detection applications.

In order to get a better computing speed, Redmon et al. [ 5 ] proposed a new YOLO algorithm to object detection. It achieved double mAP in real-time detectors. Then Redmon et al. [ 16 ] put forward an improved algorithm YOLO V2. On the basis of YOLO, batch normalization [ 17 ] was added in the algorithm to speed up the training and the algorithm adds anchor boxes and high resolution classifier to improve accuracy. Result shows that it runs significantly faster than F-RCNN with ResNet [ 18 ] and SSD [ 19 ]. To achieve high speed and accuracy rate, scholars further optimized the YOLO V2. For instance, Wei et al. [ 20 ] used the dimension clustering of object box, classified network pre-training, conducted multi-scale detection training, changed the candidate box filtering rules and other methods, and made the algorithm better adaptive to the location task and object detection. It increased the average accuracy rate of the detection network to 79.5%. Redmon et al. [ 21 ] proposed the third version of the YOLO series, YOLO V3, which improved the algorithm at the accuracy of detection.

For SSD algorithm, researchers also made many other improvements, such as DSSD [ 22 ], F-SSD [ 23 ], and R-SSD [ 24 ]. They all improved the fusion method of different features. However, it has some difficulties at expressing the shallow features of the prediction layer. To address some of the concerns, Fu et al. [ 22 ] offered a Deconvolutional Single Shot Detector (DSSD) by combining a classifier (Residual-101 [ 25 ]) with an SSD [ 26 ]. Wang et al. [ 27 ] proposed an improved SSD algorithm based on target proposal and one-stage target detection to improve the target detection performance. It improved the mAP in the small target detection by 14.46% and 13.92% compared to F-RCNN [ 13 ] and R-FCN [ 14 ] algorithms. Lin et al. [ 28 ] designed a detector RetinaNet to address extreme foreground-background class imbalance by reshaping the standard cross-entropy loss.

2.2 Relevant systems

A deep neural network is basically consisted of two different models: the first is convoluted and the other is non-linear relationships. In both models, an object is considered as a layered configuration of primitives. Numerous architectures and algorithms have implemented the concept of deep learning neural networks including belief network, stacked network, and gated recurrent unit. The first CNN was constructed by LeCun et al. [ 29 ]. The different application domains of CNN now include image-processing, handwriting character recognition, etc. Object detection is performed by estimating the coordinates and class of particular objects in the picture. The presence of these objects in a picture may be in random positions. We next summarize the details of faster RCNN and YOLO v3 architecture as they are directly relevant to our proposed method.

2.2.1 Faster RCNN

Region Proposal Network for generating regions and detecting objects uses two methods of fast RCNN. The first method proposes regions and uses the proposed regions respectively. In fast RCNN, Ren et al. [ 12 ] has used 16 architectures in convolution layers to achieve detection and classification accuracy on datasets. Kumar et al. [ 30 , 31 , 32 ] proposed a method to detect the objects with audio device in real time for blind people using deep neural network. Figure 1 demonstrates the architecture of Faster RCNN. There is a limitation in Faster R-CNN that it has a complex training process and slow processing speed.

figure 1

Architecture of F-RCNN [ 12 ]. Demonstrates the architecture of Faster RCNN. There is a limitation in faster R-CNN that it has a complex training process and slow processing speed

2.2.2 YOLO V3

YOLO V3 is a detector of objects which makes use of features learned by a deep convolutional neural network for detecting object in real time [ 21 ]. It consists of 75 convolutional layers with up-sampling layers and skips connections for the complete image one neural network being applied. Regions of the image are made. Later bounding boxes are displayed along with probabilities. The most noticeable feature of YOLO V3 is that the detections at three different scales can be done with the help of it. But the speed has been traded off for boosts in accuracy in YOLO v3, and it does not perform well with small objects that appear in groups. Figure 2 represents the working mechanism of the YOLO model.

figure 2

A YOLO model [ 21 ]. Represents the working of YOLO model for detecting the objects from the image. But the speed has been traded off for boosts in accuracy in YOLO v3 and it does not perform well with small objects that appear in groups. The fast YOLO model has lower scores but good real-time performances

2.2.3 Our contribution

To highlight our contribution to the existing literature, we next summarize some of the key points of our proposed object detection technique based on the improved SSD algorithm.

The improved SSD algorithm uses depth-wise separable convolution and spatial separable convolutions in their convolutional layers. The depth-wise separable convolution performs operations such that it maps each number of input channel with its corresponding number of output channel separately. Spatial separable convolution is the same as depth-wise convolution along the x - and y -axis.

This architecture reduces the number of operations to execute the algorithm in fast speed through ways used by depth-wise separable convolution to reduce the number of channels with the help of width multiplier and those used by spatial separable convolution to reduce the feature maps of spatial dimensions by applying resolution multiplier.

We use mAP and FPS as standard parameters for object detection. The major objective during the training is to get a high-class confidence score.

The proposed approach enables us to produce real-time object detection by using optimal values of aspect ratio.

Our improved SSD algorithm uses many default boxes, which results in more accurate detection of objects.

3 Methodology

This section presents our proposed approach for detecting the objects in real-time from images by using convolutional neural network deep learning process. The previous algorithms such as CNN, faster CNN, faster RCNN, YOLO, and SSD are only suitable for highly powerful computing machines and they require a large amount of time to train. In this paper, we have tried to overcome the limitations of the SSD algorithm by introducing an improved SSD algorithm with some improvement. The proposed scheme uses improved SSD algorithm for higher detection precision with real-time speed. However, SSD algorithm is not appropriate to detect tiny objects, since it overlooks the context from the outside of the boxes. To address this issue, the proposed algorithm uses depth-wise separable convolution and spatial separable convolutions in their convolutional layers. Specifically, our proposed approach uses a new architecture as a combination of multilayer of convolutional neural network. The algorithm comprises of two phases. First, it reduces the feature maps extraction of spatial dimensions by using resolution multiplier. Second, it is designed with the application of small convolutional filters for detecting objects by using the best aspect ratio values. The major objective during the training is to get a high-class confidence score by matching the default boxes with the ground truth boxes. The advantage of having multi-box on multiple layers leads to significant results in detection. Single shot multi-box detector was discharged at the tip of Gregorian calendar month 2016 and thus arrived at a new set of records on customary knowledge sets like Pascal VOC and COCO. The major problem with the previous methods was how to recover the fall in precision, for which SSD applies some improvements that include multi-scale feature map and default boxes. For detecting a small object with higher resolutions, feature maps are used. The training set of improved SSD algorithm depends upon three main sections, i.e., selecting the size of box, matching of boxes, and loss function. The proposed scheme can be understood by the system model given in Fig. 3 .

figure 3

The proposed system model. The training set of improved SSD algorithm depends upon three main sections, i.e., selecting the size of box, matching of boxes, and loss function. The proposed scheme can be understood by the system model given in Fig. 3

3.1 SSMBD algorithm

In order to interpret the role of SSD algorithm, we first formally denote the following concepts.

Single shot: This means that the tasks of the thing localization and classification are exhausted one passing play of the network.

Multi-box: Ground truth box and predicted box are the boxes in multi-box. This is introduced by Szegedy [ 33 ].

Detector: The network is an associate degree object detector that conjointly classifies those detected objects.

Default the size of the boxes: The selection of boxes is based on the minimum value of convolution layer and maximum values of change in intensity [ 34 ]. The first algorithm represents the procedure of producing specified feature maps F(m).

Truth boxes: After finding the size of boxes, the next phase is matching of the boxes with the corresponding truth boxes. A specific given picture to identify the truth boxes is explained in the second algorithm.

Loss function: The loss function is unbelievably simple, and it is a methodology of evaluating how well your role models your dataset. If your predictions are entirely of your loss function, it can operate next range. If the output range is less, it means that the model is good. The main objective is to minimize loss function. The loss function is also depending upon the sum of weighted localization and classification loss functions [ 35 ].

When a color image is fed into the input layer, SSD does the following.

Step 1: Image is passed through large number of convolutional layers extracting feature maps at different points.

Step 2: Every location in each of those feature maps uses a 4x4 filter to judge a tiny low default box.

Step 3: Predict the bounding box offset for each box.

Step 4: Predict the class probabilities for each box.

Step 5: Based on IOU, the truth boxes are matched with the predicted boxes.

Step6: Instead of exploiting all the negative examples, the result exploits the best-assured loss for every default box.

Steps in SSMBD Algorithm:

figure a

Steps in identifying box size:

figure b

Figure 4 shows the process of identifying total number of default boxes, and Fig. 5 demonstrates the process of detecting objects with different color boxes.

figure 4

Process of identifying total number of default boxes. Shows the process of identifying total number of default boxes

figure 5

Process of detecting objects with different color boxes. Demonstrates the process of detecting objects with different color boxes

4 Experimental results

This study proposes object detection technique to detect objects in real time on any device running the proposed model in any environment. We use python programming language and OpenCV 2.4 library to execute the proposed system. Python libraries are the open source framework for the construction, training, and identification of object detection. The chosen datasets taken into consideration for this research were bound to a group of people. Multi-scale feature extraction may improve the accuracy for detecting big object but does not exhibit a good precision of speed to detect small objects. Therefore, we have used depth-wise separable convolution along with spatial separable convolutions to achieve this. For conducting the experiments and producing the results, we use Pascal VOC Footnote 1 and COCO Footnote 2 object detection (OD) datasets from our center for image processing lab.

4.1 Experimental setup

We make use of different libraries to form a network and also use tensorflow-GPU 1.5. Once the training is done our next objective is to test the model for accuracy. The next objective is to optimize the model as a tensorflow serving and deploy that to the environment which we want to use. For experimental setup tensorflow directory, SSD MobilenetV1 FPN feature extractor, tensorflow object detection API, and anaconda virtual environment are used. This entire setup enables us to produce real-time object detection in a better way. To achieve great precision, we have increased the number of default boxes with less confidence and focus on the boxes having high confidence.

4.2 Performance etrics

The performance metrics which are used to evaluate the performance of improved SSD algorithm to predict the boundary boxes and truth boxes for classification of object are discussed here. These metrics include mAP, FPS, aspect ratio, logistic regression, and Intersection over Union (IoU). The box regression technique of SSD is used to identify the bounding box coordinate.

The accuracy is calculated using Equation ( 1 ) below, which could be improved over the original dataset.

In the equation, O correct represents the number of correctly detected object and T obj the total number of images.

IoU is calculated by the Jaccard index to find out the overlap between two bounding boxes [ 35 ]. Equation (2) shows the formula for IoU.

Logistic regression is a model which identifies the probability of a result being obtained. We have to segregate our problem dataset into different class labels. Logistic regression model usually gives one of the highest accuracies of all classification models [ 36 ]. Equation (3) indicates the logistic regression function.

Aspect ratio is used to find out the relationship between the width and height of the image. Basically, it represents a shape of an image. We have used A R to represent the aspect ratio.

Figure 6 demonstrates the different sample images taken from object detection (OD) datasets from our center for image processing lab used in the experimental setup.

figure 6

Some sample images and object detection using improved SSD model. Demonstrates the different sample images taken from object detection (OD) dataset from our Centre for Image processing Lab used in the experimental setup

Figure 7 represents the various objects detected by the proposed algorithm. In this research work, we have used different colors of boxes to show different class labels. Our scheme correctly detects and recognizes bottle, laptop, mouse, cup, teddy bear, umbrella, person, keyboard, TV, zebra, toy car, bowl, chair, bird, vassal, and suitcase.

figure 7

Detection of objects with different boxes using proposed approach. Represents the various objects detected by the proposed algorithm. We have used different colors of boxes to show different class labels. Our scheme correctly detects and recognizes bottle, laptop, mouse, cup, teddy bear, umbrella, person, keyboard, TV, zebra, toy car, bowl, chair, bird, vassal, and suitcase

5 Discussion and analysis

We have analyzed the correctness of our improved SSD algorithm which uses depth-wise separable convolution along with spatial separable convolutions generally called multilayer to increase the classification accuracy of detecting small objects without affecting the speed. These multilayer convolutional neural networks use the confidence value for improving the process of detecting accurate boxes. For experimental setup, tensorflow directory, SSD MobilenetV1 FPN feature extractor, tensorflow object detection API, and anaconda virtual environment are used. The algorithm includes width multiplier and resolution multiplier to minimize the channels, and feature maps as well. The proposed approach produces real-time object detection by using aspect ratio. Our improved SSD algorithm consists of large amounts of data, easy trained model, and faster GPUs, which allows to detect and classify multiple objects within an image with high accuracy. The key functions of the proposed algorithm are object detection, object localization, loss function, default boundary box, truth box, feature map, and localization. In the object detection techniques, the selection convolutional layer plays a vital role to improve from 65.5 to 74.3% with respect to mAP. In the case of default box shapes, it improves from 71.6 to 74.3% with respect to mAP. Our improved SSD algorithm uses the 4 × 4 feature maps along with a greater number of default boxes, resulting in a more accurate detection. While comparing with the other previous models, the testing speed of our proposed model is still faster because our approach gives 79.8% of mAP and 89 FPS. We also compare with other feature extraction model such as YOLO, SSD512, SSD300, and F-CNN to obtain the results. Table 1 demonstrates the comparison between F-CNN, YOLO, SSD512, SSD300, and our proposed model. We have combined faster R-CNN with SSD together to achieve high accuracy and FPS with good speed to detect objects in real time as well. Table 1 represents the different parameter of the improved SSD algorithm by using VOC and COCO test datasets.

Table 2 shows the performance of different machine learning algorithm as image classifiers namely convolution neural network, faster R-CNN, R-CNN, and faster R-CNN VGG, ZF.

Table 3 represents the different values of mAP on Pascal VOC and COCO datasets. It is obvious that improved SSD with multi-scale contexts meets our demand as the best solution.

Although we have improved the SSD algorithm, there are certain limitations in our research, such as blockage, deformable objects, corrupt objects, and interlaced objects. One more limitation of our object detection algorithm is its inability to deal with new object classes. Although we have trained our model for every possible object class, this problem can occur when an anonymous object is present in the image.

For detecting the object, we have used different deep learning algorithms as object classifiers namely convolution neural network and logistic regression. We have applied four different object detection algorithms like SSD512, SSD300, YOLO, and F-CNN to obtain the various small objects from the images with respect to Intersection over Union (IoU). The IoU curves and results demonstrate that our proposed approach gives the highest accuracy of 96.7%. Figure 8 a, b, and c show the different curves implemented on SSD512, SSD300, amd YOLO V3-based detector.

figure 8

a IOU Curve of SSD512. b IOU Curve of SSD300. c IOU Curve of Yolo V3. Our proposed approach gives us the highest accuracy of 96.7%. a – c Shows the different curves implemented on SSD312, SSD300, and YOLO

The proposed improved SSD approach has a higher recall value, i.e., 0.9, compared with that from YOLO, faster RCNN, NASNet, and R-FCN. Figure 9 demonstrates the graph of recall percentage versus threshold IoU compared with other object detection techniques such as YOLO, Faster-RCNN, NASNet, and R-FCN on object detection (OD) datasets from our center for image processing lab. The recall value of improved SSD algorithm is 79.8% when the different value of IoU is applied.

figure 9

Represent the different recall curves by using object detection algorithm on a particular threshold value of IoU. The proposed improved SSD approach has higher recall value, i.e., 0.9 compare with YOLO, faster R-CNN, NASNet, R-FCN as shown in a . b Demonstrates the graph of recall % vs threshold IoU compared with other object detection technique such YOLO, faster R-CNN, NASNet, and R-FCN on object detection (OD) dataset from our Centre for Image processing Lab. The recall value of improved SSD algorithm is 79.8% when the different value of IoU is applied

6 Conclusion

This study develops an object detector algorithm using deep learning neural networks for detecting the objects from the images. The research uses am improved SSD algorithm along with multilayer convolutional network to achieve high accuracy in real time for the detection of the objects. The performance of our algorithm is good in still images and videos. The accuracy of the proposed model is more than 79.8%. The training time for this model is about 5–6 h. These convolutional neural networks extract feature information from the image and then perform feature mapping to classify the class label. The prime objective of our algorithm is to use the best aspect ratios values for selecting the default boxes so that we can improve SSD algorithm for detecting objects.

For checking the effectiveness of the scheme, we have used Pascal VOC and COCO datasets. We have compared the values of different metrics such as mAP, loss function, aspect ratio, and FPS with other previous models, which indicates that the proposed algorithm achieves a higher mAP, uses more frames to gain good speed, and obtains acceptable accuracy for detecting objects from color images. This paper points out that the algorithm uses truth box to extract feature maps. Future research can extend our proposed algorithm by training the datasets for micro-objects.

Availability of data and materials

The research uses the Pascal VOC and COCO object detection (OD) datasets from the center for image processing lab at Vardhaman College of Engineering. Datasets are available upon request.


Convolution neural network

Faster convolutional neural networks

Faster region convolutional neural networks

Single shot multi-box detector

Mean average precision

Frames per second

Internet of Things

Virtual reality

You Only Look Once

Fifth generation

Object detection

Visual object classes

Common objects in context

Graphics processing unit

Feature Pyramid Network

Application program interface

ImageNet Large Scale Visual Recognition Challenge

Region Proposal Network

Deconvolutional Single Shot Detector

Open source computer vision

Feature maps

Intersection over Union

Visual Geometry Group

Y. Zhong, Y. Yang, X. Zhu, E. Dutkiewicz, Z. Zhou, T. Jiang, Device-free sensing for personnel detection in a foliage environment. IEEE Geoscience and Remote Sensing Letters 14 (6), 921–925 (2017).

Article   Google Scholar  

S.Z. Su, S.Z. Li, S.Y. Chen, G.R. Cai, Y.D. Wu, A survey on pedestrian detection. DianziXuebao 40 (4), 814–820 (2012).

M. Zeng, J. Li, Z. Peng, The design of top-hat morphological filter and application to infrared target detection. Infrared Physics & Technology 48 (1), 67–76 (2006).

L. Deng, D. Yu, Deep learning: methods and applications. Foundations and Trends® in. Signal Processing 7 (3–4), 197–387 (2014).

J. Redmon, S.Divvala, R.Girshick, A. Farhadi, You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 779-788)(2016). .

A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems (pp. 1097-1105)(2012). .

H. Jiang, E. Learned-Miller, Face detection with the faster R-CNN. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017) (pp. 650-657)(2017, May). IEEE..

R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 580-587)(2014). .

X. Peng, C. Schmid, Multi-region two-stream R-CNN for action detection. In European conference on computer vision (pp. 744-759). Springer, Cham. (2016, October). .

J. Redmon, A.Angelova, Real-time grasp detection using convolutional neural networks. In 2015 IEEE International Conference on Robotics and Automation (ICRA) (pp. 1316-1322). IEEE(2015, May). .

R. Girshick, Fast r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 1440-1448)(2015). .

S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (pp. 91-99) (2015). .

Y.J. Cao, G.M. Xu, G.C. Shi, Low altitude armored target detection based on rotation invariant faster R-CNN[J]. Laser & Optoelectronics Progress, 55(10): 101501(2018). .

J. Dai, Y. Li, K. He, J. Sun, R-fcn: Object detection via region-based fully convolutional networks. In Advances in neural information processing systems (pp. 379-387)(2016).

T.Y. Lin, P.Dollár, R.Girshick, K. He, B. Hariharan, S.Belongie, Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2117-2125)(2017). .

J. Redmon, A. Farhadi, YOLO9000: better, faster, stronger. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7263-7271)(2017). .

S. Ioffe, C.Szegedy,Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.(2015).

C. Szegedy, S. Ioffe, V. Vanhoucke, A.A. Alemi, Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-First AAAI Conference on Artificial Intelligence. (2017, February)

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.Y. Fu, A.C. Berg, Ssd: Single shot multibox detector. In European conference on computer vision (pp. 21-37)(2016, October). Springer. Cham..

Y.M. Wei, J.C. Quan, Y.Q.Y. Hou, Aerial image location of unmanned aerial vehicle based on YOLO V2[J]. Laser & Optoelectronics Progress, 54(11): 111002(2017). DOI: .

J. Redmon, A. Farhadi, Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767.(2018).

C.Y. Fu, W. Liu, A.Ranga, A. Tyagi, AC Berg, Dssd: Deconvolutional single shot detector. arXiv preprint arXiv:1701.06659.(2017).

Z. Li, F. Zhou, FSSD: feature fusion single shot multibox detector. arXiv preprint arXiv:1712.00960.(2017).

J. Jeong, H. Park, N. Kwak, Enhancement of SSD by concatenating feature maps for object detection. arXiv preprint arXiv:1705.09587.(2017).

K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778)(2016). .

W. Liu, ARabinovich, AC Berg, Parsenet: Looking wider to see better. arXiv preprint arXiv:1506.04579.(2015).

J.Q. Wang,J.S. Li,X.W. Zhou,X. Zhang, Improved SSD algorithm and its performance analysis of small target detection in remote sensing images[J]. Acta Optica Sinica, 39(6): 0628005(2019). .

T.Y. Lin, P. Goyal, R.Girshick, K. He, P. Dollár, Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision (pp. 2980-2988)(2017). .

Y. LeCun, Y.Bengio, G. Hinton,Deep learning. nature 521(2015). .

A. Kumar, S.P.Ghrera, V. Tyagi, An ID-based secure and flexible buyer-seller watermarking protocol for copyright protection. Pertanika Journal of Science & Technology, 25(1)(2017).

A. Kumar, Design of secure image fusion technique using cloud for privacy-preserving and copyright protection. International Journal of Cloud Applications and Computing (IJCAC), 9(3), 22-36(2019). .

A. Kumar, S. S. S. S. Reddy and V. Kulkarni, “An object detection technique for blind people in real-time using deep neural network,” 2019 Fifth International Conference on Image Information Processing (ICIIP), Shimla, India, 2019, pp. 292-297.

Szegedy, S. Reed, D. Erhan, D.Anguelov, S.Ioffe, Scalable, high-quality object detection. arXiv preprint arXiv:1412.1441.(2014).

V. Thakar, W. Ahmed, M.M.Soltani, J.Y. Yu, Ensemble-based adaptive single-shot multi-box detector. In 2018 International Symposium on Networks, Computers and Communications (ISNCC) (pp. 1-6) (2018, June). IEEE. .

P. Saimadhu, How The Logistic Regression Model Works, , Accessed 2 March, 2017.

Jiang, R. Luo, J. Mao, T. Xiao, Y. Jiang, Acquisition of localization confidence for accurate object detection. In Proceedings of the European Conference on Computer Vision (ECCV) (pp. 784-799)(2018). .

Download references

The paper declares no funding support.

Author information

Authors and affiliations.

Department of Computer Science & Engineering, Vardhaman College of Engineering, Hyderabad, India

Ashwani Kumar

Coggin College of Business, University of North Florida, Jacksonville, FL, 32224, USA

Zuopeng Justin Zhang

Logistics and E-Commerce School, Zhejiang Wanli University, Ningbo, Zhejiang, 315100, China

You can also search for this author in PubMed   Google Scholar


AK conceived of the study and carried out the experiment. HL and ZZ conducted the literature view and helped motivate the research. AK drafted the initial manuscript and all authors participated in further improving the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Hongbo Lyu .

Ethics declarations

Competing interest.

The authors declare that they have no competing interests.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit .

Reprints and Permissions

About this article

Cite this article.

Kumar, A., Zhang, Z.J. & Lyu, H. Object detection in real time based on improved single shot multi-box detector algorithm. J Wireless Com Network 2020 , 204 (2020).

Download citation

Received : 23 December 2019

Accepted : 01 October 2020

Published : 17 October 2020


Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

research papers on object detection

research papers on object detection

Towards Data Science

Ethan Yanjia Li

Aug 10, 2020

12 Papers You Should Read to Understand Object Detection in the Deep Learning Era

As the second article in the “Papers You Should Read” series, we are going to walk through both the history and some recent developments in a more difficult area of computer vision research: object detection. Before the deep learning era, hand-crafted features like HOG and feature pyramids are used pervasively to capture localization signals in an image. However, those methods usually can’t extend to generic object detection well, so most of the applications are limited to face or pedestrian detections. With the power of deep learning, we can train a network to learn which features to capture, as well as what coordinates to predict for an object. And this eventually led to a boom of applications based on visual perception, such as the commercial face recognition system and autonomous vehicle. In this article, I picked 12 must-read papers for newcomers who want to study object detection. Although the most challenging part of building an object detection system hides in the implementation details, reading these papers can still give you a good high-level understanding of where the ideas come from, and how would object detection evolve in the future.

As a prerequisite for reading this article, you need to know the basic idea of the convolution neural network and the common optimization method such as gradient descent with back-propagation. It’s also highly recommended to read my previous article “ 10 Papers You Should Read to Understand Image Classification in the Deep Learning Era ” first because many cool ideas of object detection originate from a more fundamental image classification research.

2013: OverFeat

OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks

Inspired by the early success of AlexNet in the 2012 ImageNet competition, where CNN-based feature extraction defeated all hand-crafted feature extractors, OverFeat quickly introduced CNN back into the object detection area as well. The idea is very straight forward: if we can classify one image using CNN, what about greedily scrolling through the whole image with different sizes of windows, and try to regress and classify them one-by-one using a CNN? This leverages the power of CNN for feature extraction and classification, and also bypassed the hard region proposal problem by pre-defined sliding windows. Also, since a nearby convolution kernel can share part of the computation result, it is not necessary to compute convolutions for the overlapping area, hence reducing cost a lot. OverFeat is a pioneer in the one-stage object detector. It tried to combine feature extraction, location regression, and region classification in the same CNN. Unfortunately, such a one-stage approach also suffers from relatively poorer accuracy due to less prior knowledge used. Thus, OverFeat failed to lead a hype for one-stage detector research, until a much more elegant solution coming out 2 years later.

2013: R-CNN

Region-based Convolutional Networks for Accurate Object Detection and Segmentation

Also proposed in 2013, R-CNN is a bit late compared with OverFeat. However, this region-based approach eventually led to a big wave of object detection research with its two-stage framework, i.e, region proposal stage, and region classification and refinement stage.

In the above diagram, R-CNN first extracts potential regions of interest from an input image by using a technique called selective search. Selective search doesn’t really try to understand the foreground object, instead, it groups similar pixels by relying on a heuristic: similar pixels usually belong to the same object. Therefore, the results of selective search have a very high probability to contain something meaningful. Next, R-CNN warps these region proposals into fixed-size images with some paddings, and feed these images into the second stage of the network for more fine-grained recognition. Unlike those old methods using selective search, R-CNN replaced HOG with a CNN to extract features from all region proposals in its second stage. One caveat of this approach is that many region proposals are not really a full object, so R-CNN needs to not only learn to classify the right classes, but also learn to reject the negative ones. To solve this problem, R-CNN treated all region proposals with a ≥ 0.5 IoU overlap with a ground-truth box as positive, and the rest as negatives.

Region proposal from selective search highly depends on the similarity assumption, so it can only provide a rough estimate of location. To further improve localization accuracy, R-CNN borrowed an idea from “Deep Neural Networks for Object Detection” (aka DetectorNet), and introduced an additional bounding box regression to predict the center coordinates, width and height of a box. This regressor is widely used in the future object detectors.

However, a two-stage detector like R-CNN suffers from two big issues: 1) It’s not fully convolutional because selective search is not E2E trainable. 2) region proposal stage is usually very slow compared with other one-stage detectors like OverFeat, and running on each region proposal separately makes it even slower. Later, we will see how R-CNN evolve over time to address these two issues.

2015: Fast R-CNN

A quick follow-up for R-CNN is to reduce the duplicate convolution over multiple region proposals. Since these region proposals all come from one image, it’s naturally to improve R-CNN by running CNN over the entire image once and share the computation among many region proposals. However, different region proposals have different sizes, which also result in different output feature map sizes if we are using the same CNN feature extractor. These feature maps with various sizes will prevent us from using fully connected layers for further classification and regression because the FC layer only works with a fixed size input.

Fortunately, a paper called “Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition” has already solved the dynamic scale issue for FC layers. In SPPNet, a feature pyramid pooling is introduced between convolution layers and FC layers to create a bag-of-words style of the feature vector. This vector has a fixed size and encodes features from different scales, so our convolution layers can now take any size of images as input without worrying about the incompatibility of the FC layer. Inspired by this, Fast R-CNN proposed a similar layer call the ROI Pooling layer. This pooling layer downsamples feature maps with different sizes into a fixed-size vector. By doing so, we can now use the same FC layers for classification and box regression, no matter how large or small the ROI is.

With a shared feature extractor and the scale-invariant ROI pooling layer, Fast R-CNN can reach a similar localization accuracy but having 10~20x faster training and 100~200x faster inference. The near real-time inference and an easier E2E training protocol for the detection part make Fast R-CNN a popular choice in the industry as well.

This dense prediction over the entire image can cause trouble in computation cost, so YOLO took the bottleneck structure from GooLeNet to avoid this issue. Another problem of YOLO is that two objects might fall into the same coarse grid cell, so it doesn’t work well with small objects such as a flock of birds. Despite lower accuracy, YOLO’s straightforward design and real-time inference ability makes one-stage object detection popular again in the research, and also a go-to solution for the industry.

2015: Faster R-CNN

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

As we introduced above, in early 2015, Ross Girshick proposed an improved version of R-CNN called Fast R-CNN by using a shared feature extractor for proposed regions. Just a few months later, Ross and his team came back with another improvement again. This new network Faster R-CNN is not only faster than previous versions but also marks a milestone for object detection with a deep learning method.

With Fast R-CNN, the only non-convolutional piece of the network is the selective search region proposal. As of 2015, researchers started to realize that the deep neural network is so magical, that it can learn anything given enough data. So, is it possible to also train a neural network to proposal regions, instead of relying on heuristic and hand-crafted approach like selective search? Faster R-CNN followed this direction and thinking, and successfully created the Region Proposal Network (RPN). To simply put, RPN is a CNN that takes an image as input and outputs a set of rectangular object proposals, each with an objectiveness score. The paper used VGG originally but other backbone networks such as ResNet become more widespread later. To generate region proposals, a 3x3 sliding window is applied over the CNN feature map output to generate 2 scores (foreground and background) and 4 coordinates each location. In practice, this sliding window is implemented with a 3x3 convolution kernel with a 1x1 convolution kernel.

Although the sliding window has a fixed size, our objects may appear on different scales. Therefore, Faster R-CNN introduced a technique called anchor box. Anchor boxes are pre-defined prior boxes with different aspect ratios and sizes but share the same central location. In Faster R-CNN there are k=9 anchors for each sliding window location, which covers 3 aspect ratios for 3 scales each. These repeated anchor boxes over different scales bring nice translation-invariance and scale-invariance features to the network while sharing outputs of the same feature map. Note that the bounding box regression will be computed from these anchor box instead of the whole image.

So far, we discussed the new Region Proposal Network to replace the old selective search region proposal. To make the final detection, Faster R-CNN uses the same detection head from Fast R-CNN to do classification and fine-grained localization. Do you remember that Fast R-CNN also uses a shared CNN feature extractor? Now that RPN itself is also a feature extraction CNN, we can just share it with detection head like the diagram above. This sharing design doesn’t bring some trouble though. If we train RPN and Fast R-CNN detector together, we will treat RPN proposals as a constant input of ROI pooling, and inevitably ignore the gradients of RPN’s bounding box proposals. One walk around is called alternative training where you train RPN and Fast R-CNN in turns. And later in a paper “Instance-aware semantic segmentation via multi-task network cascades”, we can see that the ROI pooling layer can also be made differentiable w.r.t. the box coordinates proposals.

2015: YOLO v1

You Only Look Once: Unified, Real-Time Object Detection

While the R-CNN series started a big hype over two-stage object detection in the research community, its complicated implementation brought many headaches for engineers who maintain it. Does object detection need to be so cumbersome? If we are willing to sacrifice a bit of accuracy, can we trade for much faster speed? With these questions, Joseph Redmon submitted a network called YOLO to only four days after Faster R-CNN’s submission and finally brought popularity back to one-stage object detection two years after OverFeat’s debut.

Unlike R-CNN, YOLO decided to tackle region proposal and region classification together in the same CNN. In other words, it treats object detection as a regression problem, instead of a classification problem relying on region proposals. The general idea is to split the input into an SxS grid and having each cell directly regress the bounding box location and the confidence score if the object center falls into that cell. Because objects may have different sizes, there will be more than one bounding box regressor per cell. During training, the regressor with the highest IOU will be assigned to compare with the ground-truth label, so regressors at the same location will learn to handle different scales over time. In the meantime, each cell will also predict C class probabilities, conditioned on the grid cell containing an object (high confidence score). This approach is later described as dense predictions because YOLO tried to predict classes and bounding boxes for all possible locations in an image. In contrast, R-CNN relies on region proposals to filter out background regions, hence the final predictions are much more sparse.

SSD: Single Shot MultiBox Detector

YOLO v1 demonstrated the potentials of one-stage detection, but the performance gap from two-stage detection is still noticeable. In YOLO v1, multiple objects could be assigned to the same grid cell. This was a big challenge when detecting small objects, and became a critical problem to solve in order to improve a one-stage detector’s performance to be on par with two-stage detectors. SSD is such a challenger and attacks this problem from three angles.

First, the anchor box technique from Faster R-CNN can alleviate this problem. Objects in the same area usually come with different aspect ratios to be visible. Introducing anchor box not only increased the amount of object to detect for each cell, but also helped the network to better differentiate overlapping small objects with this aspect ratio assumption.

SSD went down on this road further by aggregating multi-scale features before detection. This is a very common approach to pick up fine-grained local features while preserving coarse global features in CNN. For example, FCN, the pioneer of CNN semantic segmentation, also merged features from multiple levels to refine the segmentation boundary. Besides, multi-scale feature aggregation can be easily performed on all common classification networks, so it’s very convenient to swap out the backbone with another network.

Finally, SSD leveraged a large amount of data augmentation, especially targeted to small objects. For example, images are randomly expanded to a much larger size before random cropping, which brings a zoom-out effect to the training data to simulate small objects. Also, large bounding boxes are usually easy to learn. To avoid these easy examples dominating the loss, SSD adopted a hard negative mining technique to pick examples with the highest loss for each anchor box.

Feature Pyramid Networks for Object Detection

With the launch of Faster-RCNN, YOLO, and SSD in 2015, it seems like the general structure an object detector is determined. Researchers start to look at improving each individual parts of these networks. Feature Pyramid Networks is an attempt to improve the detection head by using features from different layers to form a feature pyramid. This feature pyramid idea isn’t very novel in computer vision research. Back then when features are still manually designed, feature pyramid is already a very effective way to recognize patterns at different scales. Using the Feature Pyramid in deep learning is also not a new idea: SSPNet, FCN, and SSD all demonstrated the benefit of aggregating multiple-layer features before classification. However, how to share the feature pyramid between RPN and the region-based detector is still yet to be determined.

First, to rebuild RPN with an FPN structure like the diagram above, we need to have a region proposal running on multiple different scales of feature output. Also, we only need 3 anchors with different aspect ratios per location now because objects with different sizes will be handle by different levels of the feature pyramid. Next, to use an FPN structure in the Fast R-CNN detector, we also need to adapt it to detect on multiple scales of feature maps as well. Since region proposals might have different scales too, we should use them in the corresponding level of FPN as well. In short, if Faster R-CNN is a pair of RPN and region-based detector running on one scale, FPN converts it into multiple parallel branches running on different scales and collects the final results from all branches in the end.

2016: YOLO v2

YOLO9000: Better, Faster, Stronger

While Kaiming He, Ross Girshick, and their team keep improving their two-stage R-CNN detectors, Joseph Redmon, on the other hand, was also busy improving his one-stage YOLO detector. The initial version of YOLO suffers from many shortcomings: predictions based on a coarse grid brought lower localization accuracy, two scale-agnostic regressors per grid cell also made it difficult to recognize small packed objects. Fortunately, we saw too many great innovations in 2015 in many computer vision areas. YOLO v2 just needs to find a way to integrate them all to become better, faster, and stronger. Here are some highlights of the modifications:

Note that YOLO v2 also experimented with a version that’s trained on 9000 classes hierarchical datasets, which also represents an early trial of multi-label classification in an object detector.

2017: RetinaNet

Focal Loss for Dense Object Detection

To understand why one-stage detectors are usually not as good as two-stage detectors, RetinaNet investigated the foreground-background class imbalance issue from a one-stage detector’s dense predictions. Take YOLO as an example, it tried to predict classes and bounding boxes for all possible locations in the meantime, so most of the outputs are matched to negative class during training. SSD addressed this issue by online hard example mining. YOLO used an objectiveness score to implicitly train a foreground classifier in the early stage of training. RetinaNet thinks they both didn’t get the key to the problem, so it invented a new loss function called Focal Loss to help the network learn what’s important.

Focal Loss added a power γ (they call it focusing parameter) to Cross-Entropy loss. Naturally, as the confidence score becomes higher, the loss value will become much lower than a normal Cross-Entropy. The α parameter is used to balance such a focusing effect.

This idea is so simple that even a primary school student can understand. So to further justify their work, they adapted the FPN model they previously proposed and created a new one-stage detector called RetinaNet. It is composed of a ResNet backbone, an FPN detection neck to channel features at different scales, and two subnets for classification and box regression as detection head. Similar to SSD and YOLO v3, RetinaNet uses anchor boxes to cover targets of various scales and aspect ratios.

A bit of a digression, RetinaNet used the COCO accuracy from a ResNeXT-101 and 800 input resolution variant to contrast YOLO v2, which only has a light-weighted Darknet-19 backbone and 448 input resolution. This insincerity shows the team’s emphasis on getting better benchmark results, rather than solving a practical issue like a speed-accuracy trade-off. And it might be part of the reason that RetinaNet didn’t take off after its release.

2018: YOLO v3

YOLOv3: An Incremental Improvement

YOLO v3 is the last version of the official YOLO series. Following YOLO v2’s tradition, YOLO v3 borrowed more ideas from previous research and got an incredible powerful one-stage detector like a monster. YOLO v3 balanced the speed, accuracy, and implementation complexity pretty well. And it got really popular in the industry because of its fast speed and simple components. If you are interested, I wrote a very detailed explanation of how YOLO v3 works in my previous article “ Dive Really Deep into YOLO v3: A Beginner’s Guide ”.

Simply put, YOLO v3’s success comes from its more powerful backbone feature extractor and a RetinaNet-like detection head with an FPN neck. The new backbone network Darknet-53 leveraged ResNet’s skip connections to achieve an accuracy that’s on par with ResNet-50 but much faster. Also, YOLO v3 ditched v2’s pass through layers and fully embraced FPN’s multi-scale predictions design. Since then, YOLO v3 finally reversed people’s impression of its poor performance when dealing with small objects.

Besides, there are a few fun facts about YOLO v3. It dissed the COCO mAP 0.5:0.95 metric, and also demonstrated the uselessness of Focal Loss when using a conditioned dense prediction. The author Joseph even decided to quit the whole computer vision research a year later, because of his concern of military usage.

2019: Objects As Points

Although the image classification area becomes less active recently, object detection research is still far from mature. In 2018, a paper called “CornerNet: Detecting Objects as Paired Keypoints” provided a new perspective for detector training. Since preparing anchor box targets is a quite cumbersome job, is it really necessary to use them as a prior? This new trend of ditching anchor boxes is called “anchor-free” object detection.

Inspired by the use of heat-map in the Hourglass network for human pose estimation, CornerNet uses a heat-map generated by box corners to supervise the bounding box regression. To learn more about how heat-map is used in Hourglass Network, you can read my previous article “ Human Pose Estimation with Stacked Hourglass Network and TensorFlow ”.

Objects As Points, aka CenterNet, took a step further. It uses heat-map peaks to represent object centers, and the network will regress the box width and height directly from these box centers. Essentially, CenterNet is using every pixel as grid cells. With a Gaussian distributed heat-map, the training is also easier to converge compared with previous attempts which tried to regress bounding box size directly.

The elimination of anchor boxes also has another useful side effect. Previously, we rely on IOU ( such as > 0.7) between the anchor box and the ground truth box to assign training targets. By doing so, a few neighboring anchors may get all assigned a positive target for the same object. And the network will learn to predict multiple positive boxes for the same object too. The common way to fix this issue is to use a technique called Non-maximum Suppression (NMS). It’s a greedy algorithm to filter out boxes that are too close together. Now that anchors are gone and we only have one peak per object in the heat-map, there’s no need to use NMS any more. Since NMS is sometimes hard to implement and slow to run, getting rid of NMS is a big benefit for the applications that run in various environments with limited resources.

2019: EfficientDet

EfficientDet: Scalable and Efficient Object Detection

In the recent CVPR’20, EfficientDet showed us some more exciting development in the object detection area. FPN structure has been proved to be a powerful technique to improve the detection network’s performance for objects at different scales. Famous detection networks such as RetinaNet and YOLO v3 all adopted an FPN neck before box regression and classification. Later, NAS-FPN and PANet (please refer to Read More section) both demonstrated that a plain multi-layer FPN structure may benefit from more design optimization. EfficientDet continued exploring in this direction, eventually created a new neck called BiFPN. Basically, BiFPN features additional cross-layer connections to encourage feature aggregation back and forth. To justify the efficiency part of the network, this BiFPN also removed some less useful connections from the original PANet design. Another innovative improvement over the FPN structure is the weight feature fusion. BiFPN added additional learnable weights to feature aggregation so that the network can learn the importance of different branches.

Moreover, just like what we saw in the image classification network EfficientNet, EfficientDet also introduced a principled way to scale an object detection network. The φ parameter in the above formula controls both width (channels) and depth (layers) of both BiFPN neck and detection head.

This new parameter results in 8 different variants of EfficientDet from D0 to D7. A light-weighed D0 variant can achieve similar accuracy with YOLO v3 while having much fewer FLOPs. A heavy-loaded D7 variant with monstrous 1536x1536 input can even reach 53.7 AP on COCO that dwarfed all other contenders.

From R-CNN, YOLO to recent CenterNet and EfficientDet, we have witnessed most major innovations in the object detection research in the deep learning era. Aside from the above papers, I’ve also provided a list of additional papers for you to keep reading to get a deeper understanding. They either provided a different perspective for object detection or extended this area with more powerful features.

Object Detection with Discriminatively Trained Part Based Models

By matching many HOG features for each deformable parts, DPM was one of the most efficient object detection models before the deep learning era. Take pedestrian detection as an example, it uses a star structure to recognize the general person pattern first, and then recognize parts with different sub-filters and calculate an overall score. Even today, the idea to recognize objects with deformable parts is still popular after we switch from HOG features to CNN features.

2012: Selective Search

Selective Search for Object Recognition

Like DPM, Selective Search is also not a product of the deep learning era. However, this method combined so many classical computer vision approaches together, and also used in the early R-CNN detector. The core idea of selective search is inspired by semantic segmentation where pixels are group by similarity. Selective Search uses different criteria of similarity such as color space and SIFT-based texture to iteratively merge similar areas together. And these merged area areas served as foreground predictions and followed by an SVM classifier for object recognition.

2016: R-FCN

R-FCN: Object Detection via Region-based Fully Convolutional Networks

Faster R-CNN finally combined RPN and ROI feature extraction and improved the speed a lot. However, for each region proposal, we still need fully connected layers to compute class and bounding box separately. If we have 300 ROIs, we need to repeat this by 300 hundred times, and this is also the origin of the major speed difference between one-stage and two-stage detector. R-FCN borrowed the idea from FCN for semantic segmentation, but instead of computing the class mask, R-FCN computes a positive sensitive score maps. This map will predict the probability of the appearance of the object at each location, and all locations will vote (average) to decide the final class and bounding box. Besides, R-FCN also used atrous convolution in its ResNet backbone, which is originally proposed in the DeepLab semantic segmentation network. To understand what is atrous convolution, please see my previous article “ Witnessing the Progression in Semantic Segmentation: DeepLab Series from V1 to V3+ ”.

2017: Soft-NMS

Improving Object Detection With One Line of Code

Non-maximum suppression (NMS) is widely used in anchor-based object detection networks to reduce duplicate positive proposals that are close-by. More specifically, NMS iteratively eliminates candidate boxes if they have a high IOU with a more confident candidate box. This could lead to some unexpected behavior when two objects with the same class are indeed very close to each other. Soft-NMS made a small change to only scaling down the confidence score of the overlapped candidate boxes with a parameter. This scaling parameter gives us more control when tuning the localization performance, and also leads to a better precision when a high recall is also needed.

2017: Cascade R-CNN

Cascade R-CNN: Delving into High Quality Object Detection

While FPN exploring how to design a better R-CNN neck to use backbone features Cascade R-CNN investigated a redesign of R-CNN classification and regression head. The underlying assumption is simple yet insightful: the higher IOU criteria we use when preparing positive targets, the less false positive predictions the network will learn to make. However, we can’t simply increase such IOU threshold from commonly used 0.5 to more aggressive 0.7, because it could also lead to more overwhelming negative examples during training. Cascade R-CNN’s solution is to chain multiple detection head together, each will rely on the bounding box proposals from the previous detection head. Only the first detection head will use the original RPN proposals. This effectively simulated an increasing IOU threshold for latter heads.

2017: Mask R-CNN

Mask R-CNN is not a typical object detection network. It was designed to solve a challenging instance segmentation task, i.e, creating a mask for each object in the scene. However, Mask R-CNN showed a great extension to the Faster R-CNN framework, and also in turn inspired object detection research. The main idea is to add a binary mask prediction branch after ROI pooling along with the existing bounding box and classification branches. Besides, to address the quantization error from the original ROI Pooling layer, Mask R-CNN also proposed a new ROI Align layer that uses bilinear image resampling under the hood. Unsurprisingly, both multi-task training (segmentation + detection) and the new ROI Align layer contribute to some improvement over the bounding box benchmark.

2018: PANet

Path Aggregation Network for Instance Segmentation

Instance segmentation has a close relationship with object detection, so often a new instance segmentation network could also benefit object detection research indirectly. PANet aims at boosting information flow in the FPN neck of Mask R-CNN by adding an additional bottom-up path after the original top-down path. To visualize this change, we have a ↑↓ structure in the original FPN neck, and PANet makes it more like a ↑↓↑ structure before pooling features from multiple layers. Also, instead of having separate pooling for each feature layer, PANet added an “adaptive feature pooling” layer after Mask R-CNN’s ROIAlign to merge (element-wise max of sum) multi-scale features.

2019: NAS-FPN

NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection

PANet’s success in adapting FPN structure drew attention from a group of NAS researchers. They used a similar reinforcement learning method from the image classification network NASNet and focused on searching the best combination of merging cells. Here, a merging cell is the basic build block of an FPN that merges any two input features layers into one output feature layer. The final results proved the idea that FPN could use further optimization, but the complex computer-searched structure made it too difficult for humans to understand.

Object detection is still an active research area. Although the general landscape of this field is well shaped by a two-stage detector like R-CNN and one-stage detector such as YOLO, our best detector is still far from saturating the benchmark metrics, and also misses many targets in complicated background. At the same time, Anchor-free detector like CenterNet showed us a promising future where object detection networks can become as simple as image classification networks. Other directions of object detection, such as few-shot recognition and NAS, are still at an early age, and we will see how it goes in the next few years. Nevertheless, as object detection technology becomes more mature, we need to be very cautious about its adoption by the military and police. A dystopia where Terminators hunt and shoot humans with a YOLO detector is the last thing we want to see in our life.

Originally published at on Aug 9, 2020

More from Towards Data Science

Your home for data science. A Medium publication sharing concepts, ideas and codes.

About Help Terms Privacy

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store

Ethan Yanjia Li

Can machines perceive the world as human do?

Text to speech

Skip to Main Content

IEEE Account

Purchase Details

Profile Information

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2023 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.


  1. (PDF) Object Detection using Image Processing

    research papers on object detection

  2. Basic Object Detection Model.

    research papers on object detection

  3. Survey of Object Detection

    research papers on object detection

  4. Illustration of the object detection task and terms used in this paper....

    research papers on object detection

  5. (PDF) Object Detection

    research papers on object detection

  6. (PDF) Continual Universal Object Detection

    research papers on object detection


  1. Real Time Object Detection and Tracking Using Deep Learning and OpenCV

  2. Object Detection on the Web

  3. Object Detection


  5. Simple object detection and classification

  6. IGCSE Computer Science: chapter 7 Algorithm Design & Problem-Solving:Life Cycle by Sir Minhaj Akhtar


  1. Object Detection

    Object detection is the task of detecting instances of objects of a certain class within an image. The state-of-the-art methods can be categorized into two

  2. Object Detection with Deep Learning: A Review

    THIS PAPER HAS BEEN ACCEPTED BY IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS ... researchers have already tried to model object detection as.

  3. A review of research on object detection based on deep learning

    In this paper, the representative algorithms of each stage are introduced in detail.Then the public and special datasets commonly used in target detection are

  4. 5 AI/ML Research Papers on Object Detection You Must Read

    We then demonstrate the effectiveness of MobileNets across a wide range of applications and use cases including object detection, finegrain classification, face

  5. Object Detection: Current and Future Directions

    Object detection is a key ability required by most computer and robot vision systems. The latest research on this area has been making great progress in

  6. Application of Deep Learning for Object Detection

    Object detection is one of these domains witnessing great success in computer vision. This paper demystifies the role of deep learning techniques based on


    The first and foremost step in visual surveillance is identifying moving objects in a video sequence. The moving object of interest may be human

  8. Object detection in real time based on improved single shot multi

    It is the basis of complex vision tasks such as target tracking and scene understanding and is widely used in wireless networks. The task of

  9. 12 Papers You Should Read to Understand Object Detection in the

    However, this region-based approach eventually led to a big wave of object detection research with its two-stage framework, i.e, region proposal

  10. The Object Detection Based on Deep Learning

    The paper first makes an introduction of the classical methods in object detection, and expounds the relation and difference between the classical methods