Failing to Learn


Failing to Learn: Autonomously Identifying Perception Failures for Self-driving Cars

Manikandasriram Srinivasan Ramanagopal, Cyrus Anderson, Ram Vasudevan, Matthew Johnson-Roberson


One of the major open challenges in self-driving cars is the ability to detect cars and pedestrians to safely navigate in the world. Deep learning-based object detector approaches have enabled great advances in using camera imagery to detect and classify objects. But for a safety critical application such as autonomous driving, the error rates of the current state-of-the-art are still too high to enable safe operation. Moreover, the characterization of object detector performance is primarily limited to testing on prerecorded datasets. Errors that occur on novel data go undetected without additional human labels. In this paper, we propose an automated method to identify
mistakes made by object detectors without ground truth labels. We show that inconsistencies in object detector output between a pair of similar images can be used as hypotheses for false negatives (e.g. missed detections) and using a novel set of features for each hypotheses, an off-the-shelf binary classifier can be used to find valid errors. In particular, we study two distinct cues - temporal and stereo inconsistencies - using data that is readily available on most autonomous vehicles. Our method can be used with any camera-based object detector and we illustrate the technique on several sets of real world data. We show that a state-of-the-art detector, tracker and our classifier trained only on synthetic data can identify valid errors on KITTI tracking dataset with an Average Precision of 0.88. We also release a new tracking dataset with over 100 sequences totaling more than 80, 000 labeled pairs of stereo images along with ground truth disparity from a game engine to facilitate further research.


Preprint on


Note: Code for the revised version of the paper will be released soon.

Beta version of the code is released in Github.


Note this data can only be used for non-commercial applications.

Data is provided in KITTI tracking format. The data is gathered at 10Hz at different times of the day from a game engine. There are 104 sequences of varying lengths, totaling 80,655 images.

Sequences and annotations (22.5GB)

We generated corresponding fake right camera images by using the depth buffer information and performed simple in-painting operation from OpenCV to fill the holes. The corresponding ground truth disparity are also provided below. 

Right Camera Images (54.5GB)

Disparity Images (4.8GB)