Vision systems with novel deep networks methods

After the Friday break, due to personal happenings that I couldn’t avoid, The Information Age is back today with another review. And today we continue with Deep Neural Networks of the Convolutional sort, this time applied to Vision systems for Robotics.

Vision systems is one the greatest challenge for Robotics. The problems revolve around how a machine recognizes all the relevant features amid a lot of noise in a real image. The vision system must discern those features in order, for instance, to evaluate the proper force to apply to reach and grab an object from a shelve, or to open a door, and this is a challenge for artificial neural networks. Further beyond the noise there’s the issue of how to provide the system with the proper training data in a way that is autonomous for the Robot, i.e., in real-time and automatic.

Those were the challenges the authors – from Princeton University, MIT, Google and AutoX – of today’s paper proposed to address:

Multi-view Self-supervised Deep Learning for 6D Pose Estimation in the Amazon Picking Challenge


Robot warehouse automation has attracted significant interest in recent years, perhaps most visibly in the Amazon Picking Challenge (APC). A fully autonomous warehouse pick-and-place system requires robust vision that reliably recognizes and locates objects amid cluttered environments, self-occlusions, sensor noise, and a large variety of objects. In this paper we present an approach that leverages multi-view RGB-D data and self-supervised, data-driven learning to overcome those difficulties. The approach was part of the MIT-Princeton Team system that took 3rd- and 4th- place in the stowing and picking tasks, respectively at APC 2016. In the proposed approach, we segment and label multiple views of a scene with a fully convolutional neural network, and then fit pre-scanned 3D object models to the resulting segmentation to get the 6D object pose. Training a deep neural network for segmentation typically requires a large amount of training data. We propose a self-supervised method to generate a large labeled dataset without tedious manual segmentation. We demonstrate that our system can reliably estimate the 6D pose of objects under a variety of scenarios. All code, data, and benchmarks are available at this http URL

A part of the paper that lists the main challenges for a vision system to overcome with the convnet set up:

  • Cluttered environments: shelves and totes may have multiple objects and could be arranged as to deceive vision algorithms (e.g., objects on top of one another).
  • Self-occlusion: due to limited camera positions, the system only sees a partial view of an object. ·
  • Missing data: commercial depth sensors are unreliable at capturing reflective, transparent, or meshed surfaces, all common in product packaging. ·
  • Small or deformable objects: small objects provide fewer data points, while deformable objects difficult alignment to prior models. ·
  • Speed: the total time dedicated to capturing and processing visual information is under 20 seconds

A further interesting new methodological approach by the team was to provide the system with large training data sets in a way that would be automatic for the Robot, what they call a self-supervised method that trains deep networks by automatically labeling training data:

Training a deep neural network for segmentation requires a large amount of labeled training data. We have developed a self-supervised training procedure that automatically generated 130,000 images with pixel-wise category labels of the 39 objects in the APC. For evaluation, we constructed a testing dataset of over 7,000 manually-labeled images. In summary, this paper contributes with: ·

  • A robust multi-view vision system to estimate the 6D pose of objects; ·
  • A self-supervised method that trains deep networks by automatically labeling training data; 
  • A benchmark dataset for estimating object poses. All code, data, and benchmarks are publicly available 

All the set up was prepared for the robots to compete in the Amazon Picking Challenge 2016 (APC):

The APC 2016 posed a simplified version of the general picking and stowing tasks in a warehouse. In the picking task, robots sit within a 2×2 meter area in front of a shelf populated with objects, and autonomously pick 12 desired items and place them in a tote. In the stowing task, robots pick all 12 items inside a tote and place them in a pre-populated shelf. Before the competition, teams were provided with a list of 39 possible objects along with 3D CAD models of the shelf and tote. At run-time, robots were provided with the initial contents of each bin on the shelf and a work-order containing which items to pick. After picking and stowing the appropriate objects, the system had to report the final contents of both shelf and tote

The Self-Supervised Deep Network Training data

It is a well-known challenge for Deep Neural Networks of the Convolutional type the gathering of sufficient data with the proper quality and with less overhead or cost to the overall performance of how the data is deployed. The authors decided to approach it by a self-supervised schema, with this detail:

By bringing deep learning into the approach we gain robustness. This, however, comes at the expense of amassing quality training data, which is necessary to learn highcapacity models with many parameters. Gathering and manually labeling such large amounts of training data is expensive. The existing large-scale datasets used by deep learning (e.g. ImageNet [20]) are mostly Internet photos, which have very different object and image statistics from our warehouse setting.

To automatically capture and pixel-wise label images, we propose a self-supervised method, based on three observations: ·

  • Batch-training on scenes with a single object can yield deep models that perform well on scenes with multiple objects [17] (i.e., simultaneous training on cat-only or dog-only images enables successful testing on cat-withdog images); ·
  • An accurate robot arm and accurate camera calibration, gives us at will control over camera viewpoint; ·
  • For single object scenes, with known background and known camera viewpoint, we can automatically obtain precise segmentation labels by foreground masking.

The captured training dataset contains 136,575 RGB-D images of 39 objects, all automatically labeled.

But in order to achieve the desired volume of data to yield good results, human intervention was also needed:

Semi-automatic data gathering. To semi-autonomously gather large quantities of training data, we place single known objects inside the shelf bins or tote in arbitrary poses, and configure the robot to move the camera and capture RGB-D images of the objects from a variety of different viewpoints. The position of the shelf/tote is known to the robot, as it is the camera viewpoint, which we use to transform the collected RGB-D images in shelf/or tote frame. After capturing several hundred RGB-D images, the objects are manually re-arranged to different poses, and the process is repeated several times. Human involvement sums up to re-arranging the objects and labeling which objects correspond to which bin/tote. Selecting and changing the viewpoint, capturing sensor data, and labeling each image by object is automated. We finally collect RGB-D images of the empty shelf and tote from the same exact camera viewpoints to model the background, in preparation for the automatic data labeling.



The full picture of the interesting challenge that was proposed to be solved in this paper is better viewed with this conclusion, where the details of pose, image segmentation, manipulation strategies for robotic grip and the self-supervised deep network framework are covered:

Designing robotic and vision systems hand-in-hand. Vision algorithms are too often designed in isolation. However, vision is one component of a larger robotic system with needs and opportunities. Typical computer vision algorithms operate on single images for segmentation and recognition. Robotic arms free us from that constraint, allowing us to precisely fuse multiple views and improve performance in cluttered environments. Computer vision systems also tend to have fixed outputs (e.g., bounding boxes or 2D segmentation maps), but robotic systems with multiple manipulation strategies can benefit from variety in output. For example, suction cups and grippers might have different perceptual requirements. While the former might work more robustly with a segmented point cloud, the latter often requires knowledge of the object pose and geometry.

To address the challenges posed by the warehouse setting, our framework leverages multi-view RGB-D data and data-driven, self-supervised deep learning to reliably estimate the 6D poses of objects under a variety of scenarios. We also provide a well-labeled benchmark dataset of APC 2016 containing over 7,000 images from 477 scenes


Worth a further look into the references and related research.


Featured Image: Andy Zeng – Princeton Vision & Robotics


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s