What is Amodal Panoptic Segmentation?

Network architecture

Humans rely on their ability to perceive complete physical structures of objects even when they are only partially visible, to navigate through their daily lives. This ability, known as amodal perception, serves as the link that connects our perception of the world to its cognitive understanding. However, unlike humans, robots are limited to modal perception, which restricts their ability to emulate the visual experience that humans have. In this work, we bridge this gap by proposing the amodal panoptic segmentation task.

Any given scene can broadly be categorized into two components: stuff and thing. Regions that are amorphous or uncountable belong to stuff classes (e.g., sky, road, sidewalk, etc.), and the countable objects of the scene belong to thing classes (e.g., cars, trucks, pedestrians, etc.). The amodal panoptic segmentation task aims to concurrently predict the pixel-wise semantic segmentation labels of visible regions of stuff classes, and instance segmentation labels of both the visible and occluded regions of thing classes. We believe this task is the ultimate frontier of visual recognition and will immensely benefit the robotics community. For example, in automated driving, perceiving the whole structure of traffic participants at all times, irrespective of partial occlusions, will minimize the risk of accidents. Moreover, by inferring the relative depth ordering of objects in a scene, robots can make complex decisions such as in which direction to move relative to the object of interest to obtain a clearer view without additional sensor feedback.

Amodal panoptic segmentation is substantially more challenging as it entails all the challenges of its modal counterpart (scale variations, illumination changes, cluttered background, etc.) while simultaneously requiring more complex occlusion reasoning. This becomes even more complex for non-rigid classes such as pedestrians. These aspects also reflect on the groundtruth annotation effort that it necessitates. In essence, this task requires an approach to fully grasp the structure of objects and how they interact with other objects in the scene to be able to segment occluded regions even for cases that seem ambiguous.


APSNet Architecture

Network architecture
Figure (a) Illustration of our proposed APSNet architecture consisting of a shared backbone and parallel semantic and amodal instance segmentation heads followed by a fusion module that fuses the outputs of both heads to yield the amodal panoptic segmentation output. (c) and (b) present the topologies of architectural components of our proposed semantic segmentation head and amodal instance segmentation head respectively.

APSNet follows the top-down approach. It consists of a shared backbone that comprises of an encoder and the 2-way Feature Pyramid Network (FPN), followed by the semantic segmentation head and amodal instance segmentation head. We employ the RegNet architecture as the encoder (depicted in red). It consists of a standard residual bottleneck block with group convolutions. The overall architecture of this encoder consists of repeating units of the same block at a given stage and comprises a total of five stages. At the same time, it has fewer parameters in comparison to other encoders but with higher representational capacity. Subsequently, after the 2-way FPN, our network splits into two parallel branches. One of the branches consists of the Region Proposal Network (RPN) and ROI align layers that take the 2-way FPN output as input. The extracted ROI features after the ROI layers are propagated to the amodal instance segmentation head. The second parallel branch consists of the semantic segmentation head that is connected from the fourth stage of the encoder.


Our proposed amodal instance segmentation comprises three parts, each focusing on one of the critical requirements for amodal reasoning. First, the visible mask head learns to predict the visible region of the target object in a class-specific manner. Simultaneously, an occluder head, class-agnostically predicts the regions that occlude the target object. Specifically, the visible mask head learns to segment background objects for a given proposal and the occluder head learns to segment foreground objects. The occluder head provides a global initial guess estimate of where the occluded region of the target object exists. With the features from both visible and occluder mask heads, the amodal instance segmentation head can reason about the presence of the occluded region as well as its shape. This is achieved by employing an occlusion mask head that predicts the occluded region of the target object given the visible and occluder features. Subsequently, the concatenated visible, occluder, and occlusion mask head features are further processed by a series of convolutions followed by a spatio-channel attention block. The aforementioned network layers aim to model the inherent relationship between the visible, occluder and occlusion features. Subsequently, the amodal mask head then predicts the final amodal mask for the target object. Additionally, the visible mask is further refined using a second visible mask head that takes the concatenated amodal features and visible features to predict the final inmodal mask.


The semantic head takes the x16 downsampled feature maps from the stage 4 of the RegNet encoder as input. We employ an identical stage 5 RegNet block with the dilation factor of the 3x3 convolutions set to 2. We refer to this block as the dilated RegNet block. Subsequently, we employ a DPC module to process the output of the dilated block. We then upsample the output to x8 and x4 downsampled factor using bilinear interpolation. After each upsampling stage, we concatenate the output with the corresponding features from the 2-way FPN having the same resolution and employ two 3x3 depth-wise separable convolutions to fuse the concatenated features. Finally, we use a 1x1 convolution to reduce the number of output channels to the number of semantic classes followed by a bilinear interpolation to upsample the output to the input image resolution.

KITTI-360-APS Dataset


We extend the KITTI 360 dataset which has semantic and instance labels with amodal panoptic annotations and name it the KITTI-360-APS dataset. It consists of nine sequences of urban street scenes with annotations for 61,168 images of resolution 1408x376 pixels. Our dataset comprises 10 stuff classes. We define a class as stuff if the class has amorphous regions or is incapable of movement at any point in time. Road, sidewalk, building, wall, fence, pole, traffic sign, vegetation, terrain, and sky are the stuff classes. Further, the dataset consists of 7 thing classes, namely car, pedestrians, cyclists, two-wheeler, van, truck, and other vehicles. Please note that we merge the bicycle and motorcycle class into a single class called two-wheelers.

License Agreement

The data is provided for non-commercial use only. By downloading the data, you accept the license agreement which can be downloaded here. If you report results based on the KITTI-360-APS dataset, please consider citing the paper mentioned in the Publications section.

BDD100K-APS Dataset


Our BDD100K-APS dataset extends the Berkeley Deep Drive (BDD100K) instance segmentation dataset with amodal instance and stuff semantic segmentation groundtruth labels. We provide amodal panoptic annotations for 10 stuff classes and 6 thing classes. Road, sidewalk, building, fence, pole, traffic sign, fence, terrain, vegetation, and sky are the stuff classes. Whereas, pedestrian, car, truck, rider, bicycle, and bus are the thing classes.

License Agreement

The data is provided for non-commercial use only. By downloading the data, you accept the license agreement which can be downloaded here. If you report results based on the BDD100K-APS dataset, please consider citing the paper mentioned in the Publications section.

Videos

Coming soon!

Code and Models

To be added

Publications

Rohit Mohan, and Abhinav Valada
Amodal Panoptic Segmentation
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2022.

(Pdf) (Bibtex)


People