Locating individual objects in image or video requires segmentation: the task of identifying whether a particular pixel comes from some object of interest or the background.
Say, for example, we’re looking at an image of traffic and we want to count vehicles.
The ideal results of segmenting individual vehicles in the above image would be something like this:
Dotted lines represent inferred outlines. The important attribute is that overlapping (occluding) vehicles are outlined separately rather than being lumped together.
*LIDAR, stereo-vision and other 3D-sensing technologies might be helpful for segmentation, but I’m focusing on cheap camera hardware in this application.
So, how do we design an algorithm to transform the inferior background-subtraction-based blobs/outlines into the ideal, vehicle-separating outlines?
One requirement is that our algorithm must have prior knowledge of vehicle appearances. It’s probably not possible to perform segmentation on a single image without having been exposed to many vehicles/training examples in the past. Otherwise, it would not be possible to infer the outline of obscured objects.
On the other hand, it’s obviously not possible to store examples of every vehicle variation in existence. Some middle ground must be achieved where a representative basis set is learned from training data, sufficient to represent actual objects fairly accurately but not burdensome on computer resources.
So, assuming we have some basis set with which to represent the objects being segmented, the next question is how we deconstruct the image into elements generated from the basis set.
For now, let’s ignore the case where a single object is segmented into multiple blobs by background subtraction. So, we’re considering the cases where each blob is either a) a single vehicle or b) some small quantity of vehicles.
How do we identify the quantity of vehicles in each blob? One approach is a direct neural-network classifier, trained on hand-classified examples of one, two, three (etc) vehicles. This emulates the cognitive process known as subitizing. The results of subitizing provide an object count for each blob. This can be used to select the number of predicted objects from our object basis set.
The final task is locally fitting the basis-set elements to each blob. This would depend on the method of representation, but we can make some assumptions. Translational brute-force testing by sliding an image prototype horizontally and vertically across the image to maximize similarity could be done in real-time if it’s over a small enough area. Z-ordering can be tested exhaustively as well. But how do we draw the actual basis-set examples to be adjusted/translated? If we’re using RBMs as the representative set, we could use activation energies to guide selection. If we’re storing raw images for a nearest-neighbor type approach, we could compute the distance to each example image using some sparse feature set.
Here’s the overall architecture:
- Background modeling
- Background subtraction
- Blob identification
- Local template-matching