Siamese Networks
Introduction
Deep Convolutional Neural Networks have become the state of the art methods for image classification tasks. However, one of the biggest limitations is they require a lot of labeled data. In many applications, collecting this much data is sometimes not feasible.
A Siamese neural network (sometimes called a twin neural network) is an artificial neural network that uses the same weights while working in tandem on two different input vectors to compute comparable output vectors. Often one of the output vectors are precomputed, thus forming a baseline against which the other output vector is compared. This is similar to comparing fingerprints but can be described more technically as a distance function for locality-sensitive hashing.
It is possible to build an architecture that is functionally similar to a siamese network but implements a slightly different function. This is typically used for comparing similar instances in different type sets.
Uses of similarity measures where a twin network might be used are such things as recognizing handwritten checks, automatic detection of faces in camera images, and matching queries with indexed documents. The perhaps most well-known application of twin networks is face recognition, where known images of people are precomputed and compared to an image from a turnstile or similar. It is not obvious at first, but there are two slightly different problems. One is recognizing a person among a large number of other persons, that is the facial recognition problem. DeepFace is an example of such a system. In its most extreme form, this is recognizing a single person at a train station or airport. The other is face verification, which is to verify whether the photo in a pass is the same as the person claiming he or she is the same person. The twin network might be the same, but the implementation can be quite different.
Now the traditional CNN.
Problems with Traditional CNN:
a) To train such a system, we first require a lot of different images of each of the 10 persons in the organization which might not be feasible. (Imagine if you are doing this for an organization with thousands of employees).
b) What if a new person joins or leaves the organization? You need to take the pain of collecting data again and re-train the entire model again. This is practically not possible especially for large organizations where recruitment and attrition are happening almost every week.
Siamese Neural Network:
Learning:
Learning in twin networks can be done with triplet loss or contrastive loss. For learning, by triplet loss, a baseline vector (anchor image) is compared against a positive vector (truthy image) and a negative vector (falsy image). The negative vector will force learning in the network, while the positive vector will act as a regularizer. For learning, by contrastive loss, there must be a weight decay to regularize the weights or some similar operation like a normalization.
A distance metric for a loss function may have the following properties
In particular, the triplet loss algorithm is often defined with squared Euclidean (which, unlike Euclidean, does not have triangle inequality) distance at its core.
Predefined metrics, Euclidean distance metric:
The common learning goal is to minimize a distance metric for similar objects and maximize for distinct ones. This gives a loss function like
The most common distance metric used is Euclidean distance, in case of which the loss function can be rewritten in matrix form as
Learned metrics, half-twin networks:
This form also allows the twin network to be more of a half-twin, implementing slightly different functions
Conclusion:
Twin networks have been used in object tracking because of their unique two tandem inputs and similarity measurement. In object tracking, one input of the twin network is the user pre-selected exemplar image, the other input is a larger search image, in which the twin network’s job is to locate the exemplar inside of the search image. By measuring the similarity between the exemplar and each part of the search image, a map of similarity score can be given by the twin network. Furthermore, using a Fully Convolutional Network, the process of computing each sector’s similarity score can be replaced with only one cross correlation layer