Treating abnormal events as a binary classification problem is not ideal for two reasons :
- Abnormal events are challenging to obtain due to their rarity.
- There is a massive variety of abnormal events, and manually detecting and labeling such events is a difficult task that requires much manpower.
A better approach would be to use unlabelled video sequences with little or no abnormal events to train which are obtained easily . Autoencoders require just that
The above flowchart explains the working of a trained ConvLSTM Autoencoder
- Reconstructed clip is output by the Autoencoder
- Based on the error (reconstruction cost) between the original and reconstructed clip we can find if a particular frame is abnormal
A vehicle in a pedestrian walk — Abnormal
For eg , normal clip would be a normal wire rope without defects and the abnormal is the frame with the defect . This frame won’t come up while reconstruction hence the RC will be higher for this frame
Why Autoencoders ?
What use is AE if its just going to reconstruct the input ? — The answer is while reconstructing the AE is actually learning a lower dimensional vector ‘ h ‘ representation of the high dimensional input ‘ x ‘
So is this similar to PCA ? — Yes , in a way . If there are no activation functions like sigmoid , tanh etc and Mean squared error is used as a loss function the latent vector is nothing but the principal component of PCA . But here we use non linear activation function at the end of each layer so this can be thought of as a non linear form of PCA whose latent vector ‘h’ is a meaningful representation of the input image . The intuition is while reconstructing ‘ x ‘ from this latent vector ‘ h ‘ certain details will be discarded in our case abnormalities . Comparing this with the original image gives the reconstruction error RC
Latent representation is also something that we encountered in Facial recognition , reducing faces to 128D and 1024 D vector representation
There are different variants of AE serving different purposes , more of which can be explored in the link below
Now we can find a spatially encoded vector ‘h’ (bottleneck layer) for an image using this approach but we also need to be aware of the previous frames’ representation to reduce it to a meaningful representation , in other words , we need a temporal component as well to encode . That is where LSTM comes to the rescue
As you can see , there is a temporal encoder after the normal Conv layer , here its reduced to a 32D vector from a 256D vector .
It is from this bottleneck layer we construct a decoder with a symmetrical structure to output the same 256D output . The deconv layers are nothing but a reverse convolution operation , something like upsampling an image using an interpolation technique , except here the upsampling is learned and not engineered as bi-cubic or bi-linear . More detailed explanation is
This 256D output is the reconstructed output which is compared with the original image to calculate the error . Based on a threshold now we can output the images which are anomalies
More about the LSTM is explained below :
Before we move on to the specifics , let’s get some intuition behind the basics
LSTM is a special type of Recurrent neural network . Just like NN capturing features like edges,shape and high level features , an RNN captures the sequential information , for eg : if there is moon present on the image the model should remember it as nighttime , if there are people closing their eyes on a beach , it should know that they are sun tanning and not sleeping .
But this sequential information is short lived , meaning if it sees a frame in a place in the US and after 2,3 frames it sees a person eating sushi it can conclude they are in Japan . Mathematically , the gradients passing this information during backprop might vanish or explode . This is the problem which LSTMs solve by remembering long term information .
Below is an intuitive understanding of the differences between NN , RNN and LSTMs
More about the working of LSTM applied on character level predictions with visualizations can be found
Now let’s move to the specifics :
LSTM involves a lot of gating mechanisms which work in unison to output the values and the cell state . A cell state is the one that holds all the information about what it has learnt so far which when passed to subsequent layers gets updated
Input gates :
Let’s use a different nomenclature for better understanding
At — candidate addition to the long term memory
ft — remember gate [ determines how much of the candidate ltm to keep ]
It — save gate [ determines how much of the old ltm to keep ]
Ot — [determines which parts of the updated ltm to focus on ]
Out(t-1) is initially 0 (time = 0)
State — new ltm state = remember gate * older ltm + save gate * new candidate ltm
Each cell outputs the above state and output values to the next layers
This way the sequential information is passed on to make better inferences
More about the working of the LSTM along with backpropagation explained mathematically like the one below can be found here
ConvLSTM v LSTM:
There are two approaches when working with sequential images :
- All the matrix multiplications are replaced with convolution operations keeping the input dimensions 3D
- The image passes through the convolutions layers and its result is a set flattened to a 1D array with the obtained features. When repeating this process to all images in the time set, the result is a set of features over time, and this is the LSTM layer input.
More about their implementation can be found here
Useful links :
Originally published at http://docs.google.com.