Computer Vision is one of the hottest domain especially at times like these when a pandemic is nudging us towards vision solutions in different industries . Traditional Computer vision was based mainly on the image processing algorithms while Deep learning based CV provides an end to end solution to vision problems which exacerbated their applicability in much wider areas . Some applications which add value are dimensioning solutions , People tracking in surveillance, Autonomous driving
Now , all you Developers out there might or might not have been exposed to these approaches but we are going to discuss something which you can definitely relate — ‘ Computation power ‘
Difference between traditional CV workflow and DL based CV workflow :
For example ,if we want to find a person in an image we need to a feature extractor and a classifier in traditional case and we need to tell the feature extractor to look for attributes of a person like two legs , two arms . Posture etc . But DL based approaches is an end to end solution where there no need for telling extractor what to look for . We just need to give them enough images to learn it for themselves
Some traditional CV algorithms :
- Scale Invariant Feature Transform (SIFT)
- Speeded Up Robust Features (SURF)
- Features from Accelerated Segment Test (FAST)
- Hough transforms
- Geometric hashing
These are coupled with Machine learning classifiers like SVM (Support vector machine)and KNN (K- Nearest neighbors) to classify objects
Let’s see the steps involved in the popular SIFT algorithm :
- A Scale Space: To make sure that features are scale-independent
- Keypoint Localisation: Identifying the suitable features
- Orientation : Ensure the keypoints are rotation invariant
- Keypoint Descriptor: Assign a unique fingerprint to each keypoint
As it can be seen the entire process revolves around the knowledge gathered through image processing techniques to identify the same image again taking into account rotation , scale invariance among others
These being hand engineered , might not generate robust features which perform well in terms of quality for complex classification tasks when there is large variation in the number of categories , for eg , the imagenet dataset of 1000 classes
Whereas most of the deep learning algorithms in the context of images revolves around CNN and the different architectures using them . The beauty of these algorithms is that they take care of both the feature extraction and the classification part and most importantly they don’t need to be given a specific feature to look for , for eg , in the former case we have to tell the extractor to look for edges at 4 different angles to find the object in an image , whereas here we don’t need to specify any such instruction
This was proved by Alexnet in 2012 which outperformed 2010 and 2011 shallow networks involving some variant of SIFT + SVM as shown below
But this increase in accuracy comes at a cost . While the SIFT and other methods involve hand engineered features the computation time is relatively low when compared to DL methods .
Some CNN architectures are :
Ideally we want to be in the top left quadrant of this plot — higher accuracy with less operations per sec . How these methods work is beyond the scope of this article but like I said we are interested in the computation of these methods , more precisely the Multiply-Accumulate operations needed for these
Let’s get into a little Math — Stay with me !
For the sake of simplicity we consider that the cat image is made of just 4 pixels (xi vector) , this is multiplied with a weight matrix and a bias (b) vector to get classification score . This is just a toy example because it says the image is a dog ! No wonder the cat’s pissed !
This simple operation done by a single neuron in a NN is called Multiply — Accumulate
Clearly an image is going to have far greater than just 4 pixels , so just imagine the MAC operations needed to build a classification network . I’ll tell you what , don’t imagine , here you go
A typical Deep Neural Network involves recurring combinations of Convolution , Pooling , Activation , Fully connected layers . Shown below are the configurations of a few architectures
As you can see the MACs needed are in the order of billions . Alexnet which takes in a 224x224 image matrix performs 724M operations ! Phew ! There’s also a trend if you notice , it seems Convolution layers are becoming more important and FC layers are decreasing .
This is where the GPU comes in , the highly parallel architecture of GPU enables matrix multiplication and hence these convolutions to be run at a significantly lower time than the CPU
The above chart shows multiplcation of 4096 x 4096 matrices on CPU and GPU with several optimization methods like Tiling , Global memory Coalescing, Avoid shared memory bank conflict etc . Clearly the GPU provides significant boost
There , problem solved , right ? Well , not exactly
Number of MACs is just one issue , an even bigger issue with these architectures is the memory access .
For one MAC op we need the image matrix , weight matrix , partial sum to tie it all together resulting in a memory access which is 200x more than the actual MAC op!
To better understand this lets take conv 1 of Alexnet
Input — 224 x 224 x 3 times the filter matrix — 3 x 3 x 32 = 43,352,064 accesses for reading from memory
Weights — 3 x 3 x 3 x 32 + 32 = 896
Output — 112 x 112 x 32 = 401,408 accesses for writing to memory
Totally = 43,754,368 accesses for one MAC op
Reason for this is attributed to the temporal architecture of CPUs and GPUs
The ALUs where the actual MAC op takes place can only fetch data from the memory hierarchy and cannot communicate with each other
A lot of research is taking place to optimize this aspect . Most common techniques out there revolve around transforming these matrices to suit the temporal architectures by reducing the multiplications and increasing the additions
One such popular transformation is the Toeplitz matrix which rearranges the matrix to perform convolution as a simple matrix multiplication on existing platforms , but the problem is as you can see in the input feature map the data gets repeated . Some other techniques are :
Fast Fourier transform
All these focus on reducing the Big O complexity by reducing the data reuse and thereby increasing the throughput
Another research community is focusing on spatial architecture such as ASICs and FPGAs because data flow is expensive as shown below where lower levels of the memory like Register file , Processing engine have significantly lower energy cost although they have low space (in kbs)
They focus on relying more on the lower levels of the memory hierarchy unlike GPUs and CPUs which rely on the costlier DRAM access . One such chip is the Eyeriss accelerator by the folks at MIT . Check it out
Now that you have a sense of how DL architectures are computationally demanding , choosing the platform would now be better . Nevertheless the key metrics one should look for are the following :
Accuracy — DL is known for its quality of predictions hence accuracy
Programmability — Not only flexibility in terms of layers but also in terms of weights
Energy / Power — Energy per op because Edge devices run on low power
Throughput / Latency — GOPS (Giga operations per sec) , frame rate
https://people.csail.mit.edu/emer/papers/2017.12.pieee.DNN_hardware_survey.pdf — comprehensive tutorial and survey coverage of recent advances in the field of efficient processing of Deep neural networks
Originally published at http://docs.google.com.