Facial Expression Analysis with Deep Learning & Computer Vision

By: Gordon Cooper, Product Marketing Manager, Synopsys

Recognizing facial expressions and emotions is a basic skill that is learned at an early age and important to human social interactions. Humans can look at a person’s face and can quickly recognize the common emotions of anger, happiness, surprise, disgust, sadness, and fear. To transfer this skill to a machine is a complex task. Researchers have devoted decades of engineering time to writing computer programs that recognize a feature with accuracy, only to have to start over again to recognize a slightly different feature.

What if, instead of programming a machine, you could teach it to recognize emotions with great accuracy?

Deep learning techniques are showing great promise in lowering error rates for computer vision recognition and classification. Implementing deep neural networks (Figure 1) in embedded systems can help give machines the ability to visually interpret facial expressions to almost human-like levels of accuracy.

Figure 1. Simple example of a deep neural network

Figure 1. Simple example of a deep neural network

A neural network can be trained to recognize patterns and is considered to be “deep” if it has an input and output layer and at least one hidden middle layer. Each node is calculated from the weighted inputs from multiple nodes in the previous layer. These weighting values can be adjusted to perform a specific image recognition task. This is referred to as the neural network training process.

For example, to teach a deep neural network to recognize a photo showing happiness, it is presented with images of happiness as raw data (image pixels) on its input layer. Knowing that the result should be happiness, the network recognizes patterns in the picture and adjusts the node weights to minimize the errors for the happiness class. Each new annotated image showing happiness helps refine the weights. Trained with enough inputs, the network can then take in an unlabeled image and accurately analyze and recognize the patterns that correspond to happiness.

Deep neural networks require a lot of computational horsepower, calculating the weighted values of all these interconnected nodes. In addition, memory for data and efficient data movement are also important. Convolutional neural networks (CNNs) (Figure 2) are the current state-of-the-art for efficiently implementing deep neural networks for vision. CNNs are more efficient because they reuse a lot of weights across the image. They take advantage of the two-dimensional input structure of the data to reduce redundant computations.

Figure 2. Example of a Convolutional Neural Network architecture (or graph) for facial analysis.

Figure 2. Example of a Convolutional Neural Network architecture (or graph) for facial analysis 

Implementing a CNN for facial analysis requires two distinct and independent phases. The first is the Training Phase. The second is the Deployment Phase.

The Training Phase (Figure 3) requires a deep learning framework – Caffe or TensorFlow, for example – that will use CPUs and GPUs for the training calculations and the knowledge to use the framework. These frameworks often provide example CNN graphs that can be used as a starting point. The deep learning framework allows for the fine tuning of the graphs. Layers may be added, removed or modified to achieve the best possible accuracies.

Figure 3. CNN Training Phase

Figure 3. CNN Training Phase

One of the biggest challenges in the Training Phase is finding the right labeled dataset to train the network. The accuracy of the deep network is highly dependent on the distribution and quality of the trained data. Several options to consider for facial analysis are the emotion annotated dataset from the Facial Expression Recognition Challenge (FREC) and the multi-annotated private dataset from VicarVision (VV).

The Deployment Phase (Figure 4), for real-time embedded design, can be implemented on an embedded vision processor like the Synopsys DesignWare® EV6x Embedded Vision Processors with programmable CNN engine. An embedded vision processor is the best choice for balancing performance with small area and lower power.

Figure 4. CNN Deployment Phase

Figure 4. CNN Deployment Phase

While the scalar unit and vector unit are programmed using C and OpenCL C (for vectorization), the CNN engine does not have to be manually programmed. The final graph and weights (coefficients) from the Training Phase can be fed into a CNN mapping tool and the embedded vision processor’s CNN engine can be configured and ready to execute facial analysis.

Images or video frames captured from a camera lens and image sensor are fed into the embedded vision processor. It can be difficult for CNN to handle significant variations in lighting conditions or facial poses, so pre-processing of the images make the faces more uniform. The heterogeneous architecture of a sophisticated embedded vision processor and CNN allows the CNN engine to classifying the image while the vector unit is preprocessing the next image – light normalization, image scaling, plane rotation, etc., and the scalar unit handles the decision making (i.e., what to do with the CNN detection results).

Image resolution, frame rate, number of graph layers and desired accuracy all factor into the number of parallel multiply-accumulations needed and performance requirements. Synopsys’ EV6x Embedded Vision Processors with CNN can run at up to 800MHz on 28nm process technologies, and offer performance of up to 880 MACs simultaneously.

Once the CNN is configured and trained to detect emotions, it can be more easily reconfigured to handle facial analysis tasks like determining an age range, identifying gender or ethnicity, and recognizing the presence of facial hair or glasses.

Summary

CNN running on an embedded vision processor opens new applications for vision processing. Soon, having the electronics around us interpreting our feelings will be commonplace, from toys detecting happiness to electronic teachers that can determine the students’ level of understanding by identifying facial expressions. The combination of deep learning, embedded vision processing and high performance CNNs will soon bring this vision closer to reality.