Designing Neural Network Architectures for Different Applications: From Facial Landmark Tracking to Lane Departure Warning System

Author: YiTa Wu, Vice President of Engineering, ULSee

Abstract

Deep learning is considered a more accurate tool than other machine learning strategies such as decision tree, genetic algorithm, and support vector machines. It stacks layers of perceptions to form a deep structure, and iteratively adjust the parameters by the backpropagation and gradient descent algorithm during the training procedure. However, to the end user, deep learning models are the “blackest” of all black boxes since it is very difficult to interpret how the system (neural network architecture) works. In this article, we will explain ULSee’s experience of designing a network architecture for multiple applications. 

Design the Network Architecture for the Most Complicated Application First

If you want to design a flexible network architecture, we recommend designing the architecture based on the most complicated target applications. Normally an architecture that works on a complicated application can also work on a simple application, but one designed for a simple application may have problem on the complicated application. ULSee targeted two applications: facial landmark tracking and lane departure warning. Facial landmark tracking combines high resolutions, a high number of facial landmarks, and spatial information to create high accuracy results, which can be more complex than the lane line recognition needed for lane departure warning systems, so we designed the architecture to meet the facial landmark tracking requirements.

A neural network architecture contains two major parts: feature extraction and inference. To meet the high frame rate requirement, we developed the feature extraction with “MobileNET” [1], and the last layer of inference was 136 outputs, indicating x and y coordinates of 68 facial landmarks as shown in Figure 1(a). Unfortunately, the 136 output architecture could not achieve high enough accuracy, probably because the architecture did not use spatial information. We then redesigned the inference part of the design, considering the heatmap [2] idea as shown in Figure 1(b) so that we can precisely predict the x and y coordinates based on the plentiful spatial information. The experimental result of the heatmap design endorses our assumption that utilizing the spatial information and the correct prediction of x and y coordinates provides accuracy that is much higher than the 136 output architecture. The only problem of the heatmap design is that the processing frame rate became very low and it could not meet another critical requirement for real-time processing.

Figure 1. Two different neural network architecture designs (136 outputs and heatmap) of last layer of inference part

The reason behind the low frame rate of the heatmap design can be found in Figure 2. The final 68 facial landmarks as shown in the bright pixels of the 112x112 resolution image are generated by 68 heatmaps shown in Figure 2(b). That is, each of the facial landmarks can be derived by a 112x112 heatmap, leading to computational complexity that is much higher than the 136 output architecture. 

Figure 2. High computational complexity of heatmap design

To reduce the computational complexity of the heatmap, we reduced the resolution of each heatmap by combining the offset concept [3] as shown in Figure 3(a). We reduced the resolution of each heatmap from 112x112 to 28x28 and add two additional 28x28 offsets for each 28x28 heatmap to indicate the offsets in x and y coordinates, respectively. In this way, we increased the frame rate significantly by reducing the number of calculations, but sacrificed the correct prediction of x and y coordinates of the facial landmarks. To mitigate this effect, we adopted the residual network concept to increase the correctness as shown in Figure 3(b). In the end, we designed a neural network architecture for facial landmark detection with high accuracy and frame rate. 

Figure 3. Two neural network architecture advanced designs (offset and residual) of last layer of inference part

Mapping the Architecture to a Simpler Application – Lane Departure Warning

In an ideal world, system designers can develop a single solution framework that applies to all kinds of applications. To test this, we tried mapping the facial landmark tracking neural network architecture to a lane departure warning (lane departure warning) application. In Figure 4, we illustrate the idea of converting the lane departure warning application to the same problem domain of facial landmark as shown in Figure 4(a) by predicting three points (yellow and red circles) from an input image as shown in Figure 4(b). 

Figure 4. The concept behind converting a facial landmark application to lane departure warning is just mapping points

Since our neural network architecture as shown in Figure 3(b) works well for facial landmark applications, we simply modified the inference layer of Figure 3(b) such that the last inference layer contains three 28x28 heatmap and six 28x28 offsets in the x and y coordinates as shown in Figure 5(a). However, the result of Figure 5(a) was not good, so we further modified the network by make the network deeper as shown in Figure 5(b). Unfortunately, that attempt’s results were not accurate enough either.

Figure 5. Two neural network architecture designs that did not provide accurate lane departure warning results

To design the proper neural network architecture for lane departure warning, we thought about the property of neural network as shown in Figure 6. Figure 6(a) shows the two major parts: the backbone (feature extraction) and inference (fully connected) layers, of the deep convolutional neural network architecture. Figure 6(b) shows an example from the low-level to high-level features according to the layers in the feature extraction part of the network architecture. It is obvious that the lanes in the lane departure warning application are either straight lines or curves and thus we needed to increase the number of low-level features instead of increasing the number of network layers. 

Figure 6. The low- and high-level features according to the layers of neural network

Figures 7 and 8 show the final network architecture for the lane departure warning application and the experimental results, respectively. Note that the red, blue and green lines in Figure 8 indicate the left, right, and ground truth, lane lines, respectively.   

Figure 7. The final neural network architecture design for lane departure warning

Figure 8. The experimental results of the lane departure warning system

Conclusion

Although deep learning is a powerful tool in machine learning, its performance is still dependent on the neural network architecture. Therefore, we need to understand the fundamentals of the problem to design the proper neural network architecture. In this article, we showed one way to design neural network architectures for different applications. The network architecture for facial landmark detection was initially designed using MobileNET as the backbone network and 136 outputs for x and y coordinates. It was further modified to be a small resolution heatmap and used additional offsets to achieve the requirement of real-time processing. The exact same structure could not be used for lane departure warning since lane departure warning requires more low-level features instead of deeper layers. In this way, the design of network architectures can build from more complex to simpler, and the proper neural network architecture can be designed accordingly.

Video: ULSee Real-Time Facial Recognition & Liveness Detection System with ARC EV7x Embedded Vision Processor IP

The ULSee UL100 AI module integrates DesignWare® ARC® EV62 Processor IP and the neural network architecture discussed in this article to perform real-time facial recognition with very low power consumption. The chip performs edge computing for fast facial recognition and liveness detection. The module is deployed for ADAS and facial payments, smart door unlocking, and more.

References

[1] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, H. Adam, “MobileNets: efficient convolutional neural networks for mobile vision applications,” CVPR, 2017.

[2] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” ECCV, Oct. 2016.

[3] G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, C. Bregler, K. Murphy, “Towards accurate multi-person pose estimation in the wild,” CVPR 2017.