Digital imaging technology replaces film with bits and bytes, with image quality measured in terms of the number of pixels. The more of these tiny colored dots in an image, the higher its resolution. Lenses in a traditional camera focus light on film to create the image. In a digital camera, an image sensor, typically either a complementary metal oxide semiconductor (CMOS) or a charge coupled device (CCD), converts light into electrical charges. With a CMOS image sensor, commonly used in smartphones, a color-filter layer provides color, while photodiodes convert the light into electrical signals that ultimately form the digital image. For some applications, like artificial vision and image recognition, a CMOS sensor works with an on-chip image processing circuit to produce the visual. With CCD image sensors, which are popular in machine-vision systems, the CCDs are transistorized light sensors on an IC that integrate the light they receive, converting the electrons into electrical signals that the camera ultimately outputs into video or still image formats.
The Vienna University of Technology (TU Wien) has developed an ultra-fast image sensor with a built-in neural network that can be trained to recognize certain objects in nanoseconds. Without requiring a computer to read and process the entirety of the image data, this chip, according to its creators, has potential in scientific experiments or other specialized applications. For now, though, neural networks are typically run on embedded vision processors or neural processing units (NPUs) to perform functions such as image quality improvement and object, people, or facial identification, with some advanced region-of-interest isolation.
Deep-learning neural networks are used in a variety of applications: speech recognition for smart speakers, facial recognition for mobile devices, and pedestrian detection in autonomous vehicles, to name just a few. Their value lies in their keen ability to identify patterns within data sets, more often even better than humans can. There are a variety of different neural network types that are applicable to camera-based applications, helping to sharpen blurry images, deliver more vivid colors, and clean up pixel bleeding. They can also perform specific tasks, like isolate a region of interest. For example, in surveillance systems, neural networks can build feature maps that highlight the most relevant parts of an image and provide a sharp visual of a person’s face or perform pedestrian counting on an image of, say, a street scene, and not the sky. By processing only the parts of the image that are of interest, the algorithms can help reduce the amount of memory and compute resources needed—a key consideration for edge applications.
A few key neural networks relevant for vision applications include:
- Convolutional neural networks (CNNs): One of the most popular image processing algorithms, CNNs provide high-accuracy object classification, with relatively smaller datasets to learn from
- Recurrent neural networks (RNNs): With the ability to learn very complex relationships from data sequences, RNNs model sequences and, in image processing, can be used for image classification
- Transformers: Initially used in voice and natural language processing, transformers can provide excellent vision results via self-attention and learn more about the image than CNNs; however, they must train on much larger data sets (often in the cloud)
Each type has its pros and cons for camera-based applications. CNNs have proven to be the most production-worthy implementation over the past several years, especially at the edge. Now, proposals that leverage regions of interest using both CNNs and transformers together are obtaining more accurate results.