Article Preview
Top1. Introduction
With the rapid development and popularization of computers and information technology, people can use their end devices (i.e., modern smartphones and Raspberry Pi) nearly anywhere, resulting in considerable research being devoted to the development of new applications for these ubiquitous end devices. Although these new applications provide significant benefits to users, their human–machine interfaces are still keyboards, mouses, or touch screens (Wang et al., 2017). Hand gesture recognition (Kim & Toomajian, 2016) can provide users with a more lively, natural, and convenient human–machine interface to operate and invoke applications on end devices. Also, hand gesture recognition can be used in human-robot interaction to create user interfaces that are natural to use and easy to learn. However, locating the hands and segmenting them from the background in an image sequence is a problem for hand gesture recognition.
In recent years, many studies (Costante et al., 2014; Dhingra & Kunz, 2019; Kim & Toomajian, 2016; Nanni et al., 2017; Shin & Sung, 2016; Žemgulys et al., 2018; ZOU et al., 2018) have used hand gesture recognition models for human–machine interface applications. These models are largely based on handcrafted features and feature extraction through deep learning. These models can be divided into static and dynamic gesture recognition. Static gesture recognition methods consider spatial features of hands, whereas dynamic gesture recognition methods extract not only spatial features but also temporal features.
In contrast to models based on handcrafted features, models (Costante et al., 2014; Dhingra & Kunz, 2019; Shin & Sung, 2016) based on deep learning perform well in automatic feature learning from image frames. Feature deep learning provides new insights into gesture recognition, and many researchers have attempted to use deep learning methods to extract gesture features from RGB, depth, and skeleton data. In (Shin & Sung, 2016), a dynamic hand gesture recognition technique was developed using a recurrent neural network (RNN) algorithm. In (Costante et al., 2014), deep CNNs and random forest (RF) algorithms were compared, and the results indicated that CNN slightly outperformed RF with sufficient data and achieved significantly better accuracy than other methods. A deep learning CNN can learn hand gesture features from single-mode data or multimodal fusion data. As the appearance and optical flow sequences are relatively easy to obtain, most deep learning methods adopt these two as their input, with few depth-based techniques.