Construction of a Bi-Modal Database for a Barrier-Free Teaching System

Construction of a Bi-Modal Database for a Barrier-Free Teaching System

Jiling Tang, Ping Feng, Zhanlei Li
Copyright: © 2019 |Pages: 15
DOI: 10.4018/IJMBL.2019070105
OnDemand:
(Individual Articles)
Available
$37.50
No Current Special Offers
TOTAL SAVINGS: $37.50

Abstract

This article analyzes the application of Chinese speech recognition technology in a non-barrier education system, and studies the construction of bi-modal database for a barrier-free teaching system. Barrier-free teaching is a term used to define systems that are designed to assist deaf students, while bi-modal approaches use both audio and video to assist learning. Based on a case study of a curriculum named “Foundations of Photoshop,” the article creates a corpus to make acquisition of experimental data and annotation of corpora. Meanwhile, the authors analyze and design the organization of data and build an essential dictionary and grammar network in the recognition system.
Article Preview
Top

Introduction

According to a 2010 survey of disability in China, it is estimated that the number of people with deafness was approximately 20,540,000, which ranked second in all types of disability. It is therefore important part that China carries out hearing impaired education and teaching research in development of special education (Yan jiaosheng, 2014).

Computerized speech-to-text technology used in hearing-impaired education was first put forward by Stuckless in one of his papers in 1981 (Furui, 2005). With IBM's strong technical support, Saint Mary's University Canada, set up the Barrier-Free Teaching Union to make speech recognition more widely applied.

In China, the earliest study about lip-reading was started by Yao HongXun from Harbin Institute of Technology. She used an extraction method of lip color filter based on the features of the lip to Identify five vowels. The lip-reading recognition in this method can reach above 90%. The Institute of Acoustics of the Chinese Academy of Sciences established the first Chinese speech dual-mode database by analyzing the structure of similar databases at home and abroad. The database is combined with the characteristics of Chinese pronunciation using a selected corpus of all the Chinese initials and finals. To an extent, it provided help for later researchers in acquisition and solution of the visual information. Moreover, Wuyi University, Dalian University of Technology, Zhejiang University, Southeast University and other institutions have also committed to lip reading study (Wudi, 2015). However, there is no special study of the technology used in teaching, especially in teaching hearing-impaired students (Jing, 2007). This leads to the status quo that the recognition ability of speech recognition software used in teaching is good, but the recognition of professional vocabularies is more difficult and makes errors frequently.

In the Chinese language environment, the use of dual mode speech recognition technology with audio and video combined for the teaching system (shown in Figure 1) can improve the recognition rate and robustness of the system. What we should understand is that the training process of speech recognition is a statistical process. In order to ensure the accuracy of the model, we require a basis of substantial audio and video data. In the field of pure speech recognition, a lot of standard speech corpus provide the baseline information, which plays a very important role in the development of speech recognition. However, in the field of audio visual, due to its application in recording different requirements in different areas, there is no such comprehensive and systematic database. Therefore, this paper introduces the construction process of an audio and video database in such a system.

Figure 1.

Structure of barrier-free teaching system

IJMBL.2019070105.f01
Top

Dual-Mode Speech Recognition

The dual-mode speech recognition system with audio and video is based on an auditory single modal speech recognition system, adding a visual subsystem. This subsystem captures the facial images of speakers, then positions faces and other main features to extract the visual features associated with pronunciation, and inputs them to a recognizer with acoustic features. A bimodal speech recognition system has better performance and higher robustness, especially when there are various kinds of noise in the application environment, so it is more suitable for being applied in the actual environment.

Complete Article List

Search this Journal:
Reset
Volume 16: 1 Issue (2024)
Volume 15: 2 Issues (2023): 1 Released, 1 Forthcoming
Volume 14: 4 Issues (2022)
Volume 13: 4 Issues (2021)
Volume 12: 4 Issues (2020)
Volume 11: 4 Issues (2019)
Volume 10: 4 Issues (2018)
Volume 9: 4 Issues (2017)
Volume 8: 4 Issues (2016)
Volume 7: 4 Issues (2015)
Volume 6: 4 Issues (2014)
Volume 5: 4 Issues (2013)
Volume 4: 4 Issues (2012)
Volume 3: 4 Issues (2011)
Volume 2: 4 Issues (2010)
Volume 1: 4 Issues (2009)
View Complete Journal Contents Listing