Aviva Directory » Computers & Internet » Speech Technology

Speech is the primary means of communication between people. Speech technology relates to the technologies designed to duplicate and respond to human voice.

Since computers were invented, research into the mechanical realization of human speech capabilities, the automation of tasks through human-machine interactions, and automatic speech recognition by machines has been conducted.

During the 1950s and 1960s, the first generation of speech technology, computers were able to recognize vowels and consonants, as well as monosyllabic words, largely from a single human speaker, as recognization differing human speech patterns was still a long way off. During the second generation, from the late 1960s through the 1970s, there was some progress toward solving the problem of non-uniformity in human speech. Using template-based speech technology, computers were able to recognize words, and even respond to simple database queries.

Beginning in the late 1980s, a shift in methodology from a template-based approach to a statistical modeling framework ushered in the third generation of speech technology. Computers were able to recognize a fluently spoken string of connected words. The technology progressed more rapidly in the 1990s. Computer-human speech became more conversational and spontaneous, and computers were less likely to be curtailed by partial words, hesitations, and word repairs that are common in human speech. The 2000s brought a more efficient detection of sentence boundaries, fillers, and disfluencies, and computers were able to recognize natural, unconstrained human-to-human speech, as from radio and television broadcasts, and even foreign conversational speech in multiple languages. Computers are even being taught to recognize facial cues in human speech.

Computers have become far more efficient at duplicating human speech, as well, although most people are still able to tell the difference between computer and human speech. The lines are rapidly becoming blurred, however.

Automatic speech recognition systems are used for call processing in telephone networks and in query-based information systems. The technology is also used to assist the voice-disabled, the hearing-disabled, and the blind, as well as to communicate with computers without using a keyboard. Apple's Siri and Amazon's Alexa are good examples of the technology. Game software has also been greatly enhanced through speech technology.

Just as there are numerous uses for speech technology, there are several subfields, including speech synthesis, speech recognition, speaker recognition, speaker verification, speech encoding, and multimodal interaction.

Speech synthesis refers to the process of generating spoken language by a machine on the basis of written input. The first computer-based speech-synthesis system was created in the late 1950s, although the first English text-to-speech system was developed in Japan in 1968. Modern uses include utilities designed for the vision-impaired, reading the text of emails and webpages, as well as the e-book reader in the Amazon Kindle devices.

Speech recognition is the ability of a computer to identify and respond to human speech. Some of these systems require training, where a user reads text or isolated vocabulary into the system, allowing it to analyze the person's specific voice. Siri and Alexa both use this technology as well.

Speaker recognition is the technology that allows a computer to identify a person from the characteristics of that individual's voice. Speaker verification builds on the technologies developed through speaker recognition and uses it to accept or reject the identity claimed by the speaker. Some banking systems use speaker verification in their customer call centers

Speech encoding is the compression of speech into a code, and is used for transmission with speech codecs that use audio signal processing and speech processing techniques. Mobile phones use speech encoding technologies.

Through multimodal interaction, users are provided with multiple modes of interacting with a computer system. As an example, a multimodal question answering system might allow for text or photo at both the question and answer level.

Speech technology involves both hardware and software, and advances in both areas of technology have resulted in the computer-human speech systems we have today and will drive further improvements in the future. Topics related to either the hardware or the software designed specifically for speech technology is the focus of topics in this category. Websites that discuss the technology itself are also appropriate here.

 

 

Recommended Resources


Search for Speech Technology on Google, Bing, or Yahoo!