First Single-Chip Speech Synthesizer

0 Conversations

Monolith to Monolith, On June 11, 1978 Texas Instruments announced A new speech synthesis monolithic integrated circuit has been developed by Texas Instruments Incorporated.


It marks the first time the human vocal tract has been electronically duplicated on a single chip of silicon. Measuring 44,000 square mils, the chip is fabricated using TI's low-cost metal gate P-channel MOS process, the same used for TI calculator MOS ICs.


The speech synthesis MOS/LSI integrated circuit along with two 128K dynamic ROMs each with the capacity to store over 100 seconds of speech, and a special version of the TMS 1000 microcomputer, all TI developed, serve as the main electronics for the new talking learning aid, SPEAK SPELL(TM), for seven year olds and up.


The new TI consumer product was introduced at the Summer Consumer Electronics Shows in Chicago, June 11-14. Speech encoding is achieved through pitch excited Linear Predictive Coding (LPC). As the name implies, LPC is based on a linear equation to formulate a mathematical model of the human vocal tract and an ability to predict a speech sample based on previous ones. Linear Predictive Coding is a technique of analyzing and synthesizing human speech by determining from original speech a description of a time varying digital filter modeling the vocal tract. This filter is then excited by either periodic or random inputs.


An on-chip 8-bit digital-to-analog (D/A) converter transforms digital information processed through the filter into synthetic speech. Codes for twelve synthesis parameters (10 filter coefficients, pitch and energy) serve as inputs to the synthesizer chip. These codes are stored in a ROM and, once decoded by on-chip circuitry, represent the time varying description of the LPC synthesis model. Inputs to the digital filter take two forms: (1) periodic and (2) random. The periodic inputs are used to reproduce voiced sounds which have a definite pitch such as vowel sounds or voiced fricatives such as Z, B or D. A random input models unvoiced sounds such as S, F, T and SH .


The speech synthesis chip has two separate logic blocks which generate the voiced and unvoiced excitation. Output of the digital filter drives a D to A converter which in turn drives a speaker. Key to TI's high quality LPC speech synthesizer is an advanced design 10-stage lattice filter which has an integrated array multiplier, an adder coupled to the multiplier output and various delay circuits coupled to the adder output. With this increased computational sequencing capability and a fast continuous data transfer rate, the multiplier can accept two inputs every five microseconds. Twenty multiply and accumulate operations are needed to generate each speech sample, and the circuit can generate up to 10,000 speech samples per second. The chip is operated at an eight kilohertz rate for the Speak and Spell. This 10th order Linear Predictive Coding (LPC-10) speech synthesizer IC accurately reproduces human speech from stored or transmitted digital data.

At Bell Laboratories

staff have developed a text-to-speech synthesizer that is highly intelligible in several languages, including English, German, French, Spanish, Russian, Chinese, and Navajo. The finest module in the synthesizer is the pronunciation module, which enables it to pronounce words and names as well as any educated American would. Yet, although capable of both reading or generating such complex text as e-mail or newspaper stories, the synthesizer does not replicate the human voice. It has a distinct "machine" sound. Which of the stages of the synthesis process we described account for this fault? Not one but many of the stages require improvements before we succeed in producing humanlike speech.


There are problems at both the text-analysis and the speech-synthesis stages. The greatest dilemma facing synthesis researchers, as well those working on automatic speech recognition. is the machine's inability to comprehend what it is saying or hearing. This, of course, is a part of the greater problem of artificial intelligence, which at present is very limited. Even so, a machine has been "taught" to play high-level chess and can defeat most human players. Compared to the problem of language understanding, however, chess is quite simple. Language acquisition is more analogous to the game of Go, as there are an enormous number of possible combinations of moves in the game and of sentences in the language. Go has approximately 10768 sequences of moves, a number that is many orders of magnitude larger than the number of atoms in the universe.


Due to this complexity, machines programmed for Go play at only an elementary or novice level. The same holds true for machine language understanding. A computer can only perform tasks requiring very limited understanding. It can maintain a dialogue about ordering a pizza but not about a subject matter that has not been previously defined.


Consequently, when the computer reads a text, it may err in its analysis of hierarchical segmentation and assignment of sentence stress. Since pitch is largely determined by segmentation and stress, incorrect information about these elements can result in unintelligible speech. To minimize the effects of such errors, we limit the range of pitch movement. Although the synthesizer sounds more realistic than people trying to impersonate computers, it still sounds very mechanical. When we can annotate text to specify phrase structure and focus, or generate text with a computer whose range of pitch can expand to match the range of human speakers, synthetic speech will sound better.


Other causes for the poor quality of synthetic speech arise from our inability to model the duration of the phonemes and the movement of the pitch as accurately as we need to to imitate human speech. More important, we still cannot analyze speech and use the resulting parameters in a way that accurately copies the human sound of the speech. At present, it is difficult to predict when we will solve these problems and build computers that sound like HAL.


Researchers in speech synthesis are now working in an area not portrayed in 2001. the film, HAL is portrayed as a large machine whose connection to the world is a large red eye. At Bell Labs, they have attached a talking face to our computer, which simultaneously sends the same information to the synthesizer and the talking head. Thus the talking head receives information about the phonemes and their duration and uses the information to compute the appropriate position of its lips, jaw, and tongue. It also moves its eyebrows to enhance the stressed portions of the speech. Although the talking head in the picture is a flat mask, it can be covered by a textured face mask portraying any person you choose. The talking face not only makes the speech synthesizer more attractive and personable, it also enhances the intelligibility of the speech by letting the listener lip read while listening to the computer


If HAL had a real face, rather than one large RED eye, would it have been so easy to turning him off, I wonder?.

Acknowledgments

I would like to thank my wife, Bell Labs, Elian Informatics. HandP, Texas Instruments, more to follow

Today and Beyond Conclusions.


Speech synthesis has been developing steadily over the last decades and it has been incorporated into several new applications. For most applications, the intelligibility and comprehensibility of synthetic speech have reached the acceptable level. However, in prosodic, text preprocessing, and pronunciation fields there is still much work and improvements to be done to achieve more natural sounding speech. Natural speech has so many dynamic changes that perfect naturalness may be impossible to achieve. However, since the markets of speech synthesis related applications are increasing steadily, the interest for giving more efforts and funds into this research area is also increasing.


Text-to-Speech Synthesis.
It is quite clear that there is still very long way to go before text-to-speech synthesis, especially high-level synthesis, is fully acceptable. However, the development is going forward steadily and in the long run the technology seems to make progress faster than we can imagine. Thus, when developing a speech synthesis system, we may use almost all resources available, because in few years today's high resources are available in every personal computer. Regardless how fast the development process will be, speech synthesis, whenever used in low-cost calculators or state-of-the-art multimedia solutions, has probably the most promising future.


Speech Recognition .
If speech recognition systems someday achieve a generally acceptable level, we may develop for example a communication system where the system may first analyze the speakers' voice and its characteristics, transmit only the character string with some control symbols, and finally synthesize the speech with individual sounding voice at the other end. Even interpretation from a language to another may became feasible. However,


It is obvious that we must wait for several years, maybe decades, until such systems are possible and commonly available.


GC/AC/ 2004


Bookmark on your Personal Space


Conversations About This Entry

There are no Conversations for this Entry

Entry

A2432693

Infinite Improbability Drive

Infinite Improbability Drive

Read a random Edited Entry


Written and Edited by

Disclaimer

h2g2 is created by h2g2's users, who are members of the public. The views expressed are theirs and unless specifically stated are not those of the Not Panicking Ltd. Unlike Edited Entries, Entries have not been checked by an Editor. If you consider any Entry to be in breach of the site's House Rules, please register a complaint. For any other comments, please visit the Feedback page.

Write an Entry

"The Hitchhiker's Guide to the Galaxy is a wholly remarkable book. It has been compiled and recompiled many times and under many different editorships. It contains contributions from countless numbers of travellers and researchers."

Write an entry
Read more