A Short History of Synthesized Speech
Created | Updated Mar 21, 2004
Personal Notes
http://www.bbc.co.uk/dna/h2g2/alabaster/A692057
|
|
|
|
|
A Monolith a Computer and Two Astronauts.
In the 1960s while watching a movie about two Astronauts and a Computer and a Monolith, I first encountered a computer that could speak. The computer named HAL, not only spoke, he was friendly and understanding.One of the many novel aspects of HAL was his voice. Before HAL, a computer speaking was deliberately created as a, mechanical, "robotic" voice. This was the viewer's cue that the computer was speaking. HAL spoke in an attractive, mellow, expressive tone quite different from the usual mechanical monotone attributed to computers in that period. HAL's warm emotional nature was even more striking when contrasted with the demeanors of his traveling companions. The astronauts where cool scientists whose faces and voices were devoid of emotional expression. Their lack of human emotions accentuated the effect of HAL's warm and emotional voice.
For most of us in the audience, a computer was something out of science fiction.
A Computers typical embodiment was an array of tall cases containing spinning tapes, a large box for the computer's memory and CPU, and machines that printed out pages and pages of wide sheets filled with numbers and obscure symbols. In all likelihood, this was the extent of their real-life experiences with computers. I used one in the seventies, interacting with it was extremely cumbersome, using decks of punched cards to submit information and receiving large numbers of punched cards or printed out pages in return, and in the eighties I used the IBM junior whose interface was a keyboard and a video screen, but still it was ungainly and slow to use.
A Short History of Synthesized Speech...
Early Mechanical Models.
Human fascination with talking machines is not new. For centuries, people have tried to empower machines with the ability to speak; prior to the machine age humans even hoped to create speech for inanimate objects. The ancients attempted to show that their idols could speak, usually by hiding a person behind the figure or channelling voices through air tubes.
The first recorded scientific attempts to construct talking machines were in the eighteenth century. And although mechanical talkers like these are still occasionally constructed today, they are generally used as measurement tools rather than as talking machines.
In 1779 C.G. Kratzenstein for the Imperial Academy of St. Petersburg, constructed a device to produced vowel sounds (/ a /, / i /, / o /, ...) by blowing air through a reed into a variable-resonance chamber that resembled a human vocal tract.
In 1791, W. von Kempelen constructed a device capable of speaking whole utterances. It consisted of bellows that forced air through a reed to excite a resonance chamber. The shape of the resonance chamber was manipulated by the fingers of one hand to produce different vowel sounds. Consonant sounds were produced by different chambers controlled by the other hand ..
In the late 1800s approximately, Sir Charles Wheatstone built an improved version of von Kempelen's machine.
In 1937 Sir R. Riesz constructed a more sophisticated mechanical talking machine. Using a similar arrangement of air flow through a reed, this machine possessed the ability to change the reed length to create the intonation or melody of speech. The user employed finger-controlled sliders to modify the shape of the tube simulating the vocal tract.
Speech - Theoretical Considerations.
Before discussing synthetic voices in more detail, we will need to be introduced to certain basic concepts used in speech production.
We begin by defining speech as a sound signal used for language communication.
Superficially, the speech signal is similar to a sound produced by a musical instrument, although it is more flexible and varied. When we speak, we push air from our lungs through the vocal chords, sometimes tightening the chords to make them vibrate as the air passes over them -- like the reed of a musical instrument such as a clarinet. In the clarinet, the pitch of the sound is changed by closing and opening holes in the body of the instrument, which causes the column of air in the instrument to become longer or shorter. When we speak, however, we change the pitch by loosening and tightening our vocal chords. We also have the ability to completely relax our vocal chords to producing voiceless sounds such as / s / or / sh /. The capacity to produce both pitched (or voiced) sounds and noise-like (or voiceless) sounds with a single instrument is not generally available to musical instruments.
Our greatest flexibility, however, comes from the innate ability to vary the shape of our instrument, the vocal tract. Most musical instruments are rigid structures and so produce a sound with a unique color or timbre associated with their particular class of instruments; thus a clarinet has a sound that is distinct from the sound of a trumpet or a violin. The descriptive words color and timbre refer to the sound quality rather than the pitch range or loudness of instruments. We humans, by contrast, can change the shape of our oral cavity by moving our tongue, lips, and jaw, thus creating a variety of sound colors.
For example, the sound of / oo / in the word boot is "dark" and muffled compared to the sound of the / ee / in a word like beet, which has a bright sound. In addition to / oo / and / ee /, two of the vowel sounds, there are consonant sounds such as / l /, / r /, and / m /. This human facility to produce a variety of sounds is the basis for our ability to speak. By combining a small number of sounds to produce a large number of words, we can produce an unlimited number of sentences. We call the different sounds that make up language phonemes.
A speech signal and its constituent phonemes can be given visual form with a sound spectrogram, commonly known as a voiceprint. The term voiceprint, coined by a manufacturer of the machines used to display spectrograms, was intended to associate them with fingerprints, which are uniquely reliable means of identification.
In the 1970s, police departments bought spectrogram machines and used them for forensic purposes. Speech scientists, however, opposed this practice, because they believed the spectrogram was not reliable legal evidence. Eventually the judicial and forensic use of spectrograms disappeared. Today, computers can reliably perform voice verification, not by using a spectrogram but with techniques borrowed from Automatic Speech Recognition. Although spectrograms are extremely useful for visualizing speech events, they are still too complex for computers to extract the appropriate information from them.
The sound spectrogram shows many aspects of the speech signal. A light blue regions corresponding to the / k /, / p /, and / b / show that the vocal tract is completely closed to pronounce the stop phonemes. In the vowel regions -- / a /, / i /, / o / and / e / -- as well as in the / r / and / l / regions, the repeated vertical lines indicate segments in which the vocal chords vibrate, causing a voiced speech signal. These segments contrast with the region corresponding to the latter part of the / k /, where such lines are not apparent, indicating that the sounds are voiceless.
A number of thick colored horizontal lines also appear in the voiced sections; they show the loudness of different frequencies at different times and represent the frequencies at which the sound is reinforced by the vocal tract. These resonances of the vocal tract are known as speech formants. The different configurations of the colored regions represent differences in the color or timbre of the sounds. When the colored regions appear in the higher frequencies (the higher areas of the spectrogram), such as during the vowel / a / in the word and or / i / in the word hit, the sound is brighter, while segments devoid of energy in the higher frequencies -- such as during the / l / or / o / -- tend to sound more muffled. You can simulate this effect by turning down the treble control on your stereo amplifier and observing the reduction of energy at higher frequencies.
Electroacoustic Models
In the late nineteenth century, before tools like the spectrogram were available for studying the speech signal, Dr Von Helmholtz and other scientists studied the relationship between the spectrum and the resultant sound. They postulated that speech-like sounds can be produced by carefully controlling the relative loudness of different regions of the spectrum and that; therefore, they could generate speech by electrical means instead of mechanically replicating the vocal tract. Helmholtz also studied the influence of the shape of different cavities on their resonance frequencies. Early in the twentieth century, J. Q. Stewart, among others, built a device to test these theories. Stewart's machine consisted of two coupled resonances excited by periodic electrical impulses. By tuning these resonances to different frequencies, he produced different vowel-like sounds.
In the 1930s. H. Dudley, R. Reiz, and S. Watkins constructed an electrical analog of Kempelen's machine This machine, the voder, was displayed at the 1939 World's Fair. Like its mechanical predecessors, the voder was manually operated by an operator who used a keyboard to control the relative loudness of the different regions of a spectrum, instead of changing the shape of an artificial vocal tract, as in earlier machines. An electrical sound generator excited the spectral shaping apparatus. The voder, the first electronic machine capable of producing speech, is the basis for today's acoustic synthesizers. The voder generated speech sounds but was not a true speaking machine, since a human operator controlled it.
http://www.obsolete.com/120_years/machines/vocoder/
"Parallel Bandpass Vocoder" (1939) Homer.W. Dudley: speech analysis and re-synthesis.
"The Voder speech synthesizer"(1940) Homer.W. Dudley: a voice model played by a human operator.
Having explained that speech is made up of a combination of different sounds or phonemes and those we can generate speech like sounds with electronic resonators that simulate the formants of the speech signal. As a specific configuration of formants can simulate a given phoneme, we should be able to synthesize speech by configuring the frequencies of a set of resonances to produce the desired sequence of phonemes that make up a given speech signal, in fact, producing a complete speech utterance by simply connecting the different phonemes could be a tricky process.
When we utter the sound of a phoneme, we move our articulators (lips, tongue, etc.) to shape the vocal tract to produce the desired sound. To say the vowel / ee / in the word beet, we move our tongue forward and raise it so it almost touches the roof of the mouth; when we say / a /, as in father, the tongue recedes to the back of the mouth and is lowered, along with the jaw. When we want to say an / a / followed by an / ee /, we produce a smooth transition from the / a / configuration of the articulators to the / ee / configuration by raising the jaw and moving the tongue forward and up. The motion of the tongue and the jaw is not instantaneous; there is a gap between the vowels in which the sound is neither / a / nor / ee / but something in between.
This can also be explained by observing the formants in a spectrogram. The first formant for / a / is quite high (850 Hz) for the range of the first formant, which is typically 250 to 900 Hz, and the second formant is low (just above the first formant). For / ee /, the first formant is extremely low, while the second formant is extremely high. Thus, when / a / is followed by / ee /, the first formant descends while the second formant rises. During the transition period when formants are moving from one configuration to the other, the sound is a mixture of the preceding and following sounds. This mix is clearly visible in the spectrogram for the word error, where the formants move smoothly between the different phonemes of the word. To synthesize an / a / followed by an / ee /, therefore, we have to model the motion of the articulators or the formants very correctly.
Another difficulty with configuring the articulators or formants for each phoneme arises when we utter very short vowels and the articulators are not able to move quickly enough to form the appropriate vocal-tract shape of the vowel. This can be seen by observing the short vowel /i/ in two different syllables, bil and dic In the syllable dic , the second formant moves only slightly from the surrounding consonants; however, in the syllable bil the second formant has to rise from the /b/ to the /i/ and fall again for the /l/. Since the vowel is short, the second formant is not able to rise fast enough before it begins falling again; so it never reaches the position of the second formant in dic . The spectrogram of the two syllables demonstrates that the vowel in the two syllables is indeed different. In the context of the syllable bil, listeners do not notice that the /i/ falls short of its target; they abstract the phoneme from the context and perceive it as an appropriate /i/. The articulators' motion and their failure to attain the proper vocal-tract shape is called coarticulation. To synthesize speech that sounds human we need to model these effects carefully. If formants move too much or too quickly, the resulting speech sounds will be unnatural and over-articulated. If they move too slowly or do not move far enough, the speaking machine will sound tongue-tied or drunk.
Before creating rules to control a speaking machine, it was necessary to develop methods of reproducing a human speech signal. At the turn of the century, two devices could convert an acoustic speech signal (air vibrations) into an electrical signal by a microphone and change an electrical signal back to an acoustic signal through a loud speaker.
These two devices -- the telephone for speech transmission and the phonograph for speech or sound storage and playback -- could store or transmit the signal but could not manipulate or alter it much. Operators could distort the signal or equalize it by boosting or reducing the bass or treble but could not convert the sound of one phoneme into another or change the pitch without changing the speed of the speech and its spectrum. To attain this flexibility, it was necessary to have independent control over the excitation of the signal, the pitch, the loudness, and the spectrum.
In 1951, an attempt to recreate human speech with an electrical device that controlled a variable spectrum used a black-and-white version of the sound spectrogram. As the spectrogram is a visual recording of a sound signal, F.S. Cooper, A.M. Liberman, and J.M. Borst used light to generate the speech. Their machine consisted of a light source and a rotating wheel of fifty concentric circles of variable densities -- to generate the different harmonics of the source signal. The light beams representing the different harmonics were aimed at the appropriate regions of a sound spectrogram. The intensity differences of the lines in the sound spectrogram varied the amount of light transmitted as the spectrogram moved through the beams. The light was converted to an electric current, then converted again, to an acoustic signal, by a speaker. In this way, the machine was able to speak the information encoded in the sound spectrogram. The device proved that speech could be generated electrically by a machine using time-varying parameters to control a spectral filter.
However, the ultimate aim of the project to build a speaking machine was to generate speech by defining a set of rules, not just to reproduce previously spoken words.
A major obstacle to using a spectrogram reading machine as a component of such a machine is its need for fifty different control parameters to reproduce speech. Generating speech by creating rules to control fifty different parameters is too complex. We needed a simpler model for controlling the time-varying spectrum component of a synthesizer.
J. Holmes experimented with recreating speech by controlling the frequencies of the resonances. First, he carefully analyzed short speech segments and manually determined the formant values for each such short segment. He then applied the data gathered from his analysis to the speech signal. Holmes's experiment demonstrated that if we could predict how formants change over time into a desired phoneme sequence, we could program a machine to speak. Since then, several researchers have introduced techniques for automatically analyzing the parameters that control the time-varying spectral filter.
These techniques are extremely useful for encoding speech at a reduced storage and transmission rate and have provided a basis for studying methods of creating rules for generating speech by machine. Once this theoretical groundwork was established, we could begin to conceive of ways to generate speech by machines. Early work had consisted of creating specialized circuitry to control synthesizers.
When digital computers became available, however, research progressed rapidly. These computers made it possible to program a machine with independent control of the pitch, loudness, and spectrum and to compute time-varying parameters to control them. Researchers constructing talking machines then faced two issues: what parameters to use, and how to generate these parameters for a given sequence of phonemes. They investigated two ways to generate the synthesis parameters: one method employs rules to generate the parameters, while the other uses stored data.
Synthesis by Rules.
The choice of parameters is extremely important to developing rules for speech synthesis. Some scientists hold that the best approach to developing rules is the geometry of the human vocal tract itself. There is a good deal of information about the articulators and their movements during speech, because both are subject to physical constraints. Some researchers have studied the geometry of the vocal tract, especially the tongue, through X-ray movies of people speaking. However, the danger of prolonged exposure to X-rays, even X-ray microbeams, means that only a limited number of such films are available.
Other researchers have tried to map the geometry of the vocal tract by analyzing the speech signal itself. This is still a topic of ongoing research, although no satisfactory solutions have yet been formulated. The air flow through the vocal tract is still not fully understood, due to the complex geometry of the vocal tract. In addition, the fact that the walls of the vocal tract (particularly the cheeks and soft palate) are not rigid contributes to the difficulty of computing airflow.
Still other researchers have attempted to apply ad hoc rules and simplified geometries of the vocal tract. Although they have been able to produce machine speech, its quality is lower than that yielded by other methods of synthesis.
Finally, one group of speech scientists worked to formulate rules for synthesizing speech by using more accessible parameters, in particular the resonances of the vocal tract, the formants. By observing spectrograms or computing the frequencies of formants of spoken utterances, these researchers have derived rules for synthesizing the phonemes within their contextual dependencies and for creating the transitions between the phonemes. So far, using the formant frequencies as the parameters for synthesis is the most successful approach.
Synthesis from Stored Segments.
An alternative method of producing computer speech stores small segments of speech to retrieve when they are needed. Storing whole sentences or phrases is impractical, and even saving words is not feasible; there are too many of them and new ones are constantly being added to the language. Storing words would also leave unsolved the problem of connecting the individual words together; although a word is a linguistic unit, acoustically there are no apparent breaks between words and only unclear delineations of word boundaries.
There are, however, certain applications with limited vocabulary needs in which whole words can be the unit of synthesis. Telephone directory assistance is one such application. Even though the speech in this case consists of a string of ten digits, the vocabulary for the application must be longer than ten digits, as the first digit in a string of ten digits is spoken differently from the third or the tenth one. A storage of one hundred words -- all ten digits in ten different positions -- encompasses all the possibilities.
Even so, the speech sounds like a series of isolated digits; it lacks the continuous flow of human speech. Storing syllables is also impractical for there are approximately fifteen thousand syllables in English and an adequate system would have to provide for smooth connections among them. Nor, as mentioned earlier, can phonemes serve as units for synthesis; their acoustic manifestations do not exist as independent entities and, besides, they are affected by the co-articulatory influence of neighboring sounds.
In 1958, G. Peterson, W. Wang, and E. Sivertsen experimented with using diphones to produce synthetic speech. These units consist of small speech segments that start in the middle of a phoneme and end in the middle of the next one. The authors theorized that phonemes are more stable in the middle and that segments between phonemes contain the necessary information about the transition from one phoneme to the next. Splicing the speech in the middle of each phoneme, therefore, should generate a smoother speech signal. The researchers did not attempt to construct a full system of diphones to produce all the possible speech-sound combinations of a given language (American English in this case). Instead, they selected several diphones and spliced them together to create phoneme sequences for a few utterances.
Although the experiment showed that the method was viable, there were some obvious problems. When speech segments are joined, discontinuities in loudness, pitch, or spectrum at the junctures are audible, usually as clicks or other undesirable sounds. Splicing speech cut from different speech utterances does not prevent such discontinuities. Because they spliced tape to connect the diphones,
Peterson and his colleagues had to carefully select diphones with similar acoustic characteristics at the junctures. In a system that includes all possible combinations of phonemes in the language, it would be impractical to use only diphones that match at the boundaries. Instead, we would have to smooth the connections between segments which can only be done when the speech is parameterized. The first such system for synthesized speech generated from stylized stored parameters of formant tracks was demonstrated in 1967.
The foregoing section describes the history of the talking machine prior the late 1960s. Although research on talking machines had been under way for a long time, it was still in its infancy at that time. Computers were able to utter speech-like sounds, but they lacked the eloquence of HAL. In fact, the computer-generated speech-like sounds of the era were almost unintelligible, whether produced through synthesis by rule or synthesis from stored data.
Post 1960s Synthesis of Sounds.
In the 1970s, however, researchers made great advances in speech synthesis, mainly because of the wealth of data on spoken utterances and improved computational power. The best system of rules for synthesizing speech, developed by D. H. Klatt, utilized a digital implementation of an electro acoustic synthesizer. The spectral shaping module consisted of a complicated network of resonances with different branches for producing vowels, nasal constants, fricatives, and stopped consonants. By recording and observing the formant motions, Klatt was able to create speech synthesis of high quality. One derivative of his system, Digital's DECTalk, has been used by noted physicist Stephen Hawking.
During the same decade, progress in the synthesis of speech from stored data was aided by research in speech coding and creation of new methods of speech analysis, and of re-synthesizing speech from analysis parameters. Like synthesis by rule, synthesis from stored data can use different kinds of parameters; however, because the method is data driven, parameters do not need to be as intuitive; they should be able to produce high-quality speech from re-synthesized, previously analyzed speech segments.
At present, two types of parameters are used for the data-driven method of synthesis: stored waveforms and a small set of spectral parameters that is mathematically derived from the speech signal. These parameters are called LPCs (linear predictive coding) because one of their forms predicts the next set of speech-waveform values from a small set of previously computed waveform values. Although waveform parameters produce high-quality speech, it is impossible to control independently the spectrum of waveforms of the stored speech. Synthesizing with these parameters, therefore, lacks flexibility for altering the speech spectrum.
The LPC parameters also produce high-quality speech, although it is somewhat mechanical-sounding. These LPCs' flexibility makes it easy to alter them to produce connected speech.
Shortly after the discovery of LPC parameters, early research involved constructing a synthesizer, using words as the unit of synthesis. By using twelve hundred common words It was able to synthesize many paragraphs of text. Because It used parameterized speech, It could smooth the connection between words and impose an intonation over the utterance to make the speech sound continuous. However, the synthesizer was limited -- too many words were not in its inventory.
A methods introduced by Peterson and his colleagues. The speech synthesizer that was currently used at Bell Laboratories generates speech from stored short utterances of analyzed speech, using LPC-derived parameters. It is not a simple system of diphones, but a complex system that contains many segments larger than diphones -- to accommodate phonemes with complex coarticulation effects. For example, to synthesize the word incapable spoken by HAL and was first transcribe the word into a phonetic notation. Incapable becomes where /*/ represents silence, /1/ is the neutral vowel schwa, and /U/ is the vowel a as in word able.
The synthesizer then attempts to match the largest string of phonemes from the word to a string in its databank. If two adjacent phonemes do not interact -- that is, there is little coarticulation between them, as is the case for /n/ followed by a /k/ -- the synthesizer will not find a diphone. In this case, it will add a silence element of zero duration. When the phoneme is greatly influenced by its neighbors, as in the case of a schwa, a triplet of phonemes will be stored in the database. Thus the word incapable will be synthesized from the these elements: The resultant speech is intelligible, although it sounds mechanical and would never be mistaken for a human voice.
Speech Generation and Text-to-Speech Conversion.
Thus far, we have seen a system capable of synthesizing speech from phonemic input. Given a sequence of phonemes, scientists can now generate a signal that sounds speech like. This was a very important task and the main preoccupation of researchers for a long time. But this is not all there is to speech. Speech, a subset of language, is one method humans use to communicate with each other.
The most direct form of language communication happens when one human, the generator, speaks to one or more humans, the receptors. This mode of communication is easy for the generator; he or she needs only choose the proper words to represent an idea and produce the speech sounds that represent the words. Barring such problems as a noisy environment or language differences, receptors will usually understand the idea the generator is trying to transmit. This mode of communication is not always possible, however. Quite often the generator and receptor are separated by large distances and cannot communicate with a speech signal. More often, a receptor is not able or willing to receive the message when the generator is willing to transmit it; or a generator may want to send a message to future receptors and preserve it for posterity.
The invention of a method to record thoughts, a writing system, introduced new possibilities for transmitting ideas, though sometimes at the expense of total clarity. The generator uses words to convey an idea and writes them in the accepted symbols.
The receptor trying to derive the intended message from the writings has only the words themselves; without cues about the real intentions of the generator and the emotional content, the correct groupings of the words may not convey the meaning of the text exactly.
An even more complex mode of communication occurs when the originator's written text is transmitted to the receptor orally by another person, a reader. To speak the text the originator intended, the reader must first understand its meaning. A complete text-to-speech system operates in this, most difficult, mode of communication -- in which the computer reads a text written by a third party.
Reading Text.
The situation is very different, however, when the computer is reading text, either printed text or a stored data base. An educated person can read text of a familiar language without difficulty, unlike a reading machine, which is not familiar with the language. A machine does not understand what it is reading.
The first problem a machine encounters is reading characters that are not words or non-alphabetic characters i.e. 123. A person would have no difficulty with such items. Often, we rely on contextual cues to decide how to pronounce such characters or words as St. (saint or street), bass (a musical instrument or a fish) and 3/5 (March fifth or three-fifths). Reading numbers is always a problem: is 5 five or fifth? is 248-1549 a telephone number or an arithmetic problem? And certainly we would not pronounce $1.5 million as dollar sign one point five million. Thus, the first task of a reading machine is to normalize the text by expanding non-word characters into words and, in the case of bass or read (present or past tense), deciding which is the correct pronunciation. When humans speak, they try to convey the structure of the message by segmenting the speech into a hierarchical structure of words, minor phrases, and sentences.
"This sort of thing has cropped up before and it has always been due to human error".
In English, luckily, white spaces in the text mark word boundaries; this is not true in, for example, Chinese and Japanese. The ends of sentences are also well marked in written English, with a period. (However, we probably wouldn't mistake the period at the end of an abbreviation as a mark for the end of a sentence.) Minor phrases are often indicated by the use of commas (not to be mistaken for the use of commas in a list).
Moreover, a comma is sometimes omitted from text, as it is in the statement, which has no comma after the word before. Conjunctions, such as and are often a cue for a minor phrase break, but not always. When HAL mentions "putting Drs. Hunter, Kimball and Kamisky aboard," neither the comma nor the conjunction indicate a minor phrase boundary. Speaking the sentence with a minor boundary in either place alters its meaning.
Finally, the words in a sentence have to be grouped into minor phrases. Going back to our original example, (this sort of thing) (has cropped up) (before) are the proper groupings in the first minor phrase. If the speaker does not speak the phrase in such groupings, he or she can either say it as a unit or pronounce each word with equal emphasis. In either case, the listeners will have difficulties comprehending the message, for they have no way to identify the important parts of the message.
A talking computer also needs to determine the focus of each phrase.
"I enjoy working with people".
He could stress any word in the sentence and change its meaning. If he stresses I he contrasts the meaning with "you enjoy ..."If he stresses enjoy, he implies a contrast with "I hate ..."When working is stressed, it means "rather than playing."
To convey the meaning of a message the computer must assign a prominent stress to the correct word. Of course, for educated human readers familiar with the language the numerous steps needed to speak a written or printed text are natural, because they understand what they are reading. Today, alas, we do not have machines that understand text, although their analysis of a text can help them sound as if they do. The Bell Labs synthesizer does paragraph-length analyses of texts. Using discourse information, statistics about word relations, and assigning words the proper part of speech (nouns, verbs, etc.), the synthesizer expands the input, segments the text, and assigns sentence-level stresses.
This process, though not perfect, works well enough to enable the machine to read very long sentences with only a minimal loss of intelligibility.
Generating Linguistic Units.
A number of issues are common to the task of reading text and generating computer speech. First, we assumed that when a computer generates the speech, it "knows" what it is trying to say (as hard a problem as that might be). HAL knows that he is trying to say "Dr. Poole" and not "drive Poole," just as he knows where the break for the phrasal hierarchies belongs and which word he needs to stress.
Next, the computer performing either task needs to know how to pronounce each word. As the English language employs a limited alphabet, there are many ways to pronounce certain letter sequences. For example, only six letters (a, e, i, o, u and y) are used to describe vowel sounds in English, but there are thirteen different vowel phonemes in the language; for example, the vowel in the word book is quite different from that in boot. In the preceding section we touched on homograph disambiguation (i.e., distinguishing among the various meanings and sounds of words like (bass, live, read, etc.), but we, and the computer, also need to know how to correctly pronounce the letters /sch/ in the words school, schedule, and mischief. Phoneticians have an alphabet that corresponds to pronunciation rather than to the spelling of words. Most dictionaries use this alphabet to indicate pronunciation. Another important aspect of pronunciation is lexical or word stress.
When HAL says, "my mission responsibility, "he puts the stress on the syllable bi. If he were to say responsibility or responsibility, the listener might not understand the word, or might even hear two words instead of one. By storing a pronunciation dictionary in the computer, we can tell the computer how to pronounce many words. Still, because of prefixes and suffixes and the constant addition of new words to the language, it is impossible to store all the words and their variations.
We therefore need to supplement the dictionary with a morphological analyzer and a set of letter-to-sound rules. Movement of the stressed syllable makes writing a morphological program a complicated task. For example, when HAL says melodramatic, with the stress on the fourth syllable, ma, he compounds the morphemes melo and drama, both of which stress the first syllable. When combined to form melodrama, the morpheme melo maintains its first-syllable stress.
Addition of the suffix tic, however, shifts the stress to the penultimate syllable. Moreover, the moving stress does not always fall on the same syllable; the shifts of act, active, activity, and activation demonstrate the variety of stress options a computer's analyzer has to recognize. After concluding the computation involved in language analysis, the computer -- whether reading text or generating speech -- has information about the hierarchical structure of the text, the focus or stress of the different segments, and the correct pronunciation, including lexical stress, of the words in the utterance. The result of the analysis is a string of phonemes annotated with several levels of stress marking and different levels of phrase marking. Once these linguistic units are generated, the computer is ready to synthesize speech.
Synthesis from Linguistics Units.
It would seem a trivial task to synthesize speech, by either rule or stored data, once the desired sequence of phonemes is known. However, the computer still lacks information about the timing and pitch of the utterance.
These factors may seem unimportant as long as the computer can pronounce the phonemes correctly. Nonetheless, mistakes in timing and pitch are likely to result in unintelligible speech or, at best, the perception that the speaker is a non-native speaker.
We are aware of the role of pitch when actors impersonating a computer in a television commercial or science fiction movie try to speak in a monotone. You notice that I said try, because they are not really talking in a monotone; if they did, it would sound more like singing than speaking. They do, however, severely restrict the range of the pitch. Humans normally talk with the timing and intonation appropriate to their native language which they acquired as children by imitating adult speakers.
The computer, of course, does not learn by imitation; for the computer to speak correctly, we have to develop the rules for pitch and timing and program it to use them.
The timing of speech events is very complicated. First, phonemes have inherent durations; for example the vowel in the word had is much longer than the vowel in pit. Yet the duration of the phonemes are not invariable. They are affected by the position of the phoneme's syllable in the phrase, the degree of stress on the syllable, the influence of neighboring phonemes, and other factors. For example, the vowel in had is much longer than the vowel in hat, because of the difference between the following consonants /d/ and the /t/. At Bell Laboratories recently we devised a statistics-based analysis scheme that measures the contribution of various factors to phoneme durations and creates algorithms to compute them.
To program rules for the pitch contour of speech, we must first understand how intonation provides information about the sentence type, sentence structure, sentence focus, and lexical stress of a speech signal. We are aware, for example, that the pitch is lower at the end of a declarative sentence, while in many interrogative sentences, it rises at the end. At the end of phrases and non-terminal sentences and parenthetical statements we indicate that we will continue speaking by lowering the pitch and reducing the range. We also express focus and stress by large pitch variations. All of the above phenomena must be programmed to make the computer deliver a message effectively.
Feeling and Singing.
So far, we have concentrated on aspects of speech synthesis that convey linguistic information by analyzing the acoustics of speech sounds, as well as the manifestations of timing and pitch. Another dimension of human speech, the emotional state of the speaker, is as important as the linguistic content of the message. I will not explore computer feelings here; however in 1974, an interest in computer music led someone to write a computer opera dealing with the intriguing subject of computer emotion. The opera featured a singing computer.
Work in computer singing stemmed from research in speech synthesis. To understand the effect of manipulating speech in the parameter domain, A constructed an interactive system to display and alter the synthesis parameters for a digital version of an electroacoustic synthesizer. Because the state of synthesis was not very advanced in 1974, they used analysis parameters from natural-speech segments. the system allowed them to adjust the timing of events by stretching and compressing the parameters and to change the pitch by simply drawing or typing a new pitch contour; for special effects, It could also change the spectral parameters. By adjusting the timing to fit the music and setting the pitch to the frequency of the desired musical notes, they were able to program a computer to sing. A singing formant developed by J. Sundburg added richness to the voice, and a vibrato contributed to its realistic sound.