These days, we’re all familiar with applications like Siri and Alexa, and automated speech recognition (ASR), a standby fixture of the science fiction programs of my youth, is now a part of our everyday lives. Unsurprisingly, the development of ASR is a fascinating story and there are several parallels between this story and one I covered previously about the development of neural networks, which happens to be a technology leveraged by modern ASR, but more on that later.
Probably the first commercial success of ASR was a toy called “Radio Rex” that was sold in the 1920s. The toy was a wooden doghouse a few inches high with a little celluloid bulldog that would automatically jump out when his name “Rex” was spoken. The “speech recognition” function of Radio Rex was, as you might imagine, very primitive. However, to understand how Radio Rex and more modern ASR applications work, we need to take a brief dip into acoustics and define 3 basic concepts: harmonics, resonance, and formants.
A harmonic is a pitch whose frequency is an integer multiple of another frequency in a complex sound. OK, that sounds more complicated than it should. The term “complex sound” really means nearly all natural noise not generated artificially. Complex sounds are actually composed of multiple frequencies; even a single pitch played on a guitar is really composed of many different pitches in a harmonic series. The relative strength (i.e. loudness or amplitude) of the various harmonics gives an instrument or voice its characteristic sound quality. That’s one reason why a trumpet and a clarinet sound different even if they are playing the same note.
Resonance, with respect to acoustics, is the phenomenon where the structure of a system favors a dominant frequency. If you’ve ever blown across the top of a bottle, then you’re familiar with resonance: the size and shape of the airspace in the bottle has a specific resonance frequency that produces a specific pitch. Rooms can have resonance too, which is best illustrated by Alvin Lucier’s work “I Am Sitting in a Room,” wherein the composer recorded himself narrating a text and repeatedly replayed and re-recored the tape. After many iterations, the resonance frequencies of the room (and probably the recording equipment too) begin to dominate over the original voiced tones. I highly recommend the link above; you can skip ahead in the video to listen to a more iterated version. The human vocal tract also has resonance frequencies, which we control to produce vowel sounds.
A formant, then, is a harmonic of a pitch that is reinforced through resonance. The difference between “ooooo” and “aaaaah” is a difference in resonance that results in two distinct sets of harmonics, even if they are voiced at an identical pitch. This is the key to how we differentiate vowel sounds in language as well as how old Rex could recognize his name.
So what’s all this have to do with Radio Rex? A tiny copper pendulum on the side of Rex’s doghouse was weighted (with lead! This was before toy safety standards) so that it would vibrate at approximately 500 hz. This roughly corresponds to the first (lowest pitch) formant of the short “e” sound in “Rex.” When the doghouse picks up a complex sound that includes a 500 hz harmonic with high enough energy, the pendulum vibrates and acts like a switch to disconnect a battery from an electromagnet, releasing a spring and allowing Rex out of the doghouse.
It has been a long road from Rex to Siri, and the story of ASR’s early evolution, as covered in this article originally published by Descript, was characterized by optimism and enthusiasm. Yet a single open letter published in the Journal of the Acoustical Society of America in 1969 could have derailed the whole enterprise. Then-director of Bell Labs, John R. Pierce, wrote a scathing analysis of the field of ASR research, implying that the work had been grossly over-funded and unscientific in its approach. Pierce cited the “lush funding” environment of the “post-Sputnik” era as a driver of bad science, the conceit being that ASR is a fanciful idea, akin to “turning water into gasoline” or “going to the moon,” and therefore it attracts excessive funding, which in turn leads to a certain level of deception, intentional or not, in experimental design, reported results, and prognosis for future development. (The “going to the moon” comment is particularly intriguing: Neil Armstrong took his “giant leap for mankind” only 3 months before Pierce’s letter was published, yet Pierce mentions it in the same breath as “extracting gold from the sea.”)
It would be difficult to understate Pierce’s importance and the weight his opinions carried. The man coined the term “transistor,” first built by the team he led at Bell Labs, and was the executive director of the group that put the first communications satellite into space. Of course, you can’t be right all the time: another quip famously attributed to Pierce is “funding artificial intelligence is real stupidity.” Pierce also de-funded Bell Labs’ research on ASR for the rest of his tenure. Fortunately, ASR funding and research continued elsewhere. Bell Labs continued to make massive scientific and technological contributions, but they never retook their position as a leader in ASR research.
Pierce’s letter is definitely worth a read; it’s short enough and provides fascinating insight into the mind of the man who wrote it as well as the time it was written. For example, “We communicate with children by words, coos, embraces, and slaps... It is not clear that we should resort to the same means with computers. In fact, we do very well with keyboards, cards, tapes, and cathode-ray tubes.”
Despite being essentially a diatribe against ASR research, Pierce’s letter touches upon a reason for the limited success of early ASR and hints at the solution that allows modern ASR to function. Consider this passage from the letter:
“...spoken English is, in general, simply not recognizable phoneme by phoneme or word by word, and that people recognize utterances, not because they hear the phonetic features or the words distinctly, but because they have a general sense of what a conversation is about and are able to guess what has been said.”
The problem with early ASR was that scientists were trying to create, as Pierce puts it, a “phonetic typewriter;” in other words, a machine that can accurately recognize individual words or sounds in isolation, independent of context, like a keyboard “recognizes” individual key presses. But, as Pierce writes, that’s not really how people recognize speech. In contrast with written language, which is systematized and structured in ways that allow direct interpretation from letter to word to meaning, understanding spoken language relies more heavily on context. Therefore, speech recognition developers would need to find a way to use contextual information in another example of computational science mimicking biology.
The next big steps forward in ASR would come in the 1970s, as researchers began to apply a statistical method called Hidden Markov Models, first described by Ruslan Stratonovich in 1960, which allowed systems to leverage contextual information to predict what words are being spoken, vastly improving the vocabulary and accuracy of ASR.