Speech recognition, Radio Rex, and an open letter from John R. Pierce

These days, we’re all familiar with applications like Siri and Alexa, and automated speech recognition (ASR), a standby fixture of the science fiction programs of my youth, is now a part of our everyday lives. Unsurprisingly, the development of ASR is a fascinating story and there are several parallels between this story and one I covered previously about the development of neural networks, which happens to be a technology leveraged by modern ASR, but more on that later.

Probably the first commercial success of ASR was a toy called “Radio Rex” that was sold in the 1920s. The toy was a wooden doghouse a few inches high with a little celluloid bulldog that would automatically jump out when his name “Rex” was spoken. The “speech recognition” function of Radio Rex was, as you might imagine, very primitive. However, to understand how Radio Rex and more modern ASR applications work, we need to take a brief dip into acoustics and define 3 basic concepts: harmonics, resonance, and formants.

A harmonic is a pitch whose frequency is an integer multiple of another frequency in a complex sound. OK, that sounds more complicated than it should. The term “complex sound” really means nearly all natural noise not generated artificially. Complex sounds are actually composed of multiple frequencies; even a single pitch played on a guitar is really composed of many different pitches in a harmonic series. The relative strength (i.e. loudness or amplitude) of the various harmonics gives an instrument or voice its characteristic sound quality. That’s one reason why a trumpet and a clarinet sound different even if they are playing the same note.

First four harmonics of a vibrating string. A string may vibrate in all of these (and more) modes simultaneously. (image credit: Juan M Aguirregabiria  CC BY-SA 3.0 )

First four harmonics of a vibrating string. A string may vibrate in all of these (and more) modes simultaneously. (image credit: Juan M Aguirregabiria CC BY-SA 3.0)

(Figure)

Resonance, with respect to acoustics, is the phenomenon where the structure of a system favors a dominant frequency. If you’ve ever blown across the top of a bottle, then you’re familiar with resonance: the size and shape of the airspace in the bottle has a specific resonance frequency that produces a specific pitch. Rooms can have resonance too, which is best illustrated by Alvin Lucier’s work “I Am Sitting in a Room,” wherein the composer recorded himself narrating a text and repeatedly replayed and re-recored the tape. After many iterations, the resonance frequencies of the room (and probably the recording equipment too) begin to dominate over the original voiced tones. I highly recommend the link above; you can skip ahead in the video to listen to a more iterated version. The human vocal tract also has resonance frequencies, which we control to produce vowel sounds.

A formant, then, is a harmonic of a pitch that is reinforced through resonance. The difference between “ooooo” and “aaaaah” is a difference in resonance that results in two distinct sets of harmonics, even if they are voiced at an identical pitch. This is the key to how we differentiate vowel sounds in language as well as how old Rex could recognize his name.

So what’s all this have to do with Radio Rex? A tiny copper pendulum on the side of Rex’s doghouse was weighted (with lead! This was before toy safety standards) so that it would vibrate at approximately 500 hz. This roughly corresponds to the first (lowest pitch) formant of the short “e” sound in “Rex.” When the doghouse picks up a complex sound that includes a 500 hz harmonic with high enough energy, the pendulum vibrates and acts like a switch to disconnect a battery from an electromagnet, releasing a spring and allowing Rex out of the doghouse.

It has been a long road from Rex to Siri, and the story of ASR’s early evolution, as covered in this article originally published by Descript, was characterized by optimism and enthusiasm. Yet a single open letter published in the Journal of the Acoustical Society of America in 1969 could have derailed the whole enterprise. Then-director of Bell Labs, John R. Pierce, wrote a scathing analysis of the field of ASR research, implying that the work had been grossly over-funded and unscientific in its approach. Pierce cited the “lush funding” environment of the “post-Sputnik” era as a driver of bad science, the conceit being that ASR is a fanciful idea, akin to “turning water into gasoline” or “going to the moon,” and therefore it attracts excessive funding, which in turn leads to a certain level of deception, intentional or not, in experimental design, reported results, and prognosis for future development. (The “going to the moon” comment is particularly intriguing: Neil Armstrong took his “giant leap for mankind” only 3 months before Pierce’s letter was published, yet Pierce mentions it in the same breath as “extracting gold from the sea.”)

It would be difficult to understate Pierce’s importance and the weight his opinions carried. The man coined the term “transistor,” first built by the team he led at Bell Labs, and was the executive director of the group that put the first communications satellite into space. Of course, you can’t be right all the time: another quip famously attributed to Pierce is “funding artificial intelligence is real stupidity.” Pierce also de-funded Bell Labs’ research on ASR for the rest of his tenure. Fortunately, ASR funding and research continued elsewhere. Bell Labs continued to make massive scientific and technological contributions, but they never retook their position as a leader in ASR research.

Pierce’s letter is definitely worth a read; it’s short enough and provides fascinating insight into the mind of the man who wrote it as well as the time it was written. For example, “We communicate with children by words, coos, embraces, and slaps... It is not clear that we should resort to the same means with computers. In fact, we do very well with keyboards, cards, tapes, and cathode-ray tubes.”

Despite being essentially a diatribe against ASR research, Pierce’s letter touches upon a reason for the limited success of early ASR and hints at the solution that allows modern ASR to function. Consider this passage from the letter:

“...spoken English is, in general, simply not recognizable phoneme by phoneme or word by word, and that people recognize utterances, not because they hear the phonetic features or the words distinctly, but because they have a general sense of what a conversation is about and are able to guess what has been said.”

The problem with early ASR was that scientists were trying to create, as Pierce puts it, a “phonetic typewriter;” in other words, a machine that can accurately recognize individual words or sounds in isolation, independent of context, like a keyboard “recognizes” individual key presses. But, as Pierce writes, that’s not really how people recognize speech. In contrast with written language, which is systematized and structured in ways that allow direct interpretation from letter to word to meaning, understanding spoken language relies more heavily on context. Therefore, speech recognition developers would need to find a way to use contextual information in another example of computational science mimicking biology.

The next big steps forward in ASR would come in the 1970s, as researchers began to apply a statistical method called Hidden Markov Models, first described by Ruslan Stratonovich in 1960, which allowed systems to leverage contextual information to predict what words are being spoken, vastly improving the vocabulary and accuracy of ASR.

Double helices, DNA dynamics, and I-motifs

Most are familiar with the classic Franklin-Watson-Crick DNA double helix; it shows up all over the place. The first appearance of the double helix in popular culture was a painting by Salvador Dalí in 1953 (in the painting, it’s on the left under the depiction of the prophet Isaiah), and I think the first time I saw a double helix may have been the film Jurassic Park (near the end of the clip).

The classic DNA double helix, the form that occurs most commonly in nature, is a right-handed helix known as B-DNA. “Handedness” refers to which way the helix twists. If a helix is right-handed, you can point your right thumb along the axis of the helix, and the turns of the helix will follow the curve of your fingers. Another way to distinguish right- from left-handed helices is to imagine yourself walking up the helix like a flight of stairs and then ask yourself whether you’d place your right or left hand on the “railing.” Whenever I come across a DNA helix in popular media or science journalism, I always like to check whether they’ve given it the “right” handedness. There have been some pretty egregious examples of people getting it wrong.

Of course, this is biology we’re talking about here, so there’s bound to be exceptions that prove the rule. Although B-DNA may be the most common form, DNA can adopt many different shapes or “conformations.” Some of these are shown below.

Alternative DNA conformations. From left to right: A-DNA, B-DNA, Z-DNA, G-4 motif, i-motif. The structures correspond to  PDB entries  440D, 5EJ1, 4OCB, 1K8P, and 1EL2, respectively.

Alternative DNA conformations. From left to right: A-DNA, B-DNA, Z-DNA, G-4 motif, i-motif. The structures correspond to PDB entries 440D, 5EJ1, 4OCB, 1K8P, and 1EL2, respectively.

The three DNA conformations on the left are double-stranded forms known to exist in nature; although A- and Z-DNA were once thought only to be possible in vitro (in a test-tube), we now know that they can exist in cells and have or reflect important biological functions. Z-DNA is a left-handed helix; try comparing the handedness using one of the above tests.

The fourth DNA conformation pictured is called a G-quadruplex, or G-4 motif. Here, four guanine bases form a single, planar hydrogen bonding network called a “G-tetrad;” three of these are stacked in the structure above to form the G-4 motif. This particular G-4 motif is comprised of 2 strands of DNA, but G-4 motifs can be made of 4 separate strands or even a single strand. G-4 motifs do occur in living cells and seem to play a role in regulating and maintaining telomeres, which are specialized regions at the ends of chromosomes.

The DNA structure shown at the far right is known as an i-motif. The “i” comes from the presence of intercalated cytosine:cytosine bonds. This structure is particularly interesting because it can be formed by a single strand of DNA folded into a box-like quadruplex structure, with alternating C:C bonds occurring between the opposite “corners” of the “box.”

Because one of the cytosines in each pair must be hemiprotonated, these structures form more readily at low pH, when there is a greater abundance of free H+ in solution. Certain cytosine-rich sequences have long been known to form i-motifs in vitro, but researchers have been skeptical about the possibility that these motifs occur in living cells, given the requirement for special conditions. However, a recent study by Mahdi Zeraati et al. demonstrates that i-motifs occur in living cells. The research group from Australia used an antibody-based approach to show that i-motifs are present, and interestingly, appear at higher frequency during certain stages of the cell cycle, specifically the G1/S boundary and the early S phase, which is when DNA replication begins. They also show that i-motifs appear in telomeres, in the regulatory regions of some genes, and when cells are exposed to high concentrations of carbon dioxide.

Given that i-motifs occur at different frequencies throughout the cell cycle or under different conditions, it is likely that these structures are dynamic, forming and unforming depending on the conditions. It is fun to speculate what the purpose of these structures might be, although little is known about their specific function. Are they recognized or bound by some regulatory protein? Do they play a role in telomere identity or maintenance, as G-4 motifs are thought to? Could their function be related to the double-stranded context in which they would occur? It is also fun to imagine what the dynamic formation of these structures might look like. Below, we can see a stretch of typical B-DNA, with one strand morphing into an i-motif structure.

Uploaded by Scientific Studios on 2018-06-13.

Natural language processing, the CARMEL toolkit, and the Zodiac Killer

You may have heard about the “computer that was taught to think like the Zodiac Killer,” which was featured on a History Channel series about the eponymous murderer and has received a fair amount of science journalism coverage. Incorrectly referred to as a “supercomputer” by several articles, it is actually a software package called “Carmel” that was developed by Jonathan Graehl, Kevin Knight, and co-workers at the University of Southern California’s Information Sciences Institute, though I suppose that they are running Carmel on some very powerful computers for the large-scale cypher decryption jobs that have attracted so much attention from the press. The Carmel toolkit was designed to address natural language processing problems, which we can think of as “computer hearing” or “computer reading” problems in some ways analogous to the computer vision topics I’ve covered in previous blog entries. You can download the toolkit for non-commercial purposes here.

Technically, Carmel is a “finite state transducer” package. But what the heck does that mean? First, we should define a finite state machine, which can be both an abstract “machine” in the form of a piece of software, or a real, physical machine. A finite state machine can exist in exactly one state at a time, determined by its inputs, and can “transition” to another state given appropriate inputs. That may sound confusing, but there are finite state machines all around us. One classic example is a vending machine. Given the right amount of money and button presses, all in a specific sequence, the machine transitions from a resting state into a state wherein it is vending a product. Following vending, the machine transitions again, back to its resting state. Other examples of common finite state machines include subway turnstiles that transition from locked to open after a token is inserted and back to locked after the passenger passes, and combination locks that transition from locked to open following a specific series of numerical inputs.

Finite state machines can be “acceptors” or “transducers.” Acceptors, like our vending machine example above, produce an output if a given input is accepted by the current state of the machine: the machine vends only if the correct inputs are provided, otherwise it just sits there. Transducers go beyond simple accepting or not accepting an input; instead they convert or transduce an input signal to a different output signal. Finite state transducers find practical application in language processing jobs like automatic translation. The example below, taken from Kevin Knight and Yaser Al-Onaizan’s primer on finite-state software, shows some simple finite state machines for language processing. On the left is an acceptor that accepts three simple inputs: “he saw me,” “he ran home,” and “she talked.” In the middle is a representation of a transducer that will change an input of “big dog” to “small dog.” Finally, on the right is a representation of a transducer that might be used as (a very small) part of a speech-to-text application, where the phonetic sounds represented by capital letters SH, AE, and K are converted into (possible) letter equivalents.

Schematic representation of example finite state machines. Left, a simple finite state acceptor; middle, a finite state transducer that converts the phrase “big dog” to “small dog;” right, a finite state transducer that converts phonemes (capital letters) to letters (lowercase). The symbol *e* indicates an empty transition label and is necessary here so that SH is translated only to “sh” and “s” alone is disallowed.

Schematic representation of example finite state machines. Left, a simple finite state acceptor; middle, a finite state transducer that converts the phrase “big dog” to “small dog;” right, a finite state transducer that converts phonemes (capital letters) to letters (lowercase). The symbol *e* indicates an empty transition label and is necessary here so that SH is translated only to “sh” and “s” alone is disallowed.

The examples above are “unweighted” or “deterministic” machines. To address practical language processing problems, “probabilistic” machines are needed. Simply put, probabilistic acceptors and transducers are deterministic machines with weights added to the transitions. If these weights reflect the probability of words appearing in specific sequences in a natural language like English, then we have begun to create a “language model.” A language model that can predict the relative probability of specific words occurring in a given sequence is created first by building a vocabulary using a large body of the target language in normal use and syntax. Then, the language model can be trained to recognize how frequently specific word combinations are likely to occur. This is useful in voice-recognition algorithms. For example a computer program might distinguish the phonetically similar “I scream” from “ice cream” based on the surrounding syntax. The training process of these algorithms is conceptually akin to training of neural networks that I have covered in previous posts.

Another useful application of entrained, linked, and nested networks of finite state machines like those that can be created with Carmel is cipher decryption. In 2011, Carmel was used to successfully decrypt a work known as The Copiale cipher. This manuscript consists of 75,000 handwritten characters in a substitution cipher and remained undeciphered for more than 260 years. One of the initial steps was to identify the language of the plaintext version of the manuscript. This was accomplished by comparing character usage patterns of the manuscript to known languages, together with a hint in the form of one of the only unencrypted words in the manuscript, “Phillipp,” a German spelling. After determining the original language, and with some intuitive guidance by Kevin Knight, the Copiale cipher was decoded in a matter of several weeks using a “brute force” computational method based on a network of finite state machines trained on German letter and word usage patterns and managed by the Carmel package.

That brings us to the Zodiac Killer, who was active in northern California in the late 1960s and early 1970s. The Zodiac Killer is probably most famous for the taunting letters sent to the press. In addition to the taunts and threats, these letters contained apparently encoded text. The first Zodiac cipher was decrypted in a short period of a few days by a school teacher and his wife, but three other Zodiac ciphers have remained unsolved to this day. In an attempt to decode the remaining unsolved ciphers, Carmel was used to entrain a network of finite state machines with the specific writing style of the Zodiac killer, who sent many letters in plaintext to papers and the police while active. Although the cipher appears not to have been solved yet, an interesting side effect has been the creation of an algorithm that can recreate the verbal style of the Zodiac killer. Knight and co-workers have even created a tool you can use to make your own (bad) creepy Zodiac poetry (although I haven’t been able to get this to work in my browser…) Here’s an example.

Mourning moaning mournful whispers weeping,
Together through the sleepless strangers slumber,
Alone and lonely lovers sleeping thieving,
A helpless dead forgotten former lover.

Through the taxi and the prison break,
Alone and angry at a brutal murder.
Surrounded by an artificial lake,
Never a convicted murderer…

3D from 2D, Structure from Motion, and hours spent inside the Colosseum

Last post I discussed extracting 3D data from a very small subject using a pair of stereo microscopic images and calculating a disparity map. This time, I’m going to talk about how larger 3D structures, like historic landmarks and city streets, can be reconstructed from 2D data.

You can see in the figure below that a single pair of stereoscopic images can give only limited information about the 3D structure of the subject. The rotating surface below was calculated directly from the stereo pair shown at the top of the image.

A 3D surface (bottom) calculated based on a stereo pair of SEM images (top).

A 3D surface (bottom) calculated based on a stereo pair of SEM images (top).

To get more information about a 3D object, we need more 2D images from different angles and perspectives. If possible, we’d like to have pictures from all sides of an object to fully reconstruct its structure. If our target were smaller than a breadbox but large enough to see with the naked eye, then it would be easy to place the object on a turntable and take a series of stereo or monocular pictures of it as it rotates (or rotate a camera around the object).

But what if our target is considerably larger than a breadbox? What if we want to reconstruct the Roman Colosseum for example? To do this, we can use a set of techniques collectively called “structure from motion” or SFM. SFM has a lot in common with the methods I used to create the 3D surface in the example above. In fact, you can think of stereo vision reconstruction as a subset of SFM where the image inputs are limited to a pair of images taken from slightly offset (binocular) positions. In the example above, I first used a block-matching algorithm to identify areas in both images that correspond to the same features of the object. Then, I used an openCV module to calculate a disparity map that describes how far the various features appeared to move in one stereo image compared to the other, which in turn reflects the relative distance of the features from the camera.

Just like with the stereo reconstruction example, with larger SFM tasks involving many images, the first step is also to find correspondence between the images. However, the block-matching method is only useful for certain types of images, such as stereo pairs (stationary object, camera in two positions) or sequential frames of a video of a moving subject (stationary camera, object in two positions). When we have more sets of 2D information, like pictures of an object taken from multiple angles, we need to employ feature recognition to find correspondence among the images to connect them to the real object.

The two most common general methods for feature recognition for SFM are scale-invariant feature transform (SIFT) and speeded-up [sic] robust features (SURF). We’ll focus our discussion on the older, open-source SIFT algorithm; although SURF offers some advantages, methods based on SIFT have been more widely applied. The first step in SIFT is the identification of keypoints. To do this, the algorithm first smooths an image so that high-frequency local variations are reduced, giving robust, low-frequency features relatively more weight. Then, keypoints are assigned to the local minima or maxima of a difference of Gaussians (DoG) calculation. DoG is a feature-enhancement method based on subtracting a more blurred image from a less blurred image (this is done by convolving an image with Gaussian kernels with different dimensions, see this previous post about convolutional neural networks for more detail). This has the effect of reducing the complexity of the image so that only high-contrast, robust features (like edges) are retained.

A picture of the Roman Colosseum before processing (left) after Gaussian blur (middle) and following a Difference of Gaussians calculation (right). Some possible maxima (identified by eye) that might give rise to SIFT features are indicated with arrows.

A picture of the Roman Colosseum before processing (left) after Gaussian blur (middle) and following a Difference of Gaussians calculation (right). Some possible maxima (identified by eye) that might give rise to SIFT features are indicated with arrows.

The next step in SIFT is feature matching and indexing. Given many images of the Colosseum processed as above, the algorithm begins to assign putative features to “bins” of similar features. Comparing thousands of features (or keypoints) using a clever computational time saver, corresponding keypoints from multiple images can be identified.

However, using DoG extrema alone produces too many candidate keypoints. To narrow the set of keypoints to the most useful ones, the SIFT algorithm uses image data near the keypoints to assign a relative physical position to each. For example, the edges of the Colosseum in the example above have high contrast with the sky at the angle the picture was taken, and therefore these locations would seem to be good candidates for keypoints. However, if we moved to the side of the Colosseum and took another picture, those edges would no longer be set against the background of the sky, and are therefore not persistent features that describe the physical object well from all angles. Features of the walls of the Colosseum itself would be more useful for mapping the physical location of additional keypoints because the image data around such features will persist to some degree from any angle. You can see how this works out in the extremely simplified example below.

 
Views of a cartoon clock tower from two angles. Stars indicate some points where Gaussian difference maxima would appear due to high contrast between neighboring pixels. On the right, the uninformative keypoints have been removed.

Views of a cartoon clock tower from two angles. Stars indicate some points where Gaussian difference maxima would appear due to high contrast between neighboring pixels. On the right, the uninformative keypoints have been removed.

 

After identifying, localizing, and filtering keypoints, each keypoint is assigned an orientation based on the pixel values and gradients around the keypoint. Now, keypoints can be related to physical features of the object regardless of image scale or camera position. The local image data around each keypoint (pixel values and directional gradients in pixel value extending from the keypoint position) are matched with local image data around neighboring keypoints until a physical 3D map is created. For a great example of SFM application to a large-scale subject—the city of Rome—read Agarwal et al. (2001) Building Rome in a Day. Communications of the ACM 54:10, 105-112, or check out the BigSFM group at Cornell.

3D from 2D, disparity maps, and one evil weevil

An interesting problem in computer vision is extracting three-dimensional information from two-dimensional data. Say we want to recreate a real object in imaginary 3D space on a computer, for an animation or for 3D printing, for example. These days, there are many different types of 3D scanners out there; some probe an object by touching it physically, and others project energy in the form of light or x-rays to probe the surface or volume, respectively.

But what if the object you want to scan is too large—or too small—to fit in a 3D scanner?

Here, computer vision technology once again takes inspiration from biology. An important function of the human visual cortex is to convert the 2D images projected onto each retina into a full understanding of the 3D world around us. We call this depth perception, which arises from two types of cues: monocular and binocular. We’ll leave monocular cues alone for now, not because they aren’t interesting, but they have less relevance for the computer vision problem I want to discuss. The central idea that connects biology with 3D reconstruction from limited 2D data is the concept of “stereopsis,” AKA binocular disparity or binocular parallax.

Animals with forward-facing eyes (like us) experience stereopsis. Try this: focus on your screen and close one eye. Bring your index finger in front of your face. Now close your open eye and open the other one. Your finger will have appeared to move relative to the screen. The apparent difference, or disparity, in the position of your finger is because the focal points of your eyes are converged on the screen and each eye sees your finger from a slightly different angle. The closer your finger is to your eye, the greater the disparity.

So, if the amount of disparity reflects the relative distance of a object from the detectors (your eyes in the example above), maybe we can extract depth information from a pair of images taken of an object that isn’t a convenient size for 3D scanning. The image below shows a pair of scanning electron micrographs of a weevil taken at two different angles, mimicking the offset position of your left and right eyes in the example above.

 

If you live in the American South, like me, you may have heard of the infamous boll weevil, which has cost the US cotton industry over $13 billion since it’s introduction in the early part of the 20th century. Pictured below is a different species of weevil that is also a crop pest.

The first two panels show individual SEM images taken at two different angles, mimicking an offset pair of eyes. The last panel cycles the two images so the disparity is more obvious. Image credit: Dr. James Kremer, AgBiome.

The first two panels show individual SEM images taken at two different angles, mimicking an offset pair of eyes. The last panel cycles the two images so the disparity is more obvious. Image credit: Dr. James Kremer, AgBiome.

Now that we have a stereo pair of images, we need to create a disparity map that will tell us which parts of the images are closer to the camera. I used OpenCV for this, an open-source computer vision library. Specifically, I used a block matching algorithm, which is commonly applied for motion estimation. Basically, block matching is a way to identify which pixels in two images (or sequential frames of a video) belong to the same objects or features. To get depth information, a disparity map is then calculated based on the distance by which the various individual features appear to be displaced. In the disparity map shown below calculated from the weevil image pair, lighter shades indicate greater disparity, meaning that these regions are closer to the observer.

Disparity map calculated from the weevil stereo pair. The background has been “zeroed out” to black.

Disparity map calculated from the weevil stereo pair. The background has been “zeroed out” to black.

I then used the pixel values of this disparity map to create a surface that reflects the depth information contained in the stereo pair.

3D surface created from the disparity map.

3D surface created from the disparity map.

Then, I applied one of the original weevil images to the surface as a texture.

3D surface created from the disparity map and colored using the original SEMs

3D surface created from the disparity map and colored using the original SEMs

Looks pretty good, but far from perfect. One of the obvious shortfalls of using stereo pairs to re-create a 3D object is that we have no information about the back of the subject. Going back to our first example, you can't see what the side of your finger facing the screen looks like. As a result, a reconstruction based only on binocular disparity looks more like a sheet draped over the object, or at best a shrink-wrapped surface. However, our experience with other objects in the world and our brains’ ability to process complex visual patterns allow us to intuit properties of the real objects from still images beyond what disparity calculations can give us. For example, we can predict that the two features closest to the camera (feet? pedipalps? I’m really not an entomologist) are above the head of the subject with space between them. We can also intuit the rounded shape of the “arms” and “head,” and we can assume some degree of bilateral symmetry for the parts we can’t see directly. Applying these predictions with the depth information from the disparity map and adding a few other details, we get the following complete reconstruction.