Reading in the Brain as a Machine Learning Model

The words GRAPHEME and PHONEME are behind a translucent illustration of a person reading.

In the book Reading in the Brain: The New Science of How We Read, Stanislas Dehaene takes us on a tour of the brain that drills down to the neuron-level of reading. Among the hundreds of interesting facts in the book (doctors have identified the specific neuron that a patient with epilepsy used to recognize Jennifer Aniston, for example), Dehaene makes a case for humans being faster – and more capable – than computers.

First let’s consider a scanner that can convert a scanned page of a book into text. For every word, computers will examine each letter sequentially. Each letter that is considered will eliminate many words (for instance, if the first letter is “c” then any word that starts with a letter other than “c” is discarded. By the time the scanner gets to the end of the word, all the words in the computer’s memory will be eliminated except for one.

Now let’s look at a human brain. There are a few things to marvel at here:

  1. Most people know at least 50,000 words (roughly).
  2. When we look at a word, we do not compare it to every word we know, letter by letter. When we look at a word, we compare that word with every word we know – all at once!!!
  3. Generally speaking, humans take the same amount of time to read a word – regardless of length (unlike computers where the length of the word directly corresponds to the time to process).
  4. Humans are better than computers at error correcting (computers get derailed when things aren’t spelled right – we don’t).

We can do this because of the concept of a neural network; all our neurons work together to solve problems in a massively interconnected way. Neural networks are related to artificial intelligence and instrumental to the “deep learning” that computers do. It’s helpful to understand what a neural network is since AI is so prominent in the public conscience.

Let’s look at an excerpt from Dahaene’s book (page 42). First we’ll consider the allegory of An Assembly of Daemons and then we’ll look at a practical example of how the brain uses nature’s neural networks to read a word by recognizing features of letters that inform letters that then inform words:


An Assembly of Daemons

This lively metaphor holds that the mental lexicon can be pictured as an immense semicircle where tens of thousands of daemons compete with one another. Each daemon responds to only one word, and makes this known by yelling whenever the word is called and must be defended. When a letter string appears on the retina, all the daemons examine it simultaneously.

Those that think that their word is likely to be present yell loudly. Thus when the word “scream” appears, the daemon in charge of the response to it begins to shout, but so does its neighbor who codes for the word “cream.” “Scream” or “cream”? After a brief competition, the champion of “cream” has to yield—it is clear that his adversary has had stronger support from the stimulus string “s-c-r-e-a-m.” At this point the word is recognized and its identity can be passed on to the rest of the system.

Daemons fight for the right to represent the correct word. This competition process yields both flexibility and robustness. The pandemonium automatically adapts to the complexity of the task at hand. When there are no other competitors around, even a rare and misspelled word like “astrqlabe” can be recognized very quickly—the daemon that represents it, even if it initially shouts softly, always ends up beating all the others by a comfortable margin. If, however, the stimulus is a word such as “lead,” many daemons will activate (those for “bead,” “head,” “read,” “lean,” “leaf,” “lend” . . .) and there will be a fierce argument before the “lead” daemon manages to take over.A diagram of how feature of letters (horizontal lines like in E and F, for example) inform which letter is being examined, which in turn informs the word that is being looked at. Every feature votes on each letter and every letter votes on each word.

FIGURE 1.5

The word identification process is similar to a vast assembly where thousands of letter and word units conspire to provide the best possible interpretation of the input string. In McClelland and Rumelhart’s model, of which only a fragment is shown here, basic features of the input string activate letter detectors, which in turn preferentially connect to detectors of the words that contain them. The links can be excitatory (arrows) or inhibitory (lines ending with discs). A fierce competition between lexical units finally identifies a dominant word that represents the network’s preferred hypothesis about the incoming string.


Note that at the top, input units are sensitive to line segments presented on the retina. In the middle lie letter-detector units that fire whenever a given letter is present. And at the bottom, units code for entire words.

All of these units are tightly linked by a swarm of connections. This enormous connectivity turns the network dynamics into a complex political game in which letters and words support, censor, or eliminate each other. If you study the graph carefully, you will see excitatory connections, represented by small arrows, as well as inhibitory connections, represented by small circles.

Their role is to propagate the votes of each of the daemons. Each input detector, which codes for a specific feature such as a vertical bar, sends stimulation to all the letters that contain this particular feature—one might say, for the sake of simplicity, that each visual neuron “votes” for the presence of these letters. At the next level, similarly, letter detectors conspire to elect specific words through stimulation of their corresponding units. The presence of letters “A” and “N,” for instance, supports the words “RAIN” and “TANK,” but only partially argues for the word “RAIL” and not at all for “PEST.”

Inhibition also contributes toward selection of the best candidate. Thanks to inhibitory connections, letters can vote against the words that do not contain them. For instance, the unit that codes for letter “N” votes against the word “RAIL” by inhibiting it. Furthermore, words that compete inhibit each other. Thus identification of the word “RAIN” is incompatible with the presence of the word “RAIL,” and vice versa.

Finally, it is useful to incorporate top-down connections, from words to their component letters. This process can be compared to a senate where letters are represented by words that, in return, support the letters that voted for them. Reciprocal connections allow for the creation of stable coalitions that can resist an occasional missing letter: if one letter “o” is lacking in the word “crocqdile,” for instance, its neighbors will still conspire to elect the word “crocodile,” and in turn the latter will vote for the presence of a middle letter “o” that is not physically present. In the end, millions of connections are needed to incorporate the numerous statistical constraints that link the levels of words, letters, and features.

Other subtleties also allow for the whole network to operate smoothly. For instance, word units can have different thresholds for firing. A word encountered frequently has a lower threshold than a rare word, and with an equal amount of bottom-up support has a better chance to win the race. The most recent models also incorporate additions such as a fine-grained coding of letter position. The resulting network has such a complex set of dynamics that it is impossible to fully describe it mathematically. One has to resort to computer simulations in order to determine how long the system takes to converge to the correct word, and how often it misidentifies it.


Phew! Reading any one word, processing it, and deriving meaning from it takes a ton of effort. This example is a good metaphor for how neural networks in machine learning work. We have around 100 billion cells in our noggins. Eat that, computers!


Image modified from pxhere by Mohamed Hassan – CC0