Shakespeare’s Vocabulary Considered Unexceptional
Shakespeare’s vocabulary is held to be extraordinary among writers. Its relative enormity is unquestioned in the popular and academic literature, bolstered by – and reflexively reaffirming – the peculiar status Shakespeare holds within our culture. 1
A few simple programs were written to analyse his and other writers’ works whose corpuses were of similar size to see how they compared.
What I discovered suggests that Shakespeare’s vocabulary, while far from small, is far from extraordinary among writers when size of corpus is taken into account.
What I found is expressed in the two graphs below (click to view).
Each is a graph showing the relative size of corpus, number of unique word tokens, and number of unique word stems for various authors’ available works in Figure 1.
Each writer is represented by three numbers.
The first column is the size of the corpus examined (ie how many “real” word tokens there were in the texts I gathered together), and divided by 20 (so the numbers are comparable to the other data points). The “corpus” for each writer was the works I could download from the Project Gutenberg website. (Note that Joyce’s corpus here does not include Finnegans Wake). 2
The second column is the number of unique tokens found in the above corpus. 3
The third column is the size of the writer’s vocabulary based on the number of unique word tokens in the corpus. This is based on the number of unique stemmed word tokens. A stemmed word is a “root” form of a word that may have several distinct relations in other word tokens. A stemming algorithm reduces the words “fishing”, “fished”, “fish”, and “fisher” to the root word, “fish”. To determine the word stems I used a Porter stemming algorithm implemented in perl freely available on the web. 4
The first thing to acknowledge is that Shakespeare’s vocabulary is larger than some other notable writers with similar-sized corpuses. It’s significantly larger, for example, than Dickens or Richardson.
At first glance it appears that Shakespeare’s vocabulary was markedly larger than Marlowe’s. However, taking a similar sized corpus of Shakespeare’s younger works, you can see that the vocabulary size for these works is almost identical to Marlowe’s. To test the hypothesis that Shakespeare’s vocabulary grew as he got older, a similar-sized corpus of his later works was examined. Again, the results were very similar.
Melville, with fewer words in Moby Dick than the younger Shakespeare, has a greater vocabulary than displayed there and in Marlowe’s works.
Milton is often cited as having a smaller vocabulary than Shakespeare, but this is also not borne out by the analysis. In fact, given the relatively small size of his available corpus, his vocabulary is very large indeed.5
Hardy – with a similar sized corpus – also shows a vocabulary not dissimilar to Shakespeare’s. Far more unique words than any other writer, even given his smaller corpus, and the only writer in the study with more than 20,000 stemmed words.
The vocabulary king among writers is Joyce, whose vocabulary towers over Shakespeare’s (Finnegans Wake was not included) even with a significantly smaller corpus.
Shakespeare’s vocabulary might be reduced further if we took out place and other names from his works, and removed the variant spellings more common in the era before standardized spelling.
The myth of Shakepeare’s unusually large vocabulary suggests that our view of Shakespeare has been warped by our veneration of his work. Rather than see him as an unusually successful writer whose works have remained popular over centuries, we have tried to make his literary abilities seem extraordinary too. Shakepeare is also said to have invented many words. Is this a myth too?
Does the size of a writer’s vocabulary matter? Isn’t it even more impressive that he managed to do so much with nothing more than the tools other writers possess?
It may also be worth researching further whether this analysis indicates that there is a difference in vocabulary size exhibited between playwrights (eg Shakespeare, Marlowe), poets (eg Milton, Shakespeare, Marlowe(?)) and novelists (eg Richardson and Dickens) or even whether the our intuitive understanding of the categories are aligned with writers’ displayed vocabularies.
“However, the single most remarkable feature about Shakespeare’s poetic language is his extraordinary vocabulary, his choice of particular words to convey particular emotional attitudes. Earlier I have had occasion to note that Shakespeare’s working vocabulary is enormous (about 25,000 words, more than twice as many as his nearest rival, John Milton)” Ian Johnston, “Studies in Shakespeare: Some Observations on Shakespeare’s Dramatic Verse in Richard III and Macbeth”, 1999, http://records.viu.ca/~johnstoi/eng366/lectures/poetry.htm
“Critics have long recognized that Shakespeare had an unusually large mental lexicon that was perhaps organized around particularly strong image-based mental models. […] Shakespeare’s almost uniquely rich use of language.” M. T. Crane, Shakespeare’s Brain: Reading with Cognitive Theory (Princeton NJ: Princeton University Press, 2000), 24
G. L. Brook, The Language of Shakespeare (London: Andre Deutsch, 1976), pp. 26-64
S. S. Hussey, The Literary Language of Shakespeare (New York: Longman, 1982), pp. 37-60
Works used in analysis:
Charles Dickens: “A Christmas Carol”, “Bleak House”, “Barnaby Rudge”, “David Copperfield”
Samuel Johnson: “Grammar of the English Tongue”, “Lives of the English Poets: Prior, Congreve, Blackmore, Pope”, “Notes to Shakespeare, Volume III: The Tragedies”, “Johnson’s Notes to Shakespeare Vol. I Comedies”, “Prefaces and Prologues to Famous Books”, “Preface to a Dictionary of the English Language”, “Preface to Shakespeare”
Thomas Hardy: “A Pair of Blue Eyes”, “The Mayor of Casterbridge”, “The Return of the Native”, “Tess of the D’urbervilles”, “Jude the Obscure”, “Far from the Madding Crowd”, “Return of the Native”
George Eliot: “Middlemarch”
Henry James: “The Bostonians”, “Portrait of a Lady”, “The Wings of a Dove”
James Joyce: “Dubliners”, “Ulysses”, “A Portrait of the Artist as a Young Man”
Christopher Marlowe: “Various minor poems”, “Dido, Queen of Carthage”, “Dr Faustus”, “Edward II”, “The Jew of Malta”, “Massacre at Paris”, “Tamburlaine the Great (part i, ii)”
Herman Melville: “Moby Dick”
John Milton: “Areopagitica”, “Milton’s Comus”, “Minor Poems by Milton”, “Paradise Lost”, “Paradise Regained”
Samuel Richardson: “Clarissa”
Shakespeare: “The Sonnets”, “A Lover’s Complaint”, “All’s Well That Ends Well”, “Antony and Cleopatra”, “As You Like It”, “The Comedy of Errors”, “Coriolanus”, “Cymbeline”, “Hamlet”, “Henry IV (parts i, ii)”, “Henry V”, “Henry VI (parts i, ii, iii)”, “Henry VIII”, “King John”, “Julius Caesar”, “King Lear”, “Love’s Labour’s Lost”, “Macbeth”, “The Merchant of Venice”, “Measure for Measure”, “The Merry Wives of Windsor”, “Midsummer Night’s Dream”, “Much Ado About Nothing”, “Othello”, “Richard II”, “Richard III”, “Romeo and Juliet”, “The Taming of the Shrew”, “The Tempest”, “Timon of Athens”, “Titus Andronicus”, “Toilus and Cressida”, “Twelfth Night”, “The Two Gentlemen of Verona”, “The Winter’s Tale”
Shakespeare Younger: “The Comedy of Errors”, “Henry VI (parts i, ii, iii)”, “King John”, “Richard III”, “Taming of the Shrew”, “Titus Andronicus”, “Twelfth Night”, “Love’s Labour’s Lost”, “Romeo and Julie”
Shakespeare Older: “The Sonnets”, “Cymbeline”, “Hamlet”, “Henry VIII”, “King Lear”, “Macbeth”, “Measure for Measure”, “Othello”, “The Tempest”, “Timon of Athens”, “The Winter’s Tale”, “A Lover’s Complaint”
Burton’s “Anatomy of Melancholy” was also analysed, but contains a great deal of Latin text interspersed, making his vocabulary anomalously large.
A “word” any set of contiguous non-space, non-punctuation characters or punctuation not beginning with a number. Possessives (“.*’s”) were removed. Words shortened with “.*’d” have been replaced with “ed”).
Porter, 1980, An algorithm for suffix stripping, Program, Vol. 14, no. 3, pp 130-137,
See also: http://en.wikipedia.org/wiki/Stemming
See footnote 1, Johnston.