The Most Pointless Docker Command Ever

What?

This article will show you how you can undo the things Docker does for you in a Docker command. Clearer now?

OK, Docker relies on Linux namespaces to isolate effectively copy parts of the system so it ends up looking like you are on a separate machine.

For example, when you run a Docker container:

$ docker run -ti busybox ps -a
PID USER COMMAND
 1 root ps -a

it only ‘sees’ its own process IDs. This is because it has its own PID namespace.

Similarly, you have your own network namespace:

$ docker run -ti busybox netstat -a
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State
Active UNIX domain sockets (servers and established)
Proto RefCnt Flags Type State I-Node Path

You also have your own view of inter-process communication and the filesystem.

Go on then

This is possibly the most pointless possible docker command ever run, but here goes:

docker run -ti 
    --privileged 
    --net=host --pid=host --ipc=host 
    --volume /:/host 
    busybox 
    chroot /host

The three ‘=host’ flags bypass the network, pid and ipc namespaces. The volume flag mounts the root filesystem of the host to the ‘/host’ folder in the container (you can’t mount to ‘/’ in the container). The privileged flags gives the user full access to the root user’s capabilities.

All we need is the chroot command, so we use a small image (busybox) to chroot to the filesystem we mounted.

What we end up with is a Docker container that is running as root with full capabilities in the host’s filesystem, will full access to the network, process table and IPC constructs on the host. You can even ‘su’ to other users on the host.

If you can think of a legitimate use for this, please drop me a line!

Why?

Because you can!

Also, it’s quite instructive. And starting from this, you can imagine scenarios where you end up with something quite useful.

Imagine you have an image – called ‘filecheck’ – that runs a check on the filesystem for problematic files. Then you could run a command like this (which won’t work BTW – filecheck does not exist):

docker run --workdir /host -v /:/host:ro filecheck

This modified version of the pointless command dispenses with the chroot in favour of changing the workdir to ‘/host’, and – crucially – the mount now uses the ‘:ro’ suffix to mount the host’s filesystem read-only, preventing the image from doing damage to it.

So you can check your host’s filesystem relatively safely without installing anything.

You can imagine similar network or process checkers running for their namespaces.

Can you think of any other uses for modifications of this pointless command?
This post is based on material from Docker in Practice, available on Manning’s Early Access Program. Get 39% off with the code: 39miell

 

Docker, ShutIt, and The Perfect 2048 Game

 Docker, ShutIt, and The Perfect 2048 Game

I cured my 2048 addiction by (almost) completing it. Obviously I didn’t get this far without cheating, but I did play every move.

2048

 

I could have hacked the Javascript or the client-side db, but I actually wanted to see if it was possible to complete the game. So I used the old-fashioned “save game” functionality to keep state.

Here’s how it’s done, and it should work on pretty much anything.

1) Get Docker

See above link to: https://www.docker.io/gettingstarted/#1

2) Get ShutIt

$ git clone https://github.com/ianmiell/shutit.git

ShutIt is used to build docker containers in a lego-like fashion. It’s designed to allow you to string together complex builds of containers in a linear way while maintaining control and reproducibility.

We use it here to build a container with a vnc environment.

Documentation here

3) Build win2048 Image

$ cd shutit

$ # Do some bogus password setup
$ cat > library/win2048/configs/$(hostname)_$(whoami).cnf << END
[container]
password:acontainerpassword
[host]
username:ausername
password:apassword
END
$ # secure config files first
$ find . | grep cnf | xargs chmod 0600

$ cd library/win2048

$ python ../../shutit_main.py --shutit_module_path ../vnc

Wait for it to finish:

[...]
Building: shutit.tk.vnc.vnc with run order: 0.322000000000000008437694987151189707219600677490234375
Completed module: shutit.tk.vnc.vnc
Building: shutit.tk.win2048.win2048 with run order: 0.326000000000000011990408665951690636575222015380859375
Completed module: shutit.tk.win2048.win2048
# BUILD REPORT FOR BUILD END lp01728_imiell_1399663295.26
################################################################################

4) Commit and Tag the Container

Pay attention to the bold text – it will need to be changed for your run

$ sudo docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
4335e86a64ca ubuntu:12.04 /bin/bash 9 minutes ago Exit 0 sick_franklin
 
$ sudo docker commit 4335e86a64ca
e3a7a9926654c8c28c1252407a4b7ee272cb7fb6ad5640ad1f54e1cacf402bb2

$ sudo docker tag e3a7a9926654c8c28c1252407a4b7ee272cb7fb6ad5640ad1f54e1cacf402bb2 yourusername/mywin2048

5) Run Up the Container

Pay attention to the bold text – it will need to be changed for your run

$ sudo docker run -t -i -p 5901:5901 -p 6080:6080 yourusername/mywin2048 /bin/bash
$ /root/start_win2048.sh

6) Play

(In another terminal)

$ vncviewer localhost:1

Password:

vncpass

7) Save Game

In another terminal:

sudo docker tag $(sudo docker commit $(sudo docker ps -a | grep -w "-p 5901:5901 -p 6080:6080" | grep -w Up | awk '{print $1}')) username/mywin2048

 

Can you play through to the end and get the perfect 2048 board?

2048closeup

 

My Favourite Secret Weapon – strace

Why strace?

I’m often asked in my technical troubleshooting job to solve problems that development teams can’t solve. Usually these do not involve knowledge of API calls or syntax, rather some kind of insight into what the right tool to use is, and why and how to use it. Probably because they’re not taught in college, developers are often unaware that these tools exist, which is a shame, as playing with them can give a much deeper understanding of what’s going on and ultimately lead to better code.

My favourite secret weapon in this path to understanding is strace.

strace (or its Solaris equivalents, trussdtruss is a tool that tells you which operating system (OS) calls your program is making.

An OS call (or just “system call”) is your program asking the OS to provide some service for it. Since this covers a lot of the things that cause problems not directly to do with the domain of your application development (I/O, finding files, permissions etc) its use has a very high hit rate in resolving problems out of developers’ normal problem space.

Usage Patterns

strace is useful in all sorts of contexts. Here’s a couple of examples garnered from my experience.

My Netcat Server Won’t Start!

Imagine you’re trying to start an executable, but it’s failing silently (no log file, no output at all). You don’t have the source, and even if you did, the source code is neither readily available, nor ready to compile, nor readily comprehensible.

Simply running through strace will likely give you clues as to what’s gone on.

$  nc -l localhost 80
nc: Permission denied

Let’s say someone’s trying to run this and doesn’t understand why it’s not working (let’s assume manuals are unavailable).

Simply put strace at the front of your command. Note that the following output has been heavily edited for space reasons (deep breath):

 $ strace nc -l localhost 80
 execve("/bin/nc", ["nc", "-l", "localhost", "80"], [/* 54 vars */]) = 0
 brk(0)                                  = 0x1e7a000
 access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
 mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f751c9c0000
 access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
 open("/usr/local/lib/tls/x86_64/libglib-2.0.so.0", O_RDONLY) = -1 ENOENT (No such file or directory)
 stat("/usr/local/lib/tls/x86_64", 0x7fff5686c240) = -1 ENOENT (No such file or directory)
 [...]
 open("libglib-2.0.so.0", O_RDONLY)      = -1 ENOENT (No such file or directory)
 open("/etc/ld.so.cache", O_RDONLY)      = 3
 fstat(3, {st_mode=S_IFREG|0644, st_size=179820, ...}) = 0
 mmap(NULL, 179820, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f751c994000
 close(3)                                = 0
 access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
 open("/lib/x86_64-linux-gnu/libglib-2.0.so.0", O_RDONLY) = 3
 read(3, "\177ELF\2\1\1\3>\1\320k\1"..., 832) = 832
 fstat(3, {st_mode=S_IFREG|0644, st_size=975080, ...}) = 0
 mmap(NULL, 3072520, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f751c4b3000
 mprotect(0x7f751c5a0000, 2093056, PROT_NONE) = 0
 mmap(0x7f751c79f000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xec000) = 0x7f751c79f000
 mmap(0x7f751c7a1000, 520, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f751c7a1000
 close(3)                                = 0
 open("/usr/local/lib/libc.so.6", O_RDONLY) = -1 ENOENT (No such file or directory)
[...]
 mmap(NULL, 179820, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f751c994000
 close(3)                                = 0
 access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
 open("/lib/x86_64-linux-gnu/libnss_files.so.2", O_RDONLY) = 3
 read(3, "\177ELF\2\1\1\3>\1\20\""..., 832) = 832
 fstat(3, {st_mode=S_IFREG|0644, st_size=51728, ...}) = 0
 mmap(NULL, 2148104, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f751b8b0000
 mprotect(0x7f751b8bc000, 2093056, PROT_NONE) = 0
 mmap(0x7f751babb000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0xb000) = 0x7f751babb000
 close(3)                                = 0
 mprotect(0x7f751babb000, 4096, PROT_READ) = 0
 munmap(0x7f751c994000, 179820)          = 0
 open("/etc/hosts", O_RDONLY|O_CLOEXEC)  = 3
 fcntl(3, F_GETFD)                       = 0x1 (flags FD_CLOEXEC)
 fstat(3, {st_mode=S_IFREG|0644, st_size=315, ...}) = 0
 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f751c9bf000
 read(3, "127.0.0.1\tlocalhost\n127.0.1.1\tal"..., 4096) = 315
 read(3, "", 4096)                       = 0
 close(3)                                = 0
 munmap(0x7f751c9bf000, 4096)            = 0
 open("/etc/gai.conf", O_RDONLY)         = 3
 fstat(3, {st_mode=S_IFREG|0644, st_size=3343, ...}) = 0
 fstat(3, {st_mode=S_IFREG|0644, st_size=3343, ...}) = 0
 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f751c9bf000
 read(3, "# Configuration for getaddrinfo("..., 4096) = 3343
 read(3, "", 4096)                       = 0
 close(3)                                = 0
 munmap(0x7f751c9bf000, 4096)            = 0
 futex(0x7f751c4af460, FUTEX_WAKE_PRIVATE, 2147483647) = 0
 socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 3
 connect(3, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
 getsockname(3, {sa_family=AF_INET, sin_port=htons(58567), sin_addr=inet_addr("127.0.0.1")}, [16]) = 0
 close(3)                                = 0
 socket(PF_INET6, SOCK_DGRAM, IPPROTO_IP) = 3
 connect(3, {sa_family=AF_INET6, sin6_port=htons(80), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = 0
 getsockname(3, {sa_family=AF_INET6, sin6_port=htons(42803), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
 close(3)                                = 0
 socket(PF_INET6, SOCK_STREAM, IPPROTO_TCP) = 3
 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
 bind(3, {sa_family=AF_INET6, sin6_port=htons(80), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = -1 EACCES (Permission denied)
 close(3)                                = 0
 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
 bind(3, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EACCES (Permission denied)
 close(3)                                = 0
 write(2, "nc: ", 4nc: )                     = 4
 write(2, "Permission denied\n", 18Permission denied
 )     = 18
 exit_group(1)                           = ?

To most people that see this flying up their terminal this initially looks like gobbledygook, but it’s really quite easy to parse when a few things are explained.

For each line:

  • the first entry on the left is the system call being performed
  • the bit in the parentheses are the arguments to the system call
  • the right side of the equals sign is the return value of the system call
open("/etc/gai.conf", O_RDONLY)         = 3

Therefore for this particular line, the system call is open, the arguments are the string /etc/gai.conf and the constant O_RDONLY, and the return value was 3.

How to make sense of this?

Some of these system calls can be guessed or enough can be inferred from context. Most readers will figure out that the above line is the attempt to open a file with read-only permission.

In the case of the above failure, we can see that before the program calls exit_group, there is a couple of calls to bind that return “Permission denied”:

 bind(3, {sa_family=AF_INET6, sin6_port=htons(80), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, 28) = -1 EACCES (Permission denied)
 close(3)                                = 0
 socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
 bind(3, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EACCES (Permission denied)
 close(3)                                = 0
 write(2, "nc: ", 4nc: )                     = 4
 write(2, "Permission denied\n", 18Permission denied
 )     = 18
 exit_group(1)                           = ?

We might therefore want to understand what “bind” is and why it might be failing.

You need to get a copy of the system call’s documentation. On ubuntu and related distributions of linux, the documentation is in the manpages-dev package, and can be invoked by eg ​​man 2 bind (I just used strace to determine which file man 2 bind opened and then did a dpkg -S to determine from which package it came!). You can also look up online if you have access, but if you can auto-install via a package manager you’re more likely to get docs that match your installation.

Right there in my man 2 bind page it says:

ERRORS
 EACCES The address is protected, and the user is not the superuser.

So there is the answer – we’re trying to bind to a port that can only be bound to if you are the super-user.


My book, Learn Bash the Hard Way, available at $5:

hero

Preview available here.


My Library Is Not Loading!

Imagine a situation where developer A’s perl script is working fine, but not on developer B’s identical one is not (again, the output has been edited).
In this case, we strace the output on developer B’s computer to see how it’s working:

$ strace perl a.pl
execve("/usr/bin/perl", ["perl", "a.pl"], [/* 57 vars */]) = 0
brk(0)                                  = 0xa8f000
[...]fcntl(3, F_SETFD, FD_CLOEXEC)           = 0
fstat(3, {st_mode=S_IFREG|0664, st_size=14, ...}) = 0
rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
brk(0xad1000)                           = 0xad1000
read(3, "use blahlib;\n\n", 4096)       = 14
stat("/space/myperllib/blahlib.pmc", 0x7fffbaf7f3d0) = -1 ENOENT (No such file or directory)
stat("/space/myperllib/blahlib.pm", {st_mode=S_IFREG|0644, st_size=7692, ...}) = 0
open("/space/myperllib/blahlib.pm", O_RDONLY) = 4
ioctl(4, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fffbaf7f090) = -1 ENOTTY (Inappropriate ioctl for device)
[...]mmap(0x7f4c45ea8000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 5, 0x4000) = 0x7f4c45ea8000
close(5)                                = 0
mprotect(0x7f4c45ea8000, 4096, PROT_READ) = 0
brk(0xb55000)                           = 0xb55000
read(4, "swrite($_[0], $_[1], $_[2], $_[3"..., 4096) = 3596
brk(0xb77000)                           = 0xb77000
read(4, "", 4096)                       = 0
close(4)                                = 0
read(3, "", 4096)                       = 0
close(3)                                = 0
exit_group(0)                           = ?

We observe that the file is found in what looks like an unusual place.

open("/space/myperllib/blahlib.pm", O_RDONLY) = 4

Inspecting the environment, we see that:

$ env | grep myperl
PERL5LIB=/space/myperllib

So the solution is to set the same env variable before running:

export PERL5LIB=/space/myperllib

Get to know the internals bit by bit

If you do this a lot, or idly run strace on various commands and peruse the output, you can learn all sorts of things about the internals of your OS. If you’re like me, this is a great way to learn how things work. For example, just now I’ve had a look at the file /etc/gai.conf, which I’d never come across before writing this.

Once your interest has been piqued, I recommend getting a copy of “Advanced Programming in the Unix Environment” by Stevens & Rago, and reading it cover to cover. Not all of it will go in, but as you use strace more and more, and (hopefully) browse C code more and more understanding will grow.

Gotchas

If you’re running a program that calls other programs, it’s important to run with the -f flag, which “follows” child processes and straces them. -ff creates a separate file with the pid suffixed to the name.

If you’re on solaris, this program doesn’t exist – you need to use truss instead.

Many production environments will not have this program installed for security reasons. strace doesn’t have many library dependencies (on my machine it has the same dependencies as ‘echo’), so if you have permission, (or are feeling sneaky) you can just copy the executable up.

Other useful tidbits

You can attach to running processes (can be handy if your program appears to hang or the issue is not readily reproducible) with -p.

If you’re looking at performance issues, then the time flags (-t, -tt, -ttt, and -T) can help significantly.

 


My book, Learn Bash the Hard Way, available at $5:

hero

Preview available here.

Shakespeare’s Vocabulary Considered Unexceptional

Shakespeare’s Vocabulary Considered Unexceptional

Summary

Shakespeare’s vocabulary is held to be extraordinary among writers. Its relative enormity is unquestioned in the popular and academic literature, bolstered by – and reflexively reaffirming – the peculiar status Shakespeare holds within our culture. 1

A few simple programs were written to analyse his and other writers’ works whose corpuses were of similar size to see how they compared.

What I discovered suggests that Shakespeare’s vocabulary, while far from small, is far from extraordinary among writers when size of corpus is taken into account.

Results

What I found is expressed in the two graphs below (click to view).

Each is a graph showing the relative size of corpus, number of unique word tokens, and number of unique word stems for various authors’ available works in Figure 1.

Calculated vocabulary of authors from corpus
Figure 1

Each writer is represented by three numbers.

The first column is the size of the corpus examined (ie how many “real” word tokens there were in the texts I gathered together), and divided by 20 (so the numbers are comparable to the other data points). The “corpus” for each writer was the works I could download from the Project Gutenberg website. (Note that Joyce’s corpus here does not include Finnegans Wake). 2

The second column is the number of unique tokens found in the above corpus. 3

The third column is the size of the writer’s vocabulary based on the number of unique word tokens in the corpus. This is based on the number of unique stemmed word tokens. A stemmed word is a “root” form of a word that may have several distinct relations in other word tokens. A stemming algorithm reduces the words “fishing”, “fished”, “fish”, and “fisher” to the root word, “fish”. To determine the word stems I used a Porter stemming algorithm implemented in perl freely available on the web. 4

Analysis

The first thing to acknowledge is that Shakespeare’s vocabulary is larger than some other notable writers with similar-sized corpuses. It’s significantly larger, for example, than Dickens or Richardson.

At first glance it appears that Shakespeare’s vocabulary was markedly larger than Marlowe’s. However, taking a similar sized corpus of Shakespeare’s younger works, you can see that the vocabulary size for these works is almost identical to Marlowe’s. To test the hypothesis that Shakespeare’s vocabulary grew as he got older, a similar-sized corpus of his later works was examined. Again, the results were very similar.

Melville, with fewer words in Moby Dick than the younger Shakespeare, has a greater vocabulary than displayed there and in Marlowe’s works.

Milton is often cited as having a smaller vocabulary than Shakespeare, but this is also not borne out by the analysis. In fact, given the relatively small size of his available corpus, his vocabulary is very large indeed.5

Hardy – with a similar sized corpus – also shows a vocabulary not dissimilar to Shakespeare’s. Far more unique words than any other writer, even given his smaller corpus, and the only writer in the study with more than 20,000 stemmed words.

The vocabulary king among writers is Joyce, whose vocabulary towers over Shakespeare’s (Finnegans Wake was not included) even with a significantly smaller corpus.

Shakespeare’s vocabulary might be reduced further if we took out place and other names from his works, and removed the variant spellings more common in the era before standardized spelling.

Conclusion

The myth of Shakepeare’s unusually large vocabulary suggests that our view of Shakespeare has been warped by our veneration of his work. Rather than see him as an unusually successful writer whose works have remained popular over centuries, we have tried to make his literary abilities seem extraordinary too. Shakepeare is also said to have invented many words. Is this a myth too?

Does the size of a writer’s vocabulary matter? Isn’t it even more impressive that he managed to do so much with nothing more than the tools other writers possess?

It may also be worth researching further whether this analysis indicates that there is a difference in vocabulary size exhibited between playwrights (eg Shakespeare, Marlowe), poets (eg Milton, Shakespeare, Marlowe(?)) and novelists (eg Richardson and Dickens) or even whether the our intuitive understanding of the categories are aligned with writers’ displayed vocabularies.

Footnotes

1
“However, the single most remarkable feature about Shakespeare’s poetic language is his extraordinary vocabulary, his choice of particular words to convey particular emotional attitudes. Earlier I have had occasion to note that Shakespeare’s working vocabulary is enormous (about 25,000 words, more than twice as many as his nearest rival, John Milton)” Ian Johnston, “Studies in Shakespeare: Some Observations on Shakespeare’s Dramatic Verse in Richard III and Macbeth”, 1999, http://records.viu.ca/~johnstoi/eng366/lectures/poetry.htm

“Critics have long recognized that Shakespeare had an unusually large mental lexicon that was perhaps organized around particularly strong image-based mental models. […] Shakespeare’s almost uniquely rich use of language.” M. T. Crane, Shakespeare’s Brain: Reading with Cognitive Theory (Princeton NJ: Princeton University Press, 2000), 24

G. L. Brook, The Language of Shakespeare (London: Andre Deutsch, 1976), pp. 26-64

S. S. Hussey, The Literary Language of Shakespeare (New York: Longman, 1982), pp. 37-60

2
Works used in analysis:

Charles Dickens: “A Christmas Carol”, “Bleak House”, “Barnaby Rudge”, “David Copperfield”

Samuel Johnson: “Grammar of the English Tongue”, “Lives of the English Poets: Prior, Congreve, Blackmore, Pope”, “Notes to Shakespeare, Volume III: The Tragedies”, “Johnson’s Notes to Shakespeare Vol. I Comedies”, “Prefaces and Prologues to Famous Books”, “Preface to a Dictionary of the English Language”, “Preface to Shakespeare”
Thomas Hardy: “A Pair of Blue Eyes”, “The Mayor of Casterbridge”, “The Return of the Native”, “Tess of the D’urbervilles”, “Jude the Obscure”, “Far from the Madding Crowd”, “Return of the Native”

George Eliot: “Middlemarch”

Henry James: “The Bostonians”, “Portrait of a Lady”, “The Wings of a Dove”

James Joyce: “Dubliners”, “Ulysses”, “A Portrait of the Artist as a Young Man”

Christopher Marlowe: “Various minor poems”, “Dido, Queen of Carthage”, “Dr Faustus”, “Edward II”, “The Jew of Malta”, “Massacre at Paris”, “Tamburlaine the Great (part i, ii)”

Herman Melville: “Moby Dick”

John Milton: “Areopagitica”, “Milton’s Comus”, “Minor Poems by Milton”, “Paradise Lost”, “Paradise Regained”

Samuel Richardson: “Clarissa”

Shakespeare: “The Sonnets”, “A Lover’s Complaint”, “All’s Well That Ends Well”, “Antony and Cleopatra”, “As You Like It”, “The Comedy of Errors”, “Coriolanus”, “Cymbeline”, “Hamlet”, “Henry IV (parts i, ii)”, “Henry V”, “Henry VI (parts i, ii, iii)”, “Henry VIII”, “King John”, “Julius Caesar”, “King Lear”, “Love’s Labour’s Lost”, “Macbeth”, “The Merchant of Venice”, “Measure for Measure”, “The Merry Wives of Windsor”, “Midsummer Night’s Dream”, “Much Ado About Nothing”, “Othello”, “Richard II”, “Richard III”, “Romeo and Juliet”, “The Taming of the Shrew”, “The Tempest”, “Timon of Athens”, “Titus Andronicus”, “Toilus and Cressida”, “Twelfth Night”, “The Two Gentlemen of Verona”, “The Winter’s Tale”

Shakespeare Younger: “The Comedy of Errors”, “Henry VI (parts i, ii, iii)”, “King John”, “Richard III”, “Taming of the Shrew”, “Titus Andronicus”, “Twelfth Night”, “Love’s Labour’s Lost”, “Romeo and Julie”

Shakespeare Older: “The Sonnets”, “Cymbeline”, “Hamlet”, “Henry VIII”, “King Lear”, “Macbeth”, “Measure for Measure”, “Othello”, “The Tempest”, “Timon of Athens”, “The Winter’s Tale”, “A Lover’s Complaint”

Burton’s “Anatomy of Melancholy” was also analysed, but contains a great deal of Latin text interspersed, making his vocabulary anomalously large.

3
A “word” any set of contiguous non-space, non-punctuation characters or punctuation not beginning with a number. Possessives (“.*’s”) were removed. Words shortened with “.*’d” have been replaced with “ed”).

4
Porter, 1980, An algorithm for suffix stripping, Program, Vol. 14, no. 3, pp 130-137,
See also: http://en.wikipedia.org/wiki/Stemming

5
See footnote 1, Johnston.