The Runbooks Project

Previously, in 2017, I wrote about Things I Learned Managing Site Reliability for Some of the World’s Busiest Gambling Sites. A lot of it focussed on runbooks, or checklists, or whatever you want to call them (we called them Incident Models, after ITIL).

It got a lot of hits (mostly from HackerNews), and privately quite a few people reached out to me to ask for advice on embedding similar. It even got name-checked in a Google SRE book.

Since then, I’ve learned a few more things about trying to get operational teams to follow best practice by writing and maintaining runbooks, so this is partly an update of that.

All these experiences have led me to help initiate a public Runbooks project to try and collect and publish similar efforts and reduce wasted effort across the industry.

tl;dr

We’ve set up a public Runbooks project to expose our private runbooks to the world.

We’re looking for contributions. Do you have any runbooks lying around that could benefit from being honed by many eyes? The GitHub repo is here if you want to get involved, or contact me on Twitter.

Back to the lessons learned.

Things I Learned Since Things I Learned

The Logic is Inarguable, the Practice is Hard

I already talked about this in the previous post, but every subsequent attempt I made to get a practice of writing runbooks going was hard going. No-one ever argues with the logic of efficiency and saved time, but when it comes to putting the barn up, pretty much everyone is too busy with something else to help.

In summary, you can’t tell people anything. You have to show them, get them to experience it, or incentivise them to work on it.

Some combination of these four things is required:

  • Line-management/influence/control to encourage/force the right behaviours
  • A critical mass of material to demonstrate value
  • Resources allocated to sustain the effort
  • A process for maintaining the material and ensuring it remains relevant

With a prevailing wind, you can get away with less in one area, but these are the critical factors that seem to need to be in place to actually get results.

A Powerful External Force Is Often Needed

Looking at the history of these kind of efforts, it seems that people need to be forced – against their own natures – into following these best practices that invest current effort for future operational benefit.

Examples from The Checklist Manifesto included:

  • Boeing and checklists (“planes are falling from the sky – no matter how good the pilots!”)
  • Construction and standard project plans (“falling building are unacceptable, we need a set of build patterns to follow and standards to enforce”)
  • Medicine and ‘pre-flight checklists’ (“we’re getting sued every time a surgeon makes a mistake, how can we reduce these?”)

In the case of my previous post, it was frustration for me at being on-call that led me to spend months writing up runbooks. The main motivation that kept me going was that it would be (as a minimal positive outcome) for my own benefit. This intrinsic motivation got the ball rolling, and the effort was then sustained and developed by the other three more process-oriented factors.

There’s a commonly-seen pattern here:

  • you need some kind of spontaneous intrinsic motivation to get something going and snowball, and then
  • a bureaucratic machine behind it to sustain it

If you crack how to do that reliably, then you’re going to be pretty good at building businesses.

It Doesn’t Always Help

That wasn’t the only experience I had trying to spread what I thought was good practice. In other contexts, I learned, the application of these methods was unhelpful.

In my next job, I worked on a new and centralised fast-changing system in a large org, and tried to write helpful docs to avoid repeating solving the same issues over and over. Aside from the authority and ‘critical mass’ problems outlined above, I hit a further one: the system was changing too fast for the learnings to be that useful. Bugs were being fixed quickly (putting my docs out of date similarly quickly) and new functionality was being added, leading to substantial wasted effort and reduced benefit.

Discussing this with a friend, I was pointed at a framework that already existed called Cynefin that had already thought about classifying these differences of context, and what was an appropriate response to them. Through that lens, my mistake had been to try and impose what might be best practice in a ‘Complicated’/’Clear’ context to a context that was ‘Chaotic’/’Complex’. ‘Chaotic’ situations are too novel or under-explored to be susceptible to standard processes. Fast action and equally fast evaluation of system response is required to build up practical experience and prepare the way for later stabilisation.

‘Why Don’t You Just Automate It?’

I get this a lot. It’s an argument that gets my goat, for several reasons.

Runbooks are a useful first step to an automated solution

If a runbook is mature and covers its ground well, it serves as an almost perfect design document for any subsequent automation solution. So it’s in itself a useful precursor to automation for any non-trivial problem.

Automation is difficult and expensive

It is never free. It requires maintenance. There are always corner cases that you may not have considered. It’s much easier to write: ‘go upstairs’ than build a robot that climbs stairs.

Automation tends to be context-specific

If you have a wide-ranging set of contexts for your problem space, then a runbook provides the flexibility to applied in any of these contexts when paired with a human mind. For example: your shell script solution will need to reliably cater for all these contexts to be useful; not every org can use your Ansible recipe; not every network can access the internet.

Automation is not always practicable

In many situations, changing or releasing software to automate a solution is outside your control or influence.

A Public Runbooks Project

All my thoughts on this subject so far have been predicated on writing proprietary runbooks that are consumed and maintained within an organisation.

What I never considered was gaining the critical mass needed by open sourcing runbooks, and asking others to donate theirs so we can all benefit from each others’ experiences.

So we at Container Solutions have decided to open source the runbooks we have built up that are generally applicable to the community. They are growing all the time, and we will continue to add to them.

Call for Runbooks

We can’t do this alone, so are asking for your help!

  • If you have any runbooks that you can donate to the cause lying around in your wikis, please send them in
  • If you want to write a new runbook, let us know
  • If you want to request a runbook on a particular subject, suggest it

However you want to help, you can either raise a PR or an issue, or contact me directly.

Some Relatively Obscure Bash Tips

Following on from previous posts on bash, here’s some more bash tips that are relatively obscure, or rarely seen, but still worth knowing about.

1) Mid-Command Comments

Usually when I want to put a comment next to a shell command I put it at the end, like this:

echo some command # This echoes some output

But until recently I had no idea that you could embed comments within a chain of commands using the colon operator:

echo before && : this  && echo after

Combined with subshells, this means you can annotate things really neatly, like this:

(echo banana; : IF YOU ARE COPYING \
  THIS FROM STACKOVERFLOW BE WARNED \
  THIS IS DANGEROUS) | tr 'b' 'm'

2) |&

You may already be familiar with 2>&1, which redirects standard error to standard output, but until I stumbled on it in the manual, I had no idea that you can pipe both standard output and standard error into the next stage of the pipeline like this:

if doesnotexist |& grep 'command not found' >/dev/null
then
echo oops
fi

3) $''

This construct allows you to specify specific bytes in scripts without fear of triggering some kind of encoding problem. Here’s a command that will grep through files looking for UK currency (‘£’) signs in hexadecimal recursively:

grep -r $'\xc2\xa3' *

You can also use octal:

grep -r $'\302\243' *

4) HISTIGNORE

If you are concerned about security, and ever type in commands that might have sensitive data in them, then this one may be of use.

This environment variable does not put the commands specified in your history file if you type them in. The commands are separated by colons:

HISTIGNORE="ls *:man *:history:clear:AWS_KEY*"

You have to specify the whole line, so a glob character may be needed if you want to exclude commands and their arguments or flags.


If you like this, you might like one of my books:
Learn Bash the Hard Way

Learn Git the Hard Way
Learn Terraform the Hard Way

LearnGitBashandTerraformtheHardWay
Buy in a bundle here

5) fc

If readline key bindings aren’t under your fingers, then this one may come in handy.

It calls up the last command you ran, and places it into your preferred editor (specified by the EDITOR variable). Once edited, it re-runs the command.

6) ((i++))

If you can’t be bothered with faffing around with variables in bash with the $[] construct, you can use the C-style compound command.

So, instead of:

A=1
A=$[$A+1]
echo $A

you can do:

A=1
((A++))
echo $A

which, especially with more complex calculations, might be easier on the eye.

7) caller

Another builtin bash command, caller gives context about the context of your shell’s

SHLVL is a related shell variable which gives the level of depth of the calling stack.

This can be used to create stack traces for more complex bash scripts.

Here’s a die function, adapted from the bash hackers’ wiki that gives a stack trace up through the calling frames:

#!/bin/bash
die() {
  local frame=0
  ((FRAMELEVEL=SHLVL - frame))
  echo -n "${FRAMELEVEL}: "
  while caller $frame; do
    ((frame++));
    ((FRAMELEVEL=SHLVL - frame))
    if [[ ${FRAMELEVEL} -gt -1 ]]
    then
      echo -n "${FRAMELEVEL}: "
    fi
  done
  echo "$*"
  exit 1
}

which outputs:

3: 17 f1 ./caller.sh
2: 18 f2 ./caller.sh
1: 19 f3 ./caller.sh
0: 20 main ./caller.sh
*** an error occured ***

8) /dev/tcp/host/port

This one can be particularly handy if you find yourself on a container running within a Kubernetes cluster service mesh without any network tools (a frustratingly common experience).

Bash provides you with some virtual files which, when referenced, can create socket connections to other servers.

This snippet, for example, makes a web request to a site and returns the output.

exec 9<>/dev/tcp/brvtsdflnxhkzcmw.neverssl.com/80
echo -e "GET /online HTTP/1.1\r\nHost: brvtsdflnxhkzcmw.neverssl.com\r\n\r\n" >&9
cat <&9

The first line opens up file descriptor 9 to the host brvtsdflnxhkzcmw.neverssl.com on port 80 for reading and writing. Line two sends the raw HTTP request to that socket connection’s file descriptor. The final line retrieves the response.

Obviously, this doesn’t handle SSL for you, so its use is limited now that pretty much everyone is running on https, but when running from application containers within a service mesh can still prove invaluable, as requests there are initiated using HTTP.

9) Co-processes

Since version 4 of bash it has offered the capability to run named coprocesses.

It seems to be particularly well-suited to managing the inputs and outputs to other processes in a fine-grained way. Here’s an annotated and trivial example:

coproc testproc (
  i=1
  while true
  do
    echo "iteration:${i}"
    ((i++))
    read -r aline
    echo "${aline}"
  done
)

This sets up the coprocess as a subshell with the name testproc.

Within the subshell, there’s a never-ending while loop that counts its own iterations with the i variable. It outputs two lines: the iteration number, and a line read in from standard input.

After creating the coprocess, bash sets up an array with that name with the file descriptor numbers for the standard input and standard output. So this:

echo "${testproc[@]}"

in my terminal outputs:

63 60

Bash also sets up a variable with the process identifier for the coprocess, which you can see by echoing it:

echo "${testproc_PID}"

You can now input data to the standard input of this coprocess at will like this:

echo input1 >&"${testproc[1]}"

In this case, the command resolves to: echo input1 >&60, and the >&[INTEGER] construct ensures the redirection goes to the coprocess’s standard input.

Now you can read the output of the coprocess’s two lines in a similar way, like this:

read -r output1a <&"${testproc[0]}"
read -r output1b <&"${testproc[0]}"

You might use this to create an expect-like script if you were so inclined, but it could be generally useful if you want to manage inputs and outputs. Named pipes are another way to achieve a similar result.

Here’s a complete listing for those who want to cut and paste:

!/bin/bash
coproc testproc (
  i=1
  while true
  do
    echo "iteration:${i}"
    ((i++))
    read -r aline
    echo "${aline}"
  done
)
echo "${testproc[@]}"
echo "${testproc_PID}"
echo input1 >&"${testproc[1]}"
read -r output1a <&"${testproc[0]}"
read -r output1b <&"${testproc[0]}"
echo "${output1a}"
echo "${output1b}"
echo input2 >&"${testproc[1]}"
read -r output2a <&"${testproc[0]}"
read -r output2b <&"${testproc[0]}"
echo "${output2a}"
echo "${output2b}"

If you like this, you might like one of my books:
Learn Bash the Hard Way

Learn Git the Hard Way
Learn Terraform the Hard Way

LearnGitBashandTerraformtheHardWay
Buy in a bundle here

Riding the Tiger: Lessons Learned Implementing Istio

Recently I (along with a few others much smarter than me) had occasion to implement a ‘real’ production system with Istio, running on a managed cloud-provided Kubernetes service.

Istio has a reputation for being difficult to build with and administer, but I haven’t read many war stories about trying to make it work, so I thought it might be useful to actually write about what it’s like in the trenches for a ‘typical’ team trying to implement this stuff. The intention is very much not to bury Istio, but to praise it (it does so much that is useful/needed for ‘real’ Kubernetes clusters – skip to the end if impatient) while warning those about to step into the breach what comes if you’re not prepared.

In short, I wish I’d found an article like this before we embarked on our ‘journey’.

None of us were experienced implementers of Istio when combined with other technologies. Most of us had about half a year’s experience working with Kubernetes and had spun up vanilla Istio more than a few times on throwaway clusters as part of our research.

1) The Number Of People Doing This Feels Really Small

Whenever we hit up against a wall of confusion, uncertainty, or misunderstanding, we reached out to expertise in the usual local/friendly/distant escalation path.

The ‘local’ path was the people on the project. The ‘friendly’ path were people in the community we knew to be Istio experts (talks given at Kubecon and the like). One such expert admitted to us that they used Linkerd ‘until they absolutely needed Istio for something’, which was a surprise to us. The ‘distant’ path was mostly the Istio forum and the Istio Slack channel.

Whenever we reached out beyond each other we were struck by how few people out there seemed to be doing what we were doing.

‘What we were doing’ was trying to make Istio work with:

  • applications that may not have conformed to the purest ideals of Kubernetes
  • a strict set of network policies (Calico global DENY-ALL)
  • a monitoring stack we could actually configure to our needs without just accepting the ‘non-production ready’ defaults

Maybe we were idiots who could configure our way out of a paper bag, but it felt that, beyond doing 101 guides or accepting the defaults, there just wasn’t that much prior art out there.

Eventually we got everything to work the way we wanted, but we burned up significant developer time in the process, and nearly abandoned our efforts more than once on the way.

2) If You Go Off The Beaten Path, Prepare For Pain

Buoyed by our success running small experiments by following blogs and docs, we optimistically tried to leap to get everything to work at the same time. Fortunately, we ran strict pipelines with a fully GitOps’d workflow which meant there were vanishingly few ‘works on my cluster’ problems to slow us down (if you’re not doing that, then do so, stat. It doesn’t have to be super-sophisticated to be super-useful).

A great example of this was monitoring. If you just read the literature, then setting up a monitoring stack is a breeze. Run the default Istio install on a bare server, and everything comes for free. Great. However, we made the mistake of thinking that this meant fiddling with this stack for our own ends would be relative easy.

First, we tried to make this work with a strict mTLS mode (which is not the default, for very good reason). Then we tried to make the monitoring stack run in a separate namespace. Then we tried to make it all work with a strict global network policy of DENY-ALL. All three of these things caused enormous confusion when it didn’t ‘just work’, and chewed up weeks of engineering time to sort out.

The conclusion: don’t underestimate how hard it will be make changes you might want to make to the defaults when using Istio alongside other Kubernetes technologies. Better to start simple, and work your way out to build a fuller mental model that will serve you better for the future.


If you like this, you might like one of my books:
Learn Bash the Hard Way

Learn Git the Hard Way
Learn Terraform the Hard Way

LearnGitBashandTerraformtheHardWay
Buy in a bundle here

3) Build Up A Good Mental Glossary

Istio has a lot of terms that are overloaded in other related contexts. Terms commonly used, like ‘cluster’, or ‘registry’ may have very specific meanings or significance depending on context. This is not a disease peculiar to Istio, but the denseness of the documentation and the number of concepts that must be embedded in your understanding before you can parse them fluently.

We spend large amounts of time interpreting passages of the docs, like theologians arguing over Dead Sea Scrolls (“but cluster here means the mesh”, “no, it means ingress to the kubernetes cluster”, “that’s a virtual service, not a service”, “a headless service is completely different from a normal service!”).

Here’s a passage picked more or less at random:


An ingress Gateway describes a load balancer operating at the edge of the mesh that receives incoming HTTP/TCP connections. It configures exposed ports, protocols, etc. but, unlike Kubernetes Ingress Resources, does not include any traffic routing configuration. Traffic routing for ingress traffic is instead configured using Istio routing rules, exactly in the same way as for internal service requests.

I can read that now and pretty much grok what it’s trying to tell in real time. Bully for me. Now imagine sending that to someone not fully proficient in networking, Kubernetes, and Istio in an effort to get them to help you figure something out. As someone on the project put it to me: ‘The Istio docs are great… if you are already an Istio developer.’

As an aside, it’s a good idea to spend some time familiarising yourself with the structure of the docs, as it very quickly becomes maddening to try and orient yourself: ‘Where the hell was that article about Ingress TLS I saw earlier? Was it in Concepts, Setup, Tasks, Examples, Operation, or Reference?”

Where was that doc?

4) It Changes Fast

While working on Istio, we discovered that things we couldn’t do in one release started working in another, while we were debugging it.

While we’re on the subject, take upgrades seriously too: we did an innocuous-looking upgrade, and a few things broke, taking the system out for a day. The information was all there, but was easy to skip over in the release notes.

You get what you pay for, folks, so this should be expected from such a fundamental set of software components (Istio) running within a much bigger one (Kubernetes)!

5) Focus On Working On Your Debug Muscles

If you know you’re going to be working on Istio to any serious degree, take time out whenever possible to build up your debug skills on it.

Unfortunately the documentation is a little scattered, so here are some links we came across that might help:

https://istio.io/docs/ops/diagnostic-tools/

https://github.com/istio/istio/wiki/Troubleshooting-Istio

https://github.com/istio/istio/wiki/Analyzing-Istio-Performance

6) When It All Works, It’s Great

When you’re lost in the labyrinth, it’s easy to forget what power and benefits Istio can bring you.

Ingress and egress control, mutual TLS ‘for free’, a jump-start to observability, traffic shaping… the list goes on. Its feature set is unparalleled, it’s already got mindshare, and it’s well-funded. We didn’t want to drop it because all these features solved a myriad of requirements in one fell swoop. These requirements were not that recherché, so this is why I think Istio is not going away anytime soon.

The real lesson here is that you can’t hide from the complexity of managing all functionalities and expect to be able manipulate and control it at will without any hard work.

Budget for it accordingly, and invest the time and effort needed to benefit from riding the tiger.

The Astonishing Prescience of Nam June Paik

I hadn’t heard of Nam June Paik until I went to his exhibition at Tate Modern a few weeks ago.

I left feeling sheepish that I hadn’t heard of him before I went. I knew a bit about Warhol and Duchamp, but had no idea there was another artist so far ahead of his time working in the US during the 60s and 70s.

Not only was his work moving, thoughtful, and provocative, it was breathtakingly far-sighted. He seems to have understood the implications of the technical changes happening in the last century far more clearly than any other science fiction writer or artist I’m aware of.

Here’s some examples of how he saw further than others.

1) He Might Have Invented Music Sampling

In 1959, several decades before Grandmaster Flash, Paik spliced together different sounds to create a fresh musical work.

He originally studied music history, and gave up on it after meeting John Cage around this time, as he figured Cage already had avant-garde music covered.

EDIT: But see: Musique Concrete

2) He Invented Video Art

An early adopter of all technology, he bought a Portapak as soon as it was released in 1964, and presented the first Video Art on the same day, at the Cafe Au Go Go in Greenwich Village, New York.

3) He predicted the Internet… in the early 70s

Just read how he struggles to find the language in 1974 to get across his vision:

“TV will gain many branches… Picture-Phone, tele-facsimilie, two way inter-active TV for shopping, library research, opinion polling, health consultation, inter-office data transmission and … 1001 new applications … a new nuclear energy in information and society-building, which I would call tenatively
‘BROADBAND COMMUNICATION NETWORK’.
Nam June Paik, 1974

and in 1973, saw that ‘point to point communication’ was a ‘Copernican change’:

And he coined the phrase ‘Electronic Super-Highway’ in 1974…


“The building of new electronic super highways will become an even huger enterprise. Assuming we connect New York with Los Angeles by means of an electronic telecommunication network that operates in strong transmission ranges, as well as with continental satellites, wave guides, bundled coaxial cable, and later also via laser beam fiber optics: the expenditure would be about the same as for a Moon landing, except that the benefits in term of by-products would be greater.

Nam June Paik, Media Planning for the Postindustrial Society – The 21st Century is now only 26 years away (1974)

4) He Might Have Invented Digital Film at Bell Labs

Paik spent time with the engineering elite at Bell Labs in 1966 and learned FORTRAN to produce a small animation of a dot moving around the screen. It’s not exactly a Pixar feature film, but quite staggering to think that this was the state of the art just 50 years ago.

5) He predicted YouTube

Paik put all these insights together and worked out that video would be freely exchanged across the world.

He called it the ‘Video Common Market’. Again, you can see how his insights outstrip the language available to him in a wonderfully quaint way.

He produced works like Global Groove to explore these ideas of cultural exchange:

‘This is a glimpse of a video landscape of tomorrow when you will be able to switch on any TV station on the earth and TV guides will be as fat as the Manhattan telephone book.’

and kept pace with technology as it developed, overseeing ‘Wrap Around the World’ where David Bowie performed a song before handing over live in real time to Ryuichi Sakamoto to play another (an exquisite Japanese piece I can’t find) while an estimated 50 million people watched, followed by a car race in Ireland, some Kung Fu, and a few other musical performances. To add to the surreality, Al Franken was the compere.

6) He Predicted eBooks

Well before their advent, he talked about the implications of ‘paperless books’, and declared that ‘paper is dead’. He saw that magnetic storage was a radical historical shift, and an opportunity to use resources more effectively. I guess he didn’t anticipate how many cat gifs it would be deemed necessary to store and distribute.

‘Nietzsche said “God is dead” and I would say now “Paper is dead”… and paper has been as omni-present as God in human civilization for many thousand [sic] years.. and replacement of chief information storage from paper to magnetic way will have fundamental shake up in every sphere of our life, and preparation for this change, better utilization new resource should be explored by all sorts of talent’

Letter from Paik to Mr Stewart, undated

I think it’s fair to say that paper is as dead as God was 50 years after Nietzsche wrote those words.

6) He Did The Artwork For ‘OK Computer’

OK, well that one’s a lie.

But I think he could have:


You can see more about the exhibition here:

A Call For Artworks

I felt quite sad that I don’t know whether or not similar interesting work is going on now. I’ve struggled to find anything similar using the tools that Paik perceived would exist. What would he be doing today?

Who is creating art with blockchain, or manipulating social media to create happenings of significance instead of manipulating elections? The last really interesting thing I remember hearing about was the Luther Blissett mob in the 1990s, though that says more about my vintage than anything about the cutting edge.

Tell me about the interesting experiments with media going on now, please!


If you like this, you might like one of my books:
Learn Bash the Hard Way

Learn Git the Hard Way
Learn Terraform the Hard Way

LearnGitBashandTerraformtheHardWay

Get 39% off Docker in Practice with the code: 39miell2


Notes on Books Read in 2019

Following on from last year’s notes on books read in 2018, here are my notes for 2019. Again, these are not reviews, more notes on things I found interesting in them, or capsule summaries. This year I’ve categorised them into General Interest, The History of Technological Change, and Other.

I already wrote longer posts on two management books I read: Turn the Ship Around, and The Toyota Way, so won’t go over those again here.

General Interest

Debt: The First 5,000 Years,
by David Graeber

One of those books, like Sapiens, that spans the whole of recorded history to give you a fresh way of looking at the world. It’s also pretty iconoclastic, challenging a significant chunk of mainstream economics.

For example, it points out that the standard explanation of the origin of money (that it replaced barter with a means of exchange) is dismissed pretty handily by pointing out that there’s no evidence for it. Instead, barter is a rare and painful process when it is observed by anthropologists. It frequently involves lengthy rituals to build trust before the exchange takes place, and violence when it doesn’t work out as planned. If it comes up at all, it’s when systems of money have broken down, such as 90s Russia, or prison currency of cigarettes or tins of tuna.

In fact, this book argues, debt came first, and was part of the fabric of everyday existence deep within the human psyche. And that’s obvious, when you think about it. It is odd not to mentally keep score of what other parties have done for us, and want to return the favour. It’s also odd to keep very close track of what our closest friends and family have done for us. There’s an innate communism at the heart of society, and we are not rational actors always thinking about our own benefit vs cost all the time.

The book lost me halfway through as it went deep into the of debt as a profound part of human nature, linking the concept of debt to the history of religion and human sacrifice. The book soon recovered and had hundreds of interesting nuggets of information, such as:

  • Americans are the least sympathetic to personal debtors, which is odd, since it was largely settled by absconding debtors. In colonial days, a debtor’s ear was nailed to a post, and in 2010 a judge sentenced a Kenney, Illinois man to ‘indefinite incarceration’ for having a 300USD lumber yard debt
  • The invention of credit appears to have preceded the creation of money, eg on Mesopotamian Cuneiform and Egyptian heiroglyphics
  • The Sumerians invented modern accountancy (among other things, like the 24-hour day)
  • The Bank of England was founded under Henry II on the issuance of a loan that could never be repaid, and indeed still hasn’t
  • Santa Claus is the patron saint of children… and theives
  • In the ten commandments, ‘Coveting your neighbours wife’ was about getting her to be your slave, not your mistress

A History of the World in 100 Objects,
by Neil McGregor

This is a great book to read if you’re busy, as each object is described in an interesting mini-essay about 100 man-made objects spanning 2 million years of human history. The oldest object is a stone chopping tool found in Oldwai Gorge, Tanzania, a squarish block about the size to fit a fist.

Other highlights:

  • The first cities came to be in about 3000 BC in Mesopotamia, the most famous being Ur in Sumer
  • The Indus valley civilization rose and fell 2500-2000 BC, and had grid layout cities, sanitation systems and home plumbing
  • Jade was more precious than gold in Chinese and central American civilizations
  • The story of the Flood (ie the same story as Noah) was found on a clay tablet now in the British Museum dating from between 700-600 BC by George Smith 12 years before The Origin of Species was published. Must have been a hard time to be a thinking Christian
  • Ozymandias is the same historical figure as Rameses II. Napoleon’s men tried to remove the statue, but couldn’t. When it was finally moved it was considered a great technical achievement – 3000 years after it was moved there
  • Zoroaster was the first prophet to teach that the universe was a battleground between good and evil, around 1000 BC
  • The face of Christ was not represented until 300 AD
  • The first state structure of South America seems to have been undone around 600 AD… by climate change
  • Every day 3000 euros are thrown into the Trevi fountain
  • It’s estimated that at one point 3 million kg of tea per year was smuggled into Britain yearly, vs 2 million kg legally
  • Kerosene consumes 20% of rural African income, and causes 3 million deaths a year, mostly to women through cooking fumes. This is one of the reasons getting solar panels to rural Africa in an economic way might be hugely beneficial

Factfulness, by Hans Rosling

A paean to progress, this widely celebrated book helpfully reminds us that despite all the bad news we hear, human progress quietly continues to improve our lot.

On a more practical level, it reminds us that the idea of the ‘them and us’ ‘Third World’ that we Generation Xers and Boomers were raised on is becoming increasingly anachronistic as the economies and cultures develop. Rosling uses child mortality as a proxy for these developments: if a society can afford to allocate its resources to reducing these deaths, then they are likely to have better education, spare capacity, an infrastructure to support decent healthcare, and so on.

As an example of how much things have changed, Rosling reminds us that 30-40 million people died in China in 1960 due to the Great Leap Forward’s famine. And very few in the west knew anything about this. He invited us to imagine how easily that could happen now, given modern communications.

You can play with the figures yourself at gapmider.org.


First Man, by James Hansen

The 50th anniversary of the moon landings in 2019 meant that it was hard not to learn something about Neil Armstrong. This book told me a few things I had not heard elsewhere, like that he was expected to make an ‘average pilot’ in his naval aviation evaluation in 1949:

Student obviously knew all work and was able to fly most of it average to above.
Towards last of period he got so nervous it began to show up in his work. Should
be able to continue on in program and make an average pilot.

Naval aviation evaluation of Neil Armstrong, 1949

That should give hope to any of us that have been deemed of average potential in any field.

I also learned that Armstrong was so averse to debt that he refused to take up a loan to renovate his house, so his wife complained about the seven-year delay.

The geek in me was intrigued to learn that the Gemini Program (which preceded the Apollo space program) used a computer that had 159744 bits of information (less than 20k).

And if you’ve had a bad day at work, spare a thought for those working on the Russian Luna project. While Apollo 11 was taking people to the moon, the unmanned Luna 15 was also on its way there – the US and USSR space agencies had to check they wouldn’t interfere with one another. Luna 15 crashed into the moon on July 21, the day after the lunar module landed.


The Death and Life of American Cities,
by Jane Jacobs

The history of architecture has long been an interest of mine, so after reading Scale last year I was reminded to look up this 1961 book about the effects of ‘rational’ city planning in the US. It’s probably the most influential book on town planning, as it quietly overturned all the modernist assumptions about the effects of rationalist town planning on its inhabitants.

It emphasised the importance of diversity of occupancy and usage in helping keep an area safe. If locals use, say, a park for a variety of reasons throughout the day, then it will be less dangerous as it will be ‘self-policed’ by its users. If a shared space is avoided because it is dark, obscured, or unpleasant, then it will

Such ideas were anathema to the modernist architect, who thought that an isolated square of grass beneath a tower block should be enough to keep thousands of inhabitants happy.

After a stimulating opening, I found the book itself quite a dull read, and gave up halfway through, as the points it made were re-iterated in later chapters. A good summary is available here.


Thinking in Systems,
by Donella Meadows

While on the subject of complex, organic systems, I moved onto this book. Considered a classic in its field, it’s a mind-expanding book that describes how to think about systems both simple and complex in the abstract. Once you read it, you might find that you apply systems thinking to almost everything.

It starts with one of the simplest of systems: a bathtub. A bathtub has flows of inputs (the tap), and outputs (the plug), and stock (the water in the tub). You can’t change the stock immediately, but must wait for the levels to change in response to the changes in flows.

Once you’ve grasped these basics, Meadows demonstrates how the way we think about these elements can lead us astray. Economists again come in for a hard time, as Meadows critiques their tendency to focus on easily-measurable and manipulable flows such as inputs and outputs (Interest rates, GDP etc) more than stocks (level of education, or the state of infrastructure). They’re also pilloried for assuming that actors in a system can immediately respond to changes in levels (prices, stocks), whereas in reality consumption and production lags these changes.

These lags and difficulty in observation, along with the number of flows that can affect stocks in a system mean that systems can become very complex indeed. This complexity – and the resultant skill required to get results needed – will be familiar to anyone that tries to manage any group of people or system of complexity. Think about, say, managing a football team, and how difficult that is, as money invested in your junior team (inputs) may result in benefits (outputs/wins) years or decades from now.

I particularly liked this quote:

I have come up with no quick or easy formulas for finding leverage points in complex and dynamic systems. Give me a few months or years and I’ll figure it out. And I know from bitter experience that, because they are so counter-intuitive,
when I do discover a system’s leverage points, hardly anyone will believe me.

Meadows, Thinking in Systems

Learn Bash the Hard Way
Learn Git the Hard Way
Learn Terraform the Hard Way

LearnGitBashandTerraformtheHardWay

Get 39% off Docker in Practice with the code: 39miell2


Never Split the Difference,
by Chris Voss

Written by a former FBI hostage negotiator, this is an entertaining riposte to the standard theory you might read about negotiation in an MBA. He rejects all the ‘give a little, take a little’ stuff, and instead focuses on exploiting human biases and weaknesses to get the outcome you want.

Despite Voss’s experience coming from life-or-death negotiations, he relates it well to everyday situations. In fact, I immediately used some of the techniques to help persuade my daughter to take reading more seriously (‘What do you think we’re trying to achieve by encouraging you to read?’). I’m not going to claim it’s changed my (or her) life, but it’s certainly resulted in more productive exchanges than I had previously had.

Particularly useful is the simple advice to keep asking what Voss calls ‘calibrated questions’ that begin with ‘What’ or ‘How’ in order to put the onus on the other side to help solve the problem. The starkest example is given at the beginning of the book. When told: ‘Give us a million dollars or we’ll kill your son’, rather than saying ‘No’, he says, ‘How am I supposed to do that?’. This makes the demand the hostage-takers problem, and sets up the conversation for a genuine negotiation, while buying time and gathering more information about the situation from the antagonist’s responses.

Other advice includes:

  • Use a ‘late-night DJ voice’ to keep the conversation calm
  • Start with ‘I’m sorry’. It always works, is free, and makes them more likely to help you
  • Name / label their feelings ‘You probably think I’m an asshole’, they will open up and calm further
  • Mirror their words, then wait. They’ll keep talking
  • When getting their agreement, make them articulate the implementation. This makes it more likely the outcome agreed will happen

The Culture Code,
by Daniel Coyle

An stimulating pop science book about how effective teams work together, The Culture Code gives plenty of anecdotal information about what leaders do to make groups tick along.

I was especially struck by this quote from a Navy SEAL:

When we talk about courage, we think it’s going against an enemy with a machine gun. The real courage is seeing the truth and speaking the truth to each other. People never want to be the person who says ‘what’s really going on here?’ But inside the squadron, that is the culture and that’s why we’re successful.

This is exactly the kind of language used to describe a productive therapy group.

Some other titbits:

  • Thinking about your ancestors makes you perform better in intelligence tests. We don’t know why, but suspect it’s because it makes you feel part of a group
  • If you physically move more than 8 meters away from people at work, then interactions drop exponentially
  • A hospital sending a postcard to people that attempted suicide asking them to drop the hospital a line to tell them how they are doing reduced readmittance rates by half
  • The brain defaults to intensive thinking about social groups when there’s nothing else to think about

Measurable factors in team performance:

  • People talk in roughly equal measure, and contributions are short
  • High levels of eye contact, and energy in conversation
  • Communication is between members, not just with the leader
  • Back-channel or side conversations take place within the team
  • Team members take periodic breaks to go outside the team, bringing back information they share with others

Ideas for action to improve group health:

  • Name and rank priorities
  • Be 10 times as clear as you can be about priorities as you think you should be
  • Figure out where the team needs proficiency and where it needs creativity
  • Embrace catchphrases
  • Measure what really matters

All this resonated very strongly with what I read in Turn the Ship Around.


The History of Technological Change

The Master Switch,
by Tim Wu

This book looks at the rise and fall of various ‘information empires’, from the telegraph to the modern media conglomerate.

It’s a good read, especially on the incidental details of these histories:

  • Many people know that when Bell first transmitted speech over an electric wire, he said ‘Watson, come here. I want you.’ Less often reported in the US is that he followed it up with ‘God save the Queen!’
  • Magnetic recording was invented by Bell Labs in the 1930s, along with the answering machine, which was then suppressed because it was seen as a threat to the core business
  • The ‘Hush-a-Phone‘ was a popular invention that resulted in lawsuits from Bell demanding that its monopoly on devices to do with the telephone was retained. The inventor, Harry Tuttle eventually won on appeal, and this was a turning point in the monopoly battle against Bell. Tuttle was remembered as Robert de Niro’s character’s name in the film Brazil
  • The US was massively behind in TV technology for a long time because key inventions were ignored or effectively suppressed by the large radio networks
  • The inventor of television, John Logie Baird, was ruined by the Crystal Palace fire, and returned to solo invention in 1936. He developed a prototype of high-definition TV which wouldn’t reach the public until the 21st century
  • A staggering 83% of US households watched Elvis’ appearance on the Ed Sullivan show. This is still the most powerful centralised information system in human history
  • Why are super heroes all over the cinema these days? Because a distinctive character can be licensed as intellectual property, where a story can not. The films are adverts for licensed products

Reading this book got me thinking about technological change going back further in time…


The second group of books I read in 2019 revolved around the history of technological change. The last couple of centuries have seen many significant societal changes as a result of technological and engineering innovation, but I’ve long wondered about the nature of changes before that, and how long those changes took to spread.

The Medieval Machine,
by Jean Gimpel

As is so often the case with a powerful history book, the biggest surprise reading this was how contemporary historical concerns seem.

I didn’t know, for example, that deforestation was a serious problem in medieval Europe. At Douai in the 13th century, wood was so scarce that people rented coffins, and the undertaker would retrieve coffins after burial. In England, hunting grounds were protected (not for ecological reasons, but for the enjoyment of the aristocracy) in the much-hated forest law.

Population growth in the period 1150-1250 was a significant driver of historical change, at a staggering 20%, and setting the scene for the flowering of knowledge in the renaissance.

Even a throwaway quote from a medieval doctor in 1267 reminded me of today’s plethora of Javascript software frameworks, and the feeling that scientific development is going on at an ever-faster pace: ‘Every day a new instrument and a new method is invented’.

The Medieval Machine makes the argument that the story told by renaissance thinkers that technological development was static in the dark ages is mostly propaganda, and underneath that ill-documented period, change was pervasive. Interestingly, innovation both scientific and intellectual was powered by the monasteries. Cistercian monasteries developed water power, the book making the argument that their philosophy of order and structure was a precursor of Henry Ford’s factory model.

The spread of the water mill drove other kinds of innovation. Because they were capital-intensive, the first corporations were created to share the cost of ownership and maintenance. The oldest, the Societe des Moulins du Bazacle lasted over 700 years until its nationalisation in the 20th century.

Animal power also increased dramatically as an inventions as simple as the harness and the horseshow increased the amount a horse could pull from 500kg in the time of the Roman Empire to a staggering 6,400kg. Since the vast majority of human endeavour before 1700 was related to tillage and haulage, this was a huge leap forward.

The availability of more energy resulted in more production – more stone was quarried in France between the 11th and 13th centuries than in the whole history of Ancient Egypt.

More energy also meant more time to devote to intellectual pursuits, and a boom in the translation of ancient texts in the 12th and 13th centuries led to a ‘translation boom’ which laid the foundations of modern science and humanist learning.

One might wonder whether this sweeping historical perspective gave Gimpel a greater ability to divine the future. Unfortunately, it did not, as Gimpel wrote in the introduction:

No more fundamental innovations are likely to be introduced to change the structure of our society… we have reached a technological plateau.

The Medieval Machine (Gimpel, 1976), p. xi

This was written just as the internet was being invented…


Medieval Technology and Social Change,
by Lynn Townsend

Similar in subject matter to The Medieval Machine, Townsend’s book covers a much longer span of history, looking at technological developments going back to ancient times.

Strikingly, she argues that the invention of the stirrup heralded a new age in the 8th century, going so far as to reason that the Battle of Hastings was lost by the side with the greater numbers because the more numerous native English didn’t understand the significance of the stirrup. It was 7th century warfare fighting 11th century warfare.

Townsend also argues that the effects of something as simple but revolutionary as the plough can be seen today in the ‘two cultures’ of North and South in France, resulting from the differences between scratch ploughing in the South and the heavy ploughing of the Northern soil.

Other fascinating tit-bits:

  • In Roman times the overland haulage of goods doubled the price every hundred miles. Water was therefore the key to trade. In the 13th century the cost increased 30% per hundred miles
  • The architect of Hagia Sophia (built in the 7th century), Anthemius of Tralles, terrified his bothersome neighbour Zeno by simulating an earthquake using steam pressure
  • The Greeks invented the cam, but the first crank appeared in Europe as late as the first half of the 9th century
  • The Chinese used magnetized needles for navigation in 1122

Machines may be made by which the largest ships, with only one man steering them, will be moved faster than if they were filled with rowers, wagons may be built which will move with incredible speed and without the aid of beasts, flying machines will be constructed in which a man … may beat the air with wings like a bird … machines will make it possible to go to the bottom of seas and rivers.

Peter of Maricour, ca. 1260

Technological Revolutions and Financial Capital, by Carlota Perez

Going back to the Industrial Revolution rather than Ancient Greece, Perez’s book argues for the utility of a schema for understanding history that revolves around technological advancement and its relationship with financial capital.

She posits that history since the industrial revolution can be divided into five phases:

  • Industrial Revolution (1771-1829)
  • Age of Steam and Railways (1829-1875)
  • Age of Steel, Electricity and Heavy Engineering (1875-1908)
  • Age of Oil, Automobile and Mass Production (1908-1971)
  • Age of Information and Telecoms (1971-20?)

Each of these phases has a structure:

  • Big bang (invention, eg microprocessor in 1971)
  • Irruption (growth of technology and its complements, eg 1971-1987)
  • Frenzy (markets go crazy for it, eg 1987-2001)
  • Turning point (return to realistic growth through crash, 2001-2008?)
  • Synergy (business and investment moves towards the new technology, 2008?-2020?)
  • Maturity (technology fully absorbed into society, 2020?-2040?)

The book was written around the turn of the millennium, so the question marks are my guesses as to what the dates should be since then. For all I know another large crash is around the corner, and the turning point could last two decades (Perez argues for a 14-year turning point between 1929 and 1943, and a 2 year turning point between 1848 and 1850).

It makes for a riveting read, and some passages seemed to be very prescient. For example, when Perez says:

‘One of the features of the current surge is the importance of innovations as creations of value and the ease with which changes can be introduced in production, due to flexible equipment and organisations.

Technological Revolutions and Financial Capital, Perez, p.136

I couldn’t help but think about the software industry’s obsession with agile methodologies over more mass-production-oriented waterfall ones.

it’s certainly an interesting thesis, but how ‘true’ it is intended to be taken by the author is unclear. She hedges around the subject, but seems to argue that it’s a useful schema for analysis rather than anything fundamental about the economies of the world. Certainly, the eras overlap, and there’s differences in the cycles seen in the different parts of the world (Japan had its crack-up boom in the 1980s, exactly at the opposite end of the cycle to the west’s around the turn of the millennium).

What to make of all this noise, and the potential arbitrariness of the technologies isn’t obvious. Is the microprocessor really the centre of technological change since the 70s? Isn’t the microprocessor an iteration on the transistor, which was invented much before, and isn’t the internet the real agent of change in the world economy? How can we tell?

Regarding the details of each phase, the S-shaped curves seem to me to be the same the curves in all descriptions of change in history: surge as people jump on a valuable to new idea, crack-up boom, crash, and return to normal. In other words: what goes up must come down.

All that said, I’ll be returning to this book a lot in the future, as its schema is a very useful map of global change that makes interpreting events more tractable, and seems as good as any other I’ve seen.


The Innovators’ Dilemma,
by Clayton Christensen

This is a standard work on business change I’d been hearing about for years.

At root its argument is pretty simple: companies develop and expertise within a paradigm that comes to define them, and they fail to develop new products or services outside that paradigm. Eventually they get overtaken by companies who develop products in adjacent spaces that – due to technical improvements in price or features – eventually move into their territory.

The principal reason for this failure to compete is that profits are more easily found in the near term by making minor improvements to existing products than developing orthogonal ones that may undermine their principal revenue streams. The lower risk and greater reward means that managers avoid these opportunities, and the companies miss their window to capture the new market.

Simply developing new product lines is not guaranteed to work even if supported by the leadership, as it may be developed too early for the market, cost too much to be profitable for too long. This isn’t just theory, Christensen gives real-world examples of companies from the burgeoning disk drive industry in the last decades that developed smaller drives too early, or failed to market them well enough because their existing customers were not interested, and new customers not effectively sought.

Christensen suggests some ways to fight this failure mode:

  • Accept failures on different dimensions in different contexts
    • A mature product line needs to succeed on profit and features
    • A new product line needs to capture market share and market effectively to new customers
  • Accept your company has a more context-specific set of capabilities than you realise
    • Developing capabilities in other areas will seem more expensive than those in the areas you are already competent
  • Accept that the information you might want in order to make decisive moves in new areas may not be available to you
    • You may have to make relatively small bets on new markets to hedge your position
  • Avoid positioning your organisation as always leading or always following

Other

Body by Science,
by McGuff and Little

I’m about as far from being a body-builder as can be, but I found this often-recommended book interesting on the science of fitness. It’s most famous for making the argument that short intense bursts of activity are more effective for improving fitness than marathons or endless jogging (steady-state activity) that can simply add stress to the body, shortening life.

It also emphasises the importance of rest periods. A thing I learned from this was that ‘fast twitch’ muscles refer to the speed of fatigue, not the speed of response. They can take a very long time to recover. Sprinters exercise extremely high twitch muscles that take a very long time to recover, so if they rest before a race they are more likely to break records.


The Inner Game of Tennis,
by Timothy Gallwey

Another often-recommended book I finally got around to reading this year. I didn’t get a great deal out of it. In many ways it’s similar to Thinking Fast and Slow, in that it posits two personas, the conscious and the unconscious.

As a tennis teacher, Gallwey found it more effective to work on the unconscious player than the conscious one by simply getting the player to carefully watch good examples, then watch themselves trying to perform the same actions. Eventually, he argued, the unconscious mind caught up and the player improves without conscious effort. By contrast, telling yourself to follow a complex sequence of steps can result in choking. By analogy, a child doesn’t learn to walk consciously, and when an adult thinks about how to walk it all falls apart.

He derides ‘positive thinking’ as merely the flip side of negative thinking, ie buying into the notion that physical movements can be consciously controlled.

If you accept the basic premise of the book, then there’s not a great deal more to learn here, but Gallwey’s anecdotes and life story are well told and diverting.


The Charisma Myth,
by Olivia Fox

As appalling as this book sounds (‘HOW ANYONE CAN MASTER THE ART OF PERSONAL MAGNETISM’), it’s actually a pretty practical guide as to how to behave in order to get people not to hate you. It reminded me of Getting Things Done in that seems to have been written by someone with field experience of helping ordinary people in this respect.

It goes beyond the basic tips at the start (keep your voice low at the end of sentences; nod slowly; wait two seconds before speaking) to more therapy-like advice about managing your own thoughts, and discussion of how this can affect you and others negatively.

I don’t think it will turn you into Will Smith or Bill Clinton, but it certainly might be helpful in managing your own emotions and reflecting on how you come across to others.


Learn Bash the Hard Way
Learn Git the Hard Way
Learn Terraform the Hard Way

LearnGitBashandTerraformtheHardWay

Get 39% off Docker in Practice with the code: 39miell2


The First Non-Bullshit Book About Culture I’ve Read

I’ve always been frustrated that people often talk about culture without giving actionable or realistic advice, and was previously prompted by this tweet to write about what I did when put in charge of a broken team:

Then the other week I met a change management type at a dinner who’d previously worked in manufacturing, and I asked him to recommend me some books. One of them was Turn the Ship Around and it was exactly the book I wanted to read.

The Story

The book tells the story of David Marquet, newly-elevated commander of the worst-performing nuclear submarine in the US Navy. It was considered a basket case, and he was given it at the last moment, meaning his previous year’s meticulous preparation (for another ship) was for nought. He was under-prepared, and the odds were against him.

Within a year he’d turned it round to be the best-performing, with the staff going on to bigger and better things, and the ship sustaining its newly-acquired status.

Just the abstract blew my mind – it’s hard enough to turn around a group of IT types whose worst failure might be to lose some data. Never mind an actual nuclear submarine, where as commander, you are personally responsible for anything that goes wrong.

I was greatly intrigued as to how he did it, and the book did not disappoint.

What Marquet Did

By his own account, what Marquet did was improvise. Faced with the constraints he had on delivering any improvement, given:

  • The poor state of crew morale on arrival
  • His relative lack of knowledge about the ship itself
  • The lack of time available to show an improvement

he had little option but either to: fail to make a significant improvement and ‘get by’ with the traditional management techniques, or do something drastic.

As he explains, his drastic course of action was to overthrow the principle of commander-control the US navy had assumed to be the best form of management for generations. The US navy’s traditional approach had been to give the commander absolute authority and responsibility on a ship. This resulted in what Marquet calls a ‘leader-follower’ mentality, which in many ways is a great way to run things.

With good discipline (something the services excel at training for) and a highly trained leader, you can get good results, especially in a safety-critical environment. You can also get a demotivated, reactive, apathetic crew who develop a culture that focusses on ‘doing the minimum’. When the culture is broken, it’s hard to change this by simply shouting at the crew louder or doubling down on discipline.

Leader-Follower to Leader-Leader

Marquet sought to replace the leader-follower culture with a leader-leader one. Since Marquet didn’t even fully understand his own ship, he had to delegate authority and responsibility down the ship’s command structure.

This is brought home to him dramatically when he issues an order that was impossible to fulfil. He issues an order to a navigator to move the ship at a certain speed. He hears an ‘Aye, aye, sir!’, and then moments later wonders why his order doesn’t seem to have been followed. It turns out the ship he is on literally cannot move at that speed!

This impresses on him that he has to do two things:

  • Abandon the pretence of his own omniscience
  • Encourage his staff to feed back information to him

In other words, he has to ‘give control‘ to his crew without endangering the world in the process. The book discusses how he achieves this, and gives some retrospective structure to his actions that make it easier to apply his experience to different environments where culture needs to be changed.

What Makes This Book So Different?

It’s Credible

Morquand talks not only about what he did, but his concerns about his actions as he carried them out. For example, he describes how when he made chiefs responsible for signing off leave (and removed several layers of bureaucracy in the process), he worried that they would misuse their new power, or just make mistakes he himself would not make.

In fact, these fears turned out to be unfounded, and by that action, he demonstrated that he wanted to make real change to the way the ship worked. This ceding of control had far more effect, he says, than any exhortation from above to more proactivity or responsibility from his underlings. He argues that such exhortations don’t work, as people don’t take words anywhere near as seriously as actions when making change.

Anyone who’s undergone any kind of corporate transformation effort when on the rank and file will know the difference between words and actions.

It’s Actionable

Far from offering vague advice, Marquet goes to the level of supplying specific sets of questions to put to your staff in meetings, and useful advice on how to implement the policies and encourage the behaviours you need in your team.

Early on in the process he uses a CBT-style technique of ‘act as though we are proud of the ship’ to kick-start the change he wants to see. Literally anyone in a leadership role looking to improve morale can implement something like that quickly.


If you like this, you might like one of my books:
Learn Bash the Hard Way

Learn Git the Hard Way
Learn Terraform the Hard Way

LearnGitBashandTerraformtheHardWay

Get 39% off Docker in Practice with the code: 39miell2


It’s Honest

There’s very little sense of Marquet trying to sell a ‘perfect world’ story as he tells you what happened. In one vivid section, a normally dependable officer goes AWOL halfway through the year, and Marquet has to track him down. Marquet then takes the massive risk of letting the officer off, which further risks losing the respect of some of his subordinates, some of whom are hard-liners on discipline. None of this sounds like fun, or clear-cut.

In another section, he describes how an officer ‘just forgot’ about not flicking a switch even though there was a standard ‘red tag’ on it signalling that it shouldn’t be touched. Again, rather than just punishing, he spent 8 hours discussing with his team how they can prevent a recurrence in a practical way.

After rejecting impractical solutions like ‘get sign off for every action from a superior’ their solution reduced mistakes like this massively. The solution was another implementable tactic: ‘deliberate action’. Staff were required to call out what they are about to do, then pause before they do it, allowing others to intervene, while giving them literal pause for thought to correct their own mistakes.

It’s Well-Structured

The book ends up having a schema that is useful, and (mercifully) is not presented as a marketable framework, and which follows naturally from the story:

  • He wants to give people control
  • He can’t do that because: 1) they lack competence, and 2) they don’t know the broader context
  • He gives control piece by piece, while working on 1) and 2) using various replicable techniques

Some of the techniques Marquet uses to achieve the competence and knowledge of context have been covered, but essentially he’s in a constant state of training everyone to be leaders rather than followers within their roles.

Some highlighted techniques:

  • Encourage staff to ask for permission (‘I intend to […] because’) rather than wait for orders
  • Don’t ‘brief’ people on upcoming tasks, ‘certify’ (give them their role, ask them to study, and test them on their competence)
  • Creation of a creed (yes, a kind of ‘mission statement’, but one that’s in Q&A form and is also specific and actionable)
  • Specify goals, not methods

All of this makes it very easy to apply these teachings to your own environment where needed.

Caveats

Despite my enthusiasm, I was left with a few question marks in my mind about the story.

The first is that Marquet seems to have had great latitude to break the rules (indeed the subtitle of the book is ‘A True Story of Building Leaders by Breaking the Rules’). His superiors explicitly told him they were more focussed on outcomes than methods. This freedom isn’t necessarily available to everyone. Or maybe one of the points of the books is that to lead effectively you have to be prepared to ‘go rogue’ to some extent and take risks to effect real changes?

Another aspect I wondered about was that I suspected Marquet started from a point where he had a workforce that were very strong in one particular direction: following orders, and that it’s easier to turn such people around than a group of people who are not trained to follow orders so well. Or maybe it’s harder, who knows?

Also, the ship was at rock bottom in terms of morale and performance, and everyone on board knew it. So there was a crisis that needed to be tackled. This made making change easier, as his direct subordinates were prepared to make changes to achieve better things (and get promotion themselves).

This makes me wonder whether a good way to make needed change as a leader when there is no obvious crisis is to artificially create one so that people get on board…

Why Everyone Working in DevOps Should Read The Toyota Way

Ignore the Noise, Go to the Signal

In a former life I was a history student. I wasn’t very good at it, and one of my weaknesses was an unwillingness to cut out the second-hand nonsense and read the primary texts. I would read up on every historian’s views on (say) the events leading up to the first world war, thinking that would give me a short-cut to the truth.

The reality was that just reading the recorded deliberations of senior figures at the time would give me a view at the truth, and a way to evaluate all the other opinions I felt bombarded by.

What I should have learned, in other words, was: ignore the noise, and go to the signal.

Lean Schmean?

I was reminded of these learning moment recently when I finally read The Toyota Way. I had heard garbled versions of its message over the years through:

  • Reading blog after blog exhorting businesses to be ‘lean’ (I was rarely the wiser as to what that really meant)
  • Heard senior leadership in one company use the verb ‘lean’ (as in: ‘we need to lean this process’ – I know, right?)
  • One colleague tell me in an all-hands that we should all stop development whenever there was a problem ‘like they do at Toyota’ (‘How and why the hell is that going to help with 700 developers, and how do I explain that to customers?’, I thought. It came to nothing)

In other words, ‘lean’ seemed to be a content-free excuse to just vaguely tell people to deliver something cheaper, or give some second-hand cargo-cult version of ‘what Toyota do’.

So it was with some scepticism I found myself with some time and a copy of The Toyota Way in my hands. Once I started reading it, I realised that it was the real deal, and was articulating better many of the things I’d done to make change in business before, some of which I wrote about here and here.

‘That’s Manufacturing. I Work in a Knowledge Industry.’

One of the most obvious objections to anyone that foists The Toyota Way (TTW) on you is that its lessons apply to manufacturing, which is obviously different from a knowledge industry. How can physical stock levels, or assembly line management principles apply to what is done in such a different field?

The book deals with that objection on a number of levels.

The Toyota Way is a Philosophy, Not a Set of Rules

First, it emphasises that TTW is a philosophy for production in general, and not a set of rules governing efficient manufacturing. This philosophy can (depending on context) result in certain methods and approaches being taken that can feel like, or in effect become, rules, but can be applied to any system of production, whether it be production of pins, medicines, cars, services, or knowledge.

What that will result in in terms of ‘rules’ for your business will depend on your specific business’s constraints. So you’re under no obligation to do things the same way Toyota do them, because even they break their own supposed ‘rules’ if it makes sense for them. One example of this is the ‘rule’ that’s often cited that stock levels must always be low or minimal to prevent waste.

A high-level overall goal of TTW is to create a steady flow of quality product output in a pipeline that reduces waste. That can mean reducing stock levels in some cases (commonly considered a ‘rule’ of lean manufacturing), or even increasing them in others, depending on the overall needs of the system to maintain a steady flow.

So while the underlying principles of TTW are relatively fixed (such as ‘you should go and see what is going on on the floor’, ‘visual aids should be used collaboratively’, and so on), the implementation of those principles are relatively loose and non-prescriptive.

This maps perfectly to DevOps or Agile, which have a relatively clear set of principles (CALMS, and the Agile Manifesto, respectively) which can be applied in all sorts of ways, none of which are necessarily ‘correct’ for any given situation. In this context, the agile and DevOps industry that’s been built up around these movements are just noise.

Waste and Pipelines are Universal

Secondly, the concept of waste and pipeline is not unique to manufacturing. If your job is to produce weekly reports for a service industry, then you might consider that time spent making that report is wasted if its contents are not acted upon, or even read in a timely way.

A rather shocking amount of time can be spent in knowledge industries producing information that doesn’t get used. In my post on documentation I wrote about the importance of the ‘knowledge factory’ idea in running an SRE team, and the necessity to pay a ‘tax’ on maintaining those essential resources (roughly 5% of staff time in that case). The dividend, of course, was far greater.

Most of that tax was spent on removing or refining documentation rather than writing it. That was time well spent, as the biggest problem with documentation I’ve seen in decades of looking at corporate intranets is too much information, leading to distrust and the gradual decay of the entire system. So it was gratifying to read in TTW that:

  • Documentation audits take place regularly across the business
  • Are performed by a third party who ensures they are in order and follow standards
  • The principal check performed is to search for out of date documentation rather than quantity or correctness (how can an outsider easily determine that anyway?)

The root of the approach is summed up perfectly in the book:

‘Capturing knowledge is not difficult. the hard part is getting people to use the standards and contribute to improving it’

The Toyota Way

So in my experience, the fact that a car is being created instead of knowledge or software is not a reason to ignore TTW. Just like software delivery, a car is a product that requires both repeated activity in a pipeline, and creative planning of features and the building of technology in a bespoke way. All these parts of the process are covered and examined in TTW.

How Flow is Not Achieved

So how to you achieve a harmonious flow of output in your non-material factory? Again, this is essentially no different to manufacturing: what you’re dealing with is a system that has various sub-processes that themselves have inputs, outputs, dependencies and behaviours whose relationships need to be understood in order to increase throughput through the system.

How Flow is Achieved: Visualise for Collaboration First

Understanding, visualising and communicating your view of these relationships with your colleagues is hard, and critical to getting everyone pointing in the same direction.

This is something I’d also stumbled towards in a previous job as I’d got frustrated with the difficulty of visualising the constraints we were working under, and over which we had no control. I wrote about this in the post ‘Project Management with Graphviz’, where I used code to visualise and maintain the dependencies in a graph. I had to explain to so many people in so many different meetings why we couldn’t deliver for them that these graphs saved me lots of time.


Interlude – Visual Representations

Another principle outlined in TTW: visual representations should be simple and share-able. Unfortunately, this is the kind of thing you get delivered to your inbox as an engineer in an enterprise:

Now, I’m sure Project Managers eat this kind of thing for breakfast, and it makes perfect sense to them, but unless it corresponds to a commonly-seen and understood reality, it’s infinitely compressible information to the typical engineer. I used to almost literally ignore them. That’s the point of the visual representations principle of TTW: effective collaboration first, not complex schemas that don’t drive useful conversations.


In retrospect, and having read TTW, the answer to the problems of slow enterprise delivery are logically quite obvious: dependent processes need to be improved before downstream processes can achieve flow. For many IT organisations, that means infrastructure must be focussed on first, as these are the dependent services the development teams depend on.

But what often happens in real world businesses (especially those that do not specialise in IT)? Yup, centralised infrastructure gets cut first, because it is perceived that it ‘doesn’t deliver value’. Ironically, cutting centralised infrastructure causes more waste by cutting off the circulatory systems other parts of the business depend on for air.

So the formerly centralised cost gets mostly duplicated in every development team (or business unit) as they slowly learn they have to battle the ‘decagon of despair’ themselves separately from the infrastructure team that specialised in that effort before.

This is the infrastructure gap that AWS jumped headlong into: by scaling up infrastructure services to a global level, they could extract a tax from each business that uses it in exchange for providing services in a finite but sufficient way that removes dependencies on internal teams.

It is also the infrastructure gap that Kubernetes is jumping headlong into. By standardising infrastructure needs such as mutual TLS authentication and network control via sidecars, Kubernetes’ nascent open source Istio companion product is centralising those infrastructure needs again in a centralised and industry-standard way.

1) How Flow is Not Achieved: No Persistence in Pursuing Change

A key takeaway from the book is that efforts to make real change take significant lengths of time to achieve. TTW reports that it took Ford 5 years to see any benefits from adopting the Toyota Production System, and 10 years for any kind of comparable culture to emerge.

It’s extremely rare that we see this kind of patience in IT organisations (or their shareholders) trying to make cultural change. The only examples I can think of spring from existential crises that result in ‘do-or-die’ attempts to change where the change needed is the last roll of the dice before the company implodes. Apple is the most notable (and biggest) of these, but many other smaller examples are out there. You can probably think of similar analogous examples from your own life where you felt you had no choice but to make a change helped you achieve it.

2) How Flow is Not Achieved: Problems are Not Surfaced

The book contains anecdotes on the importance Toyota place on surfacing problems rather than hiding them. One example of this approach is the famous andon principle, where problems are signalled as quickly and clearly as possible to all appropriate people so the right focus can be given to quickly resolve the problem before production stops, or ‘stop the line’ to ensure the problem is properly resolved before continuing if it can’t be fixed quickly.

Examples include the senior manager who criticised the junior one for not having any of these line stoppages on the latter’s watch, because if there are no line stoppages then everything must be perfect, and it clearly can never be (unless quality control finds no problems and is doing its job, which was not the casein this instance).

This is the opposite to most production systems in IT, where problems are generally covered up or worked around in order to hide challenges from managers up the chain. This approach can only work for so long and results in a general deterioration in morale.

3) How Flow is Not Achieved: Focus on Local Optimisation

There is a great temptation, when trying to optimise production systems, to focus on local optimisations to small parts of the system that seem to be ripe for optimisation. While it can be satisfying to make small parts of the system run faster than before, it is ultimately pointless if the overall system is not constrained on those parts.

In manufacturing cars, optimising the production rate of the wing mirrors is pointless if wing mirrors are already produced faster than the engines are. Similarly, shaving a small amount off the cost of a wing mirror is (relatively speaking) effort wasted if the overriding cost is the engine. Better to focus on improving the engine.

In a software development flow, making your tests run a little faster is pointless if your features are never waiting for the tests to complete to deploy. Maybe you’re always waiting 2 days elapsed time for a manager to sign off a release, and that’s the bottleneck you should focus on.

4) How Flow is Not Achieved: Failure to ‘Go and See’

Throughout TTW, the importance of ‘going and seeing’ as a principle of management is reiterated many times. I wrote about the importance of this before in my blog on changing culture (Section 1: Get on the floor), but again it was good to see this intuition externally validated.

Two examples stuck in my mind: the story of the senior leader who did nothing but watch the production line for four hours so he could see for himself what was going on, and the minivan chief designer who insisted on personally driving in all 50 US states and Canada. The minivan designer then went back to the drawing board and made significant changes to the design that made sense in North America, but not in Japan (such as having multiple cup-holders for thelong journeys typical of that region).

Both of these leaders could have had an underling do this work for them, but the culture of Toyota goes against this delegatory approach.

Implicit in this is that senior leadership need to be bought into and aware of the detail in the domain in order to drive through the changes needed to achieve success.


Go Read the Book

I’ve just scratched the surface here of the principles that can be applied to DevOps from reading TTW.

It’s important not to swallow the kool aid whole here as well. Critiques of The Toyota Way exist (see this article from an American who worked there), and are worth looking at to remind yourself Toyota have not created the utopia that reading The Toyota Way can leave you thinking they have. However, the issues raised there seem to deal with the general challenges of the industry, and the principles not being followed in certain cases (Toyota is a human organisation, after all, not some kind of spiritual production nirvana).

Oh, and at the end of the book there’s also a free ‘how to do Lean consulting’ section at the back that gives you something like a playbook for those that want to consult in this area, or deconstruct what consultants do with you if you bring them in.


If you like this, you might like one of my books:
Learn Bash the Hard Way

Learn Git the Hard Way
Learn Terraform the Hard Way

LearnGitBashandTerraformtheHardWay

Get 39% off Docker in Practice with the code: 39miell2


Surgically Busting the Docker Cache

What is ‘Busting the Cache’?

If you’ve ever spent any time building Docker images, you will know that Docker caches layers as they are built, and as long as those lines don’t change, Docker treats the outputted layer is identical

There’s a problem here. If you go to the network to pick up an artefact, for example with:

RUN curl https://myartefactserver.local/myjar.jar > myjar.jar

then Docker will treat that command as cache-able, even if the artefact has changed.

Solution 1: –no-cache

The sledgehammer solution to this is to add a --no-cache flag to your build. This removes the caching behaviour, meaning your build will run fully every time, no matter whether the lines of your Dockerfile change or not.

Problem solved? Well… not really. If your build is installing a bunch of other more stable artefacts, like this:

FROM ubuntu
RUN apt-get update -y && apt-get install -y many packages you want to install
# ...
# more commands
# ...
RUN curl https://myartefactserver.local/myjar.jar > myjar.jar
CMD ./run.sh

Then every time you want to do a build, the cycle time is slow as you wait for the image to fully rebuild. This can get very tedious.

Solution 2: Manually Change the Line

You can get round this problem by dropping the --no-cache flag and manually changing the line every time you build. Open up your editor, and change the line like this:

RUN [command]  # sdfjasdgjhadfa

Then the build will But this can get tedious.

Solution 3: Automate the Line Change

But this can get tedious too. So here’s a one-liner that you can put in an alias, or your makefile to ensure the cache is busted at the right point.

First change the line to this:

RUN [command] # bustcache: 

and then change your build command to:

perl -p -i -e "s/(.bustcache:).*/\1 $RANDOM/" Dockerfile && docker build -t tag .

The perl command will ensure that the line is changed to a random number generated by the shell.

There’s a 1/100,000 chance that the number will repeat itself in two runs, but I’m going to ignore that…


If you like this, you might like one of my books:
Learn Bash the Hard Way

Learn Git the Hard Way
Learn Terraform the Hard Way

LearnGitBashandTerraformtheHardWay

Get 39% off Docker in Practice with the code: 39miell2


Software Security Field Guide for the Bewildered

If you have worked your way in software for a number of years and you’re not a security specialist, you might be occasionally confronted by someone from ‘security’ who generally says ‘no’ to things you deliver.

For a long time I was in this position and was pretty bewildered by how to interpret what they were saying, or understand how they thought.

Without being trained or working in the field, it can be difficult to discern the underlying principles and distinctions that mark out a security magus from a muggle.

While it’s easy to find specific advice on what practices to avoid

…if you’ve ever been locked in a battle with a security consultant to get something accepted then it can be hard to figure out what rules they are working to.

So here I try and help out anyone in a similar position by attempting to lay out clearly (for the layperson) some of the principles (starting with the big ones) of security analysis before moving onto more detailed matters of definition and technology.

Principles

‘There’s no such thing as a secure system’

The broadest thing to point out that is not immediately obvious to everyone is that security is not a science, it’s an art. There is no such thing as a secure system, so to ask a security consultant ‘is that secure?’ is to invite them to think of you as naive.

Any system that contains information that is in any way private is vulnerable, whether to a simple social engineering attack, or a state-funded attempt to infiltrate your systems that uses multiple ways to attack your system. What security consultants generally try to do is establish both where these weaknesses may be, and how concerned to be about them.

IT Security Is An Art, Not A Science

This makes IT security an art, not a science, which took me some time to catch onto. There’s usually no magic answer to getting your design accepted, and often you can get to a position where some kind of tradeoff between security and risk is evaluated, and may get you to acceptance.

Anecdote: I was once in a position where a ‘secrets store’ that used base64 encoding was deemed acceptable for an aPaaS platform because the number of users was deemed low enough for the risk to be acceptable. A marker was put down to review that stance after some time, in case the usage of the platform spread, and a risk item added to ensure that encryption at rest was addressed by no later than two years.

A corollary of security being an art is that ‘layer 8’ of the stack (politics and religion) can get in the way of your design, especially if it’s in any way novel. Security processes tend to be an accretion of: specific directions derived from regulations; the vestigal scars of past breaches; personal prejudice; and plain superstition.

Trust Has to Begin Somewhere

Often when you are discussing security with people you get into situations where you get into a ‘turtles all the way down’ scenario, where you wonder how anything can be done because nothing is ever trusted.

Anecdote: I have witnessed a discussion with a (junior) security consultant where a demand was made to encrypt a public key, based on a general injunction that ‘all data must be encrypted’. ‘Using what key?’ was the natural question, but an answer was not forthcoming…

The plain fact is that everyone has to trust something at some point in order to move information around anything. Examples of things you might (or might not) trust are:

  • The veracity of the output of dmesg on a Linux VM
  • The Chef server keys stored on your hardened VM image
  • Responses to calls to the metadata IP address when running on AWS (viz: http://169.254.169.254)
  • That Alice in Accounts will not publish her password on Twitter
  • That whatever is in RAM has not been tampered with or stolen
  • The root public keys shipped with your browser

Determine Your Points of Trust

Very often determining what you are allowed to trust is the key to unlocking various security conundrums when designing systems. When you find a point of trust, exploit it (in a good way) as much as you can in your designs. If you’ve created a new point of trust as part of your designs, then prepare to be challenged.

Responsibility Has to End Somewhere

When you trust something, usually someone or something must be held responsible when it fails to honour that trust. If Alice publishes her password on Twitter, and the company accounts are leaked to the press, then Alice is held responsible for that failure of trust. Establishing and making clear where the trust failure would lie in the event of a failure of trust is also a way of getting your design accepted in the real world.

Determining what an acceptable level of trust to place in Alice will depend on what her password gives her access to. Often there are data classification levels which determine minimum requirements before trust can be given for access to that data. At the extreme end of “secret”, root private keys can be subject to complex ceremonies that attempt to ensure that no one person can hijack the process for their own ends.

Consequences of Failure Determines Level of Paranoia

Another principle that follows from the ‘security is an art, not a science’ principle is that the extent to which you fret about security will depend on the consequences of failure. The loss of a password that allows someone to read some publicly-available data stored on a company server will not in itself demand much scrutiny from security.

The loss of a root private key, however, is about as bad as it can get from a security standpoint, as that can potentially give access to all data across the entire domain of that key hierarchy.

If you want to reduce the level of scrutiny your design gets put under, reduce the consequences of a breach.


Learn Bash the Hard Way

Learn Git the Hard Way

Learn Terraform the Hard Way


Key Distinctions

If you want to keep pace with a security consultant as they explain their concerns to you, then there are certain key distinctions that they may frequently refer to, and assume you understand.

Getting these distinctions and concepts under your belt will help you convince the security folks that you know what you’re doing.

Encryption vs Encoding

This is a 101 distinction you should grasp.

Encoding is converting some data into some other format. Anyone who understands the encoding can convert the data back into readable form. ASCII and UTF-8 are examples of encodings that convert numbers into characters. If you give someone some encoded data, it won’t take them long to figure out what the data is, unless the encoding is extremely complex or obscure.

Encryption involves needing some secret or secure process to get access to the data, like a private ‘key’ that you store in your ~/.ssh folder. A key is just a number that’s very difficult to guess, like your house key’s (probably) unique shape. Without access to that secret key, you can’t work out what that data is without a lot of resources (sometimes more than the all the world’s current computing power) to overcome the mathematical challenge.

Hashing vs Encryption

Hashing and encryption may be easily confused also. Hashing is the process of turning one set of data into another through a reproducible algorithm. The key point about hashing is that the data goes one-way. If you have the hash value (say, ae5690f1aff) then you can’t easily reverse that to the original

Hashing has a weakness. Let’s say you ‘md5sum’ an insecure password like password. You will always get the value: 5f4dcc3b5aa765d61d8327deb882cf99&oq=5f4dcc3b5aa765d61d8327deb882cf99

from the hash.

If you store that hashed password in a database, then anyone can google it to find out what your password really is, even though it’s a hash. Try it with other commonly-used passwords to see what happens.

This is why it’s important to ‘salt‘ your hash with a secret key so that knowledge of the hash algorithm isn’t enough to crack a lot of passwords.

Authentication vs Authorization

Sometimes shortened to ‘authn‘ and ‘authz‘, this distinction is another standard one that gets slipped into security discussions.

Authentication

Authentication is the process of determining what your identity is. The one we’re all familiar with is photo id. You have a document with a name and a photo on it that’s hard to fake (and therefore ‘trusted’), and when asked to prove who you are you produce this document and it’s examined before law enforcement or customs accepts your claimed identity.

There have been many interesting ways to identify authenticity of identity. My favourite is the scene in Big where the Tom Hanks character has to persuade his friend that he is who he says he is, even though he’s trapped in the body of a man:

Shared Secret Authentication

To achieve this he uses a shared secret: a song (and associated dance data) that only they both know. Of course it’s possible that the song was overheard or some government agency had listened in to their conversations for years to fake the authentication, but the chances of this are minimal, and would raise the question of: why would they bother?

What would justify that level of resources just to trick a boy into believing something so ludicrous? This is another key question that can be asked when evaluating the security of a design.

The other example I like is the classic spy trope of using two halves of a torn postcard, giving one half to each side of a communication, making a ‘symmetric key’ that is difficult to forge unless you have access to one side of it:

Symmetric Key Encryption

Symmetric vs Asymmetric Keys

This also exemplifies nicely what a symmetric key is. It’s a key that is ‘the same’ one used on both sides of the communication. A torn postcard is not ‘the same’ on both sides, but it can be argued that if you have one part of it, it’s relatively easy to fake the other. This could be complicated if the back of the postcard had some other message known only to both sides written on it. Such a message would be harder to fake since you’d have to know the message in both people’s minds.

An asymmetric key is one where access to the key used to encrypt the message does not imply access to decrypt the message. Public key encryption is an example of this: anyone can encrypt a message with the public key, but the private key is kept secret by the receiver. Anyone can know the public key (and write a message using it), but only the holder of the private key can read the message.

No authentication process is completely secure (remember, nothing is secure, right?), but you can say that you have prohibitively raised the cost of cheating security by demanding evidence of authenticity (such as a passport or a driver’s license) that is costly to fake, to the point where it’s reasonable to say acceptably few parties would bother.

If the identification object itself contains no information (like a bearer token), then there is an additional level of security through as you have to both own the objects, and know what it’s for. So even if the key is lost, more has to happen before there is a compromise of the system.

Authorization

Authorization is the process of determining whether you are allowed to do something or not. While authentication is a binary fact about one piece of information (you are either who you say you are, or you are not), authorization will depend on both who you are and what you are asking to do.

In other words: Dave is still Dave. But Dave can’t open the bay doors anymore. Sorry Dave.

Concepts

RBAC

Following on from Authentication and Authorization, Role-Based Access Control gives permission to a more abstract entity called a role.

Rather than giving access to that user directly, you give the user access to the role, and then that role has the access permissions set for it. This abstraction allows you to manage large sets of users more easily. If you have thousands of users that have access to the same role, then changing that role is easier than going through thousands of users one-by-one and changing their permissions.

To take a concrete example, you might think of a police officer as having access to the ‘police officer’ role in society, and has permission to stop someone acting suspiciously in addition to their ‘civilian’ role permissions. If they quit, that role is taken away from them, but they’re still the same person.

Security Through Obscurity

Security through obscurity is security through the design of a system. In other words, if the design of your system were to become public then it would be easy to expose.

Placing your house key under a plant next to the door, or under the doormat would be the classic example. Anyone aware of this security ‘design’ (keeping the key in some easy-to-remember place near the door) would have no trouble breaking into that house.

By contrast, the fact that you know that I use public key encryption for my ssh connections, and even the specifics of the algorithms and ciphers used in those communications does not give you any advantage in breaking in. The security of the system depends on maths, specifically the difficulty in factoring a specific class of large numbers.

If there are weaknesses in these algorithms then they’re not publicly known. That doesn’t preclude the possibility that someone, somewhere can break them (state security agencies are often well ahead of their time in cryptography, and don’t share their knowledge, for obvious reasons).

‘Anybody wanna shut down the Federal Reserve?’

It’s a cliche to say that security through obscurity is bad, but it can be quite effective at slowing an attacker down. What’s bad about it is when you depend on security through obscurity for the integrity of your system.

An example of security through obscurity being ‘acceptable’ might be if you run an ssh server on (say) port 8732 rather than 22. You depend on ssh security, but the security through obscurity of running on a non-standard port prevents casual attackers from ‘seeing’ that your port 22 is open, and as a secondary effect also can prevent your ssh logs from getting overloaded (perhaps exposing to other kinds of attack). But any cracker worth her salt wouldn’t be put off by this security measure alone.

If you really want to impress your security consultant, then casually mention Kerckhoffs Principle which is a more formal way of saying ‘security through obscurity is not sufficient’.

Principle of Least Privilege

The principle of least privilege states that any process, user or program has only the privileges it needs to do its job.

Authentication works the same way, but authorization is only allowed for a minimal set of functions. This reduces the blast radius of compromise.

Blast radius is a metaphor from nuclear weapons technology.
IT people use it in various contexts to make what they do sound significant.

A simple example might be a process that starts as root (because it might need access to a low-numbered port, like an http server), but then drops down. This ensures that if the server is compromised after that initial startup then the consequences would be far less than before. It is then up for debate whether that level of security is sufficient.

Anecdote: I once worked somewhere where the standard http server had this temporary root access removed. Users had to run on a higher-numbered port and low-numbered ports were run on more restricted servers.

In certain NSA-type situations, you can even get data stores that users can write to, but not read back! For example, if a junior security agent submits a report to a senior, they then get no access to that document once submitted. This gives the junior the minimal level of privilege they need to do their job. If they could read the data back, then that increases the risk of compromise as the data would potentially be in multiple places instead of just one.

Blast Radius

There are other ways of reducing the blast radius of compromise. One way is to use tokens for authentication and authorization that have very limited scope.

At an extreme, an admin user of a server might receive a token to log into it (from a highly secured ‘login server’) that:

  • can only be used once
  • limits the session to two minutes
  • expires in five minutes
  • can only perform a very limited action (eg change a single file)
  • can only be used from a specific subnet

If that token is somehow lost (or copied) in transit then it could only be used before it’s used (within five minutes) by the intended recipient for a maximum of two minutes, and the damage should be limited to a specific file if (and only if) the user misusing the token already has access to the specified network.

By limiting the privileges and access that that token has the cost of failure is far reduced. Of course, this focusses a large amount of risk onto the login server. If the login server itself were compromised then the blast radius would be huge, but it’s often easier for organisations to manage that risk centrally as a single cost rather than spreading it across a wide set of systems. In the end, you’ve got to trust something.

Features like these are available in Hashicorp’s Vault product, which centralise secrets management with open source code. It’s the most well-known, but other products are available.

N-Factor Authentication

You might have noticed in the ‘Too Many Secrets’ clip from the film Sneakers above that access to all the systems was granted simply by being able to decrypt the communications. You could call this one-factor authentication, since it was assumed that the identity of the user was ‘admin’ just by virtue of having the key to the system.

Of course, in the real world that situation would not exist today. I would hope that the Federal Reserve money transfer system would at least have a login screen as well before you identify yourself as someone that can move funds arbitrarily around the world.

A login page can also be regarded as one-factor authentication, as the password (or token) is the only secret piece of information required to prove authenticity.

Multi-factor authentication makes sure that the loss of one piece of authentication information is not sufficient to get access to the system. You might need a password (something you know), and a secret pin (another thing you have), and a number generated by your mobile phone, and a fingerprint, and the name of your first pet. That would be 5-factor encryption.

Of course, all this is undermined if the recovery process sends a link to an authentication reset to an email address that isn’t secured so well secured. All it takes then is for an attacker to compromise your email, and then tell the system that you’ve lost your login credentials. If your email is zero- or one-factor authentication than the system is only as secure as that and all the work to make it multi-factor has been wasted.

This is why get those ‘recovery questions’ that supposedly only you know (name of your first pet). Then, when people forget those, you get other recovery processes, like sending a letter to your home with a one-time password on it (which of course means trusting the postal service end-to-end), or an SMS (which means trusting the network carrier’s security). Once again, it’s ‘things you can trust’ all the way down.

So it goes.

Acceptable Risk and Isolation

We’ve touched on this already above when discussing the ‘prohibitive cost of compromising a system’ and the ‘consequences of a breach’, but it’s worth making explicit the concept of ‘acceptable risk’. An acceptable risk is a risk that is known about, but whose consequences of compromise are less than the effort of

A sensible organisation concerned about security in the real world will have provisions for these situations in their security standards, as it could potentially save a lot of effectively pointless effort at the company level.

For example, a username/password combination may be sufficient to secure an internal hotel booking system. Even if that system were compromised, then (it might be argued) you would still need to compromise the credit card system to exploit it for material gain.

The security consultant may raise another factor at this point, specifically: whether the system is appropriately isolated. If your hotel booking system sits on the same server as your core transaction system, then an exploit of the book system could result in the compromise of your core transaction system.

Sometimes, asking a security consultant “is that an acceptable risk?” can yield surprising results, since they may be so locked into saying ‘no’ that they may have overlooked the possibility that the security standards they’re working to do indeed allow for a more ‘risk-based’ approach.

Conclusion

That was a pretty quick tour through a lot of security concepts that will hopefully help you if you are bewildered by security conversations.

If I missed anything out, please let me know: @ianmiell on twitter.

The Lazy Person’s Guide to the Info Command

Most people who use Linux pretty quickly learn about man pages, and how to navigate them with their preferred pager (usually less these days).

Less well known are the info pages. If you’ve never come across them, these look like man pages, and contain similar information, but are invoked like this:

info grep

Over the past couple of decades I often found myself looking at an info page and wondering how to navigate it, hitting various keys and getting lost and frustrated.

What Do I Do Now?

I tried man info, but that didn’t tell me how to navigate the pages. More rarely I would try info info, but didn’t have the time or patience to do follow the tutorial there and then as I was busy trying to get some information, stat.

The other day I finally had enough and decided to take the time to sit down and learn it properly. It didn’t take that long, but I figured there was a case for writing down a helpful guide for new users that just want to get going.

The Bare Minimum

Here’s the bare minimum you need to read through an info page without ever getting lost:

  • ] – next page
  • [ – previous page
  • space – page down within page
  • b – page up within page
  • q – quit

If you want to get commands into your muscle memory as fast as possible, focus on these. It won’t get you round pages efficiently, but you won’t wonder how to get back to where you were, or how you got where you are. If you’re a very casual user, stop here and come back later when you get fed up of spinning forwards and backwards through pages to find something.

Try it with something like info sed.

Levelling Up

If you want to get to the next level with info, then these commands will help:

  • n – next page in this level
  • p – previous page in this level
  • return – jump to page ‘lower down’
  • l – go back to the last node seen
  • u – go ‘up’ a level

info has a hierarchical structure. There is a top-level page, and then ‘child’ pages that can have other pages at the same ‘level’. To go to the next page at the same level you can hit the n key. To go back to the previous page at the same level you hit p.

Occasionally you will get an item that allows you ‘jump down’ a level by hitting the return key. For example, by placing the cursor on the ‘Definitions’ line below and hitting return you will be taken to

* Introduction::                An introduction to the shell.
* Definitions::                 Some definitions used.

To return to the page you were last on at any point, you can hit l (for ‘last page’) and you will be returned to the top of that page. Or if you want to go ‘up’ a level, type u.

Still Interested?

If you’re still interested then you might want to read through info info carefully, but before you do here’s a couple of final tips to help avoid getting lost in that set of pages (which I have done more than once).

First, when you get stuck or want to dig in further, you can get help:

  • ? – show the info commands window
  • h – open the general help window

Confusingly, these options opens up a half-window that, in the case of h at least, gives no indication of how to close it down again. Here’s how:

  • C-x 0 – close the window

Hitting CTRL and x together, followed by 0 gets you out.

Why Bother?

You might wonder what the point of learning to read info pages is.

For me, the main reasons are:

  • They are often far more detailed (and more structured) than man pages
  • They are more definitive and complete. The grep info page, for example, contains a great set of examples, a discussion on performance, and an introduction to regular expressions. In fact, they’re intended to be mini books that can be printed off when converted to the appropriate format
  • You can irritate and/or intimidate colleagues by dismissing man page usage as ‘inferior’ and asserting that real engineers use info (joke)

Aside from anything else, I find getting fluent with these pieces of relative arcana satisfying. Maybe it’s just me.


Learn Bash the Hard Way

Learn Git the Hard Way

Learn Terraform the Hard Way


Get 39% off Docker in Practice with the code: 39miell