“Who Should Write the Terraform?”

The Problem

Working in Cloud Native consulting, I’m often asked about who should do various bits of ‘the platform work‘.

I’m asked this in various forms, and at various levels, but the title’s question (‘Who should write the Terraform?) is a fairly typical one. Consultants are often asked simple questions that invite simple answers, but it’s our job to frustrate our clients, so I invariably say “it depends”.

The reason it depends is that the answers to these seemingly simple questions are very context-dependent. Even if there is an ‘ideal’ answer, the world is not ideal, and the best thing for a client at that time might not be the best thing for the industry in general.

So here I attempt to lay out the factors that help me answer that questions as honestly as possible. But before that, we need to lay out some background.

Here’s an overview of the flow of the piece:

  • What is a platform?
  • How we got here
    • Coders and Sysadmins became…
    • Dev and Ops, but silos and slow time to market, so…
    • DevOps, but not practical, so…
    • SRE and Platforms
  • The factors that matter
    • Non-negotiable standards
    • Developer capability
    • Management capability
    • Platform capability
    • Time to market

What is a Platform?

Those old enough to remember when the word ‘middleware’ was everywhere will know that many industry terms are so vague or generic as to be meaningless. However, for ‘platform’ work we have a handy definition, courtesy of Team Topologies:

The purpose of a platform team is to enable stream-aligned teams to deliver
work with substantial autonomy. The stream-aligned team maintains full
ownership of building, running, and fixing their application in production.
The platform team provides internal services to reduce the cognitive load
that would be required from stream-aligned teams to develop these
underlying services.

Team Topologies, Matthew Skelton and Manuel Pais

A platform team, therefore, (and putting it crudely) builds the stuff that lets others build and run their stuff.

So… is the Terraform written centrally, or by the stream-aligned teams?

To explain how I would answer that, I’m going to have to do a little history.

How We Got Here

Coders and Sysadmins

In simpler times – after the Unix epoch and before the dotcom boom – there were coders and there were sysadmins. These two groups speciated from the generic ‘computer person’ that companies found they had to have on the payroll (whether they liked it or not) in the 1970s and 80s.

As a rule, the coders liked to code and make computers do new stuff, and the sysadmins liked to make sure said computers worked smoothly. Coders would eagerly explain that with some easily acquired new kit, they could revolutionise things for the business, while sysadmins would roll their eyes and ask how this would affect user management, or interoperability, or stability, or account management, or some other boring subject no-one wanted to hear about anymore.

I mention this because this pattern has not changed. Not one bit. Let’s move on.

Dev and Ops

Time passed, and the Internet took over the world. Now we had businesses running websites as well as their internal machines and internal networks. Those websites were initially given to the sysadmins to run. Over time, these websites became more and more important for the bottom line, so eventually, the sysadmins either remained sysadmins and looked after ‘IT’, or became ‘operations’ (Ops) staff and looked after the public-facing software systems.

Capable sysadmins had always liked writing scripts to automate manual tasks (hence the t-shirt), and this tendency continued (sometimes) in Ops, with automation becoming the defining characteristic of modern Ops.

Eventually a rich infrastructure emerged around the work. ‘Pipelines’ started to replace ‘release scripts’, and concepts like ‘continuous integration’, and ‘package management’ arose. But we’re jumping ahead a bit; this came in the DevOps era.

Coders, meanwhile, spent less and less time doing clever things with chip registers and more and more time wrangling different software systems and APIs to do their business’s bidding. They stopped being called ‘coders’ and started being called ‘developers’.

So ‘Devs’ dev’d, and ‘Ops’ ops’ed.

These groups grew in size and proportion of the payroll as software started to ‘eat the world’.

In reality, of course, there was a lot of overlap between the two groups, and people would often move from one side of the fence to the other. But the distinction remained, and become organisational orthodoxy.

Dev and Ops Inefficiencies

As the Dev and Ops this pattern became bedded into organisation, people noted some inefficiencies with this state of affairs:

  • Release overhead
  • Misplaced expertise
  • Cost

First, there was a release overhead as Dev teams passed changes to Ops. Ops teams typically required instructions for how to do releases, and in a pre-automation age these were often prone to error without app- or even release-specific knowledge. I was present about 15 years in a very fractious argument between a software supplier and its client’s Ops team after an outage. The Ops team attempted to follow instructions for a release, which resulted in an outage, as instructions were not followed correctly. There was much swearing as the Ops team remonstrated that the instructions were not clear enough, while the Devs argued that if the instructions had been followed properly then it would have worked. Fun.

Second, Ops teams didn’t know in detail what they were releasing, so couldn’t fix things if they went wrong. The best they could do was restart things and hope they worked.

Third, Ops teams looked expensive to management. They didn’t deliver ‘new value’, just farmed existing value, and appeared slow to respond and risk-averse.

I mention this because this pattern has not changed. Not one bit. Let’s move on.

These and other inefficiencies were characterised as ‘silos’ – unhelpful and wasteful separations of teams for (apparently) no good purpose. Frictions increased as these mismatches were exacerbated by embedded organisational separation.

The solution was clearly to get rid of the separation: no more silos!

Enter DevOps

The ‘no more silos’ battle cry got a catchy name – DevOps. The phrase was usefully vague and argued over for years, just as Agile was and is (see here). DevOps is defined by Wikipedia as ‘a set of practices that combines software development (Dev) and IT operations (Ops)’.

At the purest extreme, DevOps is the movement of all infrastructure and operational work and responsibilities (ie ‘delivery dependencies’) into the development team.

This sounded great in theory. It would:

  • Place the operational knowledge within the development team, where its members could more efficiently collaborate in tighter iterations
  • Deliver faster – no more waiting weeks for the Ops team to schedule a release, or waiting for Ops to provide some key functionality to the development team
  • Bring the costs of operations closer to the value (more exactly: the development team bore the cost of infrastructure and operations as part of the value stream), making P&L decisions closer to the ‘truth’

DevOps Didn’t

But despite a lot of effort, the vast majority of organisations couldn’t make this ideal work in practice, even if they tried. The reasons for this were systemic, and some of the reasons are listed below:

  • Absent an existential threat, the necessary organisational changes were more difficult to make. This constraint limited the willingness or capability to make any of the other necessary changes
  • The organisational roots of the Ops team were too deep. You couldn’t uproot the metaphorical tree of Ops without disrupting the business in all sorts of ways
  • There were regulatory reasons to centralise Ops work which made distribution very costly
  • The development team didn’t want to – or couldn’t – do the Ops work
  • It was more expensive. Since some work would necessarily be duplicated, you couldn’t simply distribute the existing Ops team members across the development teams, you’d have to hire more staff in, increasing cost

I said ‘the vast majority’ of organisations couldn’t move to DevOps, but there are exceptions. The exceptions I’ve seen in the wild implemented a purer form of DevOps when there existed:

  • Strong engineering cultures where teams full of T-shaped engineers want to take control of all aspects of delivery AND
  • No requirement for centralised control (eg regulatory/security constraints)

and/or,

  • A gradual (perhaps guided) evolution over time towards the breaking up of services and distribution of responsibility

and/or,

  • Strong management support and drive to enable

The most famous example of the ‘strong management support’ is Amazon, where so-called pizza teams must deliver and support their products independently. (I’ve never worked for Amazon so I have no direct experience of the reality of this). This, notably, was the product of a management edict to ensure teams operated independently.

When I think of this DevOps ideal, I think of a company with multiple teams each independently maintaining their own discrete marketing websites in the cloud. Not many businesses have that kind of context and topology.

Enter SRE and Platforms

One of the reasons listed above for the failure of DevOps was the critical one: expense.

Centralisation, for all its bureaucratic and slow-moving faults, can result in vastly cheaper and more scalable delivery across the business. Any dollar spent at the centre can save n dollars across your teams, where n is the number of teams consuming the platform.

The most notable example of this approach is Google, who have a few workloads to run, and built their own platform to run them on. Kubernetes is a descendant of that internal platform.

It’s no coincidence that Google came up with DevOps’s fraternal concept: SRE. SRE emphasised the importance of getting Dev skills into Ops rather than making Dev and Ops a single organisational unit. This worked well at Google primarily because there was an engineering culture at the centre of the business, and an ability to understand the value of investing in the centre rather than chasing features. Banks (who might well benefit from a similar way of thinking) are dreadful at managing and investing in centralised platforms, because they are not fundamentally tech companies (they are defenders of banking monopoly licences, but that’s a post for another day, also see here).

So across the industry, those that might have been branded sysadmins first rebranded themselves as Ops, then as DevOps, and finally SREs. Meanwhile they’re mostly the same people doing similar work.

Why the History Lesson?

What’s the point of this long historical digression?

Well, it’s to explain that, with a few exceptions, the division between Dev and Ops, and between centralisation and distribution of responsibility has never been resolved. And the reasons why the industry seems to see-saw are the same reasons why the answer to the original question is never simple.

Right now, thanks to the SRE movement (and Kubernetes, which is a trojan horse leading you away from cloud lock-in), there is a fashion-swing back to centralisation. But that might change again in a few years.

And it’s in this historical milieu that I get asked questions about who should be responsible for what with respect to work that could be centralised.

The Factors

Here are the factors that play into the advice that I might give to these questions, in rough order of importance.

Factor One: Non-Negotiable Standards

If you have standards or practices that must be enforced on teams for legal, regulatory, or business reasons, then at least some work needs to be done at the centre.

Examples of this include:

  • Demonstrable separation of duties between Dev and Ops
  • User management and role-based access controls

Performing an audit on one team is obviously significantly cheaper than auditing a hundred teams. Further, with an audit, the majority of expense is not in the audit but the follow-on rework. The cost of that can be reduced significantly if a team is experienced at knowing from the start what’s required to get through an audit. For these reasons, the cost of an audit across your 100 dev teams can be more than 100x the cost of a single audit at the centre.

Factor Two: Engineer Capability

Development teams vary significantly in their willingness to take on work and responsibilities outside their existing domain of expertise. This can have a significant effect on who does what.

Anecdote: I once worked for a business that had a centralised DBA team, who managed databases for thousands of teams. There were endless complaints about the time taken to get ‘central IT’ to do their bidding, and frequent demands for more autonomy and freedom.

A cloud project was initiated by the centralised DBA team to enable that autonomy. It was explained that since the teams could now provision their own database instances in response to their demands, they would no longer have a central DBA team to call on.

Cue howls of despair from the development teams that they need a centralised DBA service, as they didn’t want to take this responsibility on, as they don’t have the skills.

Another example is embedded in the title question about Terraform. Development teams often don’t want to learn the skills needed for a change of delivery approach. They just want to carry on writing in whatever language they were hired to write in.

This is where organisational structures like ‘cloud native centres of excellence’ (who just ‘advise’ on how to use new technologies), or ‘federated devops teams’ (where engineers are seconded to teams to spread knowledge and experience) come from. The idea with these ‘enabling teams’ is that once their job is done they are disbanded. Anyone who knows anything about political or organisational history knows that these plans to self-destruct often don’t pan out that way, and you’re either stuck with them forever, or some put-upon central team gets given responsibility for the code in perpetuity.

Factor Three: Management Capability

While the economic benefits of having a centralised team doing shared work may seem intuitively obvious, senior management in various businesses are often not able to understand its value, and manage it as a pure cost centre.

This is arguably due to assumptions arising out of internal accounting assumptions. Put simply, the value gained from centralised work is not traced back to profit calculations, so is seen as pure cost. (I wrote a little about non-obvious business value here.)

In companies with competent technical management, the value gained from centralised work is (implicitly, due to an understanding of the actual work involved) seen as valuable. This is why tech firms such as Google can successfully manage a large-scale platform, and why it gave birth to SRE and Kubernetes, two icons of tech org centralisation. It’s interesting that Amazon – with its roots in retail, distribution, and logistics – takes a radically different distributed approach.

If your organisation is not capable of managing centralised platform work, then it may well be more effective to distribute the work across the feature teams, so that cost and value can be more easily measured and compared.

Factor Four: Platform Team Capability

Here we are back to the old fashioned silo problem. One of the most common complaints about centralised teams is that they fail to deliver what teams actually need, or do so in a way that they can’t easily consume.

Often this is because of the ‘non-negotiable standards’ factor above resulting in security controls that stifle innovation. But it can also be because the platform team is not interested, incentivised, or capable enough to deliver what the teams need. In these latter cases, it can be very inefficient or even harmful to get them to do some of the platform work.

This factor can be mitigated with good management. I’ve seen great benefits from moving people around the business so they can see the constraints other people work under (a common principle in the DevOps movement) rather than just complain about their work. However, as we’ve seen, poor management is often already a problem, so this can be a non-starter.

Factor Five: Time to Market

Another significant factor is whether it’s important to keep the time to delivery low. Retail banks don’t care about time to delivery. They may say they do, but the reality is that they care far more about not risking their banking licence, not causing outages that attract the interest of regulators. In the financial sector, hedge funds, by contrast, might care very much about time to market as they are unregulated and wish to take advantage of any edge they might have as quickly as possible. Retail banks tend towards centralised organisational architectures, while hedge funds devolve responsibility as close to the feature teams as possible.

So, Who Should Write the Terraform?

Returning to the original question, the question of ‘who should write the Terraform?’ can now be more easily answered, or at least approached. Depending on the factors discussed above, it might make sense for them to be either centralised or distributed.

More importantly, by not simply assuming that there is a ‘right’ answer, you can make decisions about where the work goes with your eyes open about what the risks, trade-offs, and systemic preferences of your business are.

Whichever way you go, make sure that you establish which entity will be responsible for maintaining the resulting code as well as producing it. Code, it is important to remember, is an asset that needs maintenance to remain useful and if this is ignored there could be great confusion in the future.


If you like this, you might like one of my books:
Learn Bash the Hard Way

Learn Git the Hard Way
Learn Terraform the Hard Way

LearnGitBashandTerraformtheHardWay
Buy in a bundle here

If you enjoyed this, then please consider buying me a coffee to encourage me to do more.

Business Value, Soccer Canteens, Engineer Retention, and the Bricklayer Fallacy

Having the privilege of working in software in the 2020s, I hear variations on the following ideas expressed frequently:

  • ‘There must be some direct relationship between your work and customer value!’
  • ‘The results of your actions must be measurable!’

These ideas manifest in statements like this, which sound very sensible and plausible:

  • ‘This does not benefit the customer. This is not a feature to the customer. So we should not do it.’
  • ‘We are not in the business of doing X, so should not focus on it. We are in the business of serving the customer’
  • ‘This does not improve any of the key metrics we identified’

I want to challenge these ideas. In fact, I want to turn them on their head:

  • Many peoples’ work generate value by focussing on things that appear to have no measurable or apparently justifiable customer benefit.
  • Moreover, judgements on these matters are what people are (and should be) paid to exercise.

Alex Ferguson and Canteen Design

To encapsulate these ideas I want to use an anecdote from the sporting world, that unforgiving laboratory of success and failure. In that field, the record of Alex Ferguson, manager of Manchester United (a UK football, or soccer team) in one of their ‘golden eras’ from 1986 to 2013, is unimpeachable. During those 27 years, he took them from second-from-bottom in the UK premier league table in 1986 to treble trophy winners in Europe in 1998-1999.

Fortunately, he’s recorded his recollections and lessons in various books, and these books provide a great insight into how such a leader thinks, and what they’re paid to do.

Alex Ferguson demonstrating how elite-level sports teams can be coached to success

Now, to outsiders, the ‘business value’ he should be working towards is obvious. Some kind of variation of ‘make a profit’, or ‘win trophies’, or ‘score more goals than you concede in every match’ is the formulation most of us would come up with. Obviously, these goals break down to sub-goals like:

  • Buy players cheaply and extract more value from them than you paid for
  • Optimise your tactics for your opponents
  • Make players work hard to maintain fitness and skills

Again, we mortals could guess these. What’s really fascinating about Ferguson’s memoirs is the other things he focusses on, which are less obvious to those of us that are not experts in elite-level soccer.

Sometimes if I saw a young player, a lad in the academy, eating by himself, I would go and sit beside him. You have to make everyone feel at home. That doesn’t mean you’re going to be soft on them–but you want them to feel that they belong. I’d been influenced by what I had learned from Marks & Spencer, which, decades ago in harder times, had given their staff free lunches because so many of them were skipping lunch so they could save every penny to help their families. It probably seems a strange thing for a manager to be getting involved in–the layout of a canteen at a new training ground–but when I think about the tone it set within the club and the way it encouraged the staff and players to interact, I can’t overstate the importance of this tiny change.

Alex Ferguson, Leading

Now, I invite you to imagine a product owner, or scrum master for Manchester United going over this ‘update’ with him:

  • How does spending your time with junior players help us score more goals on Saturday?
  • Are we in the business of canteen architecture or soccer matches?
  • How do you measure the benefit of these peripheral activities?
  • Why are you micromanaging building design when we have paid professionals hired in to do that?
  • How many story points per sprint should we allocate to your junior 1-1s and architectural oversight?

It is easy to imagine how that conversation would have gone, especially given Ferguson’s reputation for robust plain speaking. (The linked article is also a mini-goldmine of elite talent management subtleties hiding behind a seemingly brutish exterior.)

Software and Decision Horizons

It might seem like managing a soccer team and working in software engineering are worlds apart, but there’s significant commonality.

Firstly, let’s look at the difference of horizon between our imagined sporting scrum master and Alex Ferguson.

The scrum master is thinking in:

  • Very short time periods (weeks or months)
  • Specific and measurable goals (score more goals!)

Alex Ferguson, by contrast, is thinking in decades-long horizons, and (practically) unmeasurable goals:

  • If I talk to this player briefly now, they may be motivated to work for us for the rest of their career
  • I may encourage others to help their peers by being seen to inculcate a culture of mutual support

I can think of a specific example of such a clash of horizons that resulted in a questionable decision in a software business.

Twenty years ago I worked for a company that had an ‘internal wiki’ – a new thing then. Many readers of this piece will know of the phenomenon of ‘wiki-entropy’ (I just made that word up, but I’m going to use it all the time now) whereby an internal documentation system gradually degrades to useless whatever the value of some of the content on there due to it getting overwhelmed by un-maintained information.

Well, twenty years ago we didn’t have that problem. We decided to hire a young graduate with academic tendencies to maintain our wiki. He assiduously ranged across the company, asking owners of pages whether the contents were still up to date, whether information was duplicated, complete, no longer needed, and so on.

The result of this was a wiki that was extremely useful, up to date, where content was easily found and minimal time was wasted getting information. The engineers loved it, and went out of their way to praise his efforts to save them from their own bad habits.

Of course, the wiki curator was first to be let go when the next opportunity arose. While everyone on the ground knew of the high value of this in saving lost time and energy chasing around bad information across hundreds of engineers, the impact was difficult or never measured, and in any case, shouldn’t the engineers be doing that themselves?

For years afterwards, whenever we engineers were frustrated with the wiki, we always cursed whoever it was that made the short-sighted decision to let his position go.

So-called ‘business people’, such as shareholders, executives, project managers, and product owners are strongly incentivised to deliver short term, which most often meant prioritise short-term goals (‘mission accomplished’) over longer-term value. Those that don’t think short-term often have a strong background in engineering and have succeeded in retaining their position despite this handicap.

What To Do? Plan A – The Scrum Courtroom

So your superiors don’t often think long term about the work you are assigned, but you take pride in what you do, and want the value of your work to be felt over a longer time than just a sprint or a project increment. And you don’t want people cursing your name as they suffer from your short-term self-serving engineering choices.

Fortunately, a solution has arisen that should handle this difference of horizon: scrum. This methodology (in theory, but that’s a whole other story) strictly defines project work to be done within a regular cadence (eg two weeks). At the start of this cadence (the sprint), the team decides together what items should go in it.

At the beginning of each cadence, therefore, you get a chance to argue the case for the improvement, or investment you want to make in the system you are working on being included in the work cadence.

The problem with this is that these arguments mostly fail because the cards are still stacked against you, in the following ways:

  • The cadence limit
  • Uncertainty of benefit
  • Uncertainty of completion
  • Uncertainty of value

Plan A Mitigators – The Cadence Limit

First, the short-term nature of the scrum cadence has an in-built prejudice against larger-scale and more speculative/innovative ideas. If you can’t get your work done within the cadence, then it’s more easily seen as impractical or of little value.

The usual counter to this is that the work should be ‘broken down’ in advance to smaller chunks that can be completed within the sprint. This often has the effect of making the work seem either profoundly insignificant (‘talk to a young player in the canteen’), and of losing sight of the overall picture of the work being proposed (‘change/maintain the culture of the organisation’).

Plan A Mitigators – Uncertainty of Benefit

The scrum approach tries to increment ‘business value’ in each sprint. Since larger-scale and speculative/innovative work is generally riskier, it’s much harder to ‘prove’ the benefit for the work you do in advance, especially within the sprint cadence.

The result is that such riskier work is less likely to be sanctioned by the scrum ‘court’.

Plan A Mitigators – Uncertainty of Completion

Similarly, there is an uncertainty as to whether the work will get completed within the sprint cadence. Again, this makes the chances of success arguing your case less likely.

Plan A Mitigators – Uncertainty of Value

‘Business Value’ is a very slippery concept the closer you look at it. Mark Schwartz wrote a book I tell everyone to read deconstructing the very term, and showing how no-one really knows what it means. Or, at the very least, it means very different things to different people.

The fact is that almost anything can be justified in terms of business value:

  • Spending a week on an AWS course
    • As an architect, I need to ensure I don’t make bad decisions that will reduce the flow of features for the product
  • Spending a week optimising my dotfiles
    • As a developer, I need to ensure I spend as much time coding efficiently as possible so I can produce more features for the product
  • Tidying up the office
    • As a developer, I want the office to be tidier so I can focus more effectively on writing features for the product
  • Hiring a Michelin starred chef to make lunch
    • As a developer, I need my attention and nutrition to be optimised so I can write more features for the product without being distracted by having to get lunch

The problem with all these things is that they are effectively impossible to measure.

There’s generally no objective way to prove customer value (even if we can be sure what it is). Some arguments just sound rhetorically better to some ears than others.

If you try and justify them within some of business framework (such as improving a defined and pre-approved metric), you get bogged down in discussions that you effectively can’t win.

  • How long will this take you?
    • “I don’t know, I’ve never done this before”
  • What is the metric?
    • “Um, culture points? Can we measure how long we spend scouring the wiki and then chasing up information gleaned from it? [No, it’s too expensive to do that]”

‘Plan A’ Mitigators Do Have Value

All this is not to say that these mitigators should be removed, or have no purpose. Engineers, as any engineer knows, can have a tendency to underestimate how hard something will be to build, how much value it will bring, and even do ‘CV-driven development’ rather than serve the needs of the business.

The same could be said of soccer managers. But we still let soccer managers decide how to spend their time, and more so the more the more experienced they are, and the more success they have demonstrated.

But…

In any case, I have been involved in discussions like this at numerous organisations that end up taking longer than actually doing the work, or at least doing the work necessary to prove the value in proof of concept form.

So I mostly move to Plan B…

What To Do? Plan B – Skunkworks It

Plan B is to skip court and just do the work necessary to be able to convince others that yours is the way to go without telling anyone else. This is broadly known as ‘skunkworks‘.

The first obvious objection to this approach is something like this:

‘How can this be done? Surely the the time taken for work in your sprint has been tightly defined and estimated, and you therefore have no spare time?’

Fortunately this is easily solved. The thing about leaders who don’t have strong domain knowledge is that their ignorance is easily manipulated by those they lead. So the workers simply bloat their estimates, telling them that the easy official tasks they have will take longer than they actually will take, leaving time left over for them to work on the things they think are really important or valuable to the business.

Yes, that’s right: engineers actually spend time they could be spending doing nothing trying to improve things for their business in secret, simply because they want to do the right thing in the best way. Just like Alex Ferguson spent time chatting to juniors, and micromanaging the design of a canteen when he could have enjoyed a longer lunch alone, or with a friend.

Yes, that’s right: engineers actually spend time they could be spending doing nothing trying to improve things for their business in secret, simply because they want to do the right thing in the best way.

It’s Not A Secret

Good leaders know this happens, even encourage it explicitly. A C-level leader (himself a former engineer) once said to me “I love that you hide things from me. I’m not forced to justify to my peers why you’re spending time on improvements if I don’t know about them and just get presented with a solution for free.”

The Argument

When you get paid to make decisions, you are being paid to exercise your judgement exactly in the ways that can’t be justified within easily measurable and well-defined metrics of value.

If your judgement could be quantified and systematised, then there would be no need for you to be there to make those judgements. You’d automate it.

This is true whether you are managing a soccer team, or doing software engineering.

Making software is all about making classes of decision that are the same in shape as Alex Ferguson’s. Should I:

  • Fix or delete the test?
  • Restructure the pipeline because its foundations are wobbly, or just patch it for now?
  • Mentor a junior to complete this work over a few days, or finish the job myself in a couple of hours?
  • Rewrite this bash script in Python, or just add more to it?
  • Spend the time to containerise the application and persuade everyone else to start using Docker, or just muddle along with hand-curated environments as we’ve always done?
  • Spend time getting to know the new joiner on the team, or focus on getting more tickets in the sprint done?

Each of these decisions has many consequences which are unclear and unpredictable. In the end, someone needs to make a decision where to spend the time based on experience as the standard metrics can’t tell you whether they’re a good idea.

Conclusion

At the heart of this problem in software is what I call the ‘bricklayer fallacy’. Many view software engineering as a series of tasks analogous to laying bricks: once you are set up, you can say roughly how long it will take to do something, because laying one brick takes a predictable amount of time.

This fallacy results in the treatment of software engineering as readily convertible to what business leaders want: a predictable graph of delivery over time. Attempts to maintain this illusion for business leaders results in the fairy stories of story points, velocity, and burn-down charts. All of these can drive the real value work underground.

If you want evidence of this not working, look here. Scrum is conspicuously absent as a software methodology at the biggest tech companies. They don’t think of their engineers as bricklayers.

Soccer managers don’t suffer as much from this fallacy because we intuitively understand that building a great soccer team is not like building a brick wall.

But software engineering is also a mysterious and varied art. It’s so full of craft and subtle choices that the satisfaction of doing the job well exceeds the raw compensation for attendance and following the rules. Frequently, I’ve observed that ‘working to rule’ gets the same pay and rewards as ‘pushing to do the right thing for the long term’, but results in real human misery. At a base level, your efforts and their consequences are often not even noticed by anyone.

If you remove this judgement from people, you remove their agency.

This is a strange novelty of knowledge work that didn’t exist in the ‘bricklayer’, or age of piece-work and Taylorism. In the knowledge work era, the engineers who like to actually deliver the work of true long term value get dissatisfied and quit. And paying them more to put with it doesn’t necessarily help, as the ones that stay are the ones that have learned to optimise for getting more money rather than better work. These are exactly the people you don’t want doing the work.

If you want to keep the best and most innovative staff – the ones that will come up with 10x improvements to your workflows that result in significant efficiencies, improvements, and savings – you need to figure out who the Alex Fergusons are, and give them the right level autonomy to deliver for you. That’s your management challenge.


If you enjoyed this, then please consider buying me a coffee to encourage me to do more.

Five Reasons To Master Git On The Command Line

If you spend most of your days in a browser watching pipelines and managing pull requests, you may wonder why anyone would prefer the command line to manage their Git workflow.

I’m here to persuade you that using the command line is better for you. It’s not easier in every case you will find, and it can be harder to learn. But investing the time in building those muscles will reap serious dividends for as long as you use Git.

Here are five (far from exhaustive) reasons why you should become one with Git on the command line.

1. git log is awesome

There are so many ways that git log is awesome I can’t list them all here. I’ve also written about it before.

If you’ve only looked at Git histories through GitHub or BitBucket then it’s unlikely you’ve seen the powerful views of what’s going on with your Git repository.

This is the capsule command that covers most of the flags I use on the regular:

git log --oneline --all --decorate --graph

--oneline – shows a summary per commit in one line, which is essential for seeing what’s going on

--graph – arranges the output into a graph, showing branches and merges. The format can take some time to get used to, especially for complex repositories, but it doesn’t take long for you to get used to

--all – shows all the available branches stored locally

--decorate – shows any reference names

This is what kubernetes looks like when I check it out and run that command:

Most versions of git these days implicitly use --decorate so you won’t necessarily need to remember that flag.

Other arguments that I regularly use with git log include:

--patch – show the changes associated with a commit

--stat – summarise the changes made by file

--simplify-by-decoration – only shows changes that have a reference associated with them. This is particularly useful if you don’t want to see all commits, just ‘significant’ ones associated with branches or tags.

In addition, you have a level of control that the GitBucketLab tools lack when viewing histories. By setting the pager in your ~/.gitconfig file, you can control how the --patch output looks. I like the diff-so-fancy tool. Here’s my config:

[core]
        pager = diff-so-fancy | less -RF

The -R argument to less above shows control characters, and -F quits if the output fits in one screen.


If you like this post, you may like my book Learn Git the Hard Way

learngitthehardway

2. The git add flags

If you’re like me you may have spent years treating additions as a monolith, running git commit -am 'message' to add and commit changes to files already tracked by Git. Sometimes this results in commits that prompt you to write a message that amounts to ‘here is a bunch of stuff I did’.

If so, you may be missing out on the power that git add can give you over what you commit.

Running:

git add -i

(or --interactive) gives you an interactive menu that allows you to choose what is to be added to Git’s staging area ready to commit.

Again, this menu takes some getting used to. If you choose a command but don’t want to do anything about it, you can hit return with no data to go back. But sometimes hitting return with no input means you choose the currently selected item (indicated with a ‘*‘). It’s not very intuitive.

Most of the time you will be adding patches. To go direct to that, you can run:

git add -p

Which takes you directly to the patches.

But the real killer command I use regularly is:

git add --edit

which allows you to use your configured editor to decide which changes get added. This is a lot easier than using the interactive menu’s ‘splitting’ and ‘staging hunks’ method.


3. git difftool is handy

If you go full command line, you will be looking at plenty of diffs. Your diff workflow will become a strong subset of your git workflow.

You can use git difftool to control how you see diffs, eg:

git difftool --tool=vimdiff

To get a list of all the available tools, run:

git difftool --tool-help

4. You can use it anywhere

If you rely on a particular GUI, then there is always the danger that that GUI will be unavailable to you at a later point. You might be working on a very locked-down server, or be forced to change OS as part of your job. Ot it may fall out of fashion and you want to try a new one.

Before I saw the light and relied on the command line, I went through many different GUIs for development, including phases of Kate, IntelliJ, Eclipse, even a brief flirtation with Visual Studio. These all have gone in and out of fashion. Git on the command line will be there for as long as Git is used. (So will vi, so will shells, and so will make, by the way).

Similarly, you might get used to a source code site that allows you to rebase with a click. But how do you know what’s really going on? Which brings me to…

It’s closer to the truth

All this leads us to the realisation that the Git command is closer to the truth than a GUI (*), and gives you more flexibility and control.

* The ‘truth’ is obviously in the source code https://github.com/git/git, but there’s also the plumbing/porcelain distinction between Git’s ‘internal’ commands and it’s ‘user-friendly’ commands. But let’s not get into that here: its standard interface can be considered the ‘truth’ for most purposes.

When you’re using git on the command line, you can quickly find out what’s going on now, what happened in the past, the difference between the remote and the local, and you won’t be gnashing your teeth in frustration because the GUI doesn’t give you exactly the information you need, or gives you a limited and opinionated view of the world.

5. You can use ‘git-extras’?

Finally, using Git on the command line means you can make use of git-extras. This commonly-available package contains a whole bunch of useful shortcuts that may help your git workflow. There are too many to list, so I’ve just chosen the ones I use most commonly.

When using many of these, it’s important to understand how it interacts with the remote repositories (if any), whether you need to configure anything to make it work, and whether it affects the history of your repository, making push or pulling potentially problematic. If you want to get a good practical understanding of these things, checkout my Learn Git The Hard Way book.

git fork

Allows you to fork a repository on GitHub. Note that for this to work you’ll need to add a personal access token to your git config under git-extras.github-personal-access-token.

git rename-tag

Renames a tag, locally and remotely. Much easier than doing all this.

git cp

Copies (rather than git mv, which renames) the file, keeping the original’s history.

git undo

Removes the latest commit. You can optionally give a number, which undoes that number of commits.

git obliterate

Uses git filter-branch to destroy all evidence of a file from your Git repo’s history.

git pr / git mr

These allow you to manage pull requests locally. git pr is for GitHub, while git mr is for GitLab.

git sync

Synchronises the history of your local branch with the remote’s version of that branch.


If you like this, you might like one of my books:
Learn Bash the Hard Way

Learn Git the Hard Way
Learn Terraform the Hard Way

LearnGitBashandTerraformtheHardWay
Buy in a bundle here

How 3D Printing Kindled A Love For Baroque Sculpture

I originally got a 3D printer in order to indulge a love of architecture. A year ago I wrote about my baby steps with my printer, and after that got busy printing off whatever buildings I could find.

After a few underwhelming prints of whatever buildings I could find, I stumbled on a site called ‘Scan the World’, which contains scans of all sorts of famous artefacts.

Pretty soon I was hooked.

Laocoön and His Sons

The first one I printed was Laocoön and His Sons, and I was even more blown away than I was when I saw it in the Vatican nearly 30 years ago.

While it’s not perfect (my print washing game isn’t quite there yet), what was striking to me was the level of detail the print managed to capture. The curls, the ripples of flesh and muscle, the fingers, they’re all there.

More than this, nothing quite beats being able to look at it close up in your own home. Previously I’d ready art history books and not been terribly impressed or interested in the sculpture I’d looked at pictures of and read about.

This encouraged me to look into the history of this piece, and it turned out to be far more interesting than I’d ever expected.

The sculptor is unknown. It was unearthed in a vineyard in 1506 in several pieces, and the Pope got wind of it, getting an architect and his mate to give it a once-over. His mate was called Michelangelo, and it soon ended up in the Vatican.

It was later nicked by the French in 1798, and returned to the Vatican 19 years later after Napoleon’s Waterloo.

It blows my mind that this sculpture was talked about by Pliny the Elder, carved by Greeks (probably), then left to the worms for a millennium, then dug up and reconstructed. I’m not even sure that historians are sure that the one in the Vatican now was the ‘original’, or whether it was itself a copy of another statue, maybe even in bronze.

Farnese Bull

The next one I printed has a similar history to Laocoön and His Sons. It was dug up during excavations from a Roman bath. Henry Peacham in ‘The Complete Gentleman’ (1634) said that it “outstrippeth all other statues in the world for greatness and workmanship”. Hard to argue with that.

La Pieta

Before Michelangelo had a look at the Laocoön, he carved this statue of the dead Jesus with his mother from a single block of marble. It’s not my favourite, but I couldn’t not print it. It’s worth it for Mary’s expression alone.

Aside from being one of the most famous sculptures ever carved, this sculpture has the distinction of being one of the few works Michelangelo ever signed, apparently because he was pissed off that people thought a rival had done it.

Apollo and Daphne

Then I moved onto the sculptor I have grown to love the most: Bernini. The master.

When I printed this one, I couldn’t stop looking at it for days.

The story behind it makes sense of the foliage: consumed by love (or lust) Apollo is chasing after Daphne, who, not wanting to be pursued because of some magical Greek-mythical reason, prays to be made ugly or for her body to change. So she becomes a tree while she runs. Apollo continues to love the tree.

Amazed by that one, I did another Bernini.

Bernini’s David

I have to call this one “Bernini’s David” to distinguish it from Michelangelo’s which seems to have the monopoly on the name in sculpture terms.

I don’t know why, though, Bernini’s David is almost as breathtaking as Apollo and Daphne. Although – like Michelangelo’s David – the figure is somewhat idealised, this David feels different: A living, breathing, fighter about to unleash his weapon rather than a beautiful boy in repose. Look at how his body is twisted, and his feet are even partially off the base.

My Own Gallery

What excites me about this 3d printing lark is the democratisation of art collection. Twenty years ago the closest I could get to these works was either by going on holiday, traipsing to central London to see copies, or looking at them in (not cheap) books at home.

Now, I can have my own art collection wherever I want, and if they get damaged, I can just print some more off.


If you enjoyed this, then please consider buying me a coffee to encourage me to do more.

grep Flags – The Good Stuff

While writing a post on practical shell patterns I had a couple of patterns that used grep commands.

I had to drop those patterns from the post, though, because as soon as I thought about them I got lost in all the grep flags I wanted to talk about. I realised grep deserved its own post.

grep is one of the most universal and commonly-used commands on the command line. I count about 50 flags you can use on it in my man page. So which ones are the ones you should know about for everyday use?

I tried to find out.

I started by asking on Twitter whether the five flags I have ‘under my fingers’ and use 99% of the time are the ones others also use.

It turns out experience varies widely on what the 5 most-used are.

Here are the results, in no particular order, of my researches.

I’ve tried to categorize them to make it easier to digest. The categories are:

  • ABC – The Context Ones
  • What To Match?
  • What To Report?
  • What To Read?
  • Honourable Mention

If you think any are missing, let me know in the comments below.

ABC – The Context Ones

These arguments give you more context around your match.

grep -A
grep -B
grep -C

I hadn’t included these in my top five, but as soon as I was reminded of them, -C got right back under my fingertips.

I call them the ‘ABC flags’ to help me remember them.

Each of these gives a specified context around your grep’d line. -A gives you lines after the match, -B gives you lines before the match, and -C (for ‘context’) gives you both the before and after lines.

$ mkdir grepflags && cd grepflags
$ cat > afile <<EOF                                                                   
a
b
c
d
e
EOF
$ grep -A2 c afile
c
d
e
$ grep -B2 c afile
a
b
c
$ grep -C1 c afile
b
c
d

This is especially handy for going through configuration files, where the ‘before’ context can give you useful information about where the thing your matching sits within a wider context.

What To Match?

These flags relate to altering what you match and don’t match with your grep.

grep -i

This flag ignores the case of the match. Very handy and routine for me to use to avoid not missing matches I might want to see (I often grep through large amounts of plain text prose).

$ cat > afile <<EOF
SHOUT
shout
let it all out
EOF
$ grep shout afile
shout
$ grep -i shout afile
SHOUT
shout
grep -v

This matches any lines that don’t match the regular expression (inverts), for example:

$ touch README.md
$ IFS=$'\n'   # avoid problems with filenames with spaces in it
$ for f in $(ls | grep -v README)
> do echo "top of: $f"
> head $f
> done
SHOUT
shout
let it all out

which outputs the heads of all files in the local folder except any files with README in their names.

grep -w

The -w flag only matches ‘whole-word’ matches, ignoring cases where submitted words are part of longer words.

This is a useful flag to narrow down your matches, and also especially useful when searching through prose:

$ cat > afile <<EOF
na
NaNa
na-na-na
na_na_na
hey jude
EOF
$ grep na afile
na
na-na-na
na_na_na
$ grep -w na afile
na
na-na-na
$ grep -vwi na afile
NaNa
na_na_na
hey jude

You might be wondering what characters are considered part of a word. The manual tells us that ‘Word-constituent characters are letters, digits, and the underscore.’ This is useful to know if you’re searching code where word separators in identifiers might switch between dashes and underscore. You can see this above with the na-na-na vs na_na_na differences.

What To Report?

These grep flags offer choices about how the output you see is rendered.

grep -h

grep -h suppresses the prefixing of filenames on output. An example is demonstrated below:

$ rm -f afile
$ cat > afile1 << EOF
a
EOF
$ cp afile1 afile2 
$ grep a *
afile1:a
afile2:a
$ grep -h a *
a
a

This is particularly useful if you want to process the matching lines without the filename spoiling the input. Compare these to the output without the -h.

$ grep -h a * | uniq
a 
$ grep -h a * | uniq -c
2
grep -o

This outputs only the text specified by your regular expression. One match is output per line, but multiple matches may be made per line.

This can result in more matches than lines, as in the example below, where you look for words that end lines that end in ‘ay’, and then any words with the letter ‘e’ in them (but not at the start or the end of the word).

$ rm -f afile1 afile2
$ cat > afile << EOF
Yesterday
All my troubles seemed so far away
Now it looks as though they're here to stay
Oh I believe
In yesterday
EOF
$ grep -o ' [^ ]*ay$' afile
 away
 stay
 yesterday
$ grep -o ' [^ ]*e[^ ]*' afile
 troubles
 seemed
 they're
 here
 believe
 yesterday
grep -l

If you’re fighting through a blizzard of output and want to focus only on which files your matches are in rather than the matches themselves, then using this flag will show you where you might want to look:

$ cat > afile << EOF
a
a
EOF
$ cp afile afile2
$ cp afile afile3grep -l
$ grep a *
afile:a
afile:a
afile2:a
afile2:a
afile3:a
afile3:a
$ grep -l a *
afile
afile2
afile3

What To Read?

These flags change which files grep will look at.

grep -r

A very popular flag, this flag recurses through the filesystem looking for matches.

$ grep -r securityagent /etc
grep -I

This one is my favourite, as it’s incredibly useful, and not so well known, despite being widely applicable.

If you’ve ever been desperate to find where a string is referenced in your filesystem (usually as root) and run something like this:

$ grep -rnwi specialconfig /

then you won’t have failed to notice that it can take a good while. This is partly because it’s looking at every file from the root, whether it’s a binary or not.

The -I flag only considers text files. This radically speeds up recursive greps. Here we run the same command twice (to ensure it’s not only slow the first time due to OS file cacheing), then run the command with the extra flag, and see a nearly 50% speedup.

$ time sudo grep -rnwi specialconfig / 2>/dev/null 
sudo grep -rnwi specialconfig /  418.01s user 382.19s system 70% cpu 19:03.09 total
$ time sudo grep -rnwi specialconfig /                                                  sudo grep -rnwi specialconfig /  434.19s user 411.62s system 70% cpu 19:56.25 total
$ time sudo grep -rnwiI specialconfig /                                                sudo grep -rnwiI specialconfig /  33.54s user 322.64s system 52% cpu 11:19.03 total

Honourable mention

There are many other grep flags, but I’ll just add one honourable mention at the end here.

grep -E

I spent an embarrassingly long time trying to get regular expressions with + signs in them to work in grep before I realised that default grep didn’t support so-called ‘extended’ regular expressions.

By using -E you can use those regular expressions just as the regexp gods intended.

$ cat > afile <<< aaaaaa+bc
$ grep -o 'a+b' a            # + is treated literally
a+b
$ grep -o 'aa*b' a           # a workaround
aaaaaaaaaab
$ grep -oE 'a+b' a           # extended regexp
aaaaaaaaaab

If you like this, you might like one of my books:
Learn Bash the Hard Way

Learn Git the Hard Way
Learn Terraform the Hard Way

LearnGitBashandTerraformtheHardWay
Buy in a bundle here

If you enjoyed this, then please consider buying me a coffee to encourage me to do more.

Why It’s Great To Be A Consultant

I spent 20 years slaving away at companies doing development, maintenance, troubleshooting, architecture, management, and whatever else needed doing. For all those years I was a permanent hire, working on technology from within.

I still work at companies doing similar things, but now I’m not a permie. I’m a consultant working for a ‘Cloud Native’ consultancy that focusses on making meaningful changes across organisations rather than just focussing on tech.

I wish I’d made this move this years ago. So here are the reasons why it’s great to be a consultant.

1) You are outside the company hierarchy

Because you come in as an outsider, and are not staying forever, all sorts of baggage that exists for permanent employees does not exist for you.

When you’re an internal employee, you’re often encouraged to ‘stay in your box’ and not disturb the existing hierarchy. By contrast, as a consultant you don’t need to worry as much about the internal politics or history when suggesting changes. Indeed, you’re often encouraged to step outside your boundaries and make suggestions.

This relative freedom can create tensions with permanent employees, as you often have more asking power than the teams you consult for. Hearing feedback of “I have been saying what you’re saying for years, and no-one has listened to me” is not uncommon. It’s not fair but it’s a fact: you both get to tell the truth, and get listened to more when you’re a consultant.

2) You have to keep learning

If you like to learn, consulting is for you. You are continually thrown into new environments, forced to consider new angles on industry vertical, organisational structure, technologies and questions that you may previously have not come across, thought about, or answered. You have to learn fast, and you do.

Frequently, when you are brought in to help a business, you meet the people in the business that are the most knowledgeable and forward-thinking. Meeting with these talented people can be challenging and intimidating, as they may have unrealistically high expectations of your ‘expertise’, and an unrealistically low opinion of their own capabilities.

It’s not unfrequent to wonder why you’ve been brought in when the people you’re meeting already seem to have the capability to do what you are there to help with. When this happens, after some time you go deeper into the business and realise that they need you to help spread the word from outside the hierarchy (see 1, above).

These top permie performers can be used to being ‘top dog’ in their domain, and unwilling to cede ground to people who they’ve not been in the trenches with. You may hear phrases like ‘consulting bullshit’ if you are lucky enough for them to be honest and open with you. However, this group will be your most energetic advocates once you turn them around.

If you can get over the (very human) urge to compete and ‘prove your expertise’, and focus on just helping the client, you can both learn and achieve a lot.

3) You get more exposure

Working within one organisation for a long time can narrow your perspective in various dimensions, such as technology, organisational structure, culture.

To take one vivid example, we recently worked with a ‘household name’ business that brought us in because they felt that their teams were not consistent enough and that they needed some centralisation to make their work more consistent. After many hours of interviews we determined that they were ideally organised to deliver their product in a microservices paradigm. We ended up asking them how they did it!

I was surprised, as I’d never seen a company with this kind of history move from a ‘traditional’ IT org structure to a microservices one, and was skeptical it could be done. This kind of ‘experience-broadening’ work allows you to develop a deeper perspective on any work you end up doing, what’s possible, and how it happens.

And it’s not only at the organisational level that you get to broaden your experience. You get to interact with, and even work for multiple execution teams at different levels of different businesses. You might work with the dreaded ‘devops team’, the more fashionable ‘platform team’, traditional ‘IT teams’, ‘SRE teams’, traditional ‘dev teams’, even ‘exec teams’.

All these experiences give you more confidence and authority when debating decisions or directions: red flags become redder. You’ve got real-world examples to draw on, each of which are worth a thousand theoretical and abstract powerpoint decks, especially when dealing with the middle rank of any organisation you need to get onside. Though you still have to write some of those too (as well as code sometimes too).

4) You meet lots of people

For me, this is the big one.

I’ve probably meaningfully worked with more people in the last two years than the prior ten. Not only are the sheer numbers I’ve worked with greater, they are from more diverse backgrounds, jobs, teams, and specialties.

I’ve got to know talented people working in lots of places and benefitted from their perspectives. And I hope they’ve benefitted from mine too.

Remember, there is no wealth but life.


Join Us

If you want to experience some of the above, get in touch:
Twitter: @ianmiell
email: ian.miell \at\ gmail.com

If you like this, you might like one of my books:
Learn Bash the Hard Way

Learn Git the Hard Way
Learn Terraform the Hard Way

LearnGitBashandTerraformtheHardWay
Buy in a bundle here

If you enjoyed this, then please consider buying me a coffee to encourage me to do more.

Practical Shell Patterns I Actually Use

Over the decades I’ve been using the shell, there are thousands of reusable patterns I’ve picked up from looking over others’ shoulders and googling.

Unfortunately, I’ve forgotten about 95% of them.

So here, I list many of the patterns I actually use often enough to be able to remember. If you want to get them under your fingers, your mileage may vary depending on your tastes and what you most commonly use the shell for.

I’m acutely aware that for most of these tips there are better/faster/more elegant ways to achieve the same thing, but that’s not the point here. The point is to reflect on what actually stuck, so that others may save time by spending their time learning what is more likely to stick. I will mention alternative methods and why they didn’t take as we go, as well as theoretical limitations or flaws in the methods I actually use.

I’m going to cover:

  • Get The Last Field From The Output
  • Use sed To Extract
  • ‘Do For All’ with xargs
  • Kill All Processes
  • Find Files Ending With…
  • Process Files With sed | sh
  • Give Me All The Output With 2>&1
  • Separate Lines With tr
  • Quick Infinite Loop
  • Inline Files

Get The Last Field From The Output

$ [commands] | awk '{print $NF}' 

This is what I most commonly use awk for on the command line. I also use it where I might most elegantly use cut, by selecting a specific field with (for example, for the second field) awk '{print $2}' (see below ‘Kill All Processes’).

In the top example, NF stands for ‘number of fields’, which matches the last field (since awk is not zero-indexed). The last field in the command pipeline is commonly a filename, so I often chain this command with xargs to process each file in turn with a new command (see below “‘Do For All’ With xargs“).

You can also use cut for this kind of thing, but I have found that a mixture of awk and sed have sufficed for me to achieve what I want. I do use cut every now and then, though.

Use sed To Extract

When using pipelines, you frequently want to extract a specific part of each line that is output.

My goto command for this is sed, which is well worth investing time in. Before you do that, you have to have a reasonably good understanding of regular expressions, which is even more worth investing time in.

The sed pattern I use most often is the search and replace one (s/FIND/REPLACE/), an example of which is below. This example takes the contents of the /etc/passwd database and outputs the username and default shell for each account on the system:

$ cat /etc/passwd | sed 's/\([^:]*\):.*:\(.*\)/user: \1 shell: \2/'

sed (which is short for ‘stream editor’) can take a filename as an argument, but if none is supplied it assumes it’s receiving lines through standard input.

The first character of the sed script (which is ‘s‘ in the example) indicates the command sed is being given (in bold below), followed by the default separator (which is a forward slash).

s/\([^:]*\):.*:\(.*\)/user: \1 shell: \2/

Then, what follows (up to the next forward slash) is the regular expression pattern to match in each line (in bold below):

 s/\([^:]*\):.*:\(.*\)/user: \1 shell: \2/

Within those toenail clippings, you see two sets of opening and closing parentheses. Each of these is escaped by a backslash (to distinguish them from just matching the parentheses characters as characters):

  • \([^:]*\)
  • \(.*\)

The first one ‘captures’ the username, while the second one ‘captures’ their shell. These are then referenced in the ‘replace’ part of the sed command by their number order:

 s/\([^:]*\):.*:\(.*\)/user: \1 shell: \2/

which produces the output (on my system)…

user: nobody shell: /usr/bin/false
user: root shell: /bin/sh
user: daemon shell: /usr/bin/false
[...]

sed definitely requires some effort to learn, but it will quickly repay you if you ever do any text processing.


If you like this post, you may be interested in my book Learn Bash the Hard Way

hero

Preview available here.


‘Do For All’ With xargs

xargs is one of the most powerful and time-saving commands to use on the terminal. But it remains impenetrable to some (just ask Jim below), which is a shame, as with a little work it’s not that difficult to get to grips with.

Before giving a real-world example, let’s go through it with a simple example. Create and move into a folder, creating three files:

$ mkdir xargs_example && cd xargs_example && touch 1 2 3 && ls
1 2 3

Now, by default, xargs takes all the items passed in, and passes them as arguments to the given command:

$ ls | xargs -t ls -l
ls -l 1 2 3
-rw-r--r--  1 imiell  staff  0  3 Jan 11:07 1
-rw-r--r--  1 imiell  staff  0  3 Jan 11:07 2
-rw-r--r--  1 imiell  staff  0  3 Jan 11:07 3

(We are using the -t flag here for explanatory purposes, to show the commands that actually get run; generally, you don’t need it.)

The -n flag allows you to process a number of arguments at once. Try this to see what I mean:

$ ls | xargs -n2 -t ls -l
ls -l 1 2
-rw-r--r--  1 imiell  staff  0  3 Jan 11:07 1
-rw-r--r--  1 imiell  staff  0  3 Jan 11:07 2
ls -l 3
-rw-r--r--  1 imiell  staff  0  3 Jan 11:07 3

Most often, I use -n1, to run the command on each argument separately:

$ ls | xargs -n1 -t ls -l
ls -l 1
-rw-r--r--  1 imiell  staff  0  3 Jan 11:07 1
ls -l 2
-rw-r--r--  1 imiell  staff  0  3 Jan 11:07 2
ls -l 3
-rw-r--r--  1 imiell  staff  0  3 Jan 11:07 3

Here’s a real-world example I used recently:

find . | \
  grep azurerm | \
  grep tf$ | \
  xargs -n1 dirname | \
  sed 's/^.\///'

It:

  • outputs all non-hidden files in or under the current working folder
  • of those files, selects only those files that have azurerm in their name
  • of those files, selects only those that end with tf (eg ./azurerm/somefile.tf)
  • for each of those files, strips the full path of the filename of the pathname, resulting in the bare filename, preceded by a dot and forward slash (eg ./somefile.tf)
  • for each of those files, removes the leading dot and forward slash, leaving the final bare filename (eg somefile.tf)

But what if the argument doesn’t go at the end of the command given to xargs? In that case I use the -I flag, which allows you to replace the arguments that would be applied with a string of your choice. In this example I moved all files with ‘Aug‘ in them to a specific folder:

$ ls | grep Aug | xargs -IXXX mv XXX aug_folder

Be aware that naive use of xargs can lead to problems in scripts. What if your files have spaces in them, or even newlines? What if there are more arguments than can be handled by the command? I’m not going to cover these nuances here, but it’s well covered in this excellent resource for more advanced bash usage.

I also regularly tidy up dodgy filenames with detox on my servers.

Kill All Processes

Now you’ve seen awk and xargs, you can use these to quickly kill all processes that match. I used this quite often to kill off some pesky Virtual Machine processes that sometimes get left over in a corner case and prevent me from running up more:

$ ps -ef | grep VBoxHeadless | awk '{print $2}' | xargs kill -9

Again, you have to be careful with your grep here to ensure that you don’t accidentally kill

Also be careful with the -9 argument to kill. You should only use that when it doesn’t respond to the default kill signal (TERM rather than -9‘s KILL), which allows the process to tidy up after itself if it chooses to.

Find Files Ending With…

I often find myself looking for where files are on my system. The mlocate database is easily installable if you don’t have it, and invaluable for speeding up file lookups using the find command. For example, I often need to find files across the filesystem that end with a specific suffix:

$ sudo updatedb
$ sudo locate cfg | grep \.cfg$

Process Files With sed | sh

Often you want to run a command on a files extracted (or transformed) by a sed command, and with a little tweaking this is easily done by creating a shell script using sed, and then piping it to a shell. This example looks for https links at the start of lines in the doc.md file, and opens them up in a browser using the open command available on Macs:

$ grep ^.https doc.md | sed 's/^.(h[^])]).*/open \1/' | sh

There are alternate ways to do this with xargs, but I use this when I want to see what the resulting script will actually look like before running it (by leaving off the ‘| sh‘ at the end before running it in).

Give Me All The Output With 2>&1

Some commands separate their output into ‘standard’ output, and ‘error’ output. By default, grep only looks at the ‘standard’ output, and the ‘error’ output is ignored (because it goes to a separate ‘file handle’, but you don’t need to understand that right now).

For example, I was searching for a particular flag in the openssl command recently, and realised that openssl‘s help flag outputs to standard error by default. So adding 2>&1 (which redirects ‘error’ output to wherever the ‘standard’ output is pointed) ensures that the output is grep-able.

$ openssl x509 -h 2>&1 | grep -i common 

If you want to redirect the output to a file, you need to get the ordering right:

$ openssl x509 -h > openssl_help.txt 2>&1.   # RIGHT!

If the file redirect comes after the 2>&1, then the standard error output still goes to the terminal.

$ openssl x509 -h 2>&1 > openssl_help.txt.   # WRONG!

It’s best to think of this by considering that the command is read from left to right, so in the ‘right’ one, the interpreter ‘sees’:

  • Redirect standard output to the file openssl_help.txt, then
  • Redirect standard error to wherever standard output is pointing

and both outputs are pointed at the file. In the ‘wrong’ one:

  • Redirect standard error to wherever standard output is pointing (which at the moment is the terminal), then
  • Redirect standard output to the file openssl_help.txt

and standard error is still pointed at the terminal, while standard output is redirected to the file openssl_help.txt.

Separate Lines With tr

tr is a handy command used in a variety of contexts. Its job is to replace individual characters in a stream of output. While sed can be used for this purpose, tr has a couple of advantages over it in certain contexts:

  • It’s easier to use than sed
  • It’s not line-oriented, so it ‘dumbly’ just replaces characters without concern for line separation

Here’s an example I used it for recently, to get each item in my PATH variable shown, one per line.

$ env | grep -w PATH | tr ':' '\n'
PATH=/usr/local/sbin
/usr/local/Caskroom/google-cloud-sdk/latest/google-cloud-sdk/bin
/usr/local/opt/coreutils/bin
/usr/local/opt/grep/libexec/gnubin

Also, tr is often used to remove problematic characters from a stream of output. For example, to turn a set of lines into a single line, you can run:

$ tr -d '\n'

which removes all the ‘newlines’ from a stream.

Quick Infinite Loop

A pattern you very often need is an infinite loop. This is the way I usually get one on the command line.

$ while true; do … ; done

You can use the break keyword to escape this infinite loop.

Inline Files

Creating files can be a faff, so it’s really useful to be able to just create a file inline on the command line.

You can do this with a ‘heredoc’ like this:

$ cat > afile << SOMESTRING
The contents of the file are written here.
Just keep typing until you are done, then
end with the string you specified at the
end of the first line.
SOMESTRING

That creates a file called afile with the contents between the first and last line.

You can even go one stage further, and substitute where you would formerly have used the filename using the <() construct (see point 6 here).

$ kubectl apply -f <(cat << EOF
…
EOF
)

If you like this, you might like one of my books:
Learn Bash the Hard Way

Learn Git the Hard Way
Learn Terraform the Hard Way

LearnGitBashandTerraformtheHardWay
Buy in a bundle here

If you enjoyed this, then please consider buying me a coffee to encourage me to do more.

Why I Keep Coming Back to Cynefin

Working as a consultant in helping clients to change the way they work, I often struggle to explain to them how the way they usually attack problems is not always appropriate to the situation they’ve brought us in to help with. They might be a start-up that’s always taken an ad-hoc, JFDI approach that is now struggling with scale up or maturity challenges, or a large corporation used to planning every detail up front before acting who find themselves having to experiment with new capabilities in fast-changing fields.

In these situations, Cynefin is a useful conceptual tool for bringing leaders to the point where they understand that they may need to change their usual approach for this new context.

Cynefin is a meta-framework designed to help managers identify how they perceive situations and make sense of their own and other people’s behaviour.

But first, here’s how it helped me understand where I’d been going wrong.

How Cynefin helped me

I wrote about writing runbooks and doing SRE in a real-world context in a previous post. The post was well-received, and I figured I had this nailed. However, I learned something deeper and more about knowledge management and decision making with a client a couple of years after that when I failed to make the same techniques work in a different context.

The context of the original post was a company with a 15 year old software stack and prior examples of incidents and their histories of resolution were available to put together in a more coherent and regularised form. Even though the situation was crying out for runbooks, no-one had put them together, and getting to the point where they became part of the fabric of our work was still a long, costly and arduous one.

I found myself in a different situation some time later. I was in a team that was trying to deliver a new software delivery platform against a dynamic technical background. While there, I tried to foster a culture of writing runbooks to cover situations we’d seen in development, but I completely failed to make it stick. The reasons for this failure were numerous, but probably the most significant was that while I’d learned my lessons in a stable, ‘best practice’ context suitable for runbooks, the situation I was in was one of ’emergent practice’ where things were changing so fast that the documents were either out of date or redundant almost as soon as they were written.

While talking to a friend about this learning experience, they mentioned Cynefin (pronounced /kəˈnɛvɪn/, or kuh-NEV-in) to me as a framework for thinking about this kind of situation.

What is Cynefin?

A great overview of the Cynefin framework, from https://sketchingmaniacs.com/decision-making/

A great way to introduce Cynefin is to consider this often-mocked quote from Donald Rumsfeld.

“As we know, there are known knowns, there are things we know we know. We also know there are known unknowns, that is to say, we know there are some things we do not know. But there are also unknown unknowns – the ones we don’t know we don’t know.”

Donald Rumsfeld

I never really understood why it was mocked so much – it clearly states an important point about the nature of knowledge and ignorance in different contexts. The fact that it requires some concentration to follow should not have made it a cause of mockery. But hey.

Cynefin does something similar for business decision-making. It categorises the various types of context we might find ourselves in, and helps us orient ourselves within them. This categorisation helps us to adapt our behaviour appropriately to the context.

What are these categorised contexts? There are five:

  • Simple
  • Complicated
  • Complex
  • Chaotic

The fifth state (Disordered) just means we haven’t categorised the context yet.

1) The ‘simple’ context

Let’s take the two extremes first. The ‘simple’ context might also be called the ‘known known’, or ‘best practice’ context. It’s the context where any reasonable person can work out what to do if they know the domain. For example, if you’re an airline pilot and a light in your cabin flashes, there is a documented checklist to follow that describes what you need to do. If you’re a trained pilot, what to do is well-understood and consistently applicable. It’s a ‘simple’ context.

2) The ‘chaotic’ context

By contrast, the ‘chaotic’ context is one where no-one knows what to do. In this situation. I like to think of an improv comic. They are placed in situations where there is no ‘right way’ forward; by design, they are thrown into an unfamiliar context where they are forced to be creative and experiment. If there were a ‘best practice’ here, it just wouldn’t be funny. What is called for is ‘novel practice’it’s vital to do something, see what happens and respond to what happens next.

So, an approach seeking to find ‘best practice’ is not always best practice…

Decision-making

Cynefin proposes that there are different patterns of decision-making in these different contexts. All of them end in ‘respond’ (which seems to just mean ‘take action’), but are preceded by different approaches.

For ‘simple’, the steps are: sense, categorise, respond. In other words, figure out what the state is, which category of situation this is, and act accordingly.

For ‘chaotic’, the steps are: act, sense, respond. In other words, when you don’t know what the right thing to do is (or even where to start) just do something, and see what happens.

You can see how this maps to situations we’ve seen in different contexts. If you’re running an Accident and Emergency department, you seek to apply known best practice with every patient that comes in. So you sense (detect patient has come in), categorise (triage them), and respond (schedule them for appropriate and timely treatment. That’s a ‘simple’ (though not ‘easy’) situation.

If you’re fleeing from persecution in a war-torn country, then it’s by no means clear what the right thing to do is. You don’t have time to sit and think, and you don’t have enough information to evaluate a best path. Even if you did have information, the situation is changing rapidly and in unpredictable ways. In such a situation, just picking something to do (eg run to the airport), and re-evaluating your situation at the next appropriate point is the best thing to do. So you act (do something), sense (re-evaluate the situation), and respond (decide what to do next).

The other two categories

Between the two extremes of ‘simple’ and ‘chaotic’ are two more states: ‘complicated’ and ‘complex’.

3) The ‘complicated’ context

The ‘complicated’ situation is one where there is ‘good practice’ (ie there is a ‘good’ – but maybe not ‘best’ – ‘answer’ to the question posed by the situation of what to do), but it requires an expert to analyse the situation first. One might think of an architect called in to design a building. There are site- and client-specific things to consider in a broader context requiring expertise to know how to proceed in a ‘good practice’ way.

In these complicated situations, you ‘sense’ (gather relevant information), ‘analyse’ (work out, using your expertise, what a good solution looks like), then ‘respond’ (proceed with the build).

4) The ‘complex’ context

‘Complex’ sits between ‘complicated’ and ‘chaotic’. It’s not complete chaos, as there is some prior knowledge or experience that can be brought to bear, but even knowing how to get to a good answer is unknown. In these situations, you need to figure out the best way by experimentation. This is called ’emergent practice’, and is appropriately handled by a ‘probe’, ‘sense’, ‘respond‘ decision-making process.

Working in Cloud Native technology transformation, our consultancy often works with business used to working in the complicated or simple areas, and find that our job involves helping them understand that the inapplicability of one previously-working approach with their current one of ’emergent practice’.

By contrast, when we work with those already operating in the complex space, we generally augment their teams with our technologists, who have more experience of getting towards good practice in Cloud Native than they do. If they are in a chaotic state, then we can use our experience to help them guide their decision making towards the complex or complicated space.

Why is Cynefin such a powerful consulting tool?

So far this all might sound quite trivial. It’s obvious to any reasonable person that your decision-making process needs to be different if you’re fleeing from a war zone than if you’re flying a plane.

What makes Cynefin such a powerful tool for business consulting is that it gives a framework to the common management problem: ‘What got your business here won’t get you there’

If you’ve read the book of the same title, you’ll know that its key message is that as you age in your career, the more self-centred and driven approaches to success that worked for you when younger become less effective as you seek to lead others to collective success. You need to change your approach and whole attitude to work if you want to succeed in your new, more elevated context.

In an analogous way, the success your business had in its earlier states may be completely inappropriate in a new state.

Imagine a startup used to operating in a chaotic or complex context. They get used to ‘just doing it’ and giving their staff freedom to solve problems in new and creative ways. This works for them really well as they grow fast, but after some years they find that their offerings to consumers has matured, and novel and creative solutions result more and more in waste, disruption (of the bad kind) and inefficiency.

What they need at this point might be staff who take a different approach more aligned to ‘good’ or ‘best’ practice. However, when people advocating such approaches arrive, challenging group dynamics can come into play. The tight monoculture that’s worked in the past becomes difficult to question, and those advocating a different style get alienated and leave, or just conform to the prevailing approach.

Similarly, a company used to using ‘best practice’ to solve their challenges may be at a loss as to why their approach does not work . As a consultant, I’m often asked directly “what’s the right answer to this question?” to which the answer is all-too-often a disappointing “it depends”. What’s happened there is usually that they are seeking a ‘best’ or ‘good practice’ answer to one where an ’emergent’ or ‘novel practice’ one is called for.

Many of these patterns are documented here on our transformation patterns website, where we (for example) talk about:

And for you?

It’s not just consulting or work that Cynefin can help with. It’s also worth considering whether you have a preference for a particular way of approaching problems, and whether this preference stops you from acting in an appropriate way.

My wife and I often clash over whether to plan or not: she likes to plan holidays in advance, for example, and I find that process like pulling teeth, preferring to improvise. There are situations where planning is absolutely essential (got kids? you’d better plan activities!), and other more dynamic situations where time spent planning in detail is wasted as circumstances change (don’t book that restaurant weeks in advance, we may change our minds about where we’re going if the weather is bad).

Of course, in these situations, my approach is always best, and I don’t need no decision-making meta-framework to help me.


If you enjoyed this, then please consider buying me a coffee to encourage me to do more.



Is Agility Related to Commitment? – Money Flows Part II

Previously, I wrote about how software companies cultural challenges can be traced back to how money flows through it, using the example of an ‘accidental product’ B2B type of business that tries to go from project to product.

In this article, I want to extend that ‘cultural economics’ thinking to the subject of Agility.

An Agile Anecdote

Around 2005, when teams calling themselves agile was the new fashion (and few really knew what it meant) I worked for a company that delivered software to the sports betting industry. The industry is notoriously unregulated and cut-throat, and its combatants were (and probably still are) in a constant arms race to outdo each other for new features to attract and keep its non-sticky customers.

The pressure to deliver fast is exacerbated by another factor: major events don’t move. If you have a great new soccer bet type then it needs to be ready for the World Cup, where a huge number of bets will be placed. Afterwards, it’s effectively commercially worthless. So if a feature isn’t quite ready, even if it’s slightly flawed, you rush it out and try to deal with the fallout later.

In this context, we had a job to do for a client by a deadline and didn’t have the staff to do it. Skills in this area were rare but we found a company ready and seemingly able to do the work. When we sat down with them, they told us they liked to work ‘in an agile way’. We didn’t really know what they meant by this, and ploughed ahead. We were desperate.

We soon hit a big problem. They ended their first sprint, and said they would have to re-evaluate the timelines as they had, in true agile fashion, reflected on their experience and decided it would take longer to deliver than they thought. We explained there was an immovable deadline, and that this was a non-negotiable constraint. They said they were sorry, but they were agile, so that was the deal.

We fired them pretty quickly after that, and managed to do the work in-house. We made a few glib remarks about their lack of agility about their agile approach and moved on.

Does Being Agile Even Make Sense?

Reflecting on this war story the other week while thinking about a consulting client’s challenges, I was prompted to ask whether aiming for true Agility is the right thing.

In software now it can feel like an unquestionable dogma that being Agile is always good, but it’s worth highlighting the situations in which trying to be agile is like fitting a square peg in a round hole.

Before we do that, I want to outline what I think the ideal conditions for agility are, and then we can talk about the common conditions for not being agile.


What is Agility?

I don’t really want to get into a theological discussion of what agility truly is, so I’m going to refer to the original Manifesto only rather than any of its descendants, and here just emphasise the part that says ‘Responding to change over following a plan’


Conditions For Agility

The Agile Manifesto’s original themes emphasise internal team agency over external demands.

Individuals and interactions over processes and tools
Working software over comprehensive documentation
Customer collaboration over contract negotiation
Responding to change over following a plan

https://agilemanifesto.org/

Each of the points ‘valued more’ on the left are those that privilege team autonomy. The team gets to decide when the thing gets built, how it gets built, and even what the thing they’re building consists of. Because the customer can’t be ignored, they are brought into the team rather than faced down over a contract.

The conditions required for team autonomy, and therefore for Agile, can be summed up in one word: trust.

All the values prized in the Agile Manifesto require trust to be nurtured and sustained.

In low-trust environments:

  • Process and tools are centrally mandated
  • Documentation is centrally mandated
  • Collaboration is low – demands are made of teams in a transactional fashion
  • Plans are king. Deviation from the Plan is always a problem rather than a response.

In high-trust environments, all these aspects of building are decided within the team.

Behind this trust must lie either one of two relationships between the builders and those who ultimately provide the money (ie the customer). Either:

  • The customer is embedded within the team, or
  • The customer has only an indirect voice on the team’s performance

Implicit in this is a trust model between patron and producer where there are no hard commitments to delivery.

Back to the Iron Triangle

So far we’ve related Agility to trusting the delivery team to deliver, and this means that the customer has to accept that when, how and what will be delivered will change during the project. In other words, the team makes no binding commitment to delivery.

Customers might accept this in two situations: where they are directly involved in this dynamic decision-making, or if the nature of the commitment made to them does not involve what is delivered to them. If Netflix don’t deliver a ‘favourites’ feature on their SmartTV interface by February I can’t sue them – I just stop paying them. It’s up to them what they deliver to me, and up to me whether I pay them.

Either way, the nature of the commitment made to the patron is central to whether Agility is possible. And when we talk about commitments in projects, we talk about the iron triangle.

This old project management saw delineates the areas that one can make commitments to in a project. In other words, these are the things that are usually in contracts: when (time), what (scope), and how much it will cost. All other things being equal, changing the size of any of these areas will affect the quality of the delivery.

Is There A Contract Behind Your Work?

The first question you must ask yourself when thinking about whether it makes sense to try to be Agile is to ask whether, at the root of your efforts, is a binding commitment to deliver on those three points to whoever is ultimately funding you?

Flowing from this question will be all the behaviours that mitigate against the potentially innovative or thinking that Agility exists to enable:

  • ‘You can’t take longer to deliver this, so you will have to make choices that affect the quality of the delivery’
  • ‘You can’t reduce the scope delivered in the time, you will have to make choices ‘
  • ‘We can’t add more resources to deliver the project (even if that would help), because this would make it unprofitable.’
  • ‘We have to co-ordinate everyone to deliver a specific thing at a specific time, so we need to plan ahead in a waterfall way (though we may call it Agile).’

Binding contracts also prevent the slack that all creative or innovative teams require to work for the longer-term interests of the producing business.

Are You Mining, Or Prospecting?

Another way to look at this question of whether it makes sense to be Agile is to consider whether your work is ultimately thought of as ‘mining’ or ‘prospecting’.

When you mine, you’ve found the value in the ground, and you know how to extract it. The task is therefore simple:

  • Get the value out of the ground
  • As cheaply as possible

When you prospect, you don’t know where the value is, so you have to find it as efficiently as possible. So you have to:

  • Decide where to start looking
  • Search in that area
  • Reflect on what you’ve found
  • Decide where to look next, or start mining

This is Agility in a nutshell – we have some expertise, but we don’t know all the answers, so we will have to prospect for value. We may discover new things along the way and change the outcome through innovative reflection on what we learn as we go. We can’t make commitments, we may strike gold, or nothing at all.

Implicit in a fixed contract is that the work you are doing is ‘mining’. You’ve promised to get the value out. Whether you destroy your tooling doing it, or exhaust your miners, or break laws is not (formally) the concern of the patron. You’ve committed to deliver the value, and how to do it is your problem.

So Legal Contracts Are the Problem?

Bear in mind that the fact of whether a formal signed contract exists is not the sole determinant of whether a commitment made prevents you from being Agile. Many organisations have contracts written, but the nature of their relationship is such that it is only there as an insurance if the already existing trust between completely breaks down. Commitments can be renegotiated because both sides trust each other, and both sides find a way to satisfy each others needs and demands.

As the old saying goes: “if you need to get the contract off the shelf, you’ve both already lost”. Once trust goes, everything falls apart.

By the same token, internal patrons within businesses may commission projects or efforts without a formal contract, but with an implicit contract that signals low trust or flexibility.

Fundamentally, the question should not be ‘is there a contract?’, rather ‘is there a binding and inflexible commitment?’

If a formal contract exists that defines cost, scope, or time, there is always a danger that the patron believes such a commitment has been made. If no formal contract exists but money is changing hands, then it’s simpler: the duty of the producer is to keep the client happy enough that they keep paying, but how they do that is entirely up to them.

Trust, Commitment, and Contracts

When you hear exhortations to be Agile at your work, think about the aspects of trust, commitment, and contracts, and how they relate to your work:

  • Has a commitment been made (or does the patron believe a commitment has been made)?
  • Does the patron trust the delivery team to deliver in their best interests?
  • Is there a contract ‘behind’ the work?

Considering these questions can determine whether Agility is a realistic approach for your working context.

The corollary of this is that if Agility is what you want, you may need to work on these fundamentals first if you really want things to change. Otherwise you will be swimming against the tide.


If you enjoyed this, then please consider buying me a coffee to encourage me to do more.


If you like this, you might like one of my books:
Learn Bash the Hard Way

Learn Git the Hard Way
Learn Terraform the Hard Way

LearnGitBashandTerraformtheHardWay
Buy in a bundle here

Five Ansible Techniques I Wish I’d Known Earlier

If you’ve ever spent ages waiting for an Ansible playbook to get through a bunch of tasks so yours can be tested, then this article is for you.

Ansible can be pretty tedious to debug and obscure to develop at times (“What’s the array I need to access the IP address on the en2 interface again?”), so I went looking for various ways to speed up the process, and make it easier to figure out what is going on.

Eventually I found five tools or techniques that can help, so here they are.

These tips go in order from easiest to hardest to implement/use.

1) --step

This is the simplest of the techniques to implement and follow. Just add --step to your ansible-playbook command, and for each task you run you will get a prompt that looks like this:

PLAY [Your play name] ****************************************************************************************
Perform task: TASK: Your task name (N)o/(y)es/(c)ontinue:

For each task, you can choose to run the task (yes), not run the task (no, the default), or run the rest of the play (continue).

Note that continue will run until the end of the play, not the end of the entire run. Quite handy if you know you want to say yes to everything in the current playbook.

The downside is that if there are many tasks to get through, you have to be careful not to keep your finger on the return key and accidentally go to far.

It would be a nice little open source project for someone to make this feature more powerful, adding ‘back’ and ‘skip this playbook’ features.

2) Inline logging

In addition to runtime control, you can use old-fashioned log lines to help determine what’s going on. The following snippet of code will ‘nicely’ dump out json representations of the variables set across all the hosts. This is really handy if you want to know where Ansible has some information you want to reference in your scripts.

- name: dump all
  hosts: all
  tasks:
    - name: Print some debug information
      vars:
        msg: |
          Module Variables ("vars"):
          --------------------------------
          {{ vars | to_nice_json }}
          ================================

          Environment Variables ("environment"):
          --------------------------------
          {{ environment | to_nice_json }}
          ================================

          Group Variables ("groups"):
          --------------------------------
          {{ groups | to_nice_json }}
          ================================

          Host Variables ("hostvars"):
          --------------------------------
          {{ hostvars | to_nice_json }}
          ================================
      debug:
        msg: "{{ msg.split('\n') }}"
      tags: debug_info

As you’ll see later, you can also interrogate the Python environment interactively…

3) Run ansible-lint

As with most linters, ansible-lint can be a great way to spot problems and anti-patterns in your code.

Its output includes lines like this:

roles/rolename/tasks/main.yml:8: risky-file-permissions File permissions unset or incorrect

You configure it with a .ansible-lint file, where you can suppress classes of error, or just tell you to warn.

The list of rules are available here, and more documentation is available here.


If you like this, you might like one of my books:
Learn Bash the Hard Way

Learn Git the Hard Way
Learn Terraform the Hard Way

LearnGitBashandTerraformtheHardWay
Buy in a bundle here

4) Run ansible-console

This can be a huge timesaver when developing your Ansible code, but unfortunately there isn’t much information or guidance out there on how to use it, so I’m going to go into a bit more depth here.

The simplest way to run it is just as you would a playbook, but with console instead of playbook:

$ ansible-console -i hosts.yml
Welcome to the ansible console.
Type help or ? to list commands.
imiell@all (1)[f:5]$

You are greeted with a prompt and some advice. If you type help, you get a list of all the commands and modules available to you to use in the context in which you have run ansible-console:

Documented commands (type help <topic>):
========================================
EOF             dpkg_selections  include_vars   setup
add_host        exit             iptables       shell
apt             expect           known_hosts    slurp
apt_key         fail             lineinfile     stat
apt_repository  fetch            list           subversion
assemble        file             meta           systemd
assert          find             package        sysvinit
async_status    forks            package_facts  tempfile
async_wrapper   gather_facts     pause          template
become          get_url          ping           timeout
become_method   getent           pip            unarchive
become_user     git              raw            uri
blockinfile     group            reboot         user
cd              group_by         remote_user    validate_argument_spec
check           help             replace        verbosity
command         hostname         rpm_key        wait_for
copy            import_playbook  script         wait_for_connection
cron            import_role      serial         yum
debconf         import_tasks     service        yum_repository
debug           include          service_facts
diff            include_role     set_fact
dnf             include_tasks    set_stats

You can ask for help on these. If it’s a built-in command, you get a brief description, eg:

imiell@all (1)[f:5]$ help become_user
Given a username, set the user that plays are run by when using become

or, if it’s a module, you get a very handy overview of the module and its parameters:

imiell@all (1)[f:5]$ help shell
Execute shell commands on targets
Parameters:
  creates A filename, when it already exists, this step will B(not) be run.
  executable Change the shell used to execute the command.
  chdir Change into this directory before running the command.
  cmd The command to run followed by optional arguments.
  removes A filename, when it does not exist, this step will B(not) be run.
  warn Whether to enable task warnings.
  free_form The shell module takes a free form command to run, as a string.
  stdin_add_newline Whether to append a newline to stdin data.
  stdin Set the stdin of the command directly to the specified value.

Where the console comes into its own is when you want to experiment with modules quickly. For example:

imiell@basquiat (1)[f:5]$ shell touch /tmp/asd creates=/tmp/asd
basquiat | CHANGED | rc=0 >>

imiell@basquiat (1)[f:5]$ shell touch /tmp/asd creates=/tmp/asd
basquiat | SUCCESS | rc=0 >>
skipped, since /tmp/asd exists

If you have multiple hosts, it will run across all those hosts. This is a great way to broadcast commands across a wide range of hosts.

If you want to work on specific hosts, then use the cd command, which (misleadingly) changes your host context rather than directory. You can choose a specific host, or a group of hosts. By default, it uses all:

imiell@all (4)[f:5]$ cd basquiat
imiell@basquiat (1)[f:5]$ command hostname
basquiat | CHANGED | rc=0 >>
basquiat

If a command doesn’t match an Ansible command or module, it assumes it’s a normal shell command and runs it through one of the Ansible shell modules:

imiell@basquiat (1)[f:5]$ echo blah
basquiat | CHANGED | rc=0 >>
blah

The console has autocomplete, which can be really handy when you’re playing around:

imiell@basquiat (1)[f:5]$ expect <TAB><TAB>
chdir=      command=    creates=    echo=       removes=    responses=  timeout=
imiell@basquiat (1)[f:5]$ expect

5) The Ansible Debugger

Ansible also contains a debugger that you can use to interrogate a running Ansible process. In this example, create a file called playbook.yml, add this play to an existing one, or modify an existing play:

- hosts: all
  debugger: on_failed
  gather_facts: no
  tasks:
    - fail:

$ ansible-playbook playbook.yml
PLAY [all] ***************************
TASK [fail] **************************
Friday 27 August 2021  12:16:24 +0100 (0:00:00.282)       0:00:00.282 *********
fatal: [Ians-Air.home]: FAILED! => {"changed": false, "msg": "Failed as requested from task"}
[Ians-Air.home] help
EOF  c  continue  h  help  p  pprint  q  quit  r  redo  u  update_task

From there, you can execute Python commands directly to examing the context:

[Ians-Air.home] TASK: wrong variable (debug)> dir()
['host', 'play_context', 'result', 'task', 'task_vars']

Or use the provided commands to help you debug. For example, p maps to a pretty-print command:

[Ians-Air.home] TASK: wrong variable (debug)> p dir(task)
['DEPRECATED_ATTRIBUTES',
 '__class__',
 '__delattr__',
 '__dict__',
 '__doc__',
[...]
 'tags',
 'throttle',
 'untagged',
 'until',
 'validate',
 'vars',
 'when']

If you like this, you might like one of my books:
Learn Bash the Hard Way

Learn Git the Hard Way
Learn Terraform the Hard Way

LearnGitBashandTerraformtheHardWay
Buy in a bundle here

If you enjoyed this, then please consider buying me a coffee to encourage me to do more.