This post (Part I) walks you through a Git repo I put together to allow anyone to fine-tune a model to write emails in their own ‘voice’.
I did this to use a small model that I could run at home to help me draft emails that didn’t suck.
You don’t need a hefty GPU, but you will need a credit card to rent one on vast.ai for the training (note: affiliate link – I will benefit if you sign up and spend). It costs me about $1 per run to create a useful model that I can use on my home (non-GPU) server.
Part II will dive deeper into how the script works and what’s going on.
What It Does
At the very highest level, the script creates a new large language model (LLM), building from a base model, and training on sample emails you give it. Here’s some sample output (truncated to 40 tokens):
INPUT: Write an email to a friend arranging a drink: OUTPUT: Hey, how are you? I was thinking of going out for a drink this evening. Are you free? Let me know what time and where we could meet. Ian On 10 Jul
You can see how it’s put my name (Ian) at the bottom and a (spurious) date, because that’s why a typical email looks like from me.
Compare this to the ‘base’ model this was built from (Qwen/Qwen2.5-3B-Instruct), which gives a far more ‘generic’ and formal response (also truncated to 40 tokens):
INPUT: Write an email to a friend arranging a drink: OUTPUT: The date is 10/24/10 at 6:00pm. Subject: Exciting Plans for Our Drink Date!
Dear [Friend's Name],
I hope
That, by contrast, does not sound like me at all.
…in a bit more detail
Here’s what the script (run_fine_tune.sh) does:
Picks the cheapest available GPU from vast.ai that fits the criteria needed, and provisions it
Copies up a script to run on the machine (build_and_push_model.sh), the gmail samples you want to train it on (compressed), and a huggingface token (to store the resulting models)
Runs build_and_push_model.sh on the GPU, which:
Installs the needed software (LLaMA-Factory, llama.cpp)
Logs into huggingface
Creates the huggingface repos (if needed)
Trains the new model
Merges the new model
Converts the new model to .gguf format (allowing it to be run in llama.cpp)
Quantizes the new model (so it can be run more easily on a non-GPU machine)
Pushes the merged model, and the .gguf, to huggingface repos
Use Google Takeout to download your sent email to a .mbox file. You probably only need your ‘Sent’ emails, as they contain your responses. If you don’t use gmail, the script expects a .mbox file. Then run the python script to extract the emails and put them in the correct format. Then compress the result with xz
The book’s primary goal is to help those engineers and managers whose efforts to change their organisations with technology hit the buffers as they scale. The direct application of sweat and resultant problem-solving worked well when you’re a more junior engineer, but as you try to make more fundamental and wide-ranging change seems to work less and less effectively. This book is intended to cast light on the invisible barriers that block your progress, and give you the tools to understand how to change them.
Its secondary goal is to be a more practical guide to making change within technical organisations, and even across wider businesses. While many books and blog posts exist on what the ideal state of various aspects of your organisation (culture, team structure, test practices, platform technologies etc) not enough of them give practical advice on how to get there, which is a far messier process. It might even help your career by allowing you to see how why – and just how challenging – it will be to get there, and perhaps consider setting yourself up for success elsewhere.
I hope the book will distil the experiences I’ve had over the past five years as a Could Native consultant with Container Solutions, and the last 30 years as an engineer, to give the reader a head start in navigating these murky waters.
The book will be replete with practical advice and real-world case studies to illustrate and provide inspiration (and caution) about what lies ahead for those that want to make positive change.
I hope you’ll join me on this journey. If these themes resonate, stay tuned. I’ll be sharing updates, excerpts, and insights here along the way. In the meantime, this video (‘How Money Flows Trump Conway’s Law) shares some of my more embryonic thoughts from a couple of years ago.
Your thoughts, questions, and experiences are always welcome!
This post is aimed at those interested in continuous compliance, an extension of cloud native principles to the area of software compliance, an under-developed field of software automation.
My consultancy, Container Solutions is working on an open source project to help automate the management and reporting of controls, and this post arose from that work.
OSCAL?
OSCAL stands for “Open Security Controls Assessment Language”.
It’s a standard maintained by NIST. Like many standards, it’s big and full of jargon, difficult for outsiders to grok, and obviously designed by committee, but trying to encompass an area like this across a whole set of industries is hard, and there’s no other game in town.
The OSCAL ‘Schema’
OSCAL was created to be a flexible but complete standard for exchanging information about security controls.
It’s designed to define machine-readable documents that represent different parts of the controls lifecycle, such as assessments, risk identification, and ‘plans of action’. The idea is that participants and software within this lifecycle can exchange information using a common, machine-readable language.
For example, a regulator might publish “Control Catalogs” using OSCAL as a document format. These publications can then be read by software used by various parties involved in compliance to facilitate their work.
To give you an idea, here’s an example, minimal and edited OSCAL document, from the examples repo. It represents an Assessment Plan, one of the phases in the control lifecycle:
You can imagine how that document could be read by a desktop application which presents this information in a way that can be used by someone about to perform an assessment of controls. Note that the ‘uuids’ can be referenced by other documents’ entities, and that there are uuids that could refer to other documents’ entities. This will be relevant later.
Other entities modelled by OSCAL include:
Catalogs (lists of controls and their details)
Profiles (sets of controls derived from catalogs)
Implementations (details of how control is applied)
Assessment Plans (how the controls will be tested)
Assessment Results (findings from the assessment process)
Plan of Action and Milestone (corrective actions for addressing findings)
and here is a visual representation from the OSCAL site:
The Problem
For someone old like me who is used to a more monolithic and relation-oriented schema, OSCAL feels odd. Different documents at different levels may or may not have elements that relate to one another, and might be completely independent. For example, a control definition that exists in one doc might be referenced in another by a UUID being the same, but there’s no definition of whether such a relation needs to exist or not.
However, both in practice and in the examples given by NIST, the data ‘feels’ relational
Visualising OSCAL with Neo4J
While working on this, a colleague mentioned this talk from Alexander Koderman, (slides) where he demonstrated how to use Neo4J to visualise OSCAL documents. Along the way he discovered errors in the documents and could relate, isolate, and more easily visualise parts of the documents he was looking at:
Building on this work, I set up a repo to run up a Neo4J instance in docker, after which I could play with the nodes in an interactive way. This made learning about OSCAL and the relations between the nodes a lot more fun.
An OSCAL example System Security Plan (highlighted) and its related nodes. Where possible, size and colour have been used to distinguish different types of node. The type of node can be inferred from the relation described between them in the arrows, but if you click on them, the details can be reviewed.
Catalog
An OSCAL example catalog (highlighted) and its related nodes.
Assessment Plan and Plan of Action and Milestones
This graph shows how the Plan of Action and Milestones (highlighted, large brown node) is related to the Assessment Plan (the large orange node). Tracing through, you can see that ‘Assessment Plan’ nodes are related to ‘Assessment Result’ nodes via ‘Party’ nodes and ‘Role’ nodes.
Code
The code for running this on your own machine on Docker, along with the above images, is available here. Simply clone it and run make full-rebuild and you will see instructions on how to get the same views on the OSCAL examples.
If you don’t already know, Crossplane is billed as an:
Open source, CNCF project built on the foundation of Kubernetes to orchestrate anything. Encapsulate policies, permissions, and other guardrails behind a custom API line to enable your customers to self-service without needing to become an infrastructure expert.
Another way to view Crossplane is as a tool that uses a commodity, open source, and well-supported control plane (Kubernetes) to support the creation of other control planes.
We’ve been using it at Container Solutions for a while, and have recently been talking about how we think it’s going to become more important in future:
Just as IBM buys Terraform, Crossplane seems to be becoming a default for our client engagements.
Recently I’ve been watching Viktor Farcic’s fantastic set of tutorial videos on Crossplane. If you are an engineer or interested architect, then these primers are ideal for finding out what’s going on in this area.
While following Viktor’s work I saw another Crossplane-related video by Viktor on a subject we both seem to get asked about a lot: does Crossplane replace Terraform/Ansible/Chef/$Tool?
This is a difficult question to answer briefly (aside from just saying “yes and no”), because understanding the answer requires you to grasp what is new and different about Crossplane, and what is not. It doesn’t help that – from the user point of view – they can seem to do the exact same thing.
To get to the answer, I want to reframe a few things Viktor says in that video that confused me, in the hope that the two pieces of content taken together help people understand where Crossplane fits into the Cloud Native firmament. Although Viktor and I agree on the role Crossplane plays now and in the future, we do differ a little on defining and interpreting what is new about Crossplane, and how the industry got here.
This post follows the logic of our ‘Cloud Native Family Tree’, which seeks to explain the history of devops tooling. It’s recently been updated to include Crossplane.
Three Questions
Before we get into the debate, we may want to ask ourselves two deceptively simple questions:
What is an API?
What is a cloud service?
And one non-simple question:
What is a control plane?
Understanding exactly what the answers to these are is key to defining what’s new and useful about Crossplane.
If you think you know these already, or aren’t interested in the philosophy, skip to the end.
APIs are mechanisms that enable two software components to communicate with each other using a set of definitions and protocols.
Nowadays, most people think of an API as a set of services you can call using technologies like HTTP and JSON. But HTTP and JSON (or YAML, or XML etc) are not necessary. In this context, I like to explain to people that the venerable mkdir command is an API. mkdir conforms in these ways:
enables two software components to communicate with each other [the shell and the Linux API] using
a set of definitions [the standard flags to mkdir] and
protocols [shell standard input/output and exit codes]
Pretty much all code is something that calls an API, down to whatever goes on within the hardware (which, by definition, isn’t a software component). Technically speaking, code is “APIs all the way down”. But this isn’t a very helpful definition if it essentially describes all code.
APIs all the way down: The Linux API calls that mkdir makes in order to create a folder.
It might be argued that mkdir is not an API because it is used by humans, and not for ‘two software components to communicate’. However, you can telnet to a server and call its API by hand (I used to do this via HTTP a lot when debugging). Further, mkdir can (and is also designed to) be used within scripts
APIs are Stable
What people really want and expect from an API is stability. As a rule, the lower down the stack an API is, the more stable it needs to be. The Intel x86 API has had very few breaking changes since it came into being in 1978, and even carries over idiosyncrasies from the Datapoint 2022 terminal in 1970 (such as the 8086’s ‘little endian’ design. Similarly, the Linux Kernel API has also had very few changes (mostly removals) since version 2.6‘s release over 20 years ago (2003).
The Linux CLI, by contrast, is much less stable. This is one of the main reasons shell scripts get such a bad rep. They are notoriously difficult to write in such a way that they can be run on a wide variety of different machines. Who knows if the ifconfig command in my shell script will run in your target shell environment? Even if it’s installed and on the $PATH, and not some other command with the same name, will it have the same flags available? Will those flags do the same thing consistently? Defensively writing code against these challenges are probably the main reason people avoid writing shell scripts, alongside the ease with which you can write frighteningly broken code.
This is why tools like Ansible came into being. They abstracted away the messiness of different implementations of configuration commands, and introduced the notion of idempotence to configuration management. Rather than running a mkdir command which might succeed or fail, in Ansible you simply declare that the folder exists. This code will create a folder on ‘all’ your defined hosts.
- hosts: all  tasks:   - name: Create a folder    file:     path: /path/to/your/folder     state: directory
Ansible will ssh into them and create the folder if it doesn’t already exist, running mkdir, or whatever it needs to run to get the Linux API to deliver an equivalent result.
Viktor says that Ansible, Chef et al focussed on ‘anything but APIs’, and this is where I disagree. They did focus on APIs, but not http-based (or ‘modern’) APIs; they simplified the various command line APIs into a form that was idempotent and (mostly) declarative. Just as mkdir creates a new API in front of the Linux API, Ansible created a means to use (or create your own) APIs that simplified the complexity of other APIs.
Terraform: An Open Plugin and Cloud First Model
Terraform not only simplified the complexity of other APIs, but then added a rich and open plugin framework and a ‘cloud first’ model (as opposed to Ansible’s ‘ssh environment first’ model). In theory, there was no reason that Ansible couldn’t have done the same things Terraform did, but Ansible wasn’t designed for infrastructure provisioning the way Terraform was (as Viktor points out).
This begs the second question: if Terraform was ‘cloud first’…
What is a Cloud Service?
Many people think of a cloud service as something sold by one of the big three hyperscalers. In fact, a cloud service is the combination of three things:
A remote network connection
An API
A delegation of responsibility to a third party
That’s it. That’s all a cloud service is.
We’ve already established that an API (as opposed to just ‘running software’) is a stable way for two software components to communicate. Cloud simply takes this and places it on the network. Finally – and crucially – it devolves responsibility for delivering the result to a third party.
So, if I ask my Linux desktop (y’know, the one literally on my desk) for more memory, and it can’t give it to me because it’s run out, then that’s my responsibility to resolve, therefore it’s not a cloud service.
A colo is not a cloud service, for example, because the interface is not an API over a network. If I want a new server I’ll send them an email. If they add an API they become a cloud service.
This table may help clarify:
Remote Network Connection
API
Delegation of Responsibility
Abacus
No
No
No
Linux Server
Yes
No
No
mkdir CLI Command on Desktop Linux
No
Yes
No
Outsourced On-Prem Server
No
No
Yes
Self-managed API service
Yes
Yes
No
Windows Operating System on Desktop
No
Yes
Yes
Colo Server
Yes
No
Yes
AWS EKS
Yes
Yes
Yes
GitHub
Yes
Yes
Yes
An abacus is a simple calculation tool that doesn’t use a network connection, has an interface (moving beads), but not an API, and if the abacus breaks, that’s your problem.
A Linux server has a remote network connection, but no API for management. (SSH and the CLI might be considered an API, but it’s certainly not stable).
mkdir has an API (see above), but from mkdir‘s point of view, disk space is your problem.
If you engage a company to supply you with an on-prem server, then it’s their problem if it breaks down (probably), but you don’t generally have an API to the outsourcer.
If you build your own API and manage it yourself then you can’t pick up the phone to get it fixed if it returns an error.
If the Windows API breaks (and you paid for support), then you can call on Microsoft support, but the Windows API doesn’t need a network connection to invoke.
A colo server is supplied by you without an API, but if it doesn’t get power/bandwidth/whatever else the colo supports, you can get them to fix it, and can connect to it over the network.
Some of these may be arguable on detail, but it’s certainly true that only EKS and GitHub qualify as ‘cloud services’ in the above table, as they fulfil all three criteria for a cloud service.
What is a Control Plane?
A less commonly-understood concept that must also be understood is the ‘control plane’. The phrase comes from network routing, which divides the router architecture into three ‘planes’: the ‘data plane’, the ‘control plane’, and the ‘management plane’.
In networking, the data plane is the part of the software that processes the data requests. By contrast, the control plane is the part of the software that maintains the routing table and defines what to do with incoming packets, and the management plane handles monitoring and configuration of the network stack.
You might think of the control plane as the state management of the data that goes through the router, as opposed to the general management and configuration of the system (management plane).
This concept has been co-opted by other technologies, but I haven’t been able to find a formal definition of what a control plane when used outside of networking. I think of it as ‘whatever manages how the useful work will be done by the thing’ rather than the thing that does the actual work. If that doesn’t seem like a rigorous definition to you, then I won’t disagree.
For Kubernetes, the control plane is the etcd database and the core controllers that make sure your workloads are appropriately placed and running.
All cloud services need a control plane. They need something that orchestrates the delivery of services to clients. This is because they have a remote API and a delegation of responsibility.
So Does Crossplane Replace Terraform?
OK, now we know what the following things are:
APIs
Cloud services
Control planes
We can more clearly explain how Crossplane and Terraform (et al) relate.
Resources, APIs, Cloud Services
Crossplane and Terraform both deal with the creation of resources, and are both designed to help manage cloud services. In this sense, Crossplane can replace Terraform. However…
‘One-shot’ vs Continuous
…whereas Terraform is ‘one-shot‘ (you run it once and then it’s done), Crossplane is continuous. Part of its job is to provision resources, but it’s not its only job. Its design, and main purpose, is to give you a framework to ensure that resources remain in a ‘known state’, ultimately deriving its source of truth from the configuration of its own Kubernetes control plane (or Git, if this configuration is synchronised with a Git repository).
Terraform ‘Under’ Crossplane?
If you want, you can run your Terraform code in Crossplane with the Terraform provider. One thing to note here, thought, is that you can’t just take your existing Terraform code or other shell scripts and run it unchanged ‘within’ Crossplane’s control plane just as you would have done before. Some work will need to be done to integrate the code to run under Crossplane’s control. In this sense, Crossplane does replace Terraform, subsuming the code into its own provider.
Control Planes
In a way, Crossplane is quite close to Chef and Puppet. Both those tools had ‘control planes’ (the Chef and Puppet servers) that ensured the targets were in a conformant state. However, Chef and Puppet (along with Ansible) were designed to configure individual compute environments (physical servers, VMs etc), and not orchestrate and compose different APIs and resources into another cloud service-like API.
When I started my career as an engineer in the early noughties, I was very keen on developer experience (devex).
So when I joined a company whose chosen language was TCL (no, really), I decided to ask the engineering mailing list what IDEs they used. Surely the senior engineers, with all their wisdom and experience, would tell which of the many IDEs available at the time made them the most productive? This was decades before ‘developer experience’ had a name, but nonetheless it was exactly what I was talking about, and what people fretted about.
Minutes later, I received a terse one-word missive from the CTO:
From: CTO
Subject: IDEs
> Which IDEs do people recommend using here? Thanks,
> Ian
vim
Being young and full of precocious wisdom, I ignored this advice for a while. I installed Eclipse (which took up more RAM than my machine had, and quickly crashed it), and settled on Kate (do people still use Kate?). But eventually I twigged that I was no more productive than those around me that used vim.
So, dear reader, I married vim. That marriage is still going strong twenty years later, and our love is deeper than ever.
This pattern has been repeated multiple times with various tools:
see the fancy new GUI
try it
gradually realise that the command-line, text-only, steep-learning-curve, low-tech approach is the most productive
Some time ago it got to the point where when someone shows me the GUI, I ask where the command-line version is, as I’d rather use that. I often get funny looks at both this, and when I say I’d rather not use Visual Studio if possible.
I also find looking at gvim makes me feel a bit queasy, like seeing your dad dancing at a wedding.
This is not a vim vs not-vim post. Vim is just one of the oldest and most enduring examples of an approach which I’ve found has served me better as I’ve got older.
I call this approach ‘low-tech devex’, or ‘LTD’. It prefers:
Long-standing, battle-hardened tools that have stood the test of time
Tools with stable histories
Small tools with relatively few dependencies
Text-based input and output
Command-line approaches that exemplify unix principles
Tools that don’t require daemons/engines to run
LTD also is an abbreviation of ‘limited’, which seems appropriate…
Anyone else out there love tools for 'low tech devex'?
I have grown to love shell, make, vim, docker over the years and ignore all the gui-based fashions that tempt.
All LTD tools arguably reflect the aims and principles of the UNIX philosophy, which has been summarised as:
Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.
Portability
LTD is generally very portable. If you can use vim, it will work in a variety of situations, both friendly and hostile. vim, for example can easily be run on Windows, MacOS, Linux, or even that old HPUX server which doesn’t even run bash, yet does run your mission-critical database.
Speed
These tools tend to run fast when doing their work, and take little time to start up. This can make a big difference if you are in a ‘flow’ state when coding. They also tend to offer fewer distractions as you’re working. I just started up VSCode on my M2 Mac in a random folder with one file in it, and it took over 30 seconds until I could start typing.
Power
Although the learning curve can be steep, these tools tend to allow you to do extremely powerful things. The return on the investment, after an initial dip, is steep and long-lasting. Just ask any emacs user.
Composability / Embeddability
Because these tools have fewer dependencies, they tend to be more easily composable with – or embeddable in – one another. This is a consequence of the Unix philosophy.
For example, using vim inside a running Docker container is no problem, but if you want to exec onto your Ubuntu container running in production and run VSCode quickly and easily, good luck with all that. (Admittedly, VSCode is a tool I do use occasionally, as it’s very powerful for certain use cases, very widely used, and seems designed to make engineers’ lives easier rather than collect a fee. But I do so under duress, usually.)
And because they use also typically use standard *nix conventions, these tools can be chained together to produce more and more bespoke solutions quickly and flexibly.
Users’ Needs Are Emphasised
LTD tools are generally written by and for the user’s needs, rather than any purchaser’s needs. Purchasers like fancy GUIs and point-and-click interfaces, and ‘new’ tools. Prettiness and novelty don’t help the user in the long term.
More Sustainable
These tools have been around for decades (in most cases), and are highly unlikely to go away. Once you’ve picked one for a task, it’s unlikely you’ll need a new one.
More Maintainable
Again, because these tools have been around for a long time, they tend to have a very stable interface and feature set. There’s little more annoying than picking up an old project and discovering you have to upgrade a bunch of dependencies and even rewrite code to get your project fired up again (I’m looking at you, node).
Here’s an edited example of a project I wrote for myself which mirrors some of the projects we have built for our more engineering-focussed clients.
Developer experience starts with a Makefile. Traditionally, Makefiles were used for compiling binaries efficiently and accounting for dependencies that may or may not need updating.
In the ‘modern’ world of software delivery, they can be used to provide a useful interface for the engineer for what the project can do. As the team works on the project they can add commands to the list for tasks that they often perform.
docker_build: ## Build the docker image to run in docker build -t data-scraper . | tee /tmp/data-scraper-docker-build docker tag data-scraper:latest docker.io/imiell/data-scraper
get_latest: docker_build ## Get latest data @docker run \ -w="/data-scraper" \ --user="$(shell id -u):$(shell id -g)" \ --volume="$(PWD):/data-scraper" \ --volume="$(HOME)/.local:/home/imiell/.local" \ -- volume="$(HOME)/.bash_history:/home/imiell/.bash_history" \ --volume="/etc/group:/etc/group:ro" \ --volume="/etc/passwd:/etc/passwd:ro" \ --volume="/etc/shadow:/etc/shadow:ro" \ --network="host" \ --name=get_latest_priority \ data-scraper \ ./src/get-latest.sh
It’s a great way of sharing ‘best practice’ within a team.
I then have a make.sh script which wraps calls to make with features such as capturing logs in a standard format and cleaning up any left-over files once a task is done. I then use that script in a crontab in production like this:
A challenge with make is that it has quite a steep learning curve for most engineers to write, and an idiosyncratic syntax (to younger eyes, at least).
Whenever I learn a new technology an old one falls out of my brain.
However, while it can seem tricky to write Makefiles, it’s relatively easy to use them, which makes them quite a useful tool for getting new team members onboarded quickly. And for my personal projects – given enough time passing – new team members can mean me! If I return to a project after a time, I just run make help and I can see instantly what my developer workflow looked like.
Here’s an example make help on a project I’m working on:
This means if I have to onboard a new engineer, they can just run make help to determine what actions they can perform, and because most of the tasks run in containers or use standard tooling, it should work anywhere the command line and Docker is available.
This idea is similar to Dagger, which seeks to be a portable CI/CD engine where steps run in containers. However, running Dagger on Kubernetes requires privileged access and carries with it a Dagger engine, which requires installation and maintenance. make requires none of these moving parts, which makes maintenance and installation significantly simpler. The primitives used in Make can be used in your CI/CD tool of choice to create a similar effect.
What Low-Tech Devex Tools Should You Know About?
Here is a highly opinionated list of LTD tooling that you should know:
shell
Mastery of shells are essential to LTD; they underpin almost all of them.
vim/emacs
Already mentioned above, vim is available everywhere and is very powerful and flexible. Other similar editors are available and inspire similar levels of religious fervour.
make
Make is the cockroach of build tools: it just won’t die. Others have come and gone, some are still here, but make is always there, does the job, and can’t go out of fashion because it’s always been out of fashion.
However, as noted above, the learning curve is rather steep. It also has its limits in terms of whizz-bang features.
Docker
The trendiest of these. I could talk instead of chroot jails and how they can improve devex and delivery a lot on their own, but I won’t go there, as Docker is now relatively ubiquitous and mature. Docker is best enjoyed as a command line interface (CLI) tool, and it’s speed and flexibility composed with some makefiles or shell scripts can improve developer experience enormously.
tmux / screen
When your annoying neighbour bangs on about how the terminal can’t give you the multi-window joy of a GUI IDE, crack your knuckles and show them one of these tools. Not only can they give you windowing capabilities, they can be manipulated faster than someone can move a mouse.
I adore tmux and key sequences like :movew -r and CTRL+B z CTRL+B SPACE are second nature to me now.
git
Distributed source control at the command line. ’nuff said.
curl
cURL is one of my favs. I prefer to use it over GUI tools like Postman, as you can just save command lines in Markdown files for how-to guides and playbooks. And it’s super useful when you want to debug a network request by ‘exporting as cURL command’ from Chrome devtools.
cloud shell
Cloud providers provide in-browser terminals with the provider’s CLI installed. This avoids the need to configure keys (if you are already logged in) and worry about the versioning or dependencies of the CLI. I agree with @hibri here.
This post was originally triggered – and I choose that word carefully – by a recent experience on a cloud cost-optimisation project. These experiences prompted me to consider how things had changed since I started working in software.
As part of the project that provoked me, I was researching the topic of cloud cost analysis and was struck by how many people complained that the tools the big three providers give you are not adequate to the task. I had a look myself and indeed found that the questions I wanted answering were difficult to answer using the GUI tools.
No problem, I thought: I’ll just download the raw CSVs and find an open source project that sticks the data in a database, allowing me to write whatever queries I want on the data. After some searching I couldn’t find any such project, so wrote one myself and stuck it on GitHub. Once this was built I could write all sorts of queries like ‘Compare costs per GCP service/project/sku with previous month, showing any percentage change greater than x’, modifying them at will to hone in on the specific data I was interested in much faster and more effectively than could be achieved with GUIs. Some of the more generic ones are in the repo.
While working on this with a very bright and knowledgeable younger colleague, I was struck by the fact that he’d never needed to learn SQL, and he made sense of these queries by comparing them to other JSON-based *QLs that he’d come across (PromQL, GraphQL etc). This surprised me.
The Good Old Days
I’m going to grossly simplify matters here, but when I was coming up in the industry (around 2001) three-tier backend applications were where it was at. There were webservers (usually Apache), application servers, and databases. Knowledge and experience was thin on the ground, so you had to know a bit of everything. HTML, Javascript (mostly form submission fiddling), a language to do business logic in the application server layer (usually Java in those days, but I’d sworn off it for life), and SQL for the database. Different people had different preferences and specialisms, but you had to have a basic knowledge of all those technologies to be an effective developer.
The entire development engineer industry was about ‘dev’ at this point. ‘Ops’ was a challenge that was at the time usually either done in a half-hearted way by the developers themselves (in smaller orgs), or passed over to the sysadmins to manage (in bigger orgs). The smaller orgs grew into a devops mindset (“we’re not hiring a separate team to support the operation of the system, it’s cheaper to get the devs to do it”), and the bigger ones embraced SRE (“we need capable engineers to support and manage site reliability engineer live systems and there’s economies of scale in centralising that.”). Data science was also not really a thing then.
There were DBAs but they were there to ensure backups were done, and punish those who polluted their treasured databases with poorly written queries or badly defined indexes. They therefore sometimes doubled up as SQL tutors to the devs.
Since almost everyone was a dev, and devs did a bit of everything, engineers were often asked SQL questions in job interviews, usually starting with a simple join and then asking about indexes and performance characteristics for the more advanced candidate.
The Rise of the Specialist
Along with the rise of ops in the noughties came the continuation of increasing specialisation in the IT industry. We’d had the DBA for some time, but the ‘dev’ was gradually supplanted by the front-end engineer, the tester, the data scientist, the DevOps engineer, and later the SRE, the security engineer… the list goes on.
I say this was a continuation because this process has been going on since computers existed, from the original physicists and technicians that maintained and programmed machines the size of rooms bifurcating into COBOL and FORTRAN programmers in the 60s and 70s, leading to system admins in the 80s, and then network admins, and so on.
Parallels to this can be seen in the medical profession. Once upon a time we had a ‘ship’s surgeon‘, who had to deal with pretty much any ailment you might expect from having cannon balls fired at you while lugging your heavy cannons to face the enemy, and (in more peaceful moments) disease-ridden sailors sharing whatever diseased they’d brought in from shore leave. Fast-forward hundreds of years and we now have plastic surgeons that specialise in just in ear surgery (otologists).
And a similar thing seems to be happening with computing jobs. Whereas once upon a time everyone who interacted with a computer at work would have likely had to type in (or even write) a few SQL queries in the past, now there’s a data scientist for that. The dev now attends scrum meetings and complains about JIRA (and doesn’t even write much JQL).
Is SQL Becoming a Niche Skill?
So back to the issue at hand. Opinion seems to be divided on the subject. These interactions were typical when I asked on Twitter:
There’s no doubt that historical demand for SQL has always been strong. While SQL’s elder sibling COBOL clings to life on the mainframe, SQL remains one of the most in-demand skills in the industry, according to the TIOBE index. In the 20 years since 2004 it’s gone from 7th to 8th most popular language along side mainstays like C and upstarts like C++, Java, and Visual Basic.
Not only that, but SQL is still a widely used data querying language outside its common domains. Those who prefer SQL to JQ (and who among us without a PhD in mathematics do not?) might be interested in DuckDB, a SQL-fronted alternative to jq. And facebook’s osquery, which gives a common SQL interface to query different operating systems’ state for compliance, security, or performance purposes.
‘Schemas Are Bad, M’Kay?’
In another recent project with a similar demographic of engineer, I was horrified to be voted down about using a relational database as the consensus was ‘schemas are bad’ and ‘MongoDB is more flexible’. This, I thought, is a world gone topsy-turvy. To gain dominion over your data, you need to wrestle it to the floor with a schema. Without that, you won’t have the power to range over it with whatever questions you have to ask it, and you won’t have a database, just a set of documents or a fancier filesystem.
To this it’s often objected that schemas are ‘hard to change’, but this is only true when you’ve got many users and code dependencies on the database or huge amounts of data, not when you’re initially building up a system. They can be annoying to change, because the syntax varies between vendor and you often end up looking up whether it’s ALTER TABLE ADD INDEX or CREATE INDEX. LLMs almost banish this problem, however, as they are incredibly good at saving you time on this.
I still find it pretty insane how good LLMs are at writing SQL
I spent 15 years writing optimised SQL and for anything more than simple-ish joins I go straight to ChatGPT
My only skill now is knowing what queries it produces are unsafe to run or clearly wrong (which is rare)
Poking my head out of that world I’ve been shocked at how few people currently in IT really understand data programming/management these days. npm and Javascript sure, python probably, but data, its management and any programming around it seems to have been pushed to the sides for the ‘data scientists’ to deal with, as though data were not an essential aspect of pretty much any computing work.
But perhaps I’m just an old ship’s surgeon who might know how to saw a leg off and administer smelling salts, but doesn’t have the deft skill of an otologist who practices on their specialism on the daily. I’ll just call myself a ‘full stack’ engineer and hope no-one notices.
As someone who has worked in software since 2001, and in the Cloud Native (containerisation and Kubernetes) space since 2013, I’m getting old enough to have seen trends come and go a few times. VMs came (and stayed), continuous integration went from a fad talked about by gurus to the mainstream of software delivery, and containers went from some changes Google made to the Linux kernel to the de facto standard for software packaging, and then on to the foundation for Kubernetes, an industry-standard software deployment platform.
But it’s another thing to see a wave come, and then see it recede for a time before you expect it to rise again. And in the last years, at my work we’ve observed that more and more of the decision-makers in our industry have shifted their stance on the cloud to the point where it became impossible for us to ignore.
One might say they’ve changed their strategy, but most of them don’t couch this shift in those terms. What we’re seeing is that businesses are moving from a wholesale ‘migration to cloud’ strategy to a more piecemeal, hybrid approach. After experimentation with moving workloads to the cloud (with mixed success), companies have altered their ambitions. Instead, they are now leaving workloads on-prem (or in colos), and only moving existing workloads to cloud if there is a compelling reason to. New workloads are going to the cloud by default.
In most cases we’ve observed, this is quietly accepted as a new reality rather than trumpeted as a new strategy. We think this is a change that’s been borne of experience and financial necessity, two very strong motivators.
To be clear, we’re not talking about repatriation here. We’ve discussed repatriation in previous content, but we’re not seeing clients exit the cloud or even move workloads back into the data centre. Famously, David Heinemeier Hansson trumpeted his company HEY’s repatriation of their workloads back on-prem. It was so frequently cited and discussed that he wrote an FAQ about it to answer questions. His situation, however, was a relatively rare combination of a single-product tech company that had decided it no longer needed the agility that the cloud brings, and had the skills and confidence to bring their workloads back.
There are hints that others are seeing the same trends. James Watters of VMWare Tanzu’s Research and Development department talks here about how he’s:
“seeing a lot of organisations where the trend isn’t going one direction or the other. I’m seeing people benchmarking the cost and efficiency of the private cloud stack versus the public cloud stack, and people are having a lot of thoughtful conversations about that and in some cases are growing their private cloud deployments because of the results they see”
We’ve Been Here Before
As we all know, those who do not learn from history are doomed to repeat it. And the mainframe to PC migration of the latter 20th century offers a perfect example from living memory of a similar evolution. Putting precise dates on these kinds of shifts is difficult, but my researches suggested that this migration took place over a roughly 20-year period, from the early 80s to the first few years of the 21st century. The start date matches the time Bill Gates was said to have said that his vision was ‘a computer on every desk and in every home’.
This 1982 article from the Sarasota Herald Tribune suggests that in 1981, IBM moved from ‘large central computers’ to ‘small, smart computer equipment’ whose potential IBM’s rivals had seen in the late 70s. Interestingly, this shift followed a 13-year antitrust suit from the Justice Department that charged IBM with a mainframe monopoly. This poses an interesting question of which came first in this decision: the market changes, or the legal pressures? It’s generally believed that the market rendered the monopoly discussion irrelevant as PCs took over global compute spend.
Discussion about mainframe to PC migrations petered out around the turn of the century, as references began to consistently use the past tense when talking about it. Again, it’s interesting to note that this coincided with the ‘Y2K bug’ spending frenzy and the dot-com boom, suggesting that these investment bumps were what tipped mainframe migration to PC discussions to the past tense.
Although it’s relatively easy in retrospect to say when these trends started and finished, as it happened there was confusion about where we were in the cycle, or even whether the cycle was still in play. As ‘late’ as 1986 – over half a decade since IBM had decided they were late to the party – PCs were not considered rivals to mainframes, and even eight years after that in 1994, experts were still being asked whether client-server was the future of computing.
So we might say that the secular trend away from the mainframe was strong, but within that trend there were periods of growth and deceleration for the PC. What we’re seeing now may be a similar reduction in the rate of growth of a trend of cloud transformation that’s been going on arguably since 2006, the year Amazon introduced the S3 block storage service.
Why Now?
The causes of this retreat from wholesale cloud migration are manifold. We’ll look at three of them, in rising order of importance.
Migration exhaustion (and IT worker conservatism)
Cost of hardware/colos
Macroeconomic trends
Migration Exhaustion and Conservatism
Enterprises are wearying of transformation narratives. The last 10-15 years have seen revolving doors of CIOs proclaiming cloud transformations that have not lived up to their billing. Our founder even wrote a book called Cloud Native Transformation that captured this zeitgeist (and how do it right with the help of the patterns we open sourced, natch).
The hardest (and most underestimated) part of these transformations are the least technical ones. Once an API exists for a service, then moving your software to run on it is – in principle – straightforward.
What really constrains cloud transformations includes the innate conservatism of the broad mass of IT employees, and the organisational inertia of businesses that whose organisational architectures were not shaped by cloud economics. In other words, the ‘money flows‘ of businesses are designed for organisational paradigms out of kilter with the new computing paradigm (which we’ve written extensively on this under-explored constraint in the past, having seen it as a major blocker to cloud adoption).
Natives, Converts, Hobbyists and Others
We see a strong divide in staff we encounter. The first people we usually meet on a client are the enthusiastic and capable cloud ‘converts’. These internal evangelists are usually from a background specialism such as development, system administration, or networking. In this specialism they first used – and later in their career embraced – cloud technologies, or are career cloud natives. We often wonder why they need our help when they have such smart, committed people saying many of the right things. We find out why as we go deeper into their business and talk to the second and third groups.
The second group are feature team platform ‘hobbyists’, whose day job is typically to deliver features, but are motivated to improve delivery. They often become platform engineers over time if that’s the way they want to go, as their skills (programming, source control, automation) are readily transferable from development into that field. Such people are relatively rare (this was my own path).
Finally, we have the broad mass of IT employees who don’t have much interest in cloud computing. They are generally either satisfied with shipping features, or maintaining complex systems they have come to know well, or are at a stage of their career where operating in a new paradigm does not fill them with excitement. I once worked at a client with a DBA tasked with aiding a cloud migration who was quite resistant even to using shell scripts and Git, let alone Terraform or Python cloud libraries. These people can be considered like those who either clung to the mainframe in the nineties and survived, repurposed their career to a less technical path, or got out completely. There are more of them out there than many clued up engineers might think.
This is particularly marked in mainland EU, where employment law makes it relatively difficult and costly to turn over staff compared to the US and UK. Time and again we have discussions with business leaders that are frustrated by their own inability to move the cloud needle within their own organisation, and effectively unable (or unwilling) to part with long-standing staff. This is one of several factors that results in the frozen middle we’ve written about addressing previously.
The last ten years have seen several waves of cloud transformation, and these waves, along with the natural conservatism of the bulk of the IT workforce has resulted in a retreat of enthusiasm for wholesale migration. But this by itself would not stop an industry motivated to change. So we must look deeper.
Cost of hardware
While the IT world was focussed on cloud migrations and transformations, the cost of hardware continued to drop.
It’s always been known that the performance of a cloud CPU and memory is not the same as physical hardware, but putting exact numbers on this is tricky, as it depends on so much else than the headline numbers, not to mention the fact that you’re sharing your CPUs with other workloads. This study by Retailic suggests that 2 cloud ‘vCPUs’ (virtual CPU) is roughly equivalent to one ‘real’ CPU on-prem. And memory performance is even more murky, as this post attests.
So looking at just the CPUs you get ‘twice’ the value per CPU you pay for on-prem. So on-prem is cheaper? Not so fast. You still need to take into account cost of running your own data centre (software maintenance, electricity costs, risk of outages, network costs…) to do a proper comparison. Building all these capabilities and paying for all these ‘extras’ is the basis of the arguments for the cloud’s value in the first place.
And, of course, you have to compare the yearly rental costs of servers with the yearly depreciation costs of buying the hardware in the first place. And here, hardware has also continued to get cheaper. Traditionally, computing equipment was depreciated over three years, but anyone who has bought servers or desktops in the last ten years will know that they typically have a useful life far longer than that.
For example: at home I have a nine year old Dell Xeon workstation that I bought for £650 three years ago (to run virtual Kubernetes clusters on, natch) that is still going strong. Inflation adjusted, the 2015 cost of that machine has depreciated at about £180 per year. According to AWS’s cost calculator, a similarly-spec’d m6g.12xlarge instance on AWS would set you back over £11,000. This eye-watering sum includes a one-year reservation, and paying all upfront for a hefty discount. A three-year reservation would only bring that down to £7,000 per year.
At the other end of the ‘serious computing’ spectrum, Google saved billions of dollars at the stroke of their CFO’s pen when they updated their servers’ lifespans, from four to six years. That’s a 33% reduction in cost on a pretty large line item. (Incidentally, Google started up by saving money and increasing compute power buying up commodity PCs instead of large servers or mainframes in the early noughties.)
Even if you don’t fancy running your own data centre (the usual counter to these arguments), then you can rent colocated server space very cheaply. I quickly found a provider in London who would give me a 10U space for £119 per month. Data egress costs are cheaper, and you can scale up or down other features (such as network bandwidth) depending on your requirements.
Or you can buy ‘clouds in a box’ like Oxide at the higher end, or SoftIron’s Hypercloud at the cheaper end, which offer options to self-host a cloud offering either in your data centre, or in a colo, or HPE Greenlake which offers metered usage of a similar product. If you combine these with genuinely Cloud Native workloads then you can pick the best cloud for your use case, which might be tempting as interest rates rise.
Macroeconomic Trends
Finally, we come to the most fundamental cause of this migration slowdown. Across the world, bond and interest rates have risen from historical lows in 2020 during the COVID pandemic to levels not seen since 2008. The tech boom we saw during the pandemic, where cheap money was thrown at tech in a bid for growth became a bust as many businesses woke up to the fact that they had overspent in the previous years in a bid for fast growth.
This caused a cutting-back of spending and funding as companies and investors adjusted to the new reality, and CFOs cut back spending en masse. This in turn meant that ambitions for moving to the cloud were either dropped, put on hold, or cut back significantly. This resulted in noticeably many more requests for cost optimisation or finops work than we had previously, and noticeably fewer requests for large-scale cloud transformations.
What Next?
If the current position is one of a slowing in the growth of cloud provider take-up, then what does the history of mainframes and PCs teach us about what comes next?
There are stirrings of alternate offerings to the big providers coming to the market that leverage cheaper and less feature rich cloud providers. Cloudfanatics is one such company, about to come to market offering cheaply provisioned packages of multi-tier infrastructure for cost-conscious SMEs that don’t have the expertise to build themselves. The packages use whichever cloud provider is most appropriate for the use case in terms of cost and features/performance.
Cloudfanatics’ founder, Andrew Philp, was inspired to start this company up by his experiences with SMEs’ struggles with both cost and maintenance of cloud systems. Although their needs were generally simple, he found that lack of availability of staff who could pick up a cloud setup from a departing employee, and poor practices around cloud topics such as IAM setup and secrets management meant that there was a sweet spot of functionality that could be delivered in a cost-effective way.
This may be the beginnings of the next ‘mainframe vs PC’ war. This time the big three cloud providers are the ‘mainframe’ vendors, and these smaller cloud providers are the more agile and simpler ‘PC’. APIs such as Kubernetes and the AWS S3 protocol are the equivalent of the IBM PC standard, allowing customers to port their applications if needed.
Already we see alliances of cloud providers such as the bandwidth alliance clubbing together (facilitated by Cloudflare) to offer a looser agglomeration of cost-saving options for cash-constrained IT leaders, including free data transfers between providers. Data egress costs are often a bone of contention between customer and cloud.
What we are unlikely to see is a wholesale retreat from cloud to either on-prem or the colo. A whole generation of engineers and business owners have been raised on the cloud and Cloud Native tooling and technology. While the prem will never go away, wholesale repatriation programs like HEY’s will be relatively rare.
What’s critical in this new world is to ensure that your workloads can run portably. This is what is meant by the phrase ‘Cloud Native’. Portability has always been the dream of the CIO looking to cut costs and increase choice, from bash scripts to mainframe software, and a source of contention between bit IT and consumer. As an enterprise architect for a bank, I once had a heated discussion with an AWS representative that told me to ‘just use ECS’ when our strategy involved making our workloads Kubernetes-native. Soon afterwards AWS announced EKS to the surprise of many, including the AWS rep.
What Does AI Mean for Cloud?
Opinion is divided over the effect AI workloads will have on these trends. On the one hand, Michael Dell cites a Barclays CIO Survey report which suggests that private cloud repatriation is rising to being on 83% of CIOs’ radars, from a low of 43% at the height of COVID. It makes one wonder whether COVID drove a lot of necessary short-term costly cloud spending which is now being repatriated in a piecemeal fashion on a case-by-case basis.
Multi cloud including on prem and collocation is the clear trend. Somewhat driven by the growth in AI inference and data gravity, 83% of enterprise CIOs in Barclays survey plan to repatriate at least some workloads in 2024, up from low point of 43% in 2020 H2. pic.twitter.com/f6yLK2IVKt
Michael Dell mentions another factor in repatriation: data gravity and AI inference workloads, and here there’s a split in opinion. On the one hand, companies looking to get ahead in AI may want to buy the latest hardware and run it themselves; on the other hand, renting cloud compute for short term agility may make more sense than buying AI hardware that will be out of date in the blink of a Moore’s Law cycle.
Portability More Important than Ever
With bond rates continuing to rise, expect more and more pressure on costs to arise. But this doesn’t necessarily mean the end of cloud computing. In fact, cloud spend is continuing to rise, just at a slower rate than before, according to a recent CIO survey.
Just as with the mainframe-to-PC computing paradigm shift, we are likely to continue to move to a cloud-first world. The big three cloud players will have their moats eaten away by smaller, cheaper and interoperable solutions. Some workloads will return to on-prem from the cloud, and a few specialised workloads will remain there until well past the time where the cloud is the default choice.
In such a world, it will become more important than ever that your software workloads are portable, and properly implemented cloud native build and delivery methods will help you ensure that you have that portability. It just won’t matter that much whether it’s actually running in the cloud or on touchable tin.
This post was originally published here, and is re-posted with permission.
Writing A 1996 Essay Again in 2023, This Time With Lots More Transistors
ChatGPT 3
Gathering the Sources
PrivateGPT
Ollama (and Llama2:70b)
Hallucinations
What I Learned
TL;DR
I used private and public LLMs to answer an undergraduate essay question I spent a week working on nearly 30 years ago, in an effort to see how the experience would have changed in that time. There were two rules:
No peeking at the original essay, and
No reading any of the suggested material, except to find references.
The experience turned out to be radically different with AI assistance in some ways, and similar in others.
Although I work in software now, there was a time when I wanted to be a journalist. To that end I did a degree in History at Oxford. Going up in 1994, I entered an academic world before mobile phones, and with an Internet that was so nascent that the ‘computer room’ was still fighting for space with the snooker table for space next to the library (the snooker table was ejected a few years later, sadly). We didn’t even have phones in our dorms: you had to communicate with fellow students either in person, or via an internal ‘pigeon post’ system.
My typical week then was probably similar to a week of a History student in 1964: tutorials were generally on Friday, where I would be given the essay question and reading list for the following week:
A typical History essay reading list from 1996. My annotations at the top related to the library reference details, and the notes at the bottom (barely legible even to me now) I suspect were the notes I made during the hour as my tutor was telling me what I should have written about.
That afternoon I would go to several libraries, hoping that other students studying the same module hadn’t borrowed the books on the list. Over the following week I would try to read and make notes on the material as much as possible given my busy schedule of drinking incredibly cheap and nasty red wine, feeling depressed, and suffering from the endless colds living in Oxford (which is built on a bog) gifts you free of charge.
The actual writing of the essay would take place on the day before delivery, and involve a lot of procrastination (solitaire and minesweeper on my Compaq laptop, as I recall) while the painful process of squeezing the juices from my notes took place.
I didn’t go to lectures. They were way too early in the afternoon for me to get out of bed for.
Writing A 1996 Essay Again in 2023, This Time With Lots More Transistors
The essay question I chose to answer was from a General British History (19th-20th century) course:
How far was the New Liberalism a Departure from the Old?
ChatGPT 3
I first went to ChatGPT (3, the free version – I’m a poor student after all) to try and familiarise myself with the topic:
Already I was way ahead of my younger self. I had an idea of what the broad strokes I needed to hit were to get a passable essay together. ChatGPT is great for this kind of surveying or summarizing stuff, but is pretty bad at writing an undergraduate-level essay for you. It can write a passable essay (literally), but it will be very obvious it was written by an LLM, and probably at best get you a 2:2 or a third, due to the broad-brush tone and supine mealy-mouthed ‘judgements’ which contain very little content.
After a few more interactions with ChatGPT over the next few minutes:
I was now ready to gather the information I needed to write the essay.
Gathering the Sources
The first step was to get the materials. It took me a couple of hours to find all the texts (through various, uh, avenues) and process them into plain text. I couldn’t get the journal articles (they were academically paywalled), which was a shame, as they tended to be the best source of figuring out what academic debate was behind the question.
Those couple of hours would have been reduced significantly for any future essay I might choose to write, as I spent some time figuring out exactly where to get them and installing software to extract the plain text. So I would now allocate an hour to getting the vast majority of texts, as opposed to the small subset of texts I would hunt down over a significantly longer time in the physical libraries 30 years ago.
PrivateGPT
One of the well-known limitations with ChatGPT is that it doesn’t tell you what the relevant sources are that it looked at to generate the text it gives you.
I found out that this repo gives us a quick way to augment an LLM with a set of documents, so I installed it on an old Dell Xeon desktop with 24 cores and 128G of RAM. This machine cost me about £600 a couple of years ago, and still works a treat (I bought it to run Kubernetes clusters). It doesn’t have a graphics card in it, so I thought it would be usable for this kind of work, but it’s just fine with ‘normal’ CPUs (as long as you don’t demand instant responses).
$ python3 ingest.py Appending to existing vectorstore at db Using embedded DuckDB with persistence: data will be stored in: db Loading documents from source_documents Loading new documents: 100%|███████████████████| 12/12 [00:00<00:00, 274.84it/s] Loaded 12 new documents from source_documents Split into 38046 chunks of text (max. 350 tokens each) Creating embeddings. May take some minutes… Ingestion complete! You can now run privateGPT.py to query your documents
Ingestion of the 10M sources took about five minutes. I asked it a few questions, and this was a typical output:
> Question: what was the changing role of the state in new versus old liberalism
> Answer (took 169.94 s.): The main difference between "new" and "old" liberalism lies in their approach towards the role of the government within society. While traditional or 'Old' liberals believed that minimizing intervention by the State would lead to a more prosperous economy, ‘New’ Liberals acknowledged the need for state action and social reform as necessary measures against industrial poverty. In essence, new liberalism marked an evolution of thought among liberals towards greater acceptance of government responsibility in promoting general welfare through interventionist policies.
> source_documents/the_age_of_lloyd_george_kenneth_o_morgan.txt: It was, therefore, the Old Liberalism which helped to repair the fortunes of the party—nonconformity and industrial free trade, the very essence of the Liberal faith for half a century. Where, however, was the ‘New Liberalism’ which could embrace a more positive attitude towards the central government, for which many
> source_documents/the_new_liberalism_weiler.txt: In the broadest sense, the term new Liberalism refers both to the welfare legislation of the 1906-14 administration and to the changes in Liberal social and political theory. At the time, however, the term first referred to the theoretical changes which preceded the legislative achievements of the Liberal government.
> source_documents/the_new_liberalism_weiler.txt: What then was the new Liberalism? As defined here, it was the modification of Liberal ideology from a vehicle for a mid-Victorian ideal of laissez-faire to a philosophy based on state action and social reform to cope with industrial poverty. This modification was achieved by Liberal journalists and theorists seeking to
> source_documents/the_new_liberalism_weiler.txt: state, a development that represents the fruition of the new Liberalism.
New Liberalism The Political Economy of J.A. Hobson
UNIVERSITY OF TORONTO PRESS Toronto Buffalo London
Toronto Buffalo London Printed in Canada ISBN 0-8020-5558-3
> source_documents/the_new_liberalism_weiler.txt: understanding of the phenomenon of the new Liberalism itself.
> source_documents/the_new_liberalism_freeden.txt: THE SOCIAL POLICY OL THE NEW LIBERALISM
You can see that the output is mixed in quality (the model I used was relatively small, see Ollama section below), but it did point me at potentially relevant sources for quotes and bits to read. I spent about another hour rooting around these texts and gathering bits of text that I might use later. Note the response time, however (about three minutes is typical). I got a bit frustrated with the response time vs quality and went looking for other tech to use.
Ollama (and Llama2:70b)
This was the combination of technology that blew my mind. It absolutely changes the game for this kind of work. Ollama is a project designed to ‘get up and running with large language models locally’ which borrows concepts from Docker (like the Modelfile (similar to the Dockerfile, and ollama get/pull/list)
Compared to the ChatGPT output, Llama2:70b (with a Llama2 model with 70b parameters) allowed me to automate a significant chunk of the work needed to write the essay, reducing the time needed to write the essay from days to less than a day.
I downloaded the biggest model I could find (Llama2:70b) and asked it the question in a very unspecific way. It returned a much better essay plan than ChatGPT did:
At this point I started writing the body of the essay (introductions and conclusions are best left to the end), basing each paragraph on the numbered points above. As I wrote, looked at specific texts to find references and asked LLMs questions, something unexpected happened: I was learning about the nuances of the debate (mainly through looking the Collini book I hadn’t been able to find in the 90s).
I was getting sucked into a temptation to actually read the texts, and had to remind myself that I wasn’t allowed to do that, according to the rules of the ‘game’ I was playing here.
This appreciation of which texts were important, and what I was being invited to write about happened much faster than would have been the case pre-Internet. I really do wonder how students do this today (get in touch if you can tell me!).
Hallucinations
A well-known limitation of LLMs is their tendency to ‘hallucinate’, or ‘bullshit’ a plausible-seeming answer. I definitely saw this, and it wasted some time for me. Llama2 gave me this nice quote, which raised suspicions with me as it seemed too good to be true:
Evidence: In his book "Industrial Democracy" (1902), Webb argues that collective ownership and control of the means of production is necessary to address the systemic issues that perpetuate poverty and inequality. He writes, "The only way to secure industrial democracy is through the collective ownership and control of the means of production."
But I couldn’t find that quote anywhere in the book. Llama2 just made it up.
What I Learned
After writing my AI-assisted essay, I took a look at my original essay from 1996. To my surprise, the older one was far longer than I remembered essays being (about 2500 words to my AI-assisted one’s 1300), and was, it seemed to me, much higher quality than my AI-assisted one. It may have been longer because it was written in my final year, where I was probably at the peak of my essay-writing skills.
It took me about six hours of work (spread over 4 days) to write the AI-assisted essay. The original essay would have taken me a whole week, but the actual number of hours spent actively researching and writing it would have been at minimum 20, and at most 30 hours.
I believe I could have easily doubled the length of my essay to match my old one without reducing the quality of its content. But without actually reading the texts I don’t think I would have improved from there.
I learned that:
There’s still no substitute for hard study. My self-imposed rule that I couldn’t read the texts limited the quality to a certain point that couldn’t be made up for with AI.
History students using AI should be much more productive than they were in my day! I wonder whether essays are far longer now than they used to be.
Private LLMs (I used llama2:70b) can be way more useful than ChatGPT3.5 for this type of work, not only in the quality of generated response, but also the capability to identify relevant passages of text.
I might have reduced the time significantly with some more work to combine the llama2:70b model with the reference-generating code I had. More research is needed here.
Hope you enjoyed my trip down memory lane. The world of humanities study is not one I’m in any more, but it must be being changed radically by LLMs, just as it must have been by the internet. If you’re in that world now, and want to update me on how you’re using AI, get in touch.
The git repo with essay and various interactions with LLMs is here.
The pipe is the next-most used feature of jq after filters. If you are already familiar with pipes in the shell, it’s a very similar concept.
How Important is this Post?
Pipes are fundamental to using jq, so this section is essential to understand. We will cover:
The pipe (|)
The ‘.[]‘ filter
Nulls
Array references
Setup
Create a folder to work in, and move into it:
$ mkdir ljqthw_pipes
$ cd ljqthw_pipes
The .[] Filter
Before we get to the pipe, we’re going to talk about another filter. This filter doesn’t have a snappy name, so we’re going to refer to it as the .[] filter. It’s technically known as the ‘array/object value iterator’.
This document is an array representing users on a system, with two objects in it. Each object has a username and id name-value pair.
Since the file is a simple array, we can filter the individual objects by referring to the array index in a fashion similar to other languages, by putting the index number in square brackets:
$ jq '.[0]' doc1
$ jq '.[1]' doc1
If the referenced index doesn’t exist, then a null is returned:
$ jq '.[2]' doc1
You can also refer to negative indexes to reference the array from the end rather than the start:
If you don’t provide an index, then all the items in the array are returned:
$ jq '.[]' doc1
Look at the output. How are the items returned?
In a new array?
In a series of JSON documents?
In a comma-separated list?
Running these commands may help you decide. Note that in the below commands we are using a shell pipe, not a jq pipe. The two uses of the pipe (|) character are analogous, but not the same:
It’s really important to understand that the .[] filter does not just work on arrays, even though it uses an array’s square brackets. If a series of JSON objects are supplied, then it will iterate through the name value pairs, returning the values one by one. That’s why it’s called the ‘array/object value iterator’.
This time you had no array, just a series of two JSON objects. The .[] filter iterates over a series of objects, returning the values in it. Remember, it’s formally known as the ‘array/object value iterator’.
Now try this. What happens? Why?
$ echo '"username"' > doc3 $ jq '.[]' doc3
This filter confused me for a long time when I was learning jq, as I thought the .[] filter just removed the square brackets from the json, returning an array’s contents, like this example does:
While it obviously does do that in this case, you should now be aware that this filter does more than just ‘remove the square brackets’!
Introducing The Pipe
OK, back to doc1. This time you’re going to run the .[] filter over it, and then use a jq pipe to process it further. Run these commands and think about what the pipe character (|) is doing.
The first command shows you the JSON document pretty-printed, to remind you of the raw content.
The second command (the ‘array/object value iterator’) iterates over the objects in the array, and produces two JSON documents. This is an example of what was pointed out above: the .[] filter does not simply ‘remove the square brackets’. If it did, the resulting two objects would have commas between them. It turns the two objects within the array into a series of JSON documents. Make sure you understand this before continuing.
The third command introduces the jq pipe. Note that this time, the pipe character (|) is inside the jq string’s single quotes, making it a jq pipe rather than the ‘shell pipe’ we saw earlier. The jq pipe passes the output of the .[] operator on to the next filter (.["username"]). This filter then effectively retrieves the value of the username name/value pair from each JSON document from the output of the first .[] filter.
You can run the last command in other ways too, with identical results:
jq will figure out what you mean if you use either of those shorthands.
Nulls
Not all JSON documents have objects in them with consistent names. Here you create a document with inconsistent names in its objects (username and uname):
Now see what happens when you reference one of the names in your query:
$ jq '.[] | .username' doc4
The object with the matching name outputs the its value, but the other object outputs a null value, indicating that there was no match for the query for that object.
Referencing the other object’s name swaps which value it outputs and which results in a null:
$ jq '.[] | .uname' doc4
What You Learned
What the .[] (array/object value iterator) filter does, and how it works on arrays and objects
How to reference array items
The various ways you can refer to names in JSON objects
Exercises
1) Create a file with three JSON documents in them: an array with a single name-value pair object in it, an array with two name-value pair objects in it, and a single bare object. Run it through the .[] filter and predict what output it will give.
In this section we introduce the most-frequently used feature of jq: the filter. Filters allow you to reduce a given stream of JSON to another smaller, more refined stream of JSON that you can then do more filtering or processing on on if you want.
How Important is this Post?
Filters are fundamental to using jq, so this post’s content is essential to understand.
We will cover:
The most commonly-used filters
Selecting values from a stream of JSON
Railroad diagrams, and how arrays cannot contain name-value pairs in JSON
This file contains a simple JSON object with two name-value pairs (user1, and user2 being the names, and alice and bob being their values, respectively).
The Dot Filter
The concept of the filter is central to jq. The filter allows us to select those parts of the JSON document that we might be interested in.
The simplest filter – and one you have already come across – is the ‘dot’ filter. This filter is a simple period character (.) and doesn’t do any selection on the data at all:
$ jq . json_object
Note that here we are using the filename as the last argument to jq, rather than passing the data through a UNIX pipe with cat.
Arrays, Name-Value Pairs, and Railroad Diagrams
Now let’s try and create a similar array with the same name-value pairs, and run the dot filter against it:
Was that what you expected? When I ran this I expected it to just work, based on my experiences of data structures in other languages. But no. Arrays in JSON cannot contain a name-value pair as one of its values – it’s not a JSON object, and arrays must be composed of JSON objects.
What can an array contain? Have a look at this railroad diagram, from https://json.org/:
The above diagram defines what an array consists of. Make sure you understand how to read the diagram before continuing, as being able to read such diagrams is useful in many contexts in software development.
Railroad Diagram
A railroad diagram is also known as a syntax diagram.
It visually defines the syntax for a particular language
or format. As your code or document is read, you can
follow the line,choosing which path to take as it splits.
If your code or document can be traced through the
'railroad', then it is syntactically correct. As far as I can
tell, there is no 'official' format for railroad diagrams
but the conventional signs can easily be found
and deciphered by searching online for examples.
The ‘value’ in the array diagram above is defined here:
Following the value diagram, you can see that there is no ‘name-value’ pair defined within a value. Name-value pairs are defined in the object railroad diagram:
A JSON object consists of zero or more name-value pairs, separated by commas, making it fundamentally different from an array, which contains only JSON values separated by spaces.
It’s worth understanding these diagrams, as they can help you a lot as you try and create and parse JSON with jq. There are further diagrams on the https://json.org site (string, number, whitespace). They are not reproduced in this section, as you only need to reference them in the rarer cases when you are not sure (for example) whether something is a number or not in JSON.
Let’s say we want to know what user2‘s value is in this document. To do this, we need a filter that outputs only the value for a specific name.
First, we can select the object, and then select by name:
$ jq '.user2' json_object
This is our first proper query with jq. We’ve filtered the contents of the file down to the information we want.
In the above we just put the bare string user2 after the dot filter. jq accepts this, but you can also refer to the name by placing it in quotes, or placing it in quotes inside square brackets:
However, with the square brackets and without the quotes does not work:
$ jq '.[user2]' json_object
This is because the [""] form is the official way to look up a name in an object. The simpler .user2 and ."user2" forms we saw above are just a shorthand for this, so you don’t need the four characters of scaffolding ([""]) around the name. As you are learning jq it may be better to use the longer form to embed the ‘correct’ form in your mind.
Types of Quote Are Important
It’s important to know that the types of quote you use in your JSON are significant to jq. Type this in, and think about how it is different to the ‘non-broken’ json above.
The various allowed forms for performing the selection
The syntactic difference between an object, an array, and a name-value pair
The types of quote are important
Exercises
1) Create another json_array file as per this section, but this time with two JSON objects (ie the same line twice). Re-run the queries above and see what happens to demonstrate that jq works on a stream of JSON objects.