Recently I (along with a few others much smarter than me) had occasion to implement a ‘real’ production system with Istio, running on a managed cloud-provided Kubernetes service.
Istio has a reputation for being difficult to build with and administer, but I haven’t read many war stories about trying to make it work, so I thought it might be useful to actually write about what it’s like in the trenches for a ‘typical’ team trying to implement this stuff. The intention is very much not to bury Istio, but to praise it (it does so much that is useful/needed for ‘real’ Kubernetes clusters – skip to the end if impatient) while warning those about to step into the breach what comes if you’re not prepared.
In short, I wish I’d found an article like this before we embarked on our ‘journey’.
None of us were experienced implementers of Istio when combined with other technologies. Most of us had about half a year’s experience working with Kubernetes and had spun up vanilla Istio more than a few times on throwaway clusters as part of our research.
1) The Number Of People Doing This Feels Really Small
Whenever we hit up against a wall of confusion, uncertainty, or misunderstanding, we reached out to expertise in the usual local/friendly/distant escalation path.
The ‘local’ path was the people on the project. The ‘friendly’ path were people in the community we knew to be Istio experts (talks given at Kubecon and the like). One such expert admitted to us that they used Linkerd ‘until they absolutely needed Istio for something’, which was a surprise to us. The ‘distant’ path was mostly the Istio forum and the Istio Slack channel.
Whenever we reached out beyond each other we were struck by how few people out there seemed to be doing what we were doing.
‘What we were doing’ was trying to make Istio work with:
- applications that may not have conformed to the purest ideals of Kubernetes
- a strict set of network policies (Calico global DENY-ALL)
- a monitoring stack we could actually configure to our needs without just accepting the ‘non-production ready’ defaults
Maybe we were idiots who could configure our way out of a paper bag, but it felt that, beyond doing 101 guides or accepting the defaults, there just wasn’t that much prior art out there.
Eventually we got everything to work the way we wanted, but we burned up significant developer time in the process, and nearly abandoned our efforts more than once on the way.
2) If You Go Off The Beaten Path, Prepare For Pain
Buoyed by our success running small experiments by following blogs and docs, we optimistically tried to leap to get everything to work at the same time. Fortunately, we ran strict pipelines with a fully GitOps’d workflow which meant there were vanishingly few ‘works on my cluster’ problems to slow us down (if you’re not doing that, then do so, stat. It doesn’t have to be super-sophisticated to be super-useful).
A great example of this was monitoring. If you just read the literature, then setting up a monitoring stack is a breeze. Run the default Istio install on a bare server, and everything comes for free. Great. However, we made the mistake of thinking that this meant fiddling with this stack for our own ends would be relative easy.
First, we tried to make this work with a strict mTLS mode (which is not the default, for very good reason). Then we tried to make the monitoring stack run in a separate namespace. Then we tried to make it all work with a strict global network policy of DENY-ALL. All three of these things caused enormous confusion when it didn’t ‘just work’, and chewed up weeks of engineering time to sort out.
The conclusion: don’t underestimate how hard it will be make changes you might want to make to the defaults when using Istio alongside other Kubernetes technologies. Better to start simple, and work your way out to build a fuller mental model that will serve you better for the future.
If you like this, you might like one of my books:
Learn Bash the Hard Way
Learn Git the Hard Way
Learn Terraform the Hard Way
3) Build Up A Good Mental Glossary
Istio has a lot of terms that are overloaded in other related contexts. Terms commonly used, like ‘cluster’, or ‘registry’ may have very specific meanings or significance depending on context. This is not a disease peculiar to Istio, but the denseness of the documentation and the number of concepts that must be embedded in your understanding before you can parse them fluently.
We spend large amounts of time interpreting passages of the docs, like theologians arguing over Dead Sea Scrolls (“but cluster here means the mesh”, “no, it means ingress to the kubernetes cluster”, “that’s a virtual service, not a service”, “a headless service is completely different from a normal service!”).
Here’s a passage picked more or less at random:
An ingress Gateway describes a load balancer operating at the edge of the mesh that receives incoming HTTP/TCP connections. It configures exposed ports, protocols, etc. but, unlike Kubernetes Ingress Resources, does not include any traffic routing configuration. Traffic routing for ingress traffic is instead configured using Istio routing rules, exactly in the same way as for internal service requests.
I can read that now and pretty much grok what it’s trying to tell in real time. Bully for me. Now imagine sending that to someone not fully proficient in networking, Kubernetes, and Istio in an effort to get them to help you figure something out. As someone on the project put it to me: ‘The Istio docs are great… if you are already an Istio developer.’
As an aside, it’s a good idea to spend some time familiarising yourself with the structure of the docs, as it very quickly becomes maddening to try and orient yourself: ‘Where the hell was that article about Ingress TLS I saw earlier? Was it in Concepts, Setup, Tasks, Examples, Operation, or Reference?”
4) It Changes Fast
While working on Istio, we discovered that things we couldn’t do in one release started working in another, while we were debugging it.
While we’re on the subject, take upgrades seriously too: we did an innocuous-looking upgrade, and a few things broke, taking the system out for a day. The information was all there, but was easy to skip over in the release notes.
You get what you pay for, folks, so this should be expected from such a fundamental set of software components (Istio) running within a much bigger one (Kubernetes)!
5) Focus On Working On Your Debug Muscles
If you know you’re going to be working on Istio to any serious degree, take time out whenever possible to build up your debug skills on it.
Unfortunately the documentation is a little scattered, so here are some links we came across that might help:
6) When It All Works, It’s Great
When you’re lost in the labyrinth, it’s easy to forget what power and benefits Istio can bring you.
Ingress and egress control, mutual TLS ‘for free’, a jump-start to observability, traffic shaping… the list goes on. Its feature set is unparalleled, it’s already got mindshare, and it’s well-funded. We didn’t want to drop it because all these features solved a myriad of requirements in one fell swoop. These requirements were not that recherché, so this is why I think Istio is not going away anytime soon.
The real lesson here is that you can’t hide from the complexity of managing all functionalities and expect to be able manipulate and control it at will without any hard work.
Budget for it accordingly, and invest the time and effort needed to benefit from riding the tiger.
5 thoughts on “Riding the Tiger: Lessons Learned Implementing Istio”
great article, thanks for taking the time to write this.
The trick to Istio and everything based on Envoy is learning Envoy first. Istio and Consul Connect are simply data sources for Envoy that feed it over gRPC. If you can’t debug Envoy first, read a config, understand the terminology then you’re going to have a bad time jumping straight into a “finished” product.