Late Night SRE Observability Session

Published:

Late Night SRE / Observability Session

So today marks the end of SREcom19 Asia/Pacific here in Singapore. Unfortunately I wasn’t able to make it to the conference this year, but hey, it’s Friday night and what better way to spend the time than to catch up on what I missed from the conference, right?

I head straight to the #srecon tag on Twitter and there it is, a fellow named Dan Lüdtke kindly wrote a summary of (some of) the talks from each day. Here are links to posts from day 1, day 2 and day 3.

These already are providing enough food for thought on the subject of SRE and observability, but I inadvertently went down the Twitter hole and ended up with emerging with several more interesting links, ranging from low-level technical talks like Life of a packet through Istio, a post from Honeycomb titled Toward a Maturity Model for Observability, a really interesting post on InfoQ titled Sustainable Operations in Complex Systems with Production Excellence, a presentation from Monitorama 2019 about Evolution of Observability at Coinbase and an older presentation from QCon London 2017 titled Avoiding Alerts Overload from Microservices and to round it up, release notes for latest Consul version, which is an especially intriguing one.

That’s quite a list to ingest and digest, its looking like an excellent weekend ahead :) I’ll get right down to it and update this post as I go along.

Evolution of Observability at Coinbase

Link to presentation: https://vimeo.com/341145849.

This one was pretty entertaining, as it actually covered real production issues Coinbase encountered with the rise of Bitcoin in mid-2017, like being down for an entire day (or several) and not knowing what was wrong.

Observability wasn’t seen as a priority. Arbitrary logs everywhere, so people would be spending hours staring at logfiles trying to grok what was happening and why.

The first major breakthrough was to invest in structured logging, with Elasticsearch as a backend. ES has no problems with high-cardinality data, so this allowed them to store verbose logs from most system components.

Then they grew the team and Elasticsearch started catching fire, simply due to the fact that there’s no way to limit expensive queries from users. They were spending more time fixing ES cluster than fixing production issues.

Went over to Datadog, and liked it, a lot (back when it was still only doing metrics). Had its issues, like getting percentiles over a bunch of things (you can’t add percentiles). Also general issues with metrics - fixed granularity (a lot can happen in 10 seconds), no context around each datapoint, limits on cardinality limit what kind of questions you can ask of metrics in the future.

Ended up using both Kibana (logs) & Datadog (metrics). Next issue - these are separate systems (hey did you also want tracing?). Product for this, product for that. Naturally you want the middle of the triad of the Three Pillars of Observability.

Revisiting events - Datadog, Honeycomb & Lightstep are cool, but really need developers who understand what this is about (don’t end up with arbitrary unstructured logs!).

Sustainable Operations in Complex Systems with Production Excellence

Production ownership, aka “you build it you run it”, by accelerating feedback loops, results in higher quality software. Ownership can provide engineers with the power and autonomy that feed our natural desire for agency and high-impact work and can deliver a better experience for users.

Responsibility without empowerment is a problem. People suddenly put on alerts don’t necessarily have the skills, often there’s no time to learn them. Tooling alone isn’t going to help, tools only automate an existing system. Production excellence can be learned, but requires changes in people, processes and culture.

Silo-ing operations doesn’t work

Such teams prioritise feature velocity over service operability. This leads to larger and larger need for operator involvement, something that isn’t sustainable for very long.

A silo ensures both dev and ops aren’t aware of the needs of the other team. Teams end up with piles of manual and error prone work, or end up automating the wrong thing and increasing risk, instead of reducing it (automatically restarting a service, instead of understanding why it failed). Such toil (break / fix work), scales with the popularity of the service and leads to operational overload.

Over time, this will also slow down feature velocity - issues in production will take longer and longer to fix.

Handing out pagers doesn’t work

Unless everyone in the joint DevOps team is operationally-savvy and participates in production, there’ll exist one or two experts to whom everyone will escalate to at all times.

Sharing the pager only helps in sharing the production pain more equally, but doesn’t reduce the pain on its own. Its not enough to eliminate toil, if the underlying technical debt, the source of the toil, isn’t being addressed.

A better approach: Production excellence

Production excellence is a set of skills and practices that allow teams to be confident in their ownership of production. These needs to be spread across everyone in the team to close the feedback loop; operations becomes everyone’s responsibility, even if some people aren’t doing operations full-time. People and teams investing in acquiring these skills need to be supported and rewarded.

There are four key elements to making a team and the service it supports perform predictably in the long term.

1. Measure what matters

Cultivating Production Excellence

This talk is so packed with good content that I’ve now listened to it twice and still ended up with this elaborate and verbose summary, its almost a transcript. Totally worth the while and an extra view count - unbelievable that the video was only seen 76 times, need to spread the word.

Preamble - lets do some of these cool things and expect decent results

Its tempting to just do things without fully understanding them first - buy DevOps tools, CI/CD everything, get Kubernetes, put developers on-call. The tooling isn’t necessarily helping - developers are creating dashboards left, right and centre, but do you know which one to look at when you’re in a middle of an outage? Putting developers on-call makes them unhappy - they’ll be getting woken up by meaningless alerts, like a disk filling up. It drains the people on rotation and makes them less able to fix those problems.

Real incidents are taking longer and longer to fix, customers start looking elsewhere. Solving incidents involves relying on the “expert” member of the team for their help. Those people are usually so over-subscribed, that they only have time to give you the slightest bit of information you need. They never have the time to write down the knowledge they have in their heads, so that people don’t have to bother them again.

Deploys become unpredictable - unit and integration tests catch fewer and fewer problems, developer velocity slows down. Problems are between the systems, or caused by real user traffic, something you can never test for in the CI/CD system.

This is the description of Operational Overload - ops is taking up all of your time and nobody has time for meaningful project work. If any project hours do come by, you find there’s no plan. If there’s no plan on where you want to go, no amount of tooling is going to help. So, engineers are burned out, the site is always on fire, customers are unhappy.

What’s missing?

Invest in the people. Its the people who operate the systems, for other people. Don’t focus on technology, tools help to get to the destination quicker, but first you need a plan, need to know where you want to be.

Invest in people, culture and process. Production Excellence.

Computers can actually run fine without the constant human sacrifice, systems can be more reliable. Need to plan - think about what you’re trying to do and how you’re going to get there.

Figure out what matters in the long term and prioritise it. Focusing on engineering isn’t going to be enough, need Product Management, Customer Support, the whole business and customers as well.

Encourage people to ask questions, build confidence to question things and further their understanding.

How to get started

Know when things are “too broken” and have ability to debug them with others. Afterwards - eliminate the complexity that resulted in the outage.

The systems are always failing. Take a green lawn as an example - we don’t care if all the individual leaves are green, its OK for there to be a few brown leaves. Instead focus on knowing when the lawn is too brown, when is it a problem.

Need to have an ability to decide whether each customer-facing even is good or bad. This must be done by a robot, not a human. Start by understanding what makes the users unhappy - ask the Product Managers, they should understand those needs. It might be something like a specific latency or response code. I.e. the user is constantly retrying and keeps getting an error, or the response is taking too long, they’ll be unhappy. For batch jobs, you might want to look at freshness - how old is the latest event? Need to look at events in context. These are Service Level Indicators - data points that indicate whether things are too broken.

This understanding leads to sorting each of the customer-facing events into one of the “good” or “bad” buckets. Then look at how many eligible events did we see - “eligible” here needs to be very specific and customer-facing, discard events like a DDoS attack events, as they’re not real customer traffic. Then you work out your availability - good events versus eligible events and set targets for the availability - Service Level Objective.

SLO should use a window and a percentage - customers have a fairly good memory of the quality of your service, so don’t look at last 2 days, look at last 30 or even 90 days. So SLO is something like - 99.9% of events were good (e.g. HTTP 200 and latency < 300ms) in the last 30 days.

It follows that if we’re achieving the SLO, the users are going to be happy. Don’t set a higher SLO than what the users are expecting, that’s wasteful of development effort - you’ll be improving the reliability beyond what users care for and can notice, instead of giving them new features. I.e. in most cases, if the user gets an error and reloads / retries and then it works, its probably OK, that’s a micro-outage.

The SLO targets are the primary sources for alerting. Look at SLO trends - Error Budget. With 99.9% SLO and 1,000,000 requests over 30 days, we can have 1,000 failures. The SLO trends indicate how long you have till you run out of of the Error Budget if things continue as they are. If its going to be hours, i.e. lots and lots of errors being served, wake up a human; if its going to be days, raise a ticket for a human to look at during the next day.

Error Budgets also enable data-driven business decisions - do we need to slow down and make the service more reliable or can we continue to ship features and experiment, i.e. is it safe to run this risky experiment? If you have plenty of error budget left and can easily roll back the experiment or feature-toggle it off, then ship. If the site is constantly on fire though, then its time to invest in more reliability.

SLO doesn’t have to be perfect. A good SLO is still way better than no SLO - you don’t know what you’re aiming for, you don’t know when the customers are going to be unhappy. Star measuring what you can, involve Product Management and gather feedback on whether you’re meeting customers’ needs and if not, look at how to measure things better.

If you have good SLOs in place, you can confidently turn off all the alert that aren’t connected to whether users are having a good experiences and only alert on what matters, and actually get a good night’s sleep.

This is only half of the picture

SLIs & SLOs aren’t enough. No outages are ever the same (if you’ve done the Post Mortem process right), failure modes can’t be predicted.

The way forward is to be able to debug an outage in production, where they actually happen, rather than expecting to catch it in staging.

No matter how novel the outage, if you make it easy to form hypothesies about what’s going on and check whether they’re true, you’ll be able to resolve them faster. Need the necessary data and ability to explore it in new ways, failures require new kind of analysis.

Observability (from Systems Control theory) - determine what the system is doing by looking at its externally-observable outputs. Can you explain the variance between a good and a bad event?

Also, aim to be able to mitigate the impact and debug later if you can - drain the affected node / availability zone and investigate during the daytime hours.

SLOs and Observability go together - SLOs help to know when things are too broken, Observability helps to fix things when they are too broken.

Both aren’t enough though - complex systems cross team boundaries, so teams need to feel confident about escalating to each other and talking to each other, its not just a single person on-call. If people don’t feel safe escalating an issue, they’ll stop escalating all together. Don’t ship the org chart, work together.

Think about how you’re making things better for your future self. Make sure people know how to mitigate an outage quickly. Share knowledge and don’t be the single point of failure and reward people for working together. Practice the teamwork - do drills and game days so people are used to working together.

Outages don’t repeat, but they rhyme. Think about what you do after the outage, think about risk analysis - what’s the most impacting thing you can be doing to improve production today. Think about frequency and impact - how often does it happen and how bad is it when it happens? Think about how long it takes you to find out, how long it takes you to resolve and how many users are affected. Identify which risks are most significant and address those first, specifically, things that would cause you to exceed your Error Budget, or burn a quarter of it, are probably things you should deal with.

Having risk analysis data helps to make the case for the business - this thing here has such and such a negative impact on user happiness and this will address that. Prioritise Post Mortem action items - they won’t get done by themselves, think about how to find time in the engineering schedule to make it work.

Lack of observability, if you can’t debug your systems during an outage, you can’t understand what the system is doing is a systemic risk. Lack of collaboration is the same - if people aren’t feeling safe to bring up issues, you’re going to struggle mitigating outages quickly when they do happen.

Summary

Measure your level of service, debug effectively, collaborate with others and prioritise what you fix.