DevOpsDays Singapore 2018

Published: 11 Oct, 2018

DevOpsDays Singapore 2018 was here

This year I was lucky enough to attend this event again. Overall, I thoroughly enjoyed it, it retained the single-track style, which meant it was easy to catch all the talks and also contributed to the cozy feel of the event, it wasn’t too busy. As always, I again found the hallway conversations one of the best parts of the event - the smart people that have gathered together and the setting & context provided by the conference ensured there were plenty of interesting topics to discuss.

Additionally, it was great to catch up with the friends made at the last year’s conference, who were visiting Singapore for this event again. Its become a real regional gathering of DevOps / SRE practicioners.

Bellow are my rough notes from some of the talks, mostly things that stood out to me for one reason or another (nice idea or something to look into later etc).

The talks were recorded by Engineers.sg of course, so the videos will be up soon, I suggest you review, as there’s likely topics that you’ll find interesting.

Opening Keynote

How to do DevOps: only need to change one thing - everything.

“You build it, you run it” - the point isn’t to have no Ops, its to improve the feedback loop. Engineers get to hear from customers and hear their problems directly.

DevOps isn’t a tool / role / team; this message seems similar to the Pragmatic Dave’s talk from GOTO 2015 conference.

Its verbs, not nouns - not Devs and Ops (and Security etc), its developing and operating.

Culture (organisational structure, communication methods, learning organisation)
Automation
Lean (how long from idea till its making / saving you money)
Measurement
Sharing

DevOps - culture where people, regardless of title or backgorund, work together to imagine, build & operate things.

CI is a prerequisite for moving fast. Don’t just move fast & break things. Presence of a CI server doesn’t mean CI is actually done; don’t do that.

Continuous Delivery != Continuous Deployment. The former is push a button to deploy, the latter is deploy automatically.

Products over projects

Projects have a start & end date, afterwards they’re done and the team is dismantled.

Product teams own it all the way. Product team doesn’t split once the first version ships.

Platform approach

Platform, compliance & security teams owning respective parts of the product.

Product teams can earn the right to have someone else run the code for them - SRE.

Resources

Classic from 2006, the post that introduced the idea of You build it, you run it to the masses - A Conversation with Werner Vogelsh, CTO of Amazon at the time.
What DevOps Means to Me
It’s not CI, It’s Just CI Theatre
Continuous Integration Certification
Evolutionary Architectures book
Building Microservices book
Continuous Delivery book

Simulating Incidents in Production

Incident responders:

not ready - don’t have the complete knowledge
not practiced - no prior exprience, don’t know how to do it till you do it a bunch of times

Objective - inject controlled failure Goal - build more resilient systems

Focus on incidents with highest frequency and impact.

Model impact as a decision (binary) tree - does the page load? yes / no; does the page list inventory? yes / no. Each “no” points to dashboard & runbooks.

The Evolution of SRE at LinkedIn

SRE is what happens when you ask a software engineer to design an operations team.

Engineering discipline
Design to remove failure
Work to automate toil away

LinkedIn Generations / Journey

The Firefighter

Incident management
- 1500 minutes MTTR
- all outages lined up would add up to a year
Purely reactive
Keeps the company going one more day

The Gatekeeper

Change control
Reactive towards dev plans
Protect “our” site from “them”

Basically Dev vs Ops

The Advocate

Site up culture - work cross-teams to keep the site up and running
Rebuild trusted relationships - Dev + Ops
Invested in building more tools

The Partner

Empowers intelligent risk
Proactive, joint planning with Dev
Collaborating to magnify impact
Hybrid on-call rotation

The Engineer

Reliability throughout software lifecycle
Proactive, one plan for SRE + Dev
Everyone has the same job: what is your job? - we're all engineers

Resources

Every Day Is Monday In Operations
Seeking SRE book. Focusing on how to adapt or adopt SRE principles to any organisation.

Consul Connect

Run
Secure
Provision

No longer host-based networking - connect to this host here.

Service networking - dynamic infrastructure, now here, now there.

Service discovery
Service segmentation - identity-based security
Service configuration

Load balancing - distributes traffic to healthy instances only.

Healthchecks - actually runs scripts on hosts!!! These looked awfully like Nagios, lets not go there.

Nice webUI with overview.

Services identify with certs. Services authorise with intentions:

consul intention create --allow web db

InfluxDB downsampling - a DevOps tale

55k metrics per second.

High memory usage on InfluxDB cluster in TSM model.

Continuous query - basically a recording rule. Retention policy - 2 weeks for all data, 4 weeks for downsampled data.

Basically save memory by keeping fewer metrics.

The State of DevOps 2018

Learning organisation - there’s something to learn from a failure. Move away from blaming anything (person, software) and to making the software more reliable.

Resources

Web Operations: Keeping the Data on Time book, focusing on ideas & culture, rather than tools
The Human Side of Postmortems book
The Art of Monitoring
State of DevOps Report 2018
Effective DevOps book - more about culture
DevOps Handbook book - basically a cookbook
Kubernetes Up & Running book
Accelerate prescriptive book about learning organisations and how to create them; learn together about how to deliver business value through operations faster
Simon Wardley and his writings on Medium

Observability 3 ways: Logging, Metrics and Distributed Tracing

Metrics - number at timestamp. Aggregatable. Can identify trends.
Logs - event at timestamp. Easy to grep, readable / searchable.
Tracing - request-scoped. Track span from start to finish. Identify cause accross services.

Resources

Data Driven DevOps

Most software is 80% done 80% of the time.

Excuses / complaints at standup - verify by measuring the claim, then fix it if true. E.g. not enough test envs available - people check them out, don’t check them back in - implement automatic check-in after 3 days.

Lots of high-severity tickets - poor product? Look at ticket resolution - turns out 50% of P1 & P2 tickets are closed with “works as designed”, i.e. users had incorrect information / expectations.

Resources

Burn Down Chart

Less Yak Shaving with Dev

Not an expert, a practisioner - I practice and learn all the time.

Automation is happines. Have everything as code.

Build Communities of Practice.

Hallway

Talking to Adrian Cole (who presented about Observability earlier) about observability(!) and tracing. Learned about Expedia’s Adaptive Alerting and Haystack projects. Also extremely interesting notes on Zipkin Wiki under Designs and Workshops, with notes about what observability questions people want to answer. (e.g. from Netflix: What percentage of edge requests have at least one mid-tier retry?).

Resources

Designing Data-Intensive Applications - CAP, algorithms, data layout etc, all really interesting things
Liquid Software book - evolution from Continuous Delivery to Continuous Updates
Spinaker CD system from Netflix with really interesting features, like intergration with monitoring platforms for metrics, custom Deployment Strategies and Chaos Monkey Integration (of course!)