This year I was lucky enough to attend this event again. Overall, I thoroughly enjoyed it, it retained the single-track style, which meant it was easy to catch all the talks and also contributed to the cozy feel of the event, it wasn’t too busy. As always, I again found the hallway conversations one of the best parts of the event - the smart people that have gathered together and the setting & context provided by the conference ensured there were plenty of interesting topics to discuss.
Additionally, it was great to catch up with the friends made at the last year’s conference, who were visiting Singapore for this event again. Its become a real regional gathering of DevOps / SRE practicioners.
Bellow are my rough notes from some of the talks, mostly things that stood out to me for one reason or another (nice idea or something to look into later etc).
The talks were recorded by Engineers.sg of course, so the videos will be up soon, I suggest you review, as there’s likely topics that you’ll find interesting.
Opening Keynote
How to do DevOps: only need to change one thing - everything.
“You build it, you run it” - the point isn’t to have no Ops, its to improve the feedback loop. Engineers get to hear from customers and hear their problems directly.
DevOps isn’t a tool / role / team; this message seems similar to the Pragmatic Dave’s talk from GOTO 2015 conference.
Its verbs, not nouns - not Devs and Ops (and Security etc), its developing and operating.
- Culture (organisational structure, communication methods, learning organisation)
- Automation
- Lean (how long from idea till its making / saving you money)
- Measurement
- Sharing
DevOps - culture where people, regardless of title or backgorund, work together to imagine, build & operate things.
CI is a prerequisite for moving fast. Don’t just move fast & break things. Presence of a CI server doesn’t mean CI is actually done; don’t do that.
Continuous Delivery != Continuous Deployment. The former is push a button to deploy, the latter is deploy automatically.
Products over projects
Projects have a start & end date, afterwards they’re done and the team is dismantled.
Product teams own it all the way. Product team doesn’t split once the first version ships.
Platform approach
Platform, compliance & security teams owning respective parts of the product.
Product teams can earn the right to have someone else run the code for them - SRE.
Resources
- Classic from 2006, the post that introduced the idea of You build it, you run it to the masses - A Conversation with Werner Vogelsh, CTO of Amazon at the time.
- What DevOps Means to Me
- It’s not CI, It’s Just CI Theatre
- Continuous Integration Certification
- Evolutionary Architectures book
- Building Microservices book
- Continuous Delivery book
Simulating Incidents in Production
Incident responders:
- not ready - don’t have the complete knowledge
- not practiced - no prior exprience, don’t know how to do it till you do it a bunch of times
Objective - inject controlled failure Goal - build more resilient systems
Focus on incidents with highest frequency and impact.
Model impact as a decision (binary) tree - does the page load? yes / no; does the page list inventory? yes / no. Each “no” points to dashboard & runbooks.
The Evolution of SRE at LinkedIn
SRE is what happens when you ask a software engineer to design an
operations team
.
- Engineering discipline
- Design to remove failure
- Work to automate toil away
LinkedIn Generations / Journey
The Firefighter
- Incident management
- 1500 minutes MTTR
- all outages lined up would add up to a year
- Purely reactive
- Keeps the company going one more day
The Gatekeeper
- Change control
- Reactive towards dev plans
- Protect “our” site from “them”
Basically Dev vs Ops
The Advocate
- Site up culture - work cross-teams to keep the site up and running
- Rebuild trusted relationships - Dev + Ops
- Invested in building more tools
The Partner
- Empowers intelligent risk
- Proactive, joint planning with Dev
- Collaborating to magnify impact
- Hybrid on-call rotation
The Engineer
- Reliability throughout software lifecycle
- Proactive, one plan for SRE + Dev
- Everyone has the same job:
what is your job?
-we're all engineers
Resources
- Every Day Is Monday In Operations
- Seeking SRE book. Focusing on how to adapt or adopt SRE principles to any organisation.
Consul Connect
- Run
- Secure
- Provision
No longer host-based networking - connect to this host here.
Service networking - dynamic infrastructure, now here, now there.
- Service discovery
- Service segmentation - identity-based security
- Service configuration
Load balancing - distributes traffic to healthy instances only.
Register services via API or config file.
Healthchecks - actually runs scripts on hosts!!! These looked awfully like Nagios, lets not go there.
Nice webUI with overview.
Services identify with certs. Services authorise with intentions:
consul intention create --allow web db
InfluxDB downsampling - a DevOps tale
55k metrics per second.
High memory usage on InfluxDB cluster in TSM model.
Continuous query - basically a recording rule. Retention policy - 2 weeks for all data, 4 weeks for downsampled data.
Basically save memory by keeping fewer metrics.
The State of DevOps 2018
Learning organisation - there’s something to learn from a failure. Move away from blaming anything (person, software) and to making the software more reliable.
Resources
- Web Operations: Keeping the Data on Time book, focusing on ideas & culture, rather than tools
- The Human Side of Postmortems book
- The Art of Monitoring
- State of DevOps Report 2018
- Effective DevOps book - more about culture
- DevOps Handbook book - basically a cookbook
- Kubernetes Up & Running book
- Accelerate prescriptive book about learning organisations and how to create them; learn together about how to deliver business value through operations faster
- Simon Wardley and his writings on Medium
Observability 3 ways: Logging, Metrics and Distributed Tracing
- Metrics - number at timestamp. Aggregatable. Can identify trends.
- Logs - event at timestamp. Easy to grep, readable / searchable.
- Tracing - request-scoped. Track span from start to finish. Identify cause accross services.
Resources
Data Driven DevOps
Most software is 80% done 80% of the time.
Excuses / complaints at standup - verify by measuring the claim, then fix it if true. E.g. not enough test envs available - people check them out, don’t check them back in - implement automatic check-in after 3 days.
Lots of high-severity tickets - poor product? Look at ticket resolution - turns out 50% of P1 & P2 tickets are closed with “works as designed”, i.e. users had incorrect information / expectations.
Resources
Less Yak Shaving with Dev
Not an expert, a practisioner - I practice and learn all the time.
Automation is happines. Have everything as code.
Build Communities of Practice.
Hallway
Talking to Adrian Cole (who presented
about Observability earlier) about observability(!) and tracing.
Learned about Expedia’s Adaptive
Alerting and
Haystack projects. Also
extremely interesting notes on Zipkin
Wiki under
Designs
and Workshops
, with notes about what observability questions
people want to answer. (e.g. from Netflix: What percentage of edge
requests have at least one mid-tier retry?).
Resources
- Designing Data-Intensive Applications - CAP, algorithms, data layout etc, all really interesting things
- Liquid Software book - evolution from Continuous Delivery to Continuous Updates
- Spinaker CD system from Netflix with really interesting features, like intergration with monitoring platforms for metrics, custom Deployment Strategies and Chaos Monkey Integration (of course!)