DevOpsDays Singapore 2018

Published:

DevOpsDays Singapore 2018 was here

This year I was lucky enough to attend this event again. Overall, I thoroughly enjoyed it, it retained the single-track style, which meant it was easy to catch all the talks and also contributed to the cozy feel of the event, it wasn’t too busy. As always, I again found the hallway conversations one of the best parts of the event - the smart people that have gathered together and the setting & context provided by the conference ensured there were plenty of interesting topics to discuss.

Additionally, it was great to catch up with the friends made at the last year’s conference, who were visiting Singapore for this event again. Its become a real regional gathering of DevOps / SRE practicioners.

Bellow are my rough notes from some of the talks, mostly things that stood out to me for one reason or another (nice idea or something to look into later etc).

The talks were recorded by Engineers.sg of course, so the videos will be up soon, I suggest you review, as there’s likely topics that you’ll find interesting.

Opening Keynote

How to do DevOps: only need to change one thing - everything.

“You build it, you run it” - the point isn’t to have no Ops, its to improve the feedback loop. Engineers get to hear from customers and hear their problems directly.

DevOps isn’t a tool / role / team; this message seems similar to the Pragmatic Dave’s talk from GOTO 2015 conference.

Its verbs, not nouns - not Devs and Ops (and Security etc), its developing and operating.

  • Culture (organisational structure, communication methods, learning organisation)
  • Automation
  • Lean (how long from idea till its making / saving you money)
  • Measurement
  • Sharing

DevOps - culture where people, regardless of title or backgorund, work together to imagine, build & operate things.

CI is a prerequisite for moving fast. Don’t just move fast & break things. Presence of a CI server doesn’t mean CI is actually done; don’t do that.

Continuous Delivery != Continuous Deployment. The former is push a button to deploy, the latter is deploy automatically.

Products over projects

Projects have a start & end date, afterwards they’re done and the team is dismantled.

Product teams own it all the way. Product team doesn’t split once the first version ships.

Platform approach

Platform, compliance & security teams owning respective parts of the product.

Product teams can earn the right to have someone else run the code for them - SRE.

Resources

Simulating Incidents in Production

Incident responders:

  • not ready - don’t have the complete knowledge
  • not practiced - no prior exprience, don’t know how to do it till you do it a bunch of times

Objective - inject controlled failure Goal - build more resilient systems

Focus on incidents with highest frequency and impact.

Model impact as a decision (binary) tree - does the page load? yes / no; does the page list inventory? yes / no. Each “no” points to dashboard & runbooks.

The Evolution of SRE at LinkedIn

SRE is what happens when you ask a software engineer to design an operations team.

  • Engineering discipline
  • Design to remove failure
  • Work to automate toil away

LinkedIn Generations / Journey

The Firefighter

  • Incident management
    • 1500 minutes MTTR
    • all outages lined up would add up to a year
  • Purely reactive
  • Keeps the company going one more day

The Gatekeeper

  • Change control
  • Reactive towards dev plans
  • Protect “our” site from “them”

Basically Dev vs Ops

The Advocate

  • Site up culture - work cross-teams to keep the site up and running
  • Rebuild trusted relationships - Dev + Ops
  • Invested in building more tools

The Partner

  • Empowers intelligent risk
  • Proactive, joint planning with Dev
  • Collaborating to magnify impact
  • Hybrid on-call rotation

The Engineer

  • Reliability throughout software lifecycle
  • Proactive, one plan for SRE + Dev
  • Everyone has the same job: what is your job? - we're all engineers

Resources

Consul Connect

  • Run
  • Secure
  • Provision

No longer host-based networking - connect to this host here.

Service networking - dynamic infrastructure, now here, now there.

  • Service discovery
  • Service segmentation - identity-based security
  • Service configuration

Load balancing - distributes traffic to healthy instances only.

Register services via API or config file.

Healthchecks - actually runs scripts on hosts!!! These looked awfully like Nagios, lets not go there.

Nice webUI with overview.

Services identify with certs. Services authorise with intentions:

consul intention create --allow web db

InfluxDB downsampling - a DevOps tale

55k metrics per second.

High memory usage on InfluxDB cluster in TSM model.

Continuous query - basically a recording rule. Retention policy - 2 weeks for all data, 4 weeks for downsampled data.

Basically save memory by keeping fewer metrics.

The State of DevOps 2018

Learning organisation - there’s something to learn from a failure. Move away from blaming anything (person, software) and to making the software more reliable.

Resources

Observability 3 ways: Logging, Metrics and Distributed Tracing

  • Metrics - number at timestamp. Aggregatable. Can identify trends.
  • Logs - event at timestamp. Easy to grep, readable / searchable.
  • Tracing - request-scoped. Track span from start to finish. Identify cause accross services.

Resources

Data Driven DevOps

Most software is 80% done 80% of the time.

Excuses / complaints at standup - verify by measuring the claim, then fix it if true. E.g. not enough test envs available - people check them out, don’t check them back in - implement automatic check-in after 3 days.

Lots of high-severity tickets - poor product? Look at ticket resolution - turns out 50% of P1 & P2 tickets are closed with “works as designed”, i.e. users had incorrect information / expectations.

Resources

Less Yak Shaving with Dev

Not an expert, a practisioner - I practice and learn all the time.

Automation is happines. Have everything as code.

Build Communities of Practice.

Hallway

Talking to Adrian Cole (who presented about Observability earlier) about observability(!) and tracing. Learned about Expedia’s Adaptive Alerting and Haystack projects. Also extremely interesting notes on Zipkin Wiki under Designs and Workshops, with notes about what observability questions people want to answer. (e.g. from Netflix: What percentage of edge requests have at least one mid-tier retry?).

Resources

  • Designing Data-Intensive Applications - CAP, algorithms, data layout etc, all really interesting things
  • Liquid Software book - evolution from Continuous Delivery to Continuous Updates
  • Spinaker CD system from Netflix with really interesting features, like intergration with monitoring platforms for metrics, custom Deployment Strategies and Chaos Monkey Integration (of course!)