SREcon17 Asia/Singapore

Published: 23 May, 2017

Report from SREcon17 Asia/Australia in Singapore

This is my first attendance to SREcon (or any other USENIX conference for that matter), so I decide to arrive early to mingle over breakfast. Somewhat surprisingly, there are only a dozen or so people here (the rest are arriving late I subsequently find). Breakfast is pretty good and even the pre-brewed coffee is drinkable, so that’s a good start.

I notice that Facebook have a desk with people present already, maybe they’re still jetlagged, most of them are from USA or European offices. There’s time for a brief chat and and even time to take their technical challenge (everyone is hiring). Its 13 multiple-choice (disapponted) questions to be completed in 5 minutes. Thats a fairly fast pace, 23 seconds per question, but I finish in half the time and find out they don’t tell you your score, its just sort of “we’ll contact you if you’ve won” competition. So much for instant gratification, I was sure I got at least some (most?) of the questions correctly, the last couple of questions required some thinking though, which was great.

By this time LinkedIn folks have also arrived at their desk and guess what, they’re also hiring^Whave a technical test with promises of prizes. And not only one test, but two for that matter, one is a Linux / systems test and one is a coding test. This is perfect timing as my adrenaline is still peaking following the sprint of the Facebook test, so I dive into the Linux quiz right away. This one is also a multiple-choise test, but much shorter and the questions are very easy, I blitz it in 3 minutes (they allow up to 10 minutes) and feel like I could keep going. The coding challenge apparently takes 30 minutes, I ask some LinkedIn SREs that are standing around on how long it takes them to do this test and they say 20-30 minutes. Ok fine, this seems like its a more serious test, I decide to leave it for later when I have at least 20 minutes of uninterrupted time and its almost time for the first talk, Reliable Launches at Scale by Sebastian Kirsch from Google.

Talk 1 - Reliable Launches at Scale

I enjoyed this talk quite a bit, even though it wasn’t at all technical, but about the process Google uses to ensure launches are sucessfull.

Here’s a summary of key points:

A launch is not a release. Releases happen often and frequently, like clockwork. Launches happen once. Think about launching a new service or a new user-facing feature of an existing service.
Launches are difficult because:
1. There are hard deadlines in place - marketing announce launch date well in advance
2. Large audience - if you’re launching a new service, you expect it to be popular
3. Steep ramp-up in visitors - once live, a thundering heard of users arrive to try it out
4. Unpredictable user behaviour - you just can’t predit all the things users will to do on your system during design phase, even if you have Google-level of prior experience
Google has dedicated people, titled Launch Lead Engineers, who have the responsibility to act as auditors, gatekeepers, intra-team coordinators and educators to ensure the product architechture is sound, all sharp edges are smoothed and generally everything is in place to have a successfull launch.
The speed of launches to achieve 100 per week comes down to having a lightweight launch process. Simplicity is key, as it ensures people aren’t tempted to get creative and circumvent the process. This balances against thoroughness - the process must cover all areas.
Google wouldn’t be Google if they haven’t spent any time actually engineering a solution to help with launches. The result is a web app, Production Advisor, that starts as a questionaire for the Launch Lead Engineers to describe the service being launched and generates actions to consider and check, as well as suggestions to reading material (like the SRE book to help you make a better decision.. For example, architectural questions ask about request flow & latency requirements and might suggest isolating user and batch requests. For capacity, it asks about compute resoureces needed, the amount of press or promotion the launch is tied to and might suggest to perform a load test. The rest of the sections are:
- Reliability (failure modes, SPOF)
- Monitoring (useful metrics, alert conditions, SLO)
- Automation (any manual or special processes involved)
- Growth (expected growth of traffic)
- Dependencies (third-party code, contractual obligations)
- Rollouts (hard deadlines, marketing efforts)
Additional tools and techniques that help with reliable launches:
- Have feature flag frameworks so that features can be toggled without doing a full release. This allows a rapid change of feature state - enabled or disabled
- Gradual rollouts - can create tension with marketing as they like a big impact launch, but wise to try 5-10% launch ahead of the marketing launch and only then go 100%
- Dark launches - perform all actions except exposing the result in the UI. This is basically a load test with real production traffic

This talk left me reflecting on how we do launches at $WORK and thinking hard about which of these ideas are applicable and can help to improve our launch prowess and reliability.

Talk 2 - Event Correlation: A Fresh Approach towards Reducing MTTR

Talk brief and slides are available here.

This talk is presented by two engineers from LinkedIn. The problem is presented thus: LinkedIn run hundreds of microservices, many of which are dependent on each other. When you receive an alert for a service, how do you, quickly and reliably, establish which microservice is the root cause of the alert. Or, stated more verbosely - microservices lead to complexity, which leads to loss of reliability and steep learning curve, which leads to delay in identifications of root cause and lack of overall visibility, which results in false and frequent escalations.

LinkedIn solution to this is quite an impressive amount of engineering - they’ve spent months to build a system (composed of microservices of course :) that analyses metrics, detects correlations, performs service discovery and builds a callgraph, produces recommendations with scores and feeds the result into a remediation service. We’re told the overall system has a high degree of accuracy and has reduced MTTR for production issues considerably.

For making recommendations, they use a k-means clustering method, which is a way of grouping observations into clusters, each with a minimum variance. In other words, k-means helps them to group observations and make decisions based on the results.

There’s way more awesomeness to cover here, but I suggest you wait for USENIX to eventually release the video recording for everyone to see and watch it yourself. Alternatively, you can watch this same presentation given by another LinkedIn engineer at SREcon17 Americas here.

Talk 3 - Automated Troubleshooting of Live Site Issues

The brief for this talk sounded great - “Auto Troubleshooting Platform that aggregates the data from all the underlying data sources, troubleshoots and records the results. The Platform is built in a way that anyone can post any type of ticket and get it troubleshooted automatically” - whats not to like? Unfortunately this presentation didn’t go that well and was difficult to follow along, mostly due to the quiet and at times hard to understand speech of the presenter. Hopefully he’ll be watching the recording when its made available to see it from the eyes of the audience and not be discouraged by this.

Summary

I enjoyed my brief stay at SREcon17 and met several interesting people and heard plenty of interesting ideas and stories. One thing that felt lacking though was a presence of visitors from more local companies - there were all the big names like Facebook, LinkedIn, Microsoft, Baidu, but hardly anyone from the local tech startups (Carousell, Honestbee, Gojak, Grab to name the most well-known ones). Was it also the presence of big tech companies that made the atmosphere feel slightly corporate at times, you could sort of feel they’re mostly (only?) out here to hire people and move them away from Singapore to their huge office blocks. Subsequently I found out that most of the local production engineers in Singapore didn’t even know SREcon17 was talking place and were disapponted they missed it - sounds like event promotion has room for improvement, but also perphaps not surprising, this is the first USENIX conference taking place in Singapore, I suspect people aren’t used to see these conferences take place outside of North America & Europe.

The following day I’m surprised to receive an email from a technical recruiter at LinkedIn, telling me that I’ve won one of their challenges. This is unexpected. Even more surprisingly, another email followed several minutes later stating that I’ve won their coding challenge as well. Now this was quite hard to believe - I thought I did OK on both challenges, but simply didn’t expect to have my name pulled out of the hat and also suspected someone would definitely do a better job than me at the coding challenge. I mean, my solutions were correct and relatively concise, but the thing with coding is that you can always spend some more time on it and craft it further. It could’ve been my honest & direct feedback on the coding challenges, inspired by a small amount of ice-cold beer I was having at the time, that might have influenced the decision - I found the challenge to be too easy for the given amount of time (30 minutes) and openly stated so, and that I had time to both make a snack and open a beer and solve the challenge in half the time.

I thank LinkedIn for the prizes and the beer with its amazing ability to improve programming ability. If you have a chance to attend SREcon in the future, do it, you’ll enjoy it and leave inspired.