At the heart of any relationship – business, personal, transactional – lies trust. It’s a key component for any organisation, but perhaps especially pertinent within the world of eCommerce. A dependable service is of paramount importance when selecting a platform: (up)time really is money.
Also at stake is the ongoing trust and loyalty of your customer base; retention is something we fight hard to establish, and not something anyone wants to lose due to situations beyond their control. A platform’s clients are rightly won and lost based on the way that tricky, high traffic situations are managed.
With this in mind, it’s good to know that your platform of choice has a dependable response strategy in place for times of varying crisis. In today’s article, we’ll walk through some of the measures bluCommerce puts in place to ensure efficient and dependable service for our many enterprise clients.
An incident response plan is more than a “good to have” – as of May 2018 and the dawn of GDPR, organisations must respond to a serious data breach within 72 hours of becoming aware of it. To be PCI compliant organisations must also implement an incident response plan that meets specific minimum requirements and demonstrates their ability to respond immediately to a data security incident.
Incident response processes should be continually adjusted and fine-tuned based on experience and evolving risks. It’s crucial to ensure that they work for all stakeholders, not just those at the business end of finding a solution. Resolution is of course paramount, but effective communication is just as vital in ensuring that everyone affected by the situation feels confident in swift solutions being found.
Another issue can arise when the development of a response process swings too far in the opposite direction. Overcommunication of issues that don’t require an urgent fix or rapid reaction can quickly lead to alert fatigue and misplaced alarm. It’s vital to ensure that the right triggers are tripping the switch of your systems. Clear definition of what constitutes an incident is crucial.
Improving Our Internal Process
The incident response process for bluCommerce was historically lacking in cohesion. Large incidents were dealt with, but there was room for improvement, particularly around communication. Similarly, we wanted to streamline the process linking support tickets flagging an issue to investigation and subsequent resolution.
There was also a pressing need to refine the definition of “urgent” when it came to alerts; our on-call team were frequently getting woken up about issues that didn't need a rapid response, and in many cases didn't require any action at all!
A few well-timed conferences, with talks from Google, Shopify, Facebook, Fastly, and PagerDuty, sparked our imagination and helped show us a route towards improving our situation. We started by reducing the number of alerts to a manageable level, turning off those that didn’t require an urgent response. As well as reducing the burden on our team, this also helped to emphasise the calls that did require immediate attention.
Internally, our focus moved from ensuring servers were all working all the time, to the actual impact on the end user: our clients’ customers. For example, a single server in a cluster that breaks and gets automatically replaced without any impact to customers doesn’t require an alert, but being unable to use a site does, especially if it impacts conversion by preventing successful checkouts.
Once we were able to see the wood for the trees, we were able to look more to the future, and started thinking about how we should properly respond to incidents.
"As the Support Team, we handle a variety of issues affecting our platform on a daily basis. Occasionally, we may be faced with a more severe issue that affects the platform more than others.
Our incident response plan allows us to swiftly put in motion a resolution for these kind of incidents. It allows us to rapidly notify members of blubolt, who are key in different areas of expertise, to assess and investigate the underlying cause as a well-knit team."
- Robert Rademaker (Head of Support.)
The bluCommerce Way
Our current incident response approach is based largely on PagerDuty’s process, which is in turn based on FEMA’s “National Emergency Management System.” If it’s good enough for the USA’s Department Of Homeland Security, we figured we were on to a good thing. This set-up is rapidly becoming something of an industry standard.
For more of a deep dive into the process, this video gives a great overview of the system we’ve built our own plans around.
One of the key factors ensuring the success of our system is ensuring there's a clear leader. Each incident has an Incident Commander who is responsible for making decisions during an incident, and is the source of truth.
The Incident Commander's aim is initially to reduce the impact on customers, likely by mitigating the impact of the issue. This may include anything from adding more servers to disabling plugins or content - anything to make sure customers can use the site, buying a bit of headroom to investigate and resolve the root cause. They may do this themselves or call on members of the team to assist, but crucially, they are the one calling the shots.
We have implemented several innovative tools to improve response times and assist in following the incident response plan. Our incident management Slack bot is a great example of this, helping to remove barriers to communication across all internal stakeholders and improve coordination of the incident response.
Inevitably things break, particularly with the growing complexity of technology. While we, unfortunately, can't go back in time, we can learn from any incident that occurs to try to prevent further problems, and improve the overall reliability of the system. The way we do this is by carrying out blameless postmortems where we aim to work out what actually happened during an incident, and what steps could be taken so that we don't experience more problems in the future.
“Of course, building great sites is important. But making sure those sites remain transacting 24 hours a day, seven days a week is equally important, especially during peak periods such as the run-up to Christmas, and sales like Black Friday.
This is why, at blubolt, we have a very robust incident response plan, to ensure that even at their busiest times, our clients' sites stay up and running. Our incident response plan allows us to manage incidents as efficiently and with as little impact to on-site transactions as possible.”
- Warren Amphlett (Head Of Platform)
Incident Handling Best Practices Tips
Based on experience and the improvements made internally here with the bluCommerce platform, here are our top recommendations for ensuring a solid incident response process.
Ensure clarity regarding ownership of urgent issues; an incident commander should be assigned to lead each incident.
Use a tiering system for categorising incident severity, to help set internal expectations for responding to the incident.
Mitigate: prioritise an immediate reduction in the urgency of issue, ahead of a final fix.
Consider all stakeholders (internal and external) when planning communication strategy.
Automate processes where possible to simplify incident response and cut down on the potential for human error
Learn from experience by carrying out blameless postmortems. What needs to be done differently in the future to prevent this from occurring again?
There’s a lot to gain from nailing your own internal process – and from working with a platform that has mastered theirs. Since reviewing and refining our incident handling, bluCommerce and has benefited clearer ownership of issues and much faster fix times (with less pressure on those tasked with the fixes!) We’ve enabled easier and more valuable postmortems, which in turn help to guard against future issues. SLAs are upheld and we earn deeper levels trust from happy, reassured clients.
If you’d like to learn more about bluCommerce, the blubolt team will be exhibiting on Stand E62 at Ecommerce Expo next month. Get in touch to book your meeting today.