BCDR Series: Retail Resilience – Retailers Suffer Major Outages Ahead of Holiday Season

This year, major retailers will undoubtedly find themselves in the crosshairs of cyber-attacks as online shopping continues to rise. The National Retail Federation (NRF) expects online sales in November and December to increase between 6-8% over last year with projections as high as $105 billion.[1] On top of that, 46% of consumers say they intend to do their holiday browsing and buying online.[2] This is great news for businesses so long as online shoppers can access their websites. In just the past three months, Netflix, eBay, and PayPal suffered lengthy outages. While these outages have yet been linked to cyber-attacks, they have highlighted the importance of keeping incident response, disaster recovery, business continuity planning and strategic communication at the forefront of business decisions, particularly during the holiday shopping season. Let’s take a look at some of these outages in more detail:

On Sunday, September 20th, Netflix’s Cloud Service Provider, Amazon Web Services, suffered a brief network disruption that impacted a portion of their DynamoDB storage servers. Normally, this type of networking disruption is handled seamlessly and without change to performance. These servers handle requests for membership lists and metadata, process updates to those data sets, and reconfirm the server’s availability to accept additional requests. If the storage servers are not able to retrieve these data requests within a specific time period, they will retry the request and temporarily disqualify themselves from accepting requests. Amazon’s metadata storage servers were so overloaded with retry requests that Amazon technicians were not able to inject administrative requests to add capacity.

Although the outage was obviously noticeable to Netflix, countless other customers suffered a dramatic service degradation or found websites completely unavailable for approximately five hours that day. Unfortunately, this was not Netflix’s only outage in recent memory. In mid-October, Netflix and Expedia suffered a technical malfunction in UltraDNS’ cloud-based content delivery service. This time, UltraDNS was able to resolve the issue in about 90 minutes.[3] The likelihood of a repeat of either event is exceedingly low, but if these events were to occur on a Black Friday or Cyber Monday, the impact would be catastrophic.

Netflix was not alone in their outage in October. Later that month, for about two hours on the night of October 30th, a power outage took down PayPal services worldwide. With 173 million users and a per-day average of $644 million in transaction volume, the economic impact of this event could easily be in the millions[4] for just a few hours of downtime. This projection, by and large, includes only the direct financial impact of the outage, not accounting for longer term effects such as change in consumer confidence or degradation of brand image.

To make matters worse, eBay, hosted at the same datacenter as PayPal, not only went down in October, but again on November 14th. This second outage was caused by a network connectivity issue in one of their datacenters, and impacted eBay sales and customer experience for over 8 hours. Throughout these incidents, communication was scarce and assurances were made but follow through could have been better. During the outage, eBay said it would remove the transaction defects it could identify, but in many cases these outages had real monetary cost for buyers and sellers using the service. In response, buyers and sellers took to the help forums for clarification on which sales were valid and how outages would impact seller ratings. As a result of a lack of effective communication from the service provider, the impact of these outages may go well beyond the time period the systems were down.

That’s three major online retailers and service providers, accounting for significant portions of global network traffic and online payments, all suffering extended outages in the past three months. At this point, we don’t know the full extent of the financial impact, and given how (understandably) tight lipped many of these organizations are on the root cause, we may never fully know but there are some important steps businesses can take to better prepare for these unavoidable situations.

Strategic Communication

It is critical to incorporate continuity and crisis communications (public affairs) when these events occur. While both Netflix and PayPal worked to restore service, netizens took to social media to complain about the outages. Tweets and blog posts were roundly critical of both companies for their delays in letting consumers know about their outages and the lack of information provided. During any crisis, it is difficult to balance crisis response and public affairs. One way to try to mitigate this problem and keep customers informed during an outage is to incorporate communications planning into incident response and business continuity activities. In the world of online retail, reputation and image matter; smaller companies in particular can be crippled by the negative impact of outages such as these. Two major issues that can impact reputation are service outages and poor communication with customers. Effective strategic crisis communication, however, can help to ensure a short term-outage doesn’t have a long lasting impact on your business.

Effective Resilience

Multinational organizations like the ones mentioned in this article no doubt have Business Continuity / Disaster Recovery (BC/DR) plans in place, but events such as these just go to show that poor coordination between these elements can make a bad situation worse. At its heart, organizational resilience and business continuity planning is focused on ensuring the continuance of the business’s essential services and functions, while strategic communications can help ensure a short term outage doesn’t have a long term impact on the business. Effective resilience or continuity planning must include all of these aspects – cyber, physical, communications, and customer relations. This type of comprehensive planning ensures that risks are pre-emptively identified and tracked, acceptable metrics are set for outages and service disruptions, and mitigations are developed.

How Resilient Is Your Organization?

Knowing the potential impact outages and service disruptions can have, what would your business do in a similar situation? Have you checked your generators since October 30th? If you are hosted by another datacenter – like eBay was hosted by PayPal – what protections or assurances do you have in the event your host goes down? Who would you call? How would you message?

In our increasingly interconnected society we are seeing more and more examples of the need to be prepared. Some downtime and outages may be unavoidable; the key is to put plans in place, practice them, and make sure staff are sufficiently trained in their roles to manage and message both quickly and effectively.

 

 

[1] https://nrf.com/media/press-releases/national-retail-federation-forecasts-holiday-sales-increase-37

[2] https://nrf.com/media/press-releases/retailers-very-digital-holiday-season-according-nrf-survey

[3] http://www.nytimes.com/2015/10/16/technology/ultradns-server-problem-pulls-down-websites-including-netflix-for-90-minutes.html?_r=0

[4] http://fortune.com/2015/10/30/paypal-outage-sales/