This morning was the worst outage Harvest has experienced in many years and we are embarrassed. Our customers expect the best from Harvest and there is no excuse for failing in this way. Here’s what happened and how we are proceeding.
The summary of the issue is that sudden high traffic volume started to overwhelm our load balancers, firewalls and then our clustering tools. The effects lasted for 2 hours. It took us some time to find the core problems and put emergency resolution in place. Read on for a more technical description.
To give you some context, the recent application features we have deployed, and the recent third-party integrations we launched have introduced a new traffic pattern for the average Harvest customer’s usage which is almost 300% greater than last week’s level and this continues to grow. Despite being overprovisioned in terms of servers and network bandwidth for precisely this eventuality, a few unfortunate artificial thresholds were suddenly exceeded this morning, causing cascading problems.
This morning around 10:16am EDT our various alert systems started to let us know that Harvest was running slow and some requests were greeted with an ugly error page.
The first threshold that was exceeded was the number of files that our Linux load balancers could keep open at any one time in order to serve and track customer requests. We fixed this rapidly. The second threshold which occurred instantanouesly is that the huge traffic spike began to overwhelm our firewalls. The third problem soon surfaced when failing firewalls began to wreak havoc with our clustering tools.
For a period of roughly 45 minutes, failing network clustering operations meant that customer requests were often not reaching Harvest servers at all. This proved problematic to troubleshoot and resulted in us removing our clustering logic altogether to get Harvest back online as soon as possible. We were forced to put Harvest in maintenance mode when it came back online to reduce the giant backlog of requests so as not to overwhelm the system when it came back online.
These three problems together took around 2 hours to resolve and to get Harvest back online.
The above is not an excuse for being down for two hours during the time of day that many customers use Harvest the most. We have stabilized the known issues and are taking extensive measures to ensure that these basic issues never occur again. We have more than sufficient capacity to handle orders of magnitude sudden growth, but this morning some poor server configuration caused issues.
Thank you for being patient while we brought Harvest back online. As a reminder, we maintain transparent system status updates at HarvestStatus.com to keep you informed during any issues.