Yesterday Harvest and Forecast experienced an extended service outage. We are very sorry for this interruption and would like to explain the situation surrounding the incident. Our systems were unavailable to customers from 10:50am EDT until 12:38pm EDT. Just to be clear, at no time was the safety of customer data at risk. There are multiple layers of backups in place which keep customer data safe in the event of system issues.

The timeline of the events are as follows:

At 10:36am EDT (14:36 UTC) our monitoring systems alerted us about our master database server for Harvest having crashed due to a segmentation fault of the database server software. At this time the database monitoring system moved traffic over to the hot-standby database. This type of failure happens, and generally does not lead to any service issues. Our systems are designed to tolerate this level of failure. Some customers might have seen errors on Harvest and Forecast momentarily as traffic moved over at this time.

At 10:50am EDT (14:50 UTC) the hot-standby database system failed due to the same software crash due to segmentation fault experienced earlier by the primary database server. At this time Harvest and Forecast were offline for all customers. We attempted to bring the system back online by restarting the hot-standby database server, and Harvest and Forecast were temporarily back online within a few minutes.

At 10:59am EDT (14:59 UTC) the segmentation fault reoccurred and the hot-standby database server failed once again, taking Harvest and Forecast offline.

At 11:15am EDT (15:15 UTC) we placed Harvest and Forecast in maintenance mode to allow the team time to research what was causing these software crashes, and to bring the systems back online in a controlled manner.

The Harvest team was working to make sure that our customers were kept up to date with the current situation as the engineering team resolved the issue. Due to the fact that both the primary and the hot-standby database servers had crashed, a second hot-standby database server was brought into production in order to rule out any data corruption issues arising from the software crashes. Unfortunately it took longer than we would have liked to reconfigure our systems to use a new set of database hardware. We are working on this issue to be sure that this is a quicker process in the future.

At 12:38pm EDT (16:38 UTC) we removed maintenance mode from Harvest and Forecast and made them available once again for all customers, while the team monitored every aspect of the system health.

We are truly sorry for the extended interruption this caused for all of our customers. The duration of the outage, and the subsequent forced maintenance mode, was certainly a much longer interruption in service than we are used to.

At this time we believe the cause of these crashes to be an interaction between code deployed to production around the time of the first crash of the primary database server, and a software bug in our database server. The code in question has been removed from our product, and we will shortly be performing an upgrade in our database software.

One of the reasons the investigation took so long is that there is no precedent for this kind of software crash on both the primary and hot-standby database servers in the history of Harvest. It took some time to simply understand what could possibly be going on.

A few key takeaways from this experience for us:

  • Our maintenance page mislead some customers into thinking we were down for scheduled maintenance. We’ve created an outage page to better communicate these types of situations to customers in the future. We apologize for this miscommunication during a service outage.
  • We will be working on a quicker solution to bring additional hot-standby database hardware into production in the future.
  • We will be upgrading our database software very soon, since we believe we encountered a bug in this software yesterday which lead to software crashes, causing this outage.

Thank you to all our customers for your support while we resolved this outage.