Your browser is no longer supported! Please upgrade your web browser now.
Posts by Warwick Poole:

Harvest and Forecast Outage on July 26th

Yesterday Harvest and Forecast experienced an extended service outage. We are very sorry for this interruption and would like to explain the situation surrounding the incident. Our systems were unavailable to customers from 10:50am EDT until 12:38pm EDT. Just to be clear, at no time was the safety of customer data at risk. There are multiple layers of backups in place which keep customer data safe in the event of system issues.

The timeline of the events are as follows:

At 10:36am EDT (14:36 UTC) our monitoring systems alerted us about our master database server for Harvest having crashed due to a segmentation fault of the database server software. At this time the database monitoring system moved traffic over to the hot-standby database. This type of failure happens, and generally does not lead to any service issues. Our systems are designed to tolerate this level of failure. Some customers might have seen errors on Harvest and Forecast momentarily as traffic moved over at this time.

At 10:50am EDT (14:50 UTC) the hot-standby database system failed due to the same software crash due to segmentation fault experienced earlier by the primary database server. At this time Harvest and Forecast were offline for all customers. We attempted to bring the system back online by restarting the hot-standby database server, and Harvest and Forecast were temporarily back online within a few minutes.

At 10:59am EDT (14:59 UTC) the segmentation fault reoccurred and the hot-standby database server failed once again, taking Harvest and Forecast offline.

At 11:15am EDT (15:15 UTC) we placed Harvest and Forecast in maintenance mode to allow the team time to research what was causing these software crashes, and to bring the systems back online in a controlled manner.

The Harvest team was working to make sure that our customers were kept up to date with the current situation as the engineering team resolved the issue. Due to the fact that both the primary and the hot-standby database servers had crashed, a second hot-standby database server was brought into production in order to rule out any data corruption issues arising from the software crashes. Unfortunately it took longer than we would have liked to reconfigure our systems to use a new set of database hardware. We are working on this issue to be sure that this is a quicker process in the future.

At 12:38pm EDT (16:38 UTC) we removed maintenance mode from Harvest and Forecast and made them available once again for all customers, while the team monitored every aspect of the system health.

We are truly sorry for the extended interruption this caused for all of our customers. The duration of the outage, and the subsequent forced maintenance mode, was certainly a much longer interruption in service than we are used to.

At this time we believe the cause of these crashes to be an interaction between code deployed to production around the time of the first crash of the primary database server, and a software bug in our database server. The code in question has been removed from our product, and we will shortly be performing an upgrade in our database software.

One of the reasons the investigation took so long is that there is no precedent for this kind of software crash on both the primary and hot-standby database servers in the history of Harvest. It took some time to simply understand what could possibly be going on.

A few key takeaways from this experience for us:

  • Our maintenance page mislead some customers into thinking we were down for scheduled maintenance. We’ve created an outage page to better communicate these types of situations to customers in the future. We apologize for this miscommunication during a service outage.
  • We will be working on a quicker solution to bring additional hot-standby database hardware into production in the future.
  • We will be upgrading our database software very soon, since we believe we encountered a bug in this software yesterday which lead to software crashes, causing this outage.

Thank you to all our customers for your support while we resolved this outage.

A New HarvestStatus.com

Nobody is quite sure how the internet really works and things sometimes go wrong. If Harvest experiences any systems issues, we try to keep our customers apprised of the situation as things develop, using a combination of Twitter and HarvestStatus.com.

Today we deployed an improved version of HarvestStatus.com. We hope this new tool will be very useful.

Customers can now subscribe to updates that we post on HarvestStatus.com. Just click the ‘Subscribe’ button on the new site, then add your email address or phone number. The system will then notify you of any new issues so you can stay informed. There’s also an RSS feed to make things even easier.

This new system allows folks on various Harvest teams to communicate with you during any service-impacting events, allowing those of us fixing any issues to focus on the task at hand.

For any customers who were relying on the now-deprecated status API on the previous version of HarvestStatus.com, please let me know in the comments below and I’ll be glad to work with you on a new solution.

I will now head back down to the Harvest engine room where we spend our time working to keep Harvest running smoothly!

Site Availability Issues on September 23rd and September 24th

Over the past two days, Harvest has had two very short outages. On both Monday September 23rd at  5:30am EDT and Tuesday September 24th at 7:40am EDT, Harvest was unresponsive for around 3 minutes. These events are both caused by the same problem and we are working to resolve the issue as fast as possible. At 10pm EDT on Tuesday September 24th (what time is that for you?), we’ll be performing a brief database maintenance to resolve the issue. We don’t expect any service impact from this maintenance. Let me get into some of the technical issues behind these outages.

Over time our main database has grown steadily in size, and at a certain point the database becomes larger than the memory allocated to the database software on the servers that it is running on. There is an ancient art involved in getting this memory allocation just perfect, and we’ve found over time that it is possible to allocate too much memory to the database, and suffer poor performance as a result. We’ve found that gradually increasing the allocated memory as the database grows works well. Recently our databases have grown large enough that it has become more involved to restart a database server to increase its memory allocation, and to put that server directly back into action. We have large enough databases now that a database server with cold caches doesn’t perform well when put back into production. We need to warm the server’s cache gradually before the server can become a master server in our database cluster.

So we are left with a slightly more challenging situation than we had previously, and have had to adapt our procedure. The net result is that increasing the database memory allocation needs a new procedure, and it needs to be done in a staggered fashion, and the recent availability issues have been the result.

I apologize for the two issues with Harvest yesterday and this morning. We are taking the final steps to resolve this issue in a brief database maintenance tonight, Tuesday September 24th at 10pm EDT. We don’t expect the Harvest service to be impacted by this maintenance. Thanks for your patience, folks!

Harvest Is Moving to a New Data Center on Sunday, April 28th (Completed Successfully)

UPDATE: This migration has been successful. Harvest is back online in our new data center. If you are seeing any issues in your account, or are having trouble accessing Harvest, please email support@harvestapp.com with details. Thank you all for your support as we made this large migration.

Summary: 3 hour maintenance window on Sunday April 28th, 9am – 12pm EDT. (Your local time)

In September of 2011, we moved Harvest to a new data center. That turned out to be a great move for our customers, solving a few reliability issues. Since 2011, Harvest has grown. A lot. We now have more resources online to serve your data, and every week brings record traffic volume. We’ve also been severely impacted by natural disasters and other challenges. We’ve been looking for a data center we can really stretch out in, a facility that has an impecable track record, and a vendor with an excellent reputation. We believe we have found all three with ServerCentral.

We have deployed a new set of servers with ServerCentral in the Chicago area and are getting ready to turn them on. The facility we have deployed our servers in is one of the highest quality data centers out there. The engineering behind the power and infrastructure systems in the building is some of the best in the industry. Besides the facility itself, a lot of work has gone into making this new server cluster more tolerant of everyday failures. Building this base cluster we have high trust in is only the first step in our global availability plan.

In order to perform the final sync of customer data to the new facility in the safest possible way, we need to take Harvest offline for up to 3 hours. This is going to take place Sunday, April 28th, between 9am – 12pm EDT. What time is that for you?

During this window, you’ll see a maintenance notice if you access your Harvest account. We will work as fast as humanly possible to get rid of that maintenance notice and get you back to your important data. During the work we’ll keep you updated via Twitter and on HarvestStatus.com.

Thanks for your support.

Scheduled Maintenance, Sunday March 3rd, 11am – 4pm EST (Completed)

UPDATE: This software update was successfully deployed with less than a minute or two of service interruption. Thanks for your patience as we rolled out this significant upgrade.

Original Post:

We deploy new software to production multiple times in the average work day, but some software releases contain so much new code that we need to be a little extra careful when we deploy them.  Over the past few weeks the Harvest team has been upgrading much of the Harvest code base and the time has come to deploy this to production. This upgrade will allow us to make better software by leveraging new features of our software libraries and will make future software upgrades easier.

Harvest will be in scheduled maintenance mode on Sunday March 3rd between 11am – 4pm EST. What time is that for you?  We are not planning to take Harvest offline during this maintenance window, but there could be temporary performance or availability issues during this window as we roll out this large software upgrade.

As always, we appreciate your support! We will update the progress of this upgrade via @harvest and HarvestStatus.com

Harvest Availability Issues October 4th

This morning was the worst outage Harvest has experienced in many years and we are embarrassed. Our customers expect the best from Harvest and there is no excuse for failing in this way. Here’s what happened and how we are proceeding.

The summary of the issue is that sudden high traffic volume started to overwhelm our load balancers, firewalls and then our clustering tools. The effects lasted for 2 hours. It took us some time to find the core problems and put emergency resolution in place. Read on for a more technical description.

Continue reading…

Details On Unexpected Outage on March 5th

This morning around 8:50am EST Harvest began to perform slowly and was unavailable for short periods of time. We averted the immediate issue by doubling the number of application processes available to serve customer requests while we examined the underlying issue.

The core issue behind this morning’s incident is the tremendous adoption rate of the newly released Harvest for Mac application. This application has a new server resource profile and Harvest has had to scale rapidly to accommodate it. To make sure Harvest for Mac is always using fresh data, the application performs frequent calls to Harvest servers for the current Timesheet data. As it turns out the adoption rate of Harvest for Mac has been much faster than we scaled our resources to accommodate it. The immediate popularity of Harvest for Mac has almost doubled Harvest traffic levels within a few days.

We are making extensive changes to Harvest to allow for this new growth. Firstly, we are adding more servers and upgrading certain servers to increase their capacity. Additionally we are reworking the caching system which handles the Harvest for Mac data refresh process to make it more efficient. An update to Harvest for Mac will be released later today and we encourage all customers to install the update when prompted to do so.

We apologize if you were affected by the outage this morning. Thanks for bearing with us as we increase our capacity and make our applications more efficient, and thank you all for making Harvest for Mac so popular so quickly.

System Maintenance February 29th, planned at 7pm EST

Update: This maintenance has been completed. Thank you for your patience as we upgraded servers affected by this critical performance bug in the Linux kernel. We totally understand the poor timing of this issue, this maintenance was not typical in any way, however our hand was forced by the severity of the issue.


Recently Harvest infrastructure has been affected by a Linux kernel bug, which causes poor performance on servers which otherwise are operating optimally. Upgrades have rolled out to all servers which can be updated without any impact to our application traffic. This bug is starting to affect enough servers to introduce risk for our application performance, such as our database servers and the clustering servers which keep the database cluster online.

Because of the number of servers affected by this bug, we are forced to take the unusual step of upgrading our entire database cluster in one go to return to acceptable performance. This unfortunately means we need to take Harvest offline for up to 30 minutes while the servers are upgraded.

During the day today, we’ll be keeping a close eye on server performance and plan to perform this upgrade tonight February 29th at 7pm EST. Harvest will be offline for maintenance for 30 minutes or less at this time. What time is that for you?.

Due to the nature of this issue, we may be forced to perform the upgrade sooner than 7pm EST today, in order to preempt unacceptable performance levels. This issue is not affecting the integrity of our customers’ data in any way, rather producing unacceptable levels of performance for affected servers.

We apologize in advance for this outage and assure you that Harvest performance is our top priority. Thank you for your patience while we upgrade our infrastructure.

We will keep you informed of the progress of the maintenance on HarvestStatus.com and via @harvest on Twitter.

How Harvest Is Made, Part Two

Last week I wrote about how developers at Harvest deploy code and own the responsibility of keeping our software quality high. Today I’ll touch on the tools and process we currently use to collaborate, stay in touch with customers and glean feedback from our infrastructure.

Developer collaboration

Harvest developers are seldom in the same building, let alone the same state or country. We work as a distributed team, yet we collaborate extensively. All of our code is hosted with GitHub, which makes this collaboration simple. For those familiar with Git:

  • Developers work in feature branches off the master branch, and master is always assumed to be deployable by anybody at any time.
  • Developers use GitHub Pull Requests all the time, and significant deployments are peer reviewed in this way prior to deployment.
  • Continuous Integration server constantly tests our code, and reports concerns to the team.
  • Development takes place locally, but we have multiple production-similar staging environments for testing and QA.

Infrastructure collaboration

We strive to have no ‘walls’ over which features or releases are thrown between team members. We share the responsibility of creating and supporting our software. As the ‘systems guy’ at Harvest, it’s important to me that every developer has the ability to manage systems configuration. It’s also important that if problems arise, the team who responds to these problems is not a siloed operations team, but includes the developers who wrote the code which is running in production.

To this end, we use Chef to transform our systems configuration into a collaborative effort. Every component of our infrastructure is controlled by Chef. This means that technical team members can view and modify production configuration and roll out systems changes. The beauty of Chef is that everything is protected by Git version control and enhanced by the power of Ruby.
Continue reading…

How Harvest Is Made

You may not realize it, but almost every day there are improvements being made to Harvest while our customers are using it. Transparency is a core value here at Harvest, and I’d like to take you through a little of how we work behind the scenes, in a series of slightly technical posts.

The new Harvest Status page

We’ve just released the beta version of a tool we will be using to promote transparency between Harvest operations and our customers: the new Harvest Status Page. Bookmark this tool to keep track of how Harvest is performing at any time.

Balancing priorities

I’ll briefly walk you through the software release process we follow, and in a subsequent post I’ll talk in more detail about the tools and methods we use. If you are familiar with DevOps and the concept of continuous deployment you’ll recognize these in our workflow.

Context determines your opinion on software deployment. Our customers naturally prioritize software stability and the addition of new features as quickly as possible. Customer acquisition, avoiding outages, using cool new technology, and striving for elegant robust code are a few other priorities held by my Harvest coworkers. A natural tension can exist between these priorities. How does Harvest balance this and retain our core focus on a good customer experience?

The simplest answer is: We take small steps quickly through collaboration.

Release cycle and deployments

What may be of most interest to customers is how we deploy new code to Harvest. Harvest changes almost every day, usually multiple times per day. In the time it took me to write this blog post, two different developers deployed five production releases of Harvest. Some might be concerned that a process like this promotes poor quality software. In reality, like many other companies, we have found that this iterative, constant change promotes high quality software, exposes and resolves unexpected issues quickly and allows a distributed team to work on different features concurrently. This means, in a nutshell, that when developers deem code ready to go to production, it goes to production. No artificial release schedule governs Harvest software rollout. There is also no manager whose job it is to ensure our software quality because that is the common responsibility of every person committing code at Harvest.

100% bug-free software is an unrealistic goal, but we strive for a bare minimum of issues by having structure in place to address problems quickly and efficiently:

  • All significant code changes are peer reviewed before deployment. In the next post, I’ll talk about how we do this.
  • Every developer, designer and sysadmin at Harvest is able to (and does) deploy production code.
  • Mondays tend to be the busiest traffic day of the week at Harvest, so we rarely release big new features on Mondays. Same goes for late on Fridays, when bugs could linger over a weekend.
  • We have an internal QA process and production-similar staging environments, where we perform extensive testing when required.

Some deployments warrant special care, such as releases which involve database migrations changing large datasets. Certain database operations could produce a poor customer experience while deployments roll out. We have in the past, and will continue to deploy these releases at times of lowest customer impact, although Harvest’s global customer base reduces this window constantly. We have a maintenance mode which we can employ to take Harvest offline briefly if we need to.

If you have seen Harvest in maintenance mode and we didn’t notify you, our customer, prior to this deployment, we made a mistake and you can be sure that the team is working on the problem with urgency. It happens, but we think Harvest’s uptime speaks to how infrequently this occurs.

Obviously, when it comes to software which has a third party review process, or runs on customer desktops, such as our iPhone App and the upcoming Mac App, our process to roll out change is a little different to the core Harvest software that runs on our own servers.

If this post was too technical (or not technical enough), the one thing I hope you will take away from this is: Harvest software changes all the time in small increments. This concept of continuous deployment isn’t new or revolutionary and it may not work well for every company, but it allows us to strike a balance between stability and agility and keep forward momentum as we build a fairly complex suite of software.

Next week I’ll touch on the tools we use to review code, communicate as a team and keep on top of our infrastructure performance. If there is something you’d like me to specifically discuss, let me know in the comments or directly at warwick@getharvest.com.