On 28th August 2019 8:24am Pabau experienced major performance issues.
Firstly, we’d like to deeply apologise for the interruption over the past few days.
We understand that you, as one of our clients, rely on our service for your business and livelihood. We take issues like these very seriously. We strongly believe in transparency and that’s why we’re sharing information about this incident, how we managed it and what we are doing to prevent similar events in the future.
Before we get into it, we’d like to explain some of the terms we will be using.
Pabau stores frequently accessed data in caches. So what is a cache? A cache stores data so requests can be served faster, so it saves us time instead of querying the storage on the server. Think of it as remembering a phone number, rather than looking it up in a phonebook every time you need it.
System CPU spikes
CPUs (central processing unit) are the brains of a computer. They perform all the tasks required to operate the apps on our smartphones and Pabau in the cloud. These tasks can be roughly divided into two groups, system and user:
- User tasks are things that we can run, browsers, apps, databases, games etc.
- System tasks are all the things that are required to run and manage the system and it’s resources (memory, running more tasks, saving and retrieving things from storage).
So when the system CPU spikes, the CPU is processing intensive tasks that are required to run the system.
What went wrong?
On Monday, August 26th between 08:25AM and 18:00PM, we noticed a few slower than normal response times on Pabau. This was a result of intermittent, system CPU spikes on the primary database server, which delayed some requests. We reduced the system CPU load on the database server and raised a support query to our service provider for more information. In addition, our DevOps team began investigating the issue, whilst keeping clients up to date via our status page.
On Tuesday, August 27th at 09:00AM, we noticed similar issues to those experienced on the previous day and closely monitored the primary database server’s system CPU usage and Pabau’s response times. The frequency of system CPU spikes were increasing throughout the day, we escalated the support query with our service provider. We reevaluated the findings and speculated that the intermittent system CPU spikes, seen on the database server, were narrowed down to specific modules such as Connect & the Financial module. We continued to work through the night to optimize these modules, and although speed increased slightly, we did not take the status page out of ‘Service Disruption’.
On Wednesday, 28th, the latency followed a similar pattern, we made the decision to take the servers offline at 9:30pm GMT to perform scheduled maintenance, with our suspicions pointing to our service provider’s physical hardware.
The upgrade was a success, and the performance resumed to normal.
We investigated various solutions, and the final fix was to move the physical host underlying our database. With the help from our service provider we were successfully able to perform this maintenance and transfer thousands of users across 60 countries across to a new infrastructure. Monitoring the improvement yesterday and today has been a success.
– [Day 1 AM] We introduced 2 extra people onto support to deal with the extreme volume of requests coming in, in addition, brought 1 person back off holiday to help assist with the phone lines.
– [Day 1 PM] We took a backup on Tuesday, upgraded our 2 backup servers and 3 Web servers, this had no real noticeable impact.
– [Day 2 PM] Immediately after failing to resolve the issue on Day 1 with our in-house team, we brought in an external team to help in assisting on the issue.
– [Day 2 AM] The sheer amount of requests meant that the only effective way to get back to every client was to disable the phone lines, place a pre-recorded message notifying we are aware of the problem, and switch to email responses. A lot of people have criticized this, to put things into perspective, below is a graph of the last 10 days, over the 3 days we received almost 100x the amount of requests we would receive on an average day, across all channels (mainly voice). Disabling the phone lines allowed us to increase our response times and covering updates to a wider audience across other channels such as the status page and email/web.
– [Day 3 AM] After no luck with the external consultants, we brought in another team bringing the total DevOps professionals to 6.
– [Day 3 PM] We had posted a total of 13 updates to the Pabau status page www.pabau.com/status
– [Day 3 PM] Support worked through the night to clear out all tickets and return to clients who requested a callback.
And for the most important change, we upgraded ALL clients, across our entire infrastructure to 64gb MEMORY and 32 CPU Cores, in non-techy terms this is fast, very fast. Although it may not be noticeable in day to day activities, Pabau is actually running 5x faster, more noticeable when performing tasks such as running a report with a large amount of data across a long period of time.
Where do we go from here?
- Understand the impact of changes before executing them.
- Investigate other potential workarounds for intermittent, system CPU spikes.
- We are set to further expand our DevOps team to assist in faster resolution.
- Migrate our newsletter system away from our main servers to ensure that we can also keep people notified in bulk via newsletters.
We are BETA testing select customers on dedicated hardware (sharing the load).
- We will be forming a larger incident response team to handle major outages such as the one we have experienced.
Again, we would like to apologize for the interruption caused. We always use problems like this as a chance for us to improve and learn, in order to reduce these incidents from occurring again.