On Tuesday, January 28 at 02:38am to 29th Pabau experienced a major problem with the appointments table. All times are in GMT.
First up, we’d like to apologise for the interruption. We take issues like these very seriously. We strongly believe in transparency and that’s why we’re sharing information about this incident, how we managed it and what we are doing to prevent similar events in the future.
What went wrong?
Shortly after 02:38am 28th January we were alerted to a problem with our appointment calendar. Our tech team addressed the issues in line with our standard process, but in this instance the problem could not be resolved instantly.
The issue occurred due to an oversight in code review from October last years release, which eventually lead to a query corrupting the calendar table on our production server. The reason it had not been an issue sooner was due to the fault being in the Month View reschedule feature, when typing specific wording into the NOTES field.
(for the techies)
At 02:43am We woke key personals in our incident response team after being alerted by our DevOps.
The decision was made at 02:46AM to place Pabau into a a partial maintenance mode, whilst our team would restore the appointments table.
- Some of our clients were asking why we were unable to restore a backup. Currently, as per our backup policy, Pabau backs up almost a terabyte of data nightly.
- Unfortunately, the issue was spotted at 02:43am, meaning that the backup had already been made, and with the issue being a data corruption issue (not data loss), the bookings table was already backed up with the corrupted appointment data set(meaning the earliest backup we could go to was the 27th).
- As an extra precaution, Pabau stores all appointments in a log file, instantly, meaning in a disaster type event such as the one we have faced, we can restore appointments made between the period in which the backup was yet to be made.
- Whilst restoring the database took only around 1 hour, the transfer of the appointments that were created / amended on the 29th took a considerable amount of time, which led us having to migrate them to a separate table which was heavily optimised, allowing us to that port them back to the bookings table.
Where do we go from here / some points learnt
- In January, I was appointed as Head of IT & Infrastructure, and have been working on a number of introductions in the background of Pabau (mainly around data logging). Everything me or my team are going to be working on is in the interest of safety & stability.
- We found the newsletter to be effective and the status page to work well in comparison to historical events.
- Added in a 2nd layer of code review (this was a very technical issue, a 2nd pair of eyes will reduce the chance of this ever happening again).
- Investigate the possibility of splitting clients among multiple servers to reduce the impact and load on support (whilst we understand it is frustrating to not be able to call our support lines in incidents like this, we received over 130 calls per 5 minutes, and we found the most effective route was via Facebook, Email & the status page). Reducing the number of clients impacted will allow for more
Again, we would like to sincerely apologise for the interruption caused. We always use problems like this as a chance for us improve and learn, in order to reduce these incidents from occurring again.
Our priority now is to complete work on our technology infrastructure that’s currently underway, which will allow us to permanently auto-scale for our growing demand.
We’re sorry this has happened. Everything we do is focused on making life easier for businesses who use our service, so it’s disappointing to have not met those standards today. Our team are continually working on improving our platform and putting preventative measures in place to avoid this in the future.