On Wednesday May 31st, 2023, at 12:55 GMT we deployed a minor version to one of our services. Shortly after at 12:56 GMT, Frontegg’s US monitoring system started sending alerts for an authentication service which was not performing as expected, and the team immediately began investigating the issue.
At 13:01 GMT we started getting alerts from Frontegg’s EU monitoring as well regarding the same service, shortly after, we started to get complaints from customers.
At 13:04 GMT, 8 min after we started getting the alerts the team concluded that it was sourced by a recent change that was deployed. As part of the change, there was a database migration for one of our primary services. However, the migration job didn't run due to an edge race condition in our CD infrastructure, causing the service to remain in a schema mismatch state.
At this point we immediately started a rollback process for both EU & US regions that was completed by 13:16 GMT. Once the rollback completed, we noticed that our services are working as expected again and customers also reported that they were no longer experiencing issues.
Most requests to customers’ custom Frontegg domains resulted in 401/404 responses or inability to authenticate.
For the EU region - between 12:59 to 13:16 GMT time.For the US region - between 12:56 to 13:14 GMT time
Following the monitoring alerts the incident response team immediately identified the potential corrupted service and started rollback procedure with the previous successful deployment.