On Wednesday May 31st, 2023, at 12:55 GMT we deployed a minor version to one of our services. Shortly after at 12:56 GMT, Frontegg’s US monitoring system started sending alerts for an authentication service which was not performing as expected, and the team immediately began investigating the issue.
At 13:01 GMT we started getting alerts from Frontegg’s EU monitoring as well regarding the same service, shortly after, we started to get complaints from customers.
At 13:04 GMT, 8 min after we started getting the alerts the team concluded that it was sourced by a recent change that was deployed. As part of the change, there was a database migration for one of our primary services. However, the migration job didn't run due to an edge race condition in our CD infrastructure, causing the service to remain in a schema mismatch state.
At this point we immediately started a rollback process for both EU & US regions that was completed by 13:16 GMT. Once the rollback completed, we noticed that our services are working as expected again and customers also reported that they were no longer experiencing issues.
Affect:
Most requests to customers’ custom Frontegg domains resulted in 401/404 responses or inability to authenticate.
For the EU region - between 12:59 to 13:16 GMT time.For the US region - between 12:56 to 13:14 GMT time
Following the monitoring alerts the incident response team immediately identified the potential corrupted service and started rollback procedure with the previous successful deployment.