Frontegg Services are showing Degraded Performance in EU & US

Incident Report for Frontegg

Postmortem

Executive summary:

On Wednesday May 31st, 2023, at 12:55 GMT we deployed a minor version to one of our services. Shortly after at 12:56 GMT, Frontegg’s US monitoring system started sending alerts for an authentication service which was not performing as expected, and the team immediately began investigating the issue.

At 13:01 GMT we started getting alerts from Frontegg’s EU monitoring as well regarding the same service, shortly after, we started to get complaints from customers.

At 13:04 GMT, 8 min after we started getting the alerts the team concluded that it was sourced by a recent change that was deployed. As part of the change, there was a database migration for one of our primary services. However, the migration job didn't run due to an edge race condition in our CD infrastructure, causing the service to remain in a schema mismatch state.

At this point we immediately started a rollback process for both EU & US regions that was completed by 13:16 GMT. Once the rollback completed, we noticed that our services are working as expected again and customers also reported that they were no longer experiencing issues.

‌

Affect:

Most requests to customers’ custom Frontegg domains resulted in 401/404 responses or inability to authenticate.

For the EU region - between 12:59 to 13:16 GMT time.For the US region - between 12:56 to 13:14 GMT time

Mitigation and resolution:

Following the monitoring alerts the incident response team immediately identified the potential corrupted service and started rollback procedure with the previous successful deployment.

Preventive steps:

We defined a gated process for deploying DB migration changes
A schema validation on service init to prevent schema mismatch cases was added
Will add deployment validation that will fail deployment if migration didn’t run
Will remove the high dependency in that specific service as a single-point-of-failure for the main system flows
Reduce service rollback time by running only relevant part of the CD pipeline

Posted Jun 01, 2023 - 14:28 UTC

Resolved

This incident has been resolved.

Posted May 31, 2023 - 16:13 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted May 31, 2023 - 13:18 UTC

Investigating

We are currently investigating this issue.

Posted May 31, 2023 - 13:08 UTC