Frontegg Services are showing Degraded Performance in EU & US
Incident Report for Frontegg
Postmortem

Executive summary:

On Wednesday May 31st, 2023, at 12:55 GMT we deployed a minor version to one of our services. Shortly after at 12:56 GMT, Frontegg’s US monitoring system started sending alerts for an authentication service which was not performing as expected, and the team immediately began investigating the issue.

At 13:01 GMT we started getting alerts from Frontegg’s EU monitoring as well regarding the same service, shortly after, we started to get complaints from customers.

At 13:04 GMT, 8 min after we started getting the alerts the team concluded that it was sourced by a recent change that was deployed. As part of the change, there was a database migration for one of our primary services. However, the migration job didn't run due to an edge race condition in our CD infrastructure, causing the service to remain in a schema mismatch state.

At this point we immediately started a rollback process for both EU & US regions that was completed by 13:16 GMT. Once the rollback completed, we noticed that our services are working as expected again and customers also reported that they were no longer experiencing issues.

Affect:

Most requests to customers’ custom Frontegg domains resulted in 401/404 responses or inability to authenticate.

For the EU region - between 12:59 to 13:16 GMT time.For the US region - between 12:56 to 13:14 GMT time

Mitigation and resolution:

Following the monitoring alerts the incident response team immediately identified the potential corrupted service and started rollback procedure with the previous successful deployment.

Preventive steps:

  • We defined a gated process for deploying DB migration changes
  • A schema validation on service init to prevent schema mismatch cases was added
  • Will add deployment validation that will fail deployment if migration didn’t run
  • Will remove the high dependency in that specific service as a single-point-of-failure for the main system flows
  • Reduce service rollback time by running only relevant part of the CD pipeline
Posted Jun 01, 2023 - 14:28 UTC

Resolved
This incident has been resolved.
Posted May 31, 2023 - 16:13 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted May 31, 2023 - 13:18 UTC
Investigating
We are currently investigating this issue.
Posted May 31, 2023 - 13:08 UTC
This incident affected: User authentication, Machine to machine authentication, and SSO & SAML authentication.