US region services partial outage

Incident Report for Frontegg

Postmortem

Executive summary:

On June 3rd, at 12:06 GMT, The Frontegg team received an indication from our monitoring system of increased latency for refresh token requests (average greater than 750 ms) in our US region. Starting at 12:12 GMT, the first customer reached out to frontegg noting request timeouts. At 12:13 GMT, we updated our status page and officially began the investigation.

As a preliminary measure the team began a number of different mitigation actions in an attempt to remedy the situation as quickly as possible. After seeing no improvement, at 12:30 GMT the team began a full cross-regional disaster recovery protocol. At 12:40 GMT we also began a same-region disaster recovery protocol (starting a new same-region cluster) as part of the escalation to ensure a successful recovery.

At 13:25 GMT we began to divert the traffic to the new same-region cluster and by 13:30 we saw a stabilization of traffic to Frontegg. Upon further investigation, we discovered the root cause to be a networking issue inside our main cluster which caused a chain reaction affecting the general latency of the cluster. Additionally we are working with our cloud provider to gather additional details on the event from their side.

Effect:

From 12:06 GMT to 13:30 GMT on June 3rd, Frontegg accounts hosted in our US region experienced a substantial latency to a significant part of identity-based requests on Frontegg. This meant many requests were timed out, causing users to be unable to login or refresh their tokens. Additionally, access to the Frontegg Portal was partially blocked due to this issue.

Mitigation and resolution:

Once the Frontegg team received the initial alert to refresh latency, we began an investigation into our traffic, request latency, workload, hanging requests, and database latency. Upon finding inconclusive results, the team initiated a handful of mitigation efforts, such as:

At 12:14 GMT, we increased our cluster workload.
At 12:30 GMT the team began a full cross-regional disaster recovery protocol.
At 12:40 GMT we also began a same-region disaster recovery protocol (starting a new same-region cluster) as
By 13:00 GMT, we increased the number of Kafka brokers as an additional measure for mitigation.

After a preliminary check on the new same-region cluster we began diverting traffic to the new cluster. By 13:30 GMT we saw a stabilization of traffic to this cluster and moved the incident to monitoring. We continued to monitor traffic for the next hour before resolving the incident.

Preventive steps:

We are adding a same-region hot failover cluster for quick mitigation of P0 issues
We are fine-graining our rate limits on all routes within the system to add additional protection to our cluster health
We are working closely with our cloud provider to gather additional information on the event in order to increase the predictability of future events

‌

At Frontegg, we take any downtime incident very seriously. We understand that Frontegg is an essential service, and when we are down, our customers are down. To prevent further incidents, Frontegg is focusing all efforts on a zero-downtime delivery model. We apologize for any issues caused by this incident.

Posted Jun 04, 2024 - 16:20 UTC

Resolved

This incident has been resolved.

Posted Jun 03, 2024 - 14:58 UTC

Update

We are continuing to monitor for any further issues.

Posted Jun 03, 2024 - 14:01 UTC

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jun 03, 2024 - 13:40 UTC

Update

We are continuing to investigate this issue.

Posted Jun 03, 2024 - 13:01 UTC

Investigating

We are currently investigating this issue.

Posted Jun 03, 2024 - 12:13 UTC