Resolved
Major outage
Started 3 months agoLasted about 2 hours
Affected
Agents
Degraded performance from 4:30 PM to 6:25 PM
Updates
- PostmortemPostmortem
Starting around 9:30am PT on August 22, jobs started hanging due to service crashes caused by illegal instruction exceptions on AMD EPYC processors. The recent upgrade to v1.89 of the Rust language stack unlocked the crc-fast instruction set based on Intel Xeon instructions (AVX-512), which are not supported on AMD EPYC. AWS Fargate schedules our service across both CPU types, which resulted in occasional Illegal Instruction exceptions. We have deployed a fix for this issue that rolls back crc-fast. To better detect and mitigate these issues going forward, we now send coredumps to CloudWatch and recover automatically from crashed instances in our scheduler.
- ResolvedResolvedThis incident has been resolved. Starting around 9:30am PT on August 22, jobs started hanging due to service crashes caused by illegal instruction exceptions on AMD EPYC processors. The recent upgrade to v1.89 of the Rust language stack unlocked the crc-fast instruction set based on Intel Xeon instructions (AVX-512), which are not supported on AMD EPYC. AWS Fargate schedules our service across both CPU types, which resulted in occasional Illegal Instruction exceptions. We have deployed a fix for this issue that rolls back crc-fast. To better detect and mitigate these issues going forward, we now send coredumps to CloudWatch and recover automatically from crashed instances in our scheduler.
- InvestigatingInvestigatingWe are currently investigating this incident.
![[object Object]](/_next/image?url=https%3A%2F%2Finstatus.com%2Fuser-content%2Fv1722037592%2Fsti2ckn7k49m8gbj2p3l.png&w=3840&q=75)