Problem Statement
On Wednesday, April 18, 2023, there was an issue with the database instance IP used by a production environment in the cloud monitoring platform. As a result, certain parts of the cloud monitoring console experienced abnormal functionality. The issue persisted from 17:00 to 17:43 UTC+8. We sincerely apologize for the inconvenience caused by the abnormal status of the cloud monitoring service, which had a negative impact on your user experience.
Incident Background
In a production environment of the cloud monitoring platform, there was an unfortunate incident where a database instance was mistakenly detached from its migration identifier before the migration process was completed. As a result, the database without the necessary identifier became inaccessible through the old IP address, particularly during high-load high-availability (HA) switching scenarios. This resulted in abnormal database connections within the production environment services.
What was the specific reason for the unsuccessful switch?
The high-load condition prompted a routine high-availability master-slave switch for the CDB. However, the database instance had not completed its migration and was mistakenly labelled as migrated. As a result, the previous Virtual IP (VIP) became invalid during the switch, leaving only the new VIP accessible. This discrepancy led to abnormal database connections for the old VIP. Consequently, any traffic that had not completely transitioned to the new IP was unable to access the database, resulting in connection failures and subsequent service unavailability.
Was there any data loss?
We want to assure you that no data loss occurred during the incident.
What happened during the incident?
1. Start time of the incident: At UTC time 16:38, a high-load alert was triggered in the cluster, indicating that the CPU utilization exceeded 95%.
2. Trigger of the incident: At 16:38 UTC, the occurrence of slow queries began to rise. This was followed by a decline in the success rate of the business layer at 16:58 UTC. Simultaneously, the number of slow queries reached its peak.
3. Troubleshooting process: The troubleshooting process involved examining the service logs of the business layer. It was discovered that a specific database IP was encountering abnormal access. Upon investigation, it was determined that the database instance associated with that IP had been experiencing a high load and had undergone a high-availability failover, rendering the old IP invalid. Subsequent investigation revealed that the instance had been incorrectly labelled as migrated.
4. Steps Taken:
- All configurations utilizing the old IP in the configuration center were scanned, and the database access IP address was updated to an available IP address.
- The database operations team was contacted to manually initiate high-availability (HA) and restore the availability of the old address through manual intervention.
5.Incident Recovery: After restoring the database access, the success rate of the business layer service interface improved, and the functionality of the cloud monitoring console returned to normal.
Impact
1.The alarm console became non-functional, affecting the alarm history display and hindering users from performing regular console operations.
2.Alarm notifications encountered issues with retrieving alarm notification message content, resulting in the failure to send alarm notifications as intended.
3.The Dashboard console became inaccessible, preventing users from accessing monitoring data and displaying error messages indicating operation failures.
Next Steps and Action Plan
The following measures will be implemented to prevent a recurrence of the incident.
1. A thorough review of the migration status for all database instances requiring migration will be conducted. The database instances will be marked as completed only when no access records are associated with the old IP.
2. Accelerated the migration progress to ensure the completion of all pending database instance migrations by the first half of 2023.
3. Implemented standardization of database usage, enhanced monitoring capabilities, performed proactive capacity expansion for high-load instances, and optimized slow query logic to eliminate any inefficiencies.