Tencent Cloud Reveals Root of Service Disruption on April 8: Cloud API Glitch Persisted for almost 87 Minutes

Thanks to Gamingdeputy netizens Doraemon Maruko-chan Lead delivery!

Gamingdeputy reported on April 14 that Tencent Cloud’s official public account issued a document today, disclosing the reasons and details of the widespread service failure on April 8.

Officials stated that after fault location, it was found thatCustomers' inability to log in to the console is caused by cloud API exceptions.. Cloud API is a unified set of open interfaces on the cloud. Customers can programmatically manage and control cloud resources through the API. The cloud console provides interactive web page functions by combining cloud APIs.

Advertisement

After a failure occurs,As a result, some public cloud services that rely on cloud APIs to provide product capabilities are unavailable, including cloud functions, text recognition, microservice platforms, audio content security, verification codes, etc.. The fault lasted for nearly 87 minutes, during which a total of 1,957 customers reported fault.

Tencent Cloud said that if cloud services are compared to a “hotel”, the console is equivalent to the “front desk” and is a unified service entrance. “A malfunction at the hotel front desk will result in the unavailability of check-in, stay extension and other management capabilities.However, occupied rooms will not be affected.. “In this failure, the customers' configured servers and other IaaS resources, including already deployed and running businesses, were not affected by the cloud API anomaly.

Officials disclosed the root cause of the failure and improvement measures as follows:

After a comprehensive inventory of this failure, the most fundamental reason is that during the version change process, sandbox verification and plan drills were not effectively performed, which exposed the shortcomings in change management. Next, we will quickly improve and improve from the following aspects: To reduce the impact scope and duration of the failure.

First, improve system resilience

1. Regularly perform scheduled change strategy simulation drills to ensure that when a real failure occurs, you can quickly switch to recovery mode and minimize service interruption time.

2. Optimize the service deployment architecture and avoid potential circular dependency problems in API services through layered architecture, code review and monitoring.

3. Provide an API service escape channel for the caller to quickly switch when a failure occurs.

Second, strengthen change management and protection measures

1. Improve the automated test case library and strictly verify the changes through the sandbox environment before system changes.

2. Implement a grayscale release strategy, gradually promote new features or configuration changes, and gradually take effect according to clusters, availability zones, and regions, so that they can be quickly rolled back when problems are discovered.

3. Introduce an abnormal automatic circuit breaker mechanism, which can immediately interrupt the change process when a system abnormality is detected.

Third, enhance fault response and communication capabilities

1. Comprehensively upgrade the fault handling process to ensure real-time updates of fault handling progress and estimated recovery time points, and improve the efficiency of fault report release.

2. In the external failure notice, clearly explain the affected business scope, root cause of the failure and estimated repair time, and maintain transparency.

3. Optimize the information display logic of Tencent Cloud Health Status Dashboard (StatusPage), eliminate dependence on cloud services such as cloud API, and introduce caching and disaster recovery mechanisms to ensure that even when cloud services fail, faults can be accurately and timely delivered. information.

According to Gamingdeputy’s report on April 8,same dayTencent Cloud experienced service failure in the afternoon, the interface responds with errors, internal service errors, and the web page displays a 504 error. Netizens also reported service failures on Tencent Cloud’s official Weibo account, and the IPs came from many places across the country.

Advertisement

Advertisement