1) AIOps practice of China’s multi-service platform Meituan
In the 2019 Summit of Global Leadership Technology Committee, Bin Song, a senior tech expert from Meituan, talked about the company’s extremely challenging quasi-real-time logistics business in terms of system stability, including high peak traffic, large instantaneous peaks, long service links, online business complexity, failure sensitivity. The aforementioned problems affected order fulfillment, and resulted in compensation and customer complaints. After more than a year of hard work, Meituan had gradually shifted manual Ops to automated Ops, with attempts to use machine learning for efficiency enhancement. Song’s sharing revealed the gradual process of Meituan establishing a comprehensive and reliable automated Ops system with capacity assessment, flexibility design, failure diagnosis, and risk prevention.
Diagram 2: Overview of Meituan’s delivery practices
Failure diagnosis includes failure detection and root cause analysis (RCA). It clarifies what failures are, how they look like, and what the root cause is. Meituan presented failure information in both developer’s and manager’s perspectives. The core function of failure diagnosis includes failure convergence, link monitoring, failure topology display, and alert escalation, etc. It labels, operates, manages, and tracks failure at the same time.
Diagram 3: Meituan’s failure detection module
Meituan RCA uses not only the conventional vertical analysis, which is root cause mining, but also an innovative horizontal analysis- the recursive relation of a call link.
Diagram 4：Meituan’s RCA of failure
Judging by the result, less location time and borderer detection coverage have massively reduced operation costs. The entire IT Ops support system adopts machine learning in the end-to-end link monitoring and dynamic thresholds of failure rate monitoring, which is anomaly detection based on the LOF (local outlier factor) algorithm. Rules are supplemented in the meanwhile.