Invariant Searching and Metric Ranking for Fault Diagnosis of Distributed Systems

Yong Ge, Computer Science Department
March 20, 2014 - 12:30 PM
130 Woodward
  The management of large-scale distributed information systems relies on the effective use and modeling of monitoring data collected at various points in the systems. A traditional approach to model monitoring data is to discover invariant relationships among the monitoring data. Indeed, we can discover all invariant relationships among all pairs of monitoring data and generate invariant networks, where a node is a monitoring data source (metric) and a link indicates an invariant relationship between two monitoring data. Such an invariant network representation can help system experts to localize and diagnose the system faults by examining those broken invariant relationships and their related metrics, because system faults usually propagate among the monitoring data and eventually lead to some broken invariant relationships. However, it is very time-consuming to search the complete set of invariants of large scale systems. We have developed effective pruning techniques based on the identified upper bounds. Accordingly, two efficient algorithms are proposed to search the complete set of invariants based on the pruning techniques. Furthermore, at one time, there are usually a lot of broken links (invariant relationships) within an invariant network. Without proper guidance, it is difficult for system experts to manually inspect this large number of broken links. To this end, we propose two types of algorithms for ranking metric anomaly by link analysis in invariant networks.