A lot of the recent tools I have focused on helps developers understand traces and memory dumps, but it generally assists developers working on a single diagnostics artifacts in isolation. There is, however, a class of problem that assumes you need to look at artifacts coming from thousands or even millions of services, apps and devices. For problems at this scale manual reviews are impractical.
Facebook operates a family of services used by over two billion people daily on a huge variety of mobile devices. In this scenario developers are interested in identifying classes and groups of problems and how they are connected (or not) to other groups of problems. For example, does one group of issues occur more frequently at a particular time of day? Or when your service experience high load? Put another way, developers are looking to understand the cause and effect at scale.
I read this paper on their approach a couple years ago but lost track of it, so here is the pdf for my (and your) future reference.
Here is a snippet from the post on the Engineering site:
Facebook implemented continuous contrast set mining (CCSM), an anomaly-detection framework that uses contrast set mining (CSM) techniques to locate statistically “interesting” (defined by several statistical properties) sets of features in groups.
…
Resolving these crashes and other reliability issues in a timely manner is a top priority. To help us respond as quickly as possible, we have been creating a collection of services that use machine learning (ML) to aid engineers in diagnosing and resolving software reliability and performance issues.
Comments are closed.