MonitoringOperators

From RAD Lab

Jump to: navigation, search

Monitoring operators, suggesting actions

Students

  • Peter Bodík

Summary

Operators in Internet services rely on their experience to quickly localize and fix site problems. However, the operators can't remember everything and often search through emails or ask more experienced colleagues how to fix a given problem. Capturing operator knowledge about how to resolve problems is especially important when new operators need to be trained or when a new type of resolution for a problem is found and must be communicated to other operators.

One way to think about operator knowledge capture is to ask the question: Which graphs, dashboards, or documentation were most useful in the past in diagnosing a particular problem, and what actions (restarting processes, changing configuration, rebooting machines) resulted in a successful fix?

We propose to gradually develop a "recommendations" system for operators by first observing the actions operators perform when resolving problems such as: analyze graph X, look at alarms for host Y, read documentation for Z, reboot host W. When operators use web-based tools such as dashboards, we can simply parse the corresponding access logs on the internal web servers to capture the operators' actions. When operators use command-line tools, we can use typescript files, history or 'sudo' logs to capture their actions. Finally, using a trouble ticket database, we can discover who worked on a given problem and when (resolution timeline). By combining these sources of information we can automatically learn how operators resolved any of the problems in the past.

The information about how operators resolve problems could be used in many ways. The most straightforward application is to create a “Page you/your team made”: a collection of graphs, documentation and other “actions” that are most often used by the operator or people in his team. The next step is to create a recommendation system that will suggest actions based on the type of problem that the operator is currently working on. Finally, by combining this approach with Statistical and Machine Learning, we should be able to automatically resolve certain problems by performing the recovery actions that resolved such problems in the past.