Anicetus is a operational telemetry system for generating streams of information that are useful in monitoring, analyzing, and triaging software systems. It differs from traditional logging components by focusing on capturing parametric information as well as the topology of information flow within the system. This information becomes critical as system complexity grows and no currently available logging packages ease the assimilation of this information. The project wiki describes the general architecture and protocol for Anicetus. Language specific implementations provide the details of each binding.
Anicetus is a software telemetry framework. Telemetry differs from logging in terms of both goals and content. Logging attempts to provide information for both operational monitoring and software debugging. In doing so, it is required to serve two masters and often neither of them particularly well. The software developer is constantly faced with the problem of what to log with the result often being too much and too little is logged, at the same time. Telemetry is focused solely on providing information that is useful for understanding the operation of the software from an operational perspective. Operations is primarily interested in the following facts:
- How healthy is the application?
- What errors is the application encountering that block proper operation?
- How well is the application performing?
- What external dependencies does the application have that might impact it's operation?
- Are the business metrics in the appropriate ranges (e.g. sales, sessions, etc.)?
Developers are more focused on the flows of the application internally while operations are more focused on ensuring the business goals are being met by the system. Developers need to understand why the application fails in a particular way while operations needs to understand what components can impact the application and how those components are behaving. The challenge to date has been that logging frameworks, which are focused on the application developer have been pressed into the secondary duty of providing operational information as well.
One other key challenge for large systems is understanding the relationships between components. Static analysis cannot provide an accurate picture of these relationships for a few reasons. First, logic branches make it difficult for static tools to know for certain which of the several dependent components are involved in any given flow. Second, abstractions hide the component dependencies from developers, making it difficult for developers to articulate their dependencies. Finally, components may be used to differing levels depending upon the use case. An accurate topology of components can only be derived from the running system. This information is readily captured by a properly designed telemetry framework.