Swimming in Sensors and Drowning in Data
Modern sensor networks and communication networks provide large sets of operational data, including information per system, subsystem, feature, port, or even per packet. A network device, like a router or a switch, offers millions of multivariate time series that portray the state of the device. But the fact that data is available does not mean that it is easily consumable.
Indeed, the reality is that network operations produce a lot of data but at the same time they are starving for insights. The amount of telemetry data available in the form of multivariate time series is produced at an unprecedented rate, but typical telemetry streams contain time series for metrics that are high dimensional, noisy often with missing values and thus offer incomplete information only. The generation of continuous telemetry data at high frequencies and volume poses serious problems to data centers in terms of bandwidth and storage requirements.
This is challenging for network administrators who need to store, interpret, and reason about the data in a holistic and comprehensive way. One of their usual practices is to hand-pick a small subset of the available timeseries-data based on experience. Another is to apply down-sampling strategies for aggregating common features, de facto limiting the prediction accuracy of telemetry analytics. Doing this, network administrators are confronted by two key challenges:
- Visibility: within a device, alarms/notifications from different components are reported independently. Across devices, there are fewer systematic telemetry exchanges, even though a network event often gives rise to alarms/notifications on multiple devices in the network. Therefore, a device-centric view prevents network administrators from having complete control of the data center infrastructure and elaborating correct network diagnosis.
- Filtering and aggregation: the deluge of data generated by multiple sensors is true for all industries due to the abundance of complex systems and sensors proliferation. A single event is often present in a multitude of data sources in heterogeneous formats, such as syslog, Model Driven Telemetry (MDT), SNMP, etc. None of these data sources are correlated, nor is there any identifier that ties the data to an application or service. If a large majority of events is collected and processed on-line, the amount of data created often exceeds the capabilities of backend systems and controllers for storage capacity and processing power.
The traditional approaches to solve these challenges are to:
- Create highly scalable centralized controllers with a network-wide view for data mining. This approach is limited by CAPEX investments for hardware (e.g., backend systems, HPC facilities, storage systems) and software (e.g. licenses, development of new algorithms).
- Limit the scope of the data collected on a subset of counters and devices selected with the assistance of domain experts (SME) or rule-based systems. This approach is limited by the background knowledge of the domain expert or by the expert-system static knowledge-base, i.e. you'll only see what you were looking for. Due to CPU and memory limitations on routers, on-box expert systems are typically based on manually crafted and maintained rules (rather than consider learning-based approaches), which lack flexibility and require frequent update to remain current. Although expert-systems perform well in domains where the rules are followed, they tend to perform poorly for anything outside the pre-specified rules. Quite commonly, thresholds of these rule-engines need to be adjusted on a per-deployment basis, causing a significant deployment and maintenance cost.
So how can we take advantage of this flood of data and turn them into actionable information? Traditional techniques for data mining and knowledge discovery are unsuitable to uncover the global data structure from local observations. When there is too much data, data is either aggregated globally (e.g., static semantic ontologies), or data is aggregated locally (e.g., deep learning architectures) to reduce data storage and data transport requirements. Either way, with simple aggregation methods, we might lose those insights that we are ultimately looking for. To avoid this, our team has been exploring methodologies that enable data mining services in complex environments and that allow to get useful insights directly from the data. We’re putting ourselves into the shoes of a consumer of telemetry data, understanding the delivered data as a product: What are the business insights and associated values that the telemetry data offers? Which of the “dark data” offered by a router or switch but typically left unexplored is providing interesting insights to the operator? We exploit topological methods to generate rich signatures of the input data by reducing large datasets to a compressed representation in lower dimensions and find unexpected relationships and rich structures.
In the
next episode, we will provide some background on topology and introduce the concepts behind Topological Data Analysis.