Big data analytics, as with many other new technologies, can suffer from hype together with the trough of disillusionment and scepticism that accompanies it.
If network data is big data, we’ve got no option but to climb up the slope of enlightenment. Is the problem we have in network data analytics really a big data problem?
To answer that first we have to define what we mean by big data; in my words:
- Small data: traditional RDBMS data processing applications with siloed data sets, i.e. non-distributed file system
- Big data: more data than can be stored on a single node and hence requires a distributed file system
The Google File System was the origin of big data distributed file systems we use today. The challenge with a distributed file system is how to achieve performant processing at scale? MapReduce addressed this by taking the processing to the data, rather than vice versa. GFS and MapReduce are the genesis of the big data technologies we use today; subsequent work open sourced by Doug Cutting et al at Yahoo gave rise to Hadoop and the broader big data analytics ecosystem. For a great write up on the history see: The history of Hadoop: From 4 nodes to the future of data.
With that groundwork out of the way, let’s get back to the question of whether network data is big data? We already see network operators producing 100s of Gigabytes of data per day. In addition, there are a number of factors are driving the need for higher scale network data processing:
- Breaking down data siloes. In networking we have traditionally stored data into different systems which form siloes by both by type (e.g. metric data separate to event data, etc.) and by domain (e.g. data centre data separate to IP WAN data separate to optical data, etc.). If we’re really to understand what is going on in our networks and bring to bear the power of analytics we need to break down these siloes and bring together the network dataset.
- Huge growth in traffic and devices. The Cisco VNI Global traffic forecast predicts a 3-fold increase in global IP traffic by 2020 and >60% increase in devices and connections. This will drive a significant increase in the amount of data that network operators need to process.
- Transition to telemetry. In network data analysis we’ve been constrained by SNMP for too long. With the transition to model-driven telemetry, data is streamed from devices rather than needing to be polled, and that data is accompanied by a model which enables it to be dynamically interpreted. Model-driven telemetry removes the shackles of SNMP and enables a broader set of data to be produced more frequently.
As these factors compound, it’s inevitable that network data is big data.