Sometimes friends who are not techies, or even those who are but not in the same area of work ask me about what I do. I sometimes simply say 'I build stuff that monitors and analyzes the performance of IT environments'. I'll go on to say that we're not talking about a handful of computers or servers in your local small business, we're talking about lots of equipment and huge networks. In an attempt to relate it to something they might know, I'll refer to an example of some large financial institution, or some multinational that would have an environment of the scale we're dealing with. Then they sometimes get a sense of it. Even for us in the business, it's worth pausing occasionally and getting an updated reality check on the volumes of data we are talking about in the typical large IT Operations center. Understanding this is key, since simply being able to handle the volumes comes before analysing the volumes.
This year some colleagues carried out an internal study to refresh that understanding as they regularly do. The numbers are quite impressive. For example, a typical large enterprise might have maybe 5000 servers ( physical or virtual ). When the various infrastructure aspects that go along with these, networking, storage, and the middleware and set of apps that run on top are considered, they estimated that the IT Operations folks might expect to see over 1.3TB per day of data flowing out from such a system. Remember, this is just related to the management of that overall environment, and is not particularly the application data itself. This broke down to roughly 1TB per day of unstructured data in the form of logs, events, alarms etc, and the balance of 0.3TB being performance data.
These figures are just averages for the generic 'large' enterprise. If the enterprise's business is particulary information-centric, as in the case of a financial trading institution, these numbers can rise dramatically. One such institution in the study had close to 30,000 servers generating almost 60 million alerts and messages a day and 500GB of corresponding performance metric data. Some institutions have numbers of servers rising into the hundreds of thousands. That's a lot of potential data.
Simply handling, as in receiving,moving and archiving, these volumes of data reliably 24x7 is challenge enough, but it is not sufficient. Those capabilities are very much table-stakes at this stage of the game. The primary purpose of the data and monitoring systems is to keep the target systems operational. We must know at the earliest possible time that there are emerging problems, and ideally this is ahead of the time where the issues become service impacting. There's little thanks for confirming hours or days later that the data indicated problems existed, when the users have already experienced them. So real-time analytics is the order of the day here and significant portions of those massive volumes of data must be analysed in real-time if we want to get ahead of the problems to prevent, or at least limit their impact.