Data protection in the era of massive data

Benjamin Franklin said that in this world nothing can be said to be certain, except death and taxes. We could say that although not so deterministic but quite close, the world of technology still needs to preserve and protect information with backup policies. That does not change and it’s still a very unpleasant task by the way (we only remember when there is a disaster, but to avoid it, it requires a daily cost and effort).

What has changed is the nature and especially the volume of that information (and associated data services) making it increasingly complex to perform that task. At the same time, a greater uncertainty in ensuring the capacity for recovery. This leads us to look for new models and data protection paradigms that make the task more effective, efficient, predictable … and cheaper.

Strategic, organizational and process-related limitations

From our experience, we detect in companies the following main limitations in the protection of information, related to strategic, organizational or process aspects:

Lack of a global view of the information nature and criticality. Companies usually have reasonably complete and updated (not always!) IT system blueprint but very infrequently maps of repositories and data flows, with their nature (e.g. source or derivative, conceptual model) and their criticality (based on business value).
Protection policies defined vertically by systems, not horizontally by process logic. Linked to the above, it is common to find that data protection policies are made vertically by the logical (or physical) system where is located, and not by the nature of the source. In many cases, the backup team is completely unaware of the nature of the data to which they are applying backup policies.
Inability to treat differently, but in combination, data protection policies and system availability. In many cases, systems recovery plan (which allow a given information service to be restarted in case of failure or disaster) and data protection plans (which allows recovering a particular repository in case of loss, accidental deletion or data corruption) are managed independently and disconnected.
Lack of risk analysis models and associated economic impact. Most organizations lack moderately complete risk models related to the two previously mentioned aspects. And above all they do not have it modeled with the metric that everyone understands: MONEY. This means that the data protection policies (and the return of investment made) are based on subjective criteria; in some cases, spurious.

Technology limitations

On the other side, technology limitations derived from a market of little glamour (that of backup solutions) and a growing complexity in the requirements are:

Predictability of recovery capabilities and metrics. In general, backup tools do a more or less decent job in running backups. But ask a backup manager in the IT ops team of a large company if he/she put his/her hand in the fire (and his/her position) for being able to recover in case of disaster, the information from a critical system with the service levels agreed upon. If the answer is yes, ask if they regularly run simulations and take metrics. Backing up is reasonably easy; restoring on failure is the hard part of the story.
Heterogeneity of environments and backup solutions. Any medium / large company ends up having multiple systems and backup solutions, complicating the processes and increasing the overall risk. The adoption of Cloud models has only added an additional factor of complexity that many ‘traditional’ solutions do not manage efficiently.
Scalability with the volume of information to be protected. In bigdata we use the phrase ‘data has gravity’. That is, they weigh. Moving data or making copies for protecting or recovering them does nothing but increase the technical complexity of the process with the (incessant) increase in the volume of information to be managed.
Inability to manage depending on the specific nature of the information to be protected. In many cases, our backup policies, for technical reasons, use fixed policies when we find in the repositories different data with different nature and that should not have the same protection requirements (for example , information of high volatility with voluminous information, but with low or no change rate).

Trends in the era of massive data

The unstoppable information growth to be processed and backed-up and the advent of massive data analytics projects lead to additional pressure, which can not be solved in a linear fashion. We need radical transformations in the technological approach. Some lines of innovation that we highlight are the following:

Intrinsic protection. What this solutions look for is substantially eliminating the need to make a backup copy with a conventional backup tool. Instead, the repository self-protects and is able to provide a solution to the different scenarios of information recovery that we can have (full recovery due to physical disaster, accidental deletion, recovery of a previous version in time, assurance of the inalterability of specific contents for legal or functional reasons, …). There are many aspects to take into account, but in general the vision of a data repository as a succession of changes over time with several geographically dispersed replicas, is the basis for addressing these models.
Reduction of technological assets to be protected. As we mentioned in the section on technology limitations, backup policies were traditionally used to recover a specific service or repository, not only to protect the contents. So, we make a backup of our whole server, with all its elements, to ensure that we can restore it in case of problems (with all the information it stores and serves). This is profoundly inefficient. The trend in software engineering and IT ops with innovations such as virtualization a long time ago and containers more recently is that services are recreatable with a descriptor of the components, not because we have saved an exact copy ‘just in case’. It is as if to protect our ‘mobility service’ by car, we have a car that is what we use and we have another exactly the same in the garage in case the first one fails or has an accident. But we also have to pass the work of doing in that second vehicle all the small bumps, scratches, wear that the original car has, so that it is exactly the same. If we were able to describe in an automated way the process of building a car, it is much more efficient to simply have that process updated and, when our car fails, to launch the process of creating a new one; naturally in our example, ‘making’ something physical like a car is complex and expensive over time; but ‘making’ a virtual server or container (especially the latter) is not. Therefore, if we design our SW services properly, we can only back-up the data, not the systems that execute them. At Tecknolab, in our DBcloudbin service, we have reached the milestone that there is not a single physical or virtual server that is backed up; only service descriptors (instructions for automatically building the car if necessary) and data repositories, which are physically decoupled from the servers that manage them (by two completely independent and heterogeneous means, one physical and one logical providing each one a robustness added to the recovery capabilities).
Combine capabilities in the same technical process (protection, availability and security). If we have to keep running data protection; we must continue to ensure that our services are available in case of disaster; we must also ensure that the data we handle is not corrupted accidentally or intentionally (e.g. ransomware), it seems reasonable to take advantage of a single process to meet those needs, simplifying and increasing efficiency. Some modern backup solutions allow that on the backup image we have made of a system, we can start a new version of the service in case of disaster in our primary environment (therefore taking advantage of this backup for a scenario of service availability); In addition, as we have a whole sequence of backups in each of our repositories, we can identify unexpected changes (for example, abnormal rate of change in the data in that repository based on a historical trend analysis) and warn of this as a potential virus attack. There are already manufacturers that provide this combination of functionalities in the market.

Conclusions

In summary, we can not sustain the model of ‘business as usual’ with the scenario of geometric scaling of information volume to be managed. This is especially true in the field of data protection. We must change the processes, the way of doing things and the solutions to protect what is (and increasingly) the most important asset of our companies: information.

At Tecknolab, we have taken advantage of our youth and lack of technical legacy to internally adopt highly efficient processes. Additionally, we have designed a solution for size reduction in enterprise databases, DBcloudbin, which allows to easily segment information to which we must apply very different information backup policies, simplifying the process with intrinsic backup techniques, dramatically reducing the overall backup costs.We encourage you to try it and contact us for any questions about it.

Data protection in the era of massive data

Strategic, organizational and process-related limitations

Technology limitations

Trends in the era of massive data

Conclusions

Submit a Comment Cancel reply