Data protection in the era of massive data

Data protection in the era of massive data

Benjamin Franklin said that in this world nothing can be said to be certain, except death and taxes. We could say that although not so deterministic but quite close, the world of technology still needs to preserve and protect information with backup policies. That does not change and it’s still a very unpleasant task by the way (we only remember when there is a disaster, but to avoid it, it requires a daily cost and effort).

What has changed is the nature and especially the volume of that information (and associated data services) making it increasingly complex to perform that task. At the same time, a greater uncertainty in ensuring the capacity for recovery. This leads us to look for new models and data protection paradigms that make the task more effective, efficient, predictable … and cheaper.

Strategic, organizational and process-related limitations

From our experience, we detect in companies the following main limitations in the protection of information, related to strategic, organizational or process aspects:

  • Lack of a global view of the information nature and criticality. Companies usually have reasonably complete and updated (not always!) IT system blueprint but very infrequently maps of repositories and data flows, with their nature (e.g. source or derivative, conceptual model) and their criticality (based on business value).
  • Protection policies defined vertically by systems, not horizontally by process logic. Linked to the above, it is common to find that data protection policies are made vertically by the logical (or physical) system where is located, and not by the nature of the source. In many cases, the backup team is completely unaware of the nature of the data to which they are applying backup policies.
  • Inability to treat differently, but in combination, data protection policies and system availability. In many cases, systems recovery plan (which allow a given information service to be restarted in case of failure or disaster) and data protection plans (which allows recovering a particular repository in case of loss, accidental deletion or data corruption) are managed independently and disconnected.
  • Lack of risk analysis models and associated economic impact. Most organizations lack moderately complete risk models related to the two previously mentioned aspects. And above all they do not have it modeled with the metric that everyone understands: MONEY. This means that the data protection policies (and the return of investment made) are based on subjective criteria; in some cases, spurious.

Technology limitations

On the other side, technology limitations derived from a market of little glamour (that of backup solutions) and a growing complexity in the requirements are:

  • Predictability of recovery capabilities and metrics. In general, backup tools do a more or less decent job in running backups. But ask a backup manager in the IT ops team of a large company if he/she put his/her hand in the fire (and his/her position) for being able to recover in case of disaster, the information from a critical system with the service levels agreed upon. If the answer is yes, ask if they regularly run simulations and take metrics. Backing up is reasonably easy; restoring on failure is the hard part of the story.
  • Heterogeneity of environments and backup solutions. Any medium / large company ends up having multiple systems and backup solutions, complicating the processes and increasing the overall risk. The adoption of Cloud models has only added an additional factor of complexity that many ‘traditional’ solutions do not manage efficiently.
  • Scalability with the volume of information to be protected. In bigdata we use the phrase ‘data has gravity’. That is, they weigh. Moving data or making copies for protecting or recovering them does nothing but increase the technical complexity of the process with the (incessant) increase in the volume of information to be managed.
  • Inability to manage depending on the specific nature of the information to be protected. In many cases, our backup policies, for technical reasons, use fixed policies when we find in the repositories different data with different nature and that should not have the same protection requirements (for example , information of high volatility with voluminous information, but with low or no change rate).

Trends in the era of massive data

The unstoppable information growth to be processed and backed-up and the advent of massive data analytics projects lead to additional pressure, which can not be solved in a linear fashion. We need radical transformations in the technological approach. Some lines of innovation that we highlight are the following:

  • Intrinsic protection. What this solutions look for is substantially eliminating the need to make a backup copy with a conventional backup tool. Instead, the repository self-protects and is able to provide a solution to the different scenarios of information recovery that we can have (full recovery due to physical disaster, accidental deletion, recovery of a previous version in time, assurance of the inalterability of specific contents for legal or functional reasons, …). There are many aspects to take into account, but in general the vision of a data repository as a succession of changes over time with several geographically dispersed replicas, is the basis for addressing these models.
  • Reduction of technological assets to be protected. As we mentioned in the section on technology limitations, backup policies were traditionally used to recover a specific service or repository, not only to protect the contents. So, we make a backup of our whole server, with all its elements, to ensure that we can restore it in case of problems (with all the information it stores and serves). This is profoundly inefficient. The trend in software engineering and IT ops with innovations such as virtualization a long time ago and containers more recently is that services are recreatable with a descriptor of the components, not because we have saved an exact copy ‘just in case’. It is as if to protect our ‘mobility service’ by car, we have a car that is what we use and we have another exactly the same in the garage in case the first one fails or has an accident. But we also have to pass the work of doing in that second vehicle all the small bumps, scratches, wear that the original car has, so that it is exactly the same. If we were able to describe in an automated way the process of building a car, it is much more efficient to simply have that process updated and, when our car fails, to launch the process of creating a new one; naturally in our example, ‘making’ something physical like a car is complex and expensive over time; but ‘making’ a virtual server or container (especially the latter) is not. Therefore, if we design our SW services properly, we can only back-up the data, not the systems that execute them. At Tecknolab, in our DBcloudbin service, we have reached the milestone that there is not a single physical or virtual server that is backed up; only service descriptors (instructions for automatically building the car if necessary) and data repositories, which are physically decoupled from the servers that manage them (by two completely independent and heterogeneous means, one physical and one logical providing each one a robustness added to the recovery capabilities).
  • Combine capabilities in the same technical process (protection, availability and security). If we have to keep running data protection; we must continue to ensure that our services are available in case of disaster; we must also ensure that the data we handle is not corrupted accidentally or intentionally (e.g. ransomware), it seems reasonable to take advantage of a single process to meet those needs, simplifying and increasing efficiency. Some modern backup solutions allow that on the backup image we have made of a system, we can start a new version of the service in case of disaster in our primary environment (therefore taking advantage of this backup for a scenario of service availability); In addition, as we have a whole sequence of backups in each of our repositories, we can identify unexpected changes (for example, abnormal rate of change in the data in that repository based on a historical trend analysis) and warn of this as a potential virus attack. There are already manufacturers that provide this combination of functionalities in the market.

Conclusions

In summary, we can not sustain the model of ‘business as usual’ with the scenario of geometric scaling of information volume to be managed. This is especially true in the field of data protection. We must change the processes, the way of doing things and the solutions to protect what is (and increasingly) the most important asset of our companies: information.

At Tecknolab, we have taken advantage of our youth and lack of technical legacy to internally adopt highly efficient processes. Additionally, we have designed a solution for size reduction in enterprise databases, DBcloudbin, which allows to easily segment information to which we must apply very different information backup policies, simplifying the process with intrinsic backup techniques, dramatically reducing the overall backup costs.We encourage you to try it and contact us for any questions about it.

 

Growth of non-relational data. Traditional databases are exploding.

Growth of non-relational data. Traditional databases are exploding.

Business applications evolve very fast. The functional requirements are more sophisticated and we need to manage more non-relational data (photos, documents, images, videos …).
This need increases by several orders of magnitude the volume of information to be handled, as well as the complexity at software development and, above all, systems operations.

Data growth

Traditionally, business applications consisted mainly of some type of user interface (with some kind of more or less sophisticated forms technology) that allowed different users to interact with the environment to enter and consult data. In one way or another, data ended up in a traditional relational database (Oracle, SQL Server, DB2, Informix, …). The complexity came from the fact that, depending on the application and the company, some of these tables could have millions of rows and the query of data by very varied criteria resulted in very sensitive criteria for optimizing the queries (the famous ‘query plan’ ). In the most operational aspects, the headache was the ability to recover in critical situations (backup, replication, disaster recovery procedures, …).

That has remained so until relatively recently (a few years), when those applications were becoming more sophisticated and needed to cover other business demands. It was not enough to save all the data of our client in his client record at the application, we also had to save, for example, the contract document signed between the parties and had to be accessible from the application itself; or the mail messages, with all their attachments, that we have exchanged with the client or supplier in a certain business process. This requirement has been solved from the SW development department of the companies, typically by one of three alternative ways:

  • We store this data in a file service and associate it in the database through a link. This is reasonably simple but it gives quite a few management problems, and also technical ones. We have two repositories to manage (which synchronize backups, for example). If the data is sensitive, we will need specific security policies in two media (database and operating system) that are not integrated in a particularly simple way; we also complicate the transactional consistency (canceling a complete transaction in database due to a failure is simple, if we also have data in a filesystem, things get more complicated).
  • We store this data in a document management system or similar. In many aspects it is quite close to the previous scenario, with the advantage that a document manager system provides more and better management services, but also a greater complexity in operations and administration (we must manage, patch, upgrade  two complex systems).
  • We store this non-relational data in the database. This is, at the level of software engineering, the simplest; unique interface (SQL), transactional consistency, a data-type (BLOB) that allows to handle objects of any nature. In many cases it is the option chosen by many customers. In particular, where the decision is directed or influenced by the software engineering team.

This last scenario leads us to the fact that these databases no longer only handle relational data, but that they must manage very high volumes of binary content. And, although the most advanced enterprise database technologies are capable of doing this, we quickly discovered that the cost of infrastructure and operation is skyrocketing. These critical environments require infrastructure of the highest quality and speed and that is paid for.

Object store. New storage paradigm

This leads us to look for alternatives and lately, with the explosion of the volume of data handled, the alternative of Object Stores for storing files and binary content has become very popular, where the Amazon AWS S3 service has become a de-facto standard. It seems sensible to move those binary contents to S3 or similar object storage with several clear and immediate benefits:

  • Unlimited storage with virtually no management required.
  • Much lower costs.
  • Possibility of exploiting these contents in alternative scenarios (for example, advanced analytics, machine learning) without having a direct impact on the databases of the transactional systems.
  • Simplification or elimination of conventional backup needs.
  • Possibility (depending on the technology) to apply information retention policies that facilitate compliance with data retention regulations, so that the repository itself ensures the inalterability and prevents its deletion during the defined period.

The advantages are multiple; but there are also drawbacks. The main one is the fact that these systems have access through an API that, without being very complex, forces us to change the entire data access and persistence layer in our business applications to save and access the information in this repository. And that can be a tedious job, subject to errors and with a certain risk, proportional to the level of complexity and obsolescence of our applications. And almost more important: it involves diverting the resources of engineering and software development in our company (always scarce) to solve an internal problem of IT, which does not provide a direct functional value to the business user.

Databases meet object stores

In this context, at Tecknolab we have proposed to provide a solution that allows to move binary content to the different options of object store repositories, both in public Cloud services (Amazon AWS, Microsoft Azure, Google Cloud Platform) and , for an ‘on-premise’ deployment in a local datacenter, with the main object storage manufacturer alternatives (Hitachi HCP, Dell-EMC ECS, IBM COS, among others). With this service, called DBcloudbin, the configuration at the database is immediate and, what is more important, transparent for the application; with the same software, the application continues to access the data through the database using SQL as before, but in reality the system is responsible for reading the data that has been moved to the object storage and providing it to the application as if were in the database. This gives us all the benefits of having the data centralized in the database (single access, transactional consistency) but with the savings of using a much cheaper infrastructure for those data that do not need the access speed of a relational database . For more details of the solution, visit https://www.dbcloudbin.com/solution

BigData strategy. How to start?

BigData strategy. How to start?

A bigdata initiative must start with a bigdata strategy. We comment on the recommended approach.

Today, a large number of companies from all sectors and sizes (although especially the largest ones) are launching bigdata initiatives, partly due to the pressure of competition and new business challenges and, partly, why to hide it, for a certain ‘fashion’ or pressure of the environment (if my competition is getting into this, I too, I will not be less …).
The reality, surprising as it may seem, is that as recent IDC studies show, most companies are starting or planning the start of bigdata initiatives in the short term but do not know where to start.
In this context, in many cases the results are tragic because they make big mistakes that costs a lot of money. One of the most common is to start with the technological component: “We set up a DataLake”. The result, as I say, is tragic, partly due to the fact that the commercial wizards of the sector have created a series of myths and legends that have been internalized.

Myth 1: New technologies based on Hadoop are cheap and are deployed in a jiffy.

This is substantially false (or only partially correct and an oversimplification) that leads to throwing us in an enthusiastic technical race without an armed strategy and, of course, without a business case and moderately robust economic model. Basically we use the false mental scheme, curiously usual in many top executives of bulky payrolls that if it is much cheaper than my usual technologies (typically a DWH in this case) we should save money, no matter how we do it. And in this context, Murphy’s Law always comes out triumphant and the result in most cases is a lot of budget consumed with nothing to take to the mouth of ‘real’ result.

Initial Recommendation: Strategy and economic model first. Processes and organization later. Technology, at the end. Experiment, prioritize and monitor (adapt and adjust your model).

Actually they are concepts that I have been using for many years in transformation consultancy of any technological area, but it is more applicable than ever to this field.
A great weakness, particularly of the Latin countries, is our proverbial animosity for the strategy. We are action. Planning is about cowards. The adaptability and improvisation of the Latin character is a great asset in my opinion but always adequately integrated into a robust strategic planning.
In this sense, a first mistake is to confuse ‘data-driven’ business (driven by data) with bigdata technology. One thing does not necessarily imply the other. We must identify our data-driven business scenarios and how we are going to execute them. We will only adopt new bigdata technologies if we have a clear justification for it; and we will model it (I do not mean technically, but with a business case, I will talk about it in more detail in later posts of this blog).
Our approach to the bigdata strategy is a bidirectional model (top-down and bottom-up). The business top-down model will be in charge of identifying those ‘data-driven’ business scenarios, modeling it economically (business value) and data requirements (what data and potential analytical models I need to implement it). The bottom-up model is to model & cataloging what data sources (potential or real) we have in our business processes or we need to solve our business use cases. The intersection of both will give us the feasibility analysis and a first cost modeling. At this point we can make decisions and translate it into a strategic plan; this is, substantially, which scenarios we tackle first (the cheapest ones among those with the most impact); how are we going to monitor progress (what are my KPIs?) and how we are going to feed our model with reality as we execute it, allow us to adjust our expectations. And have fun!