Data protection in the era of massive data

Data protection in the era of massive data

Benjamin Franklin said that in this world nothing can be said to be certain, except death and taxes. We could say that although not so deterministic but quite close, the world of technology still needs to preserve and protect information with backup policies. That does not change and it’s still a very unpleasant task by the way (we only remember when there is a disaster, but to avoid it, it requires a daily cost and effort).

What has changed is the nature and especially the volume of that information (and associated data services) making it increasingly complex to perform that task. At the same time, a greater uncertainty in ensuring the capacity for recovery. This leads us to look for new models and data protection paradigms that make the task more effective, efficient, predictable … and cheaper.

Strategic, organizational and process-related limitations

From our experience, we detect in companies the following main limitations in the protection of information, related to strategic, organizational or process aspects:

  • Lack of a global view of the information nature and criticality. Companies usually have reasonably complete and updated (not always!) IT system blueprint but very infrequently maps of repositories and data flows, with their nature (e.g. source or derivative, conceptual model) and their criticality (based on business value).
  • Protection policies defined vertically by systems, not horizontally by process logic. Linked to the above, it is common to find that data protection policies are made vertically by the logical (or physical) system where is located, and not by the nature of the source. In many cases, the backup team is completely unaware of the nature of the data to which they are applying backup policies.
  • Inability to treat differently, but in combination, data protection policies and system availability. In many cases, systems recovery plan (which allow a given information service to be restarted in case of failure or disaster) and data protection plans (which allows recovering a particular repository in case of loss, accidental deletion or data corruption) are managed independently and disconnected.
  • Lack of risk analysis models and associated economic impact. Most organizations lack moderately complete risk models related to the two previously mentioned aspects. And above all they do not have it modeled with the metric that everyone understands: MONEY. This means that the data protection policies (and the return of investment made) are based on subjective criteria; in some cases, spurious.

Technology limitations

On the other side, technology limitations derived from a market of little glamour (that of backup solutions) and a growing complexity in the requirements are:

  • Predictability of recovery capabilities and metrics. In general, backup tools do a more or less decent job in running backups. But ask a backup manager in the IT ops team of a large company if he/she put his/her hand in the fire (and his/her position) for being able to recover in case of disaster, the information from a critical system with the service levels agreed upon. If the answer is yes, ask if they regularly run simulations and take metrics. Backing up is reasonably easy; restoring on failure is the hard part of the story.
  • Heterogeneity of environments and backup solutions. Any medium / large company ends up having multiple systems and backup solutions, complicating the processes and increasing the overall risk. The adoption of Cloud models has only added an additional factor of complexity that many ‘traditional’ solutions do not manage efficiently.
  • Scalability with the volume of information to be protected. In bigdata we use the phrase ‘data has gravity’. That is, they weigh. Moving data or making copies for protecting or recovering them does nothing but increase the technical complexity of the process with the (incessant) increase in the volume of information to be managed.
  • Inability to manage depending on the specific nature of the information to be protected. In many cases, our backup policies, for technical reasons, use fixed policies when we find in the repositories different data with different nature and that should not have the same protection requirements (for example , information of high volatility with voluminous information, but with low or no change rate).

Trends in the era of massive data

The unstoppable information growth to be processed and backed-up and the advent of massive data analytics projects lead to additional pressure, which can not be solved in a linear fashion. We need radical transformations in the technological approach. Some lines of innovation that we highlight are the following:

  • Intrinsic protection. What this solutions look for is substantially eliminating the need to make a backup copy with a conventional backup tool. Instead, the repository self-protects and is able to provide a solution to the different scenarios of information recovery that we can have (full recovery due to physical disaster, accidental deletion, recovery of a previous version in time, assurance of the inalterability of specific contents for legal or functional reasons, …). There are many aspects to take into account, but in general the vision of a data repository as a succession of changes over time with several geographically dispersed replicas, is the basis for addressing these models.
  • Reduction of technological assets to be protected. As we mentioned in the section on technology limitations, backup policies were traditionally used to recover a specific service or repository, not only to protect the contents. So, we make a backup of our whole server, with all its elements, to ensure that we can restore it in case of problems (with all the information it stores and serves). This is profoundly inefficient. The trend in software engineering and IT ops with innovations such as virtualization a long time ago and containers more recently is that services are recreatable with a descriptor of the components, not because we have saved an exact copy ‘just in case’. It is as if to protect our ‘mobility service’ by car, we have a car that is what we use and we have another exactly the same in the garage in case the first one fails or has an accident. But we also have to pass the work of doing in that second vehicle all the small bumps, scratches, wear that the original car has, so that it is exactly the same. If we were able to describe in an automated way the process of building a car, it is much more efficient to simply have that process updated and, when our car fails, to launch the process of creating a new one; naturally in our example, ‘making’ something physical like a car is complex and expensive over time; but ‘making’ a virtual server or container (especially the latter) is not. Therefore, if we design our SW services properly, we can only back-up the data, not the systems that execute them. At Tecknolab, in our DBcloudbin service, we have reached the milestone that there is not a single physical or virtual server that is backed up; only service descriptors (instructions for automatically building the car if necessary) and data repositories, which are physically decoupled from the servers that manage them (by two completely independent and heterogeneous means, one physical and one logical providing each one a robustness added to the recovery capabilities).
  • Combine capabilities in the same technical process (protection, availability and security). If we have to keep running data protection; we must continue to ensure that our services are available in case of disaster; we must also ensure that the data we handle is not corrupted accidentally or intentionally (e.g. ransomware), it seems reasonable to take advantage of a single process to meet those needs, simplifying and increasing efficiency. Some modern backup solutions allow that on the backup image we have made of a system, we can start a new version of the service in case of disaster in our primary environment (therefore taking advantage of this backup for a scenario of service availability); In addition, as we have a whole sequence of backups in each of our repositories, we can identify unexpected changes (for example, abnormal rate of change in the data in that repository based on a historical trend analysis) and warn of this as a potential virus attack. There are already manufacturers that provide this combination of functionalities in the market.

Conclusions

In summary, we can not sustain the model of ‘business as usual’ with the scenario of geometric scaling of information volume to be managed. This is especially true in the field of data protection. We must change the processes, the way of doing things and the solutions to protect what is (and increasingly) the most important asset of our companies: information.

At Tecknolab, we have taken advantage of our youth and lack of technical legacy to internally adopt highly efficient processes. Additionally, we have designed a solution for size reduction in enterprise databases, DBcloudbin, which allows to easily segment information to which we must apply very different information backup policies, simplifying the process with intrinsic backup techniques, dramatically reducing the overall backup costs.We encourage you to try it and contact us for any questions about it.

 

Growth of non-relational data. Traditional databases are exploding.

Growth of non-relational data. Traditional databases are exploding.

Business applications evolve very fast. The functional requirements are more sophisticated and we need to manage more non-relational data (photos, documents, images, videos …).
This need increases by several orders of magnitude the volume of information to be handled, as well as the complexity at software development and, above all, systems operations.

Data growth

Traditionally, business applications consisted mainly of some type of user interface (with some kind of more or less sophisticated forms technology) that allowed different users to interact with the environment to enter and consult data. In one way or another, data ended up in a traditional relational database (Oracle, SQL Server, DB2, Informix, …). The complexity came from the fact that, depending on the application and the company, some of these tables could have millions of rows and the query of data by very varied criteria resulted in very sensitive criteria for optimizing the queries (the famous ‘query plan’ ). In the most operational aspects, the headache was the ability to recover in critical situations (backup, replication, disaster recovery procedures, …).

That has remained so until relatively recently (a few years), when those applications were becoming more sophisticated and needed to cover other business demands. It was not enough to save all the data of our client in his client record at the application, we also had to save, for example, the contract document signed between the parties and had to be accessible from the application itself; or the mail messages, with all their attachments, that we have exchanged with the client or supplier in a certain business process. This requirement has been solved from the SW development department of the companies, typically by one of three alternative ways:

  • We store this data in a file service and associate it in the database through a link. This is reasonably simple but it gives quite a few management problems, and also technical ones. We have two repositories to manage (which synchronize backups, for example). If the data is sensitive, we will need specific security policies in two media (database and operating system) that are not integrated in a particularly simple way; we also complicate the transactional consistency (canceling a complete transaction in database due to a failure is simple, if we also have data in a filesystem, things get more complicated).
  • We store this data in a document management system or similar. In many aspects it is quite close to the previous scenario, with the advantage that a document manager system provides more and better management services, but also a greater complexity in operations and administration (we must manage, patch, upgrade  two complex systems).
  • We store this non-relational data in the database. This is, at the level of software engineering, the simplest; unique interface (SQL), transactional consistency, a data-type (BLOB) that allows to handle objects of any nature. In many cases it is the option chosen by many customers. In particular, where the decision is directed or influenced by the software engineering team.

This last scenario leads us to the fact that these databases no longer only handle relational data, but that they must manage very high volumes of binary content. And, although the most advanced enterprise database technologies are capable of doing this, we quickly discovered that the cost of infrastructure and operation is skyrocketing. These critical environments require infrastructure of the highest quality and speed and that is paid for.

Object store. New storage paradigm

This leads us to look for alternatives and lately, with the explosion of the volume of data handled, the alternative of Object Stores for storing files and binary content has become very popular, where the Amazon AWS S3 service has become a de-facto standard. It seems sensible to move those binary contents to S3 or similar object storage with several clear and immediate benefits:

  • Unlimited storage with virtually no management required.
  • Much lower costs.
  • Possibility of exploiting these contents in alternative scenarios (for example, advanced analytics, machine learning) without having a direct impact on the databases of the transactional systems.
  • Simplification or elimination of conventional backup needs.
  • Possibility (depending on the technology) to apply information retention policies that facilitate compliance with data retention regulations, so that the repository itself ensures the inalterability and prevents its deletion during the defined period.

The advantages are multiple; but there are also drawbacks. The main one is the fact that these systems have access through an API that, without being very complex, forces us to change the entire data access and persistence layer in our business applications to save and access the information in this repository. And that can be a tedious job, subject to errors and with a certain risk, proportional to the level of complexity and obsolescence of our applications. And almost more important: it involves diverting the resources of engineering and software development in our company (always scarce) to solve an internal problem of IT, which does not provide a direct functional value to the business user.

Databases meet object stores

In this context, at Tecknolab we have proposed to provide a solution that allows to move binary content to the different options of object store repositories, both in public Cloud services (Amazon AWS, Microsoft Azure, Google Cloud Platform) and , for an ‘on-premise’ deployment in a local datacenter, with the main object storage manufacturer alternatives (Hitachi HCP, Dell-EMC ECS, IBM COS, among others). With this service, called DBcloudbin, the configuration at the database is immediate and, what is more important, transparent for the application; with the same software, the application continues to access the data through the database using SQL as before, but in reality the system is responsible for reading the data that has been moved to the object storage and providing it to the application as if were in the database. This gives us all the benefits of having the data centralized in the database (single access, transactional consistency) but with the savings of using a much cheaper infrastructure for those data that do not need the access speed of a relational database . For more details of the solution, visit https://www.dbcloudbin.com/solution

5 limitations that surprise in Public Cloud (and it’s not security)

5 limitations that surprise in Public Cloud (and it’s not security)

The adoption of public Cloud in companies can lead to identify some surprising limitations. I propose five (and none of them is security).

There is a clear trend in companies of all sizes to move loads to the different IaaS services of Public Cloud that we find in the market, being Amazon AWS, Microsoft Azure and Google Cloud Platform the main global offers (with a clear leadership as of today from AWS).
When a company makes that decision, it will usually have done some pilots and proof of concept with the solution and, we suppose and wish, made some numbers to define the business case.

My experience with this process is that there are at least five things that surprise us because we can believe that it should be resolved or be simple, when we embark on this trip. And no, none of them is the security, everlasting ‘sin’ that has suffered historically and in my opinion little founded (in general it is safer than many alternatives of private service). These are five things that you will foreseeably clash when adopting public cloud in  a company of a significant size, not an individual user or a small organization or work group:

1.- Define consumption quotas or resources to limit the use to a specific budget
In a private environment, resources are explicitly delimited. They are what they are. A great advantage of the public cloud is that the capacity is (almost) unlimited and we can grow as much as we want. But in a large company, this is a double-edged sword and it is not uncommon for our organizational model to set limits for organizations or groups that, in fact, map the budgets assigned to each of these groups. Well, that which seems obvious, is not so easy to do because the main public Cloud services do not allow to define quotas, only alarms (and in many cases can not be managed easily by the resources of each group or organization of our company). These alarms do not prevent the additional consumption of resources, they only warn.

2.- Predictability of the monthly cost
We know it, or we guess it when we embark on the Cloud; this is pay-per-use in the strictest sense of the term. In principle, once again, it’s good, we only pay for what we use. But in most companies, especially in the finance&control department, the unpredictability of a cost generates many nerves. The first thing that is required to the responsible for a certain service is a figure for forecasting the expenditure, often in the long term. If we combine this with the previous point, we see that this predictability is much lower than what we could have considered at the beginning and will possibly require economic models that are much more thoughtful and sophisticated than we would have thought at first.

3.- Transparency and ease in the allocation of costs
One of the principles of the Cloud model, is cost transparency. Yes it is. Since you are billed for what you consume it is obviously easy to receive an invoice with all the items billed and the cost that it entails. And maybe get scared with the minutes. But in a medium / large company that is not enough, one has to charge-back these costs to the different organizations and cost centers that compose it. And there the thing gets more difficult, with little intrinsic help from the tools of the service provider. It can become a real hell and a significant cost in effort and time to map those costs internally.
I remember a large long-established Spanish multinational company that came, a week later, to ask us to rescue from a scrap warehouse a server that had been decommissioned and that was actually used by ‘someone’. An organization with that level of internal control, imagine how it can suffer in a model like this.
The proof that this is a real problem is that there has been a niche market of companies that have generated SaaS solutions whose main purpose is to help customers control and manage their public cloud costs, integrating more easily the different interfaces to obtain information and report it appropriately for the client and its organizational model. And that has an additional cost, of course.

4.- Comparison of costs between Clouds
A virtual machine is a virtual machine. It is invented, it does not have big differences and it works more or less the same in the basics, no matter if we run it in a virtual infrastructure based on vSphere of vmware in our private environment or in an instance of AWS in Ohio or in an Azure node in Ireland. In addition, the typology of the service (as a IaaS service) is also very similar at least in what refers to the extent to where the provider responsibility ends and where starts ours.
All public cloud services have public and reasonably transparent pricing models (this is the Cloud …) so we can potentially do fair comparisons (supposed assured the level of performance or any other characteristic that is relevant to us) instances are similar). Price comparisons seem easy: if option A is 10% cheaper, then the global service for our company will be 10% cheaper (more or less). It’s that simple.
Nothing is further from reality. There are multiple factors that knock down this simplistic assertion but I will detail only two: (1) on the one hand the cost model is much more complex than the one of ‘virtual machine price per hour’. There are dozens of billable concepts (from IP addresses to network traffic or monitoring concepts) that are not homogeneous and can vary costs significantly (like 10 or 15%); (2) on the other hand, there are discounts for sustained use, which have three radically different models among the three main players (AWS discount by the a priori reserve of instances that is also managed by type of instance further complicating the model; Google uses a automatic discount model for permanent instances and Azure integrates it into the management of the global licensing contract that 90% of large and medium-sized companies have with Microsoft, the Enterprise Agreement).

5.- The public cloud is the cheapest option
A massive service such as AWS, Azure or Google is the one that, in principle, has all the ingredients to make the most of the economy of scale. This added to the fierce competition in this market leads to the logical scenario that the cost of a Public Cloud service must necessarily be lower than other alternatives (which will have their advantages for other factors but can hardly compete on price).
Well … not always. My own experience and others that I have had the opportunity to listen to, after reasonably sophisticated comparisons, have come to the conclusion that it may be more expensive. This point would give for an encyclopedia and in general every organization is a world and its situation and scenarios are too specific to be able to assert in a resounding and generalized way if it is cheaper or more expensive. My reflection and summary is that (1) the more mature, flexible and advanced is the application architecture and the systems operations & management in a company, the more likely it is to make a public Cloud service profitable and (2) do model reasonably and as complete as possible your scenarios and costs so that you can assess correctly if it is an advantageous option at the level of costs for your company. And yes, it is something that is complex enough to analyze so that some of us earn our living with it.