Friday, July 25, 2014

What you know about ETL process is wrong

The ETL process 
The title you have just read, is deliberately provocative. Of course not everything is false. My intention is to try to see the things from another point of view.  Don't  take anything for granted, and try to read some axioms, typical of the world of the Data Warehouse, in a critical way.
I will try to provide a different view of reality, questioning the individual letters of the ETL paradigm. It is therefore necessary to investigate in more detail the meaning of the ETL process.
We can find many definitions of the ETL process. In general, it is an expression that refers to the process of Extraction, Transformation  and Load of input data into a Synthesis System (Data Warehouse, Data Mart ...) used by the end users.
This is a very general definition, which does not helps us to understand the work that we face. A simple design can help you to understand the process.

The data that are in the structures of the Operational Systems(OLTP), are extracted, transformed and loaded into the the Data Warehouse structures.
In recent years, it has also sets another definition of the loading process. Its difference is the inversion of the "L" with the "T", that is, the implementation of the transformation phase AFTER the extraction phase.
This trend is related to the need to charge increasingly large amounts of data, and to the ability to treat this data, using the ETL tools. Many data transformations, maybe performed "on the fly", ie in memory, or with the help of temporary tables, can be problematic.
You have less problems to load the data file as it is, in a staging table, and then apply on it, this transformations.

The ambiguity of the ETL and ELT processes
As part of my considerations on the Micro ETL Foundation, the ELT approach is more in line with his philosophy. There must be a close relationship between the input data file and the Staging table. This ratio must be 1:1 and the flow must be as complete as possible.
Despite the ELT approach is better, this does not mean that it is the correct one. Of course, on the Internet you can find various articles and comments relating to the pros and cons of the two approaches.
In my view, however, the reality is different. The problem is not to decide whether to make the changes before or after loading. The problem is that both processes need to be revised. This is because, if we look carefully :

  1. The extraction step doesn't exists.
  2. It lacks the configuration and acquisition step.
  3. It is not convenient the transformation phase
  4. It is not clear how to do the loading, and where to do it.

Thus, although we can continue to speak generically of ETL (or ELT) because it is basically an acronym universally known for years, we must be aware that the name is misleading in case you want to set a baseline with the three phases into a project  Gantt, with the estimates associated.
Let us then to justify the 4 previous points.

1 - The extraction step doesn't exists
Is there a very real extraction activity in charge of the development team of the DWH? I think not. In most cases, the feeding systems are external systems that reside on mainframe, perhaps with different operating systems, and different database programming languages.
The extraction phase of the data, the "E" of the Extract word, is always in charge to the feeding system, which knows how to produce the flow. The Data Warehouse team must instead deal with two activities.

  • The activity of Acquisition or Transfer, namely the placement and storage of the input data files into well-defined folders in the DWH server. All this, with a pre-established naming convention.
  • The analysis of the contents of the data file, that is what  the feeding system must produce. This, if we're lucky. Otherwise, because generate new data files costs money,  you will have to reuse or integrate already existing data files.

The relationship with the external systems, using the transfer of data files, is used by most of the Data Warehouse projects. The CDC (Change Data Capture) situations are not so frequent, however, and do not cover the whole loading phase.
There are also rare cases, in which the DWH team builds the extraction statements and runs them directly, using a database link.
This should not be done for safety reasons, for performance reasons (who knows the indexing structure into the external systems ?) and for reasons of liability (if the data are not loaded, where is the problem?).
And also for scalability reasons. In times of budget cuts, it is increasingly common for the "IT people" to change the transactional systems or part of the source systems.
Having a source configuration which remains stable to which the external systems must adapt it, is definitely a choice that maintains the stability.

2 - It lacks the configuration and the acquisition step
The first step to be taken into account (and it is not simple) is the definition phase of the data files and their configuration in the metadata tables. It will be the feeding system to provide us the definitions using word documents, excel, pdf or other.
We must also give a unique identification of the data file, not numeric, valid for all feeding systems. It 'important that the name will be unique.
If we have a data file of financial operations, let's call it,for example, TMOV. If we have multiple data files, such as daily, monthly, quarterly, etc, let's call them DTMOV, MTMOV, QTMOV. If we have two systems that provide the daily financial operations, let's call them XDTMOV, YDTMOV to distinguish them, but we must have always a unique name as a reference.  On it we will build a primary key.
In this phase, we will have to configure all of the characteristics of the data files, not only their columnar structure.

3 - It is not convenient the transformation phase
We now analyze the letter “T”, that is the "Transform" component of the process. My opinion is that we should not talk about transformation, but of enrichment of the data.
To transform the data, means make them different from the original one: this has, as a consequence, a difficulty in the control of the data.
We must always be able to demonstrate that the data that we have received in input is identical to what we have loaded into the Data Warehouse. Immediately after the deploy into production, certainly we will have to answer to several check requests.
If the original data has been transformed, we will have to spend much time to restore the original data files (maybe already stored on tapes) and redo the tests. If we preserve the original data and enrich them with the result of the transformation, we will be able to respond more efficiently and faster.  So my suggestion is:

Keep the original data into the Staging Area tables  (and, if possible, even after).
Do not make changes to the existing data, but add the columns that contain the transformation result.
Enrichment is the right word. I execute the enrichment  by transforming or aggregating different data as consequence of the requirements.
Implement the enrichment step not as a staging phase, but as a phase of post-staging, ie only at the end of the whole loading of the Staging Area. This is because, often, the enrichment involves the use of data from other staging tables. To avoid implementing any precedence rules or supervision of arrivals, it is certainly preferable to wait for the completion of the entire staging process.

4 – How and where to load
The phase of the loading is very generic, since it does not say where to load the data. We should decide where to load immediately, because this choice will determine which, of the two fundamental approaches in the field of Data Warehouse, will be adopted.
Many years have passed, but this choice will continue to divide the international community. Innmon approach or Kimball  approach?
We want to have a comprehensive architecture of ODS (Operational Data Store) that retains more detail data and a dimensional architecture for synthesis data, or we prefer to have a single dimensional structure for both? Everyone can decide according to his own experience, your own timing and your own badget.
However, regardless of the method used, surely the first structure to be loaded is the Staging Area, which at first, will welcome the input data files. The Staging Area is a very vast topic. Just some suggestion.
The loading of the staging tables should be as simple as possible. A single direct  insertion, possibly filtered by some logical structure, from the data file into the final table. Some small "syntactic" transformation can be done, but it must be of formatting, and not of semantic.
The loading must always be preceded by the cleaning of the staging table. Do not load into a staging table, multiple data files (more days, for example) of the same type, that, for some reason, have not been loaded and they have accumulated. If you can, always process them one at a time.
If it is necessary, you can aggregate them, by hand or with an automatic mechanism, into a single data file. Do not forget that we have to perform very accurate control of these flows.
So, even a trivial control on the congruence between the number of rows loaded and those present in the data file, it will be much more difficult if the staging table contains the rows of several input streams.

So in conclusion, keep in mind that, in practice, ETL hides a different acronym, which can be summarized with: CALEL

  • Configuration
  • Acquisition
  • Load (Staging Area)
  • Enrichment
  • Load (Data Warehouse)

But, as CALEL  is just horrible, we can continue to call it, ETL process. All this we can represent graphically in this way: