Massimo Cenci Data Warehouse Blog

Saturday, March 5, 2016

Recipes of Data Warehouse and Business Intelligence (Use case 02) - How to load a data file from network with control file

Analysis of the use case 2

This use case shows how to configure a data file that is present in a network folder (obviously it must be visible to the server that contains the Data Warehouse). Furthermore this file is coupled to the corresponding control file that, in order to be managed by MEF, must have the same name as the data file unlike its extension.
This situation typically occurs for two reasons.

1 - the feeding system can not (or do not want to do it) send the data file to the DWH
2 - the feeding system generates a data file that is used by multiple applications.

In this case we have take the data file (moving it or just copying it) and carry it to the
reception folder called rcv .
The control file contains no further information. His mere presence certifies that the file date is OK. We shall see, in a next use case, also in the case where the control file must contain the number of rows in the data file. This setting is preferable, as it allows us greater control on the consistency of the data file .
As in use case 01 follow the indications contained in the read_me file. Here I will describe in detail the sequence of operations.

Changes to MEF repository

In order to properly handle the move (or copy) of a data file from a network folder
to rcv folder, you can access to the Operating System using java functionality.
The original package mef_java was then modified to be able to execute
Operating System commands on both Linux and Windows. Consequently they were made of small interventions even on packages and mef mef_sta to use the new Java function.
Another intervention was necessary to also configure the information
"FIELDS TERMINATED BY" associated the Oracle external table. Ovviously, the data file can be constructed on very different systems between them, so some
settings can be valid on Windows but not on Linux and vice versa. The only thing
what we can do is to make configurable the feature of the external table and try to change the value if there are any problems.
In this use case, for example, the same data file, on Windows 64 had to be set as "0x 0d0a" but on Linux and Windows 32 as "NEWFILE". If the load gives error, try changing the value.
Before to start with configuration and run, do the work indicated in the read_me.txt to prepare the MEF environment.

Configuration of the data file

The data file used for this demonstration is that of the financial regulated markets
,downloadable from ESMA site (http://mifiddatabase.esma.europa.eu). For the purposes of demonstration, the network folder is instead a local folder that is initialized automatically after the installation of the MEF.
You can safely try the use case with a your network folder, for example '\\ w2886038.bim.local \ shared'.
In this case, you must move the data file and the control file under the
'\\ W2886038.bim.local \ shared' folder and you have to change the pointing of the Oracle directory as follows:

CREATE OR REPLACE DIRECTORY
DWH_ESM_M_REGMAR_SRC AS
'\\w2886038.bim.local\shared';

The io_regmar.txt contains the setting for the data file. Set your correct value according to your Operating System.

Configuration of the structure of the data file

The configuration is contained in the regmar.csv. There is nothing else to do. You have done all the work neeeded to process the data file. Connect to Oracle and run the configuration.

sqlplus etl/etl
SQL> @sta_conf_io REGMAR

Pre o post processing

Nothing to do.

Load of the data file

To load the data file, you must put it and its control file under the folder pointed from the DWH_ESM_M_REGMAR_SRC directory.

Thursday, February 25, 2016

Recipes of Data Warehouse and Business Intelligence (Use case 01) - How to load a data file with fixed length columns, dummy fields and transformation rules

Introduction

In the initial example, we loaded the global financial market data files (MIC). One of the fields in the data file was the ISO country code in the 2-characters format. The international encoding of nations also provides a 3-characters format and a numeric format.
It 'important for a Data Warehouse, have all the standard codes, above all for information of a general nature as well as the codes of nations. This allows the Data Warehouse to receive and send data files to the outside world using a common coding language.
We can find on the internet the data file that contains these encodings (I used the http://www.chemie.fu-berlin.de/diverse/doc/ISO_3166.html site but you can find others. I used this example because with a simple copy and paste, I created a txt format data file with fields at fixed lenght.
This data file, after being loaded into the Staging Area, will be a good candidate to become the dimension table of nations. This dimension must have a unique key field on which to build the dimensional key. In the case where, like this, there are more ways to identify the nation uniquely, we need to decide on a single one and use it as a standard for the whole Data Warehouse. We decide to define as a unique key the ISO code the 3-characters format.
In the sample of financial markets (MIC), there is the 2-characters format code. Suppose that, since the MIC will become a dimension of analysis, we want to include the ISO 3-characters format code and its related description.
The objective is to show how to use MEF for
•    Load the data file of the nations in format with fixed-length fields.
•    Set up a trasfomazion rule on a field
•    Configure the fields that are not in the MIC data file.
•    Use the post processing
Do all this, it is very simple. It takes a little work.

1 - It starts as usual, by configuring the COUNTRY data file and placing the sample file in the \dat.
2 - Run the load.
3 - Change the configuration file of the MIC structure, adding dummy fields. These fields will not be present in the file date, but will be created in the Staging Area table.
4 - We use the post processing package, which always runs even if it has nothing to do, to write the plsql code that will fill the added fields.
The data file and the related configuration files, you can download from https://drive.google.com/open?id=0B2dQ0EtjqAOTQzZSaUlyUmxpT1k. Under the folder of Use Case. Follow the read_me instructions.
We see in detail these steps.

The COUNTRY configuration

The configuration of the data file is performed by opening the io_country.txt file located in the \cft, and setting the following values.

IO_COD:COUNTRY
IO_DEB:ISO 3166 Codes (Countries)
TYPE_COD:FIN
SEC_COD:ISO
FRQ_COD:M
FILE_LIKE_TXT:countries%.dat
FILE_EXT_TXT:countries_22-02-2016.dat
HOST_NC:.,
HEAD_CNT:3
FOO_CNT:0
SEP_TXT:
START_NUM:11
SIZE_NUM:10
MASK_TXT:DD-MM-YYYY
TRUNC_COD:1

Since the data reference date is inside the file name (countries_22-02-2016.dat), we need to set the size properly MASK_TXT: DD-MM-YYYY. The date starts at the eleventh character (START_NUM:11) and is long 10 (SIZE_NUM: 10).
I plugged in the data file 3 header lines, so there is HEAD_CNT:3
Since it is a data file with fixed length fields, leave blank the information of the SEP_TXT separator.
The configuration of the structure of the data file, was performed by opening the file country.csv under the folder \cft
We see in the figure only the columns that need to be setted.

The first 4 lines setup the fields according as they are present in the data file. They have a fixed length which is affected by the amplitude of the column headers (HOST_LENGTH_NUM), but in the Staging Area table, I want to have the real length, ie the one that is obtained with the sum of
HOST_LENGTH_NUM with STA_OFF_NUM.
We set the control of the congruence for all fields that do not undergo the rules (CHK_FLG) and we set the fields that make up the unique index (UI_FLG). After loading the Staging Area table, you will create the unique index. This is to prevent, immediately, that may be present duplicate keys.
At line 5, we insert the field in more that it is not found in the data file. (The COLUMN_COD FILE is null). Also, because we want the nation numerical code is just a number (not a number with leading zeros) we set a rule change. This rule will be used in the creation of the view that the processing module will use for loading the staging table.
Now, open a shell and :

Microsoft Windows [Version 6.0.6002]
Copyright (c) 2006 Microsoft Corporation. All rights reserved.

E:\>cd projects\dwh\mef

E:\projects\dwh\mef>sqlplus etl/etl

SQL*Plus: Release 12.1.0.1.0 Production on Wed Feb 24 10:41:25 2016

Copyright (c) 1982, 2013, Oracle. All rights reserved.

Last Successful login time: Tue Feb 23 2016 09:28:53 +01:00

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL> @sta_conf_io COUNTRY

PL/SQL procedure successfully completed.

Disconnected from Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

E:\projects\dwh\mef>

Load the COUNTRY data file
Reconnect to SQL*Plus and run the load:

E:\projects\dwh\mef>sqlplus etl/etl

SQL*Plus: Release 12.1.0.1.0 Production on Wed Feb 24 11:08:01 2016

Copyright (c) 1982, 2013, Oracle. All rights reserved.

Last Successful login time: Wed Feb 24 2016 10:41:26 +01:00

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL> exec mef_job.p_run('sta_iso_country');

PL/SQL procedure successfully completed.

SQL>

The MIC reconfiguration

We must change the structure of the MIC data file by inserting the extra fields, as we have done in the fifth row of the COUNTRY configuration. Doing so is very simple, because you just do a copy and paste of the lines 3 and 1of COUNTRY. (With FILE_COLUMN_COD and UI_FLG changed to null).

After changing the configuration file of MIC, run the configuration script.

SQL> @sta_conf_io MIC

PL/SQL procedure successfully completed.

Disconnected from Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

E:\projects\dwh\mef>

Setting of the new fields PAE_DEB and PAEISO3_COD with the post-processing

Needless to say that in order to update these fields, you must first be loaded the COUNTRY data file. This setting is a typical post-processing operation. I have opened the temp_dwh_sta_iso_mic_ppk.sql file, present under the \sql folder, and modify it as follows.

create or replace package dwh_STA_ISO_MIC_PPK as
procedure p_main;
end;
/
create or replace package body dwh_STA_ISO_MIC_PPK as
pv_pkg varchar2(30) := 'dwh_STA_ISO_MIC_PPK.';
pv_error   exception;
pragma exception_init (pv_error, -20058);

procedure p_upd_iso is
   v_mod varchar2(61) := pv_pkg||'p_upd_iso';
   v_count number;
   v_upd number;
   v_msg varchar2(500);
begin
   select count(*) into v_count
   from dwh_sta_iso_mic_stt;

   update dwh_sta_iso_mic_stt d
   set (paeiso3_cod,pae_deb) = (
       select paeiso3_cod,pae_deb
       from dwh_sta_iso_country_stt s
       where d.paeiso2_cod = s.paeiso2_cod
       )
   where exists (
      select paeiso3_cod,pae_deb
      from dwh_sta_iso_country_stt s
      where d.paeiso2_cod = s.paeiso2_cod
      );
   v_upd := sql%rowcount;
   v_msg := 'Updated '||v_upd||' countries on '||v_count;
   if (v_upd <> v_count) then
      v_msg := v_msg||chr(10)||'Attention ! Some contries ('||
      to_char(v_count-v_upd)||') without description and ISO3 code';
   end if;
   mef.p_send(v_mod,v_msg);
   mef.p_mail('INFO','Staging Area Informational ('||v_mod||')',v_msg);
exception
   when pv_error then raise;
   when others then mef.p_rae(sqlerrm,v_mod);
end;

procedure p_main is
   v_mod varchar2(61) := pv_pkg||'p_main';
begin
        if (mef_job.pv_wr.repeat_cnt > 0) then
      mef.p_send(v_mod,'Start');
      p_upd_iso;
      mef.p_send(v_mod,'End');
   end if;
exception
   when others then mef_unit.p_exc(v_mod,sqlerrm);
end;
end;
/

I have closed the file and I have renamed it, without the initial "temp_".
I have already done for you: you must only download the file, move it under the \sql folder, and compile it.

E:\projects\dwh\mef>sqlplus etl/etl

SQL*Plus: Release 12.1.0.1.0 Production on Wed Feb 24 11:16:35 2016

Copyright (c) 1982, 2013, Oracle. All rights reserved.

Last Successful login time: Wed Feb 24 2016 11:08:01 +01:00

Connected to:
Oracle Database 12c Enterprise Edition Release 12.1.0.1.0 - 64bit Production
With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options

SQL> @..\sql\dwh_sta_iso_mic_ppk.sql

Package created.

Package body created.

SQL>

As you can see, the code performs a simple update of the Staging Area Table, taking the values from the dwh_sta_iso_country_stt table. Because we want to be notified if there are nations who have not found a description, add a new email code in the configuration table MEF_EMAIL_CFT, which will be used by mef.p_mail call.From SQL * Plus give:

Insert into MEF_EMAIL_CFT (EMAIL_COD, FROM_TXT, TO_TXT, CC_TXT, SUBJ_TXT, STATUS_COD)
Values ('INFO', 'mef@microetlfoundation.com', 'massimocenci@outlook.com', 'massimo_cenci@yahoo.it', null,1);
COMMIT;

Of course, change the email address with yours. We can now reload the MIC file. We enter SQL * Plus and give:

SQL> exec mef_job.p_run('sta_iso_mic');

PL/SQL procedure successfully completed.

SQL>

We see (in this image fragment) that the MIC Staging table has new values in the added columns.

Since there are some cases of country codes not found, we received the informational email:

Wednesday, February 24, 2016

Designing, building, loading and managing a Staging Area for an Oracle Data Warehouse. In 20 minutes.

I briefly summarize what I have described in the presentation, with the same name, on Slideshare.

In the recent years, I have tried to explain and share a vision of the Data Warehouse which I defined as Micro ETL Foundation (MEF). MEF was born as a set of ideas, tips and methods to be able to successfully implement a project of this type.

The main guideline was the simplicity. This was the most difficult point to be maintained, because a project of Data Warehouse and Business Intelligence is extraordinarily complex.I have shown in various articles, that even small details such as the descriptions of the codes or the null values management can weigh heavily on the final outcome of the project.In fact, in my opinion, the real complexity lies in being able to simplify the complexity.

It is now time to collect and put together organically and usable in real life, everything that I wrote, giving a consistent life to the Micro ETL Foundation transforming it into a an ecosystem to be used by all in a simple and free way.

The implementation was carried out using the internal language of the database (pl / sql) and it is addressed to an Oracle Data Warehouse builted in Windows (and Linux) environment, but the underlying philosophy can be easily adapted to any RDBMS and any Operating System.

The use of a well-defined Naming Convention has been decisive for the creation of models useful to the automatic generation of all the objects that constituting the system. We can then define MEF as a "Naming Convention-driven" ecosystem.

The focus will be on the Staging Area, because, for me, is one of the main components of a Data Warehouse, the basis (according to the Kimball approach) on which we can build the next components, ie the Dimensions and the Fact tables. The example described is the load of a simple file in csv format.

The MEF ecosystem is downloadable at https://drive.google.com/open?id=0B2dQ0EtjqAOTQzZSaUlyUmxpT1k

The use case is described on Slideshare:

http://www.slideshare.net/jackbim/recipe-14-of-data-warehouse-and-business-intelligence-build-a-staging-area-for-an-oracle-data-warehouse-1?related=1

http://www.slideshare.net/jackbim/recipe-14-build-a-staging-area-for-an-oracle-data-warehouse-1

Now I will describe how to use it with others practical examples.

Saturday, March 7, 2015

How to identify and control the reference day of a data file

Introduction

In a presentation of some time ago [1], I had highlighted the importance of the control between the reference day and the expected one of a data file received from the host system. Given the importance of the topic, I would further deepen both the meaning of the control and the identification of the reference day of the data.

This argument may seem strange, or too technical, mainly for those who are not very experienced about Data Warehouse, or for those they are to his first experience. I'll try to simplify the concept it as much as possible, because this it is a very useful example to understand the pitfalls inside the ETL process.

Example

To understand the problem, we assume a similar example of practical life. Suppose that you receive every morning, from your bank, via email, the list of the operations of your checking account.

So, you receive daily an excel spreadsheet with the movements of the previous day. The reference day of the operations, is located in the subject of the email, for example: "Operations of the 03/12/2015".

Obviously you do not check your email from the bank every morning: you have created an automatic mechanism that opens the email, and load the daily operations, with the day indicated in the subject of the email, in the database of your household expenses. Only at the start of the month, you start to do your analysis about the previous month.

The questions are:

What guarantees do you have, that the email arrive to you every day?

What guarantee do you have that the day, in the subject of the email, definitely makes always reference to the day before?

The importance of the control

If we are not able to answer the questions above, we have no guarantee about the validity of our analysis.

This is the reason why the control is important. That's why we have to be absolutely sure that the day that we are loading is exactly what we expected. And, before that, we have to verify that what we expected, has arrived. We should not assume that everything spins smoothly, because it is always possible that an anomaly occurs. Both in the source systems that in our systems.

Do not forget, that in the face of a problem, we are always us, understood as the Data Warehouse team, the first interlocutors, the first people to which requires explanation. In practice, in the eyes of end-users, we are the cause of the problem. Those who had to control and they did not it. Here are two examples that may occur. (and in fact they are occurred to me in the past).

Problem 1

In the source system there has been some revision or changes in the extraction plans (no one is forced to alert you), so an error rises in the data filter. Consequence: the data file is daily generated correctly, but the reference day inside the flow is wrong. We don't know the cause. The reference day may remain fixed, backward a day, equal to the system date, or refers to a field different from the previous data file. Only if we check the reference day with the expected day we realize the error.

Problem 2

The ETL process, via ftp, copies the file from the host system folder to a local folder to process it. Suppose that the host folder, which is a folder also used by other systems, for reasons of maintenance or space, starting from a certain day, is no longer used, replaced by another folder. Nobody warns you, the old folder is still visible, and the Data Warehouse, the day after the exchange, continues to feed itself from the old folder. Also in this case, to have an immediate control of the reference day is crucial.

Moreover, if the data you are loading erroneously always for the same day, are mixed with other data that instead change daily, discover this kind of error cannot be immediate and may take several days.

The solution

The proposed solution to control the congruence between the reference day and the expected one, was very minimal, according to the rules of an "agile" vision for the construction of a Data Warehouse [2]. Were sufficient a metadata table and a log table. We can graphically represent its logic in the following figure. In practice, we can use a configuration table, of "calendar" type. In it, each calendar day, is associated with the expected day of the daily data source and with the reference day loaded.

A log table shows, at the end of the daily load, the congruence of the two days. The physical names are constructed according my standard of naming convention.

The basic elements

The elements involved in this control are thus three. The first is the calendar day. The second is the awaited day inside the daily flow. We must configure it on the basis of the flow analysis. We know their logic, so we can easily configure the awaited day. For example, if the data file is not expected at the weekend, the saturdays and the sundays of the calendar will not be setted as expected days.

The third element in the game is the reference day inside the data file. That is the day to be compared with that expected for the verification of congruence.

The reference day of the data

Now focus your attention to the third element, the reference day of the data . Where is this information?

It would be nice to think to a rule that the day is always present directly as a column of the file. This should be a rule, an underlying condition.

Unfortunately in a Data Warehouse, as usual, nothing is simple, and the reference day, most of the time, is not a column of the data file. The rule that we have described, it may be applied only if we are lucky enough to be able to decide ourselves the structure of the source files. Most often, though, we'll have to re-use (to save money, of course) data files already existing, created with different rules. Let's see what are the situations that surely we will face.

The day is a column of the data file
The day is in the header of the data file
The day is in the tail of the data file
The day is in the data file name
The day, there is not.

The day is a column of the data file

This is the most simple and easily manageable case. The day is a column of the data file. For example, a recording date, an execution date of an order, a fiscal date. Typically you can find this situation in the data files that will become, at the end of the ETL process, the fact tables.

The day is in the header/tail of the data file

Typically you can find this situation in standardized flows, already present, generated on the mainframe, by programs written with Cobol language. These files can have various lines in the header/tail with a lot of information in it, including the reference day. In fact, the reference day, usually, is the date of the generation of the data file. However, if the data are of anagraphical type (like the features of the customers), we can think that it is the same of the reference day.

Think of a customer data file. If the data extraction start at 22.00 of the day 2015/03/31, we may think that it freezes the daily situation of the customers at the 2015/03/31; therefore the extraction day is the same of the reference day. Caution, however, the cases in which the extraction starts after the midnight. In this case, the reference day is equal to the extraction day minus 1 day.

The reference day is inside the file name

This is a fairly common event, typical above all for data files in csv format.

The reference day is missing

We must also consider this case. Sometimes the data file is generated with the same name, (therefore exclude the case above), may have a header/tail, but without the information that interests us. We can only rely on the system date of the Data Warehouse server.

The importance of the metadata

Let us now see how to provide a method to handle these situations. It is easily implemented using only a bit of the language of the database (for example, the pl/sql of Oracle, but these techniques can easily be adapted to any database) and a bit of metadata.

Metadata, logical term very sophisticated and intriguing, that hides what we call trivially configuration tables, are essential to manage the case studies we have described. The metadata needed to identify and extract the reference day are:

ref_day_cod = fits here, if there is, the name of the column that contains the reference day.
row_num = with this number we denote the offset of the line, from the beginning of the data file, if it is in the header, in which we find the reference day. The number will be negative in the case where the reference day is in the tail.
start_num = with this number we indicate, within the row_num, the position of the starting character of the reference day.
size_num = with this number we indicate the size of the day. If in the YYYYMMDD format, will be 8, in the DD-MM-YYYY format is 10. Avoid using the months in text format.
for_txt = with this string we indicate the date format of the reference day. For example 'yyyy-mm-dd'.
off_num = with this number we indicate the offset (in days) of the reference date. This information is useful in the case where the generation of the flow and its processing is performed on different days. Let's take an example. Today is the day 14 and start the processing of the data file. The flow was generated yesterday at 22:00, The day 13 is in the header, and we assume that this is the day of the reference data. In this case off_num must be equal to zero. Suppose now that the generation of the data file takes place after midnight, for which we find in the header the day 14. In this case off_num must be set = -1, because even if the generation and processing of the flow occurred after midnight definitely the master data refer to the situation of the day 13. This, in brief, is the need for this metadata.

Now complete the previous figure with an example of the corresponding data files.

How to use the metadata

Define the metadata, place them in a configuration table of all data files and fill the content, it is not a "passive" activity . That is, not only serves to document the structure of the data files, but it is also "active" information, because it can be used in the ETL programs.

Let's take a concrete example very simple. A data file, it is only a sequence of lines structured in data columns. We can represent it as:

The first thing to do in order to load the data file into the Staging Area is to build an external table that "sees" the data file, with the same columns plus the technical field <IO> _row_cnt, that simply gives a number to each row. <IO> is the input file code.

The Staging Area table must obviously have the same columns of the external table plus the reference day DAY_COD. If the day is already in the data file, it does not matter. Always insert an additional DAY_COD column. At this point, you can create a function that, having as input the IO_COD, uses the metadata table and extracts the value of the reference day to be loaded in the DAY_COD of the Staging Area table.

Conclusion

I have already had occasion to emphasize the importance of the reference day of the data. A mistake on his identification may jeopardize the validity of the whole Data Warehouse. Do not forget that:

the versioning policies of the dimensional tables (slowly changing dimension of type 2) are based on the reference day.

the fact tables are partitioned using the reference day. An error in the initial phase can be very costly in the following stages of the project. So, beyond the technical solution, the identification and control of the reference day is a very important issue in a Data Warehouse project.

The management of the reference day, just as the "NULL" management[3] and as the description of the codes [4], which I have already spoken at length, seem unimportant arguments and of little interest.

But, to ignore them, is very dangerous, and may be the cause of many failures in the worst case. Or heavy delays in the deploy in the best cases.

References

[1] http://www.slideshare.net/jackbim/recipes-4-of-data-warehouse-staging-area-how-to-verify-the-reference-day

[2] http://www.slideshare.net/jackbim/recipe-11-agile-data-warehouse-and-business-intelligence

[3] http://www.slideshare.net/jackbim/recipes-5-of-data-warehouse-the-null-values-management-in-the-etl-process

[4] http://www.slideshare.net/jackbim/recipe-10-the-descriptions-management

Wednesday, December 10, 2014

The Descriptions Management

Introduction

The descriptions management, is a subject little discussed within the Data Warehouse and Business Intelligence community. It speaks little in books and articles, although it is a crucial issue because it affects the way we see the information.
So, we'll talk about descriptions. In particular, the descriptions of the codes that we can found in the dimension tables of a Data Warehouse.
I think it's important to show the complexity that is hidden behind what seem trivial. And it is necessary for all, (end users, project managers, data architects and the other stakeolder) know that the descriptions management is a particularly tricky topic.

A real life example

The best way to introduce this topic, is to start with a real example. Suppose we observe an end-user who is using a Business Intelligence interface.
The user needs to extract numerical data, for example, the movements of an account, and he wants to filter only those customers who have a particular profession, for example, only employees.
In order to meet this requirement, the Data Warehouse Team has just prepared a menu with a list of choices of values, (that is, descriptions) on which the end user can choose the line "Employee". The following figure shows this case.

This figure shows the visual world. Now we enter in the world of descriptions management, with a number of considerations that describe the logical and the physical world that hides behind the little word "just".

The vision of the end-user

The description "Employee" associated with the code "10" is a description wanted by the end user. It is very important to show a description that is in tune with the daily nomenclature of the end users. This need has a very important consequence.
It is very unlikely that the code description, which comes from the source systems (we call it "external"), it is the same description of the end user, in fact, usually, the description will be different.
This is because the description is present in the OLTP systems, not in the synthesis systems. So you need to always have in the Data Warehouse two descriptions. That of OLTP systems, let's call it "External description" and the final one, let's call it "Business description". To have both is important because it is the only link between two systems that have different purposes.

The extended description

As seen in the example, in the list of choices, must not appear only the description, but also the code. Often this does not happen, and it is shown only the description. My suggestion is to show them both.
We define this new description as "extended description", and it must be obtained by concatenating the value of the code with a "-" (minus) and the description of the value. It is useful to do this, because end users are not all equal.
Some user think using descriptions, but others think using codes. Often the end users have so many years working in a certain environment, that it is easier for them to find customers with profession "10" rather than those with the profession "Employee". Moreover, the descriptions, in time, may vary, so the presence of the code is undoubtedly safer. The best thing is to show them both, code and description, to meet all types of users. So we have "10 - Employee"

Because we have an external description and a business description, we have to decide which of the two use to build the extended description.
In general, it is appropriate to use a single description, namely the business description, but it can be an acceptable choice also have two extended descriptions.
Let's now look at the physical world behind the list of values displayed to the end user. Because as we know, the logical considerations, always have physical consequences.

How to see the extended descriptions

When a user requests a choice of values on the Business Intelligence interface, the display of the list is almost always the result of the generation of a query on the database of the Data Warehouse. Generally, it is very likely that starts a "select distinct" of the values of the extended description, on the dimension table.
The extended description can be implemented without taking up space, with the characteristic of the virtual columns. The virtual columns are defined on Oracle, but you can always create views that create those columns. We will show later how to define it.
Here's an example, very simplified, of the columns of a possible customers dimension table, which will be the base of the list of values:

CUST_KEY - Artificial key associated to the customer
CUST_COD - Natural key (customer code)
PROF_COD - Profession code
PROF_HDS - External Description (OLTP description)
PROF_BDS - Business description
PROF_XDS - Extended description. Obtained by concatenating the code with the business description

Where to find the extended descriptions

We have seen that a simple "click" on a list of choices, hides various considerations. But the most surprising thing is that the above is just the tip of the iceberg. In this case, the similarity is very apt. The tip is what the end user sees. But what happens below the surface? It 'so trivial to get descriptions that come (maybe) from source systems, and insert them in the dimension tables? As always happens in a Data Warehouse, nothing is trivial and everything is more complicated than it seems. Indeed, code descriptions can be located, in my experience, in seven different places.

The code description is already in the staging table (which will become the dimension table)
The code description is present in a table of descriptions, exclusively for that code.
The code description is present in a table of descriptions for all of the codes. (or nearly all)
The code description is present in a sheet excel / csv
The code description is present in a Word document or pdf or xml
The code description is in an email.
The code description is hard-coded in the code of the OLTP forms , so, after a phone call to some referent of OLTP systems, we return to one of the 4 or 5 or 6 point.

Descriptions as metadata

Without thinking where are the descriptions and the way in which we will acquire, how can we manage the descriptions in our Data Warehouse?
My advice is to always have a metadata table as that of the point 3 and reduce all other cases to it. But to do so, we need to introduce the concept of domain of values. The concept is very simple. Just give a name to the set of values associated with a certain code. For example, if we know that the set of possible values of the code profession is:

10 = Employee
20 = Free Lance
30 = Tradesman

Associate the "PROF" name to this domain of values and insert this data into a metadata table that we call MEF_DOM_CFT (MEF Domain Configuration). The table can have a very simple structure with the following information:

DOM_COD - Domain code
COD_TXT - Value of the code
HDS_TXT - External description
BDS_TXT - Business description

The Naming Convention of the descriptions

We adopt, as always, a naming convention. On this issue I have already written a lot on my blog, in this case we assume that all the fields of descriptive codes, ie codes that makes sense to have a description, must be called like * _COD, the external descriptions like * _HDS, the business descriptions like *_BDS *, the extended descriptions like *_XDS.

Who has experience of Data Warehouse knows that, unfortunately, if we wait for the end of the analysis of the domains, we lose a lot of time, so we must act preventively. My advice is to fill in the automatic way the domains table with the values of the codes that we found in the staging tables, giving them a unique name, and a default description like "nd" (not defined). In a second step we can take action on generic domains by changing their default name . The default domain name that we will adopt will be <source >. <Column name>.

Other considerations

Descriptions are names that rarely vary. It would be a waste of time to set a daily loading/update of the domains table. Do an initial massive load and manage, when it is necessary, any changes and additions.

Often you get, in the source files, short and long descriptions. Obviously, if they are present in the source you should load them, but you should always use the short description as "external description". Now the trend is to shorten the information (see the language of SMS messages) so, in my opinion, you can ignore long descriptions that can create problems of sizing on graphical interfaces (PC,Tablet and smartphones)

The business description is necessarily manual. It will be the end user who will decide which description prefers to see. The initial loading, however, must initialize the business description equal to the external description, if,and only if, the business description is empty. Then, from time to time, we will proceed to the changes.

Another point that should not be overlooked is the following: how much are reliable, codes and descriptions that we find in the points 2-6? As often happens, the documentation is not always up to date, so we risk to have not the full domain of the code. On this point, we have to be proactive in the sense that we can not ask generically an update of the documentation of the source systems. We must be the ones to ask precisely the descriptions of the codes that we do not find in the documentation.

The importance of the domains

Finally, the main reason for the existence of the domains table. The reuse of the description. In the tables of a Data Warehouse the same domain can be applied to so many columns.
In fact they have different names, but refer to the same domain of values. For example, a currency code, can be present in a table with name CURTRAT (currency of treatment) and CURTRA (currency of trading), but both must have the same domain, let's call it "CUR".

The descriptions management - Stage 1

At this point, we explore the descriptions management with a real simulation, obviously very simplified, after creating the simulation environment. The work plan is as follows.

Create the environment
Load the domains table
Load the staging table
Update the domains table on the basis of the codes that we find in the table of Staging Area
Load the dimension table
Update the dimension table with descriptions

Create the environment

We prepare a simulation environment, creating the necessary tables with only the information they we need, which is the table of the staging area, the dimension table and the domains table. As we can see, the dimension table contains both the external description and the business description. The extended description is built using the syntax of the virtual columns.

DROP TABLE CUST_STA CASCADE CONSTRAINTS;
CREATE TABLE CUST_STA (
CUST_COD VARCHAR2(9),
PROF_COD VARCHAR2(30)
);

DROP TABLE MEF_DOM_CFT CASCADE CONSTRAINTS;
CREATE TABLE MEF_DOM_CFT (
DOM_COD VARCHAR2(61) NOT NULL
,COD_TXT VARCHAR2(60)
,HDS_TXT VARCHAR2(255)
,BDS_TXT VARCHAR2(255)
);

DROP TABLE CUST_DIT CASCADE CONSTRAINTS;
CREATE TABLE CUST_DIT (
CUST_KEY NUMBER
,CUST_COD VARCHAR2(9)
,PROF_COD VARCHAR2(30)
,PROF_HDS VARCHAR2(60)
,PROF_BDS VARCHAR2(60)
,PROF_XDS VARCHAR2(93) GENERATED ALWAYS
AS (PROF_COD||' - '||PROF_BDS) VIRTUAL
);

Load the domains table

The loading of this table will necessarily be a mix of automatic loading and manual loading. We refer to the 7 situations seen before. If the domains are already present in a specific table or in a common table, we will use just a SQL statement to load it. If you are in an Excel spreadsheet, convert it to a CSV file, point it with an Oracle external table and then load them with a SQL statement. Otherwise, you have to load them manually one by one. Suppose to be in the latter case and load the domain of the profession code. We note the difference between external description and business description. If, as often happens, the feeding system is a mainframe, the descriptions will come all in uppercase and they will not be very user-friendly for a directional reporting

INSERT INTO MEF_DOM_CFT VALUES ('PROF','10','EMPL.' ,'Employee');
INSERT INTO MEF_DOM_CFT VALUES ('PROF','20','FREE L.' ,'Free Lance');
INSERT INTO MEF_DOM_CFT VALUES ('PROF','30','TRADESMAN','Tradesman');
COMMIT;

After the insertion, the contents of the domains table will therefore be:

Obviously, if we have not yet considered the analysis of the domains or we are still waiting for documentation, this step can be ignored. The table remains empty and will be loaded automatically in a second step.

Load the staging table

The staging table loading is a job that will be part of the ETL process. For simplicity we simulate its load with the following SQL statements. The '?' as value code, substitutes NULL. From the methodological point of view, I consider important to avoid NULL values in the Data Warehouse, and replace them immediately in the Staging Area with a default value. Usually I use the question mark. [for a detailed discussion refer to [http://www.slideshare.net/jackbim/recipes-5-of-data-warehouse-the-null-values-management-in-the-etl-process]

INSERT INTO CUST_STA VALUES ('Larry','10');
INSERT INTO CUST_STA VALUES ('Tom' ,'15');
INSERT INTO CUST_STA VALUES ('Brown','20');
INSERT INTO CUST_STA VALUES ('Steve','40');
INSERT INTO CUST_STA VALUES ('Jack' ,'?');
COMMIT;

After the insertion, the content of the staging table will be:

Update the domains table

As you can see from the staging area contents, I purposely simulated a situation of domain not updated, in fact are not present all the values that could arrive (missing the "15", "40" and a "?" ). In order to keep the domains table updated, we perform a procedure that we can run as post-processing in the Staging Area loading. The procedure is the following.

create or replace procedure p_dom_prof(p_dom_cod varchar2) is
begin
   for r in (select distinct p_dom_cod dom_cod
             ,a.prof_cod
             ,b.cod_txt
             ,nvl(b.hds_txt,'nd') hds_txt
             ,nvl(b.bds_txt,'nd') bds_txt
             from cust_sta a
             left outer join mef_dom_cft b
             on (a.prof_cod = b.cod_txt
             and b.dom_cod = p_dom_cod)) loop
      if (r.cod_txt is null) then
         insert into mef_dom_cft (dom_cod,cod_txt,hds_txt,bds_txt)
         values(p_dom_cod,r.prof_cod,r.hds_txt,r.bds_txt);
      end if;
   end loop;
   commit;
end;
/
sho errors
exec p_dom_prof('PROF');

The procedure performs a simple loop on the values of the profession code of the staging table. There is the link with the domains table,using domain name that receives as input. As we can see, the procedure will fill the domain table even if the table is initially empty (by means of the outer join). Also always acts as an insert, then does not alter or delete descriptions that have been entered manually.
The procedure also inserts the missing codes with the default description "nd". If now we query the domains table, we will get:

Load the dimension table

The loading of the dimension table, is one of the most challenging and delicate component of the entire ETL process. From my experience, almost all dimension tables should take account of the "Slowly Changing Dimension" of type 2, on which I will not go into detail now. Precisely because of this complexity, I preferred not to add complexity to complexity. I prefer manage the descriptions as post-processing of the loading of the dimension tables. So, first I load the dimension table by taking the data from the staging table, then I update the descriptive components. We load the dimension table in a very simply way with the following SQL statement:

INSERT INTO CUST_DIT(CUST_KEY,CUST_COD,PROF_COD)
SELECT ROWNUM,CUST_COD,PROF_COD
FROM CUST_STA;
COMMIT;

Situation after loading:

Update the dimension table

To update the dimension table, we create and execute the following procedure.

create or replace procedure p_dit_prof(p_dom_cod varchar2) as
begin
     for r in (select distinct prof_cod from cust_dit) loop
      update cust_dit d
      set (prof_hds,prof_bds) = (select hds_txt,bds_txt
         from mef_dom_cft s
         where d.prof_cod = s.cod_txt
         and s.dom_cod = p_dom_cod)
      where prof_cod = r.prof_cod
      and exists (select 1
         from mef_dom_cft s
         where d.prof_cod = s.cod_txt
         and s.dom_cod = p_dom_cod);
     end loop;
   commit;
end;
/
sho errors
exec p_dit_prof('PROF');

The logic of the procedure is very simple. For each value of the profession code, it performs the update of the description taken from the domains table. If now we query the dimension table we will get:

After the nightly loading of the Data Warehouse, we can query the domain table (or the dimension table), and all situations, in which there is the description "nd", must be the object of attention for all the people who are concerned with the quality control of the data . In practice, for each code that is "nd", we must send a request for a description to the feeding systems.

The descriptions management - Stage 2

The Phase 1 is helpful to us, to be aware of the problems and difficulties associated with descriptions management . The solution shown, although functional, is practically unusable in a real Data Warehouse.
The procedures that update the domains table and the dimension table, are specific for the profession code, and such procedure should be repeated, with appropriate modifications, for all other codes that require a description. In a Data Warehouse, we can have dozens of analytical dimensions, each with dozens of descriptive codes. It is not thinkable write hundreds of procedures for the descriptions management.
So let's see how to implement a single procedure that handles all automatically. To achieve the result indicated, we must necessarily get help from some configuration tables. In particular, a configuration table of the input stream, which indicate me the name of the staging table and of the dimension table. And we need a configuration table of all columns that need a description.

Environment creation

We enrich the staging table assuming the presence of three codes. The first, already seen in the previous phase is the profession code.The second is the status code of the customer. The third is the type code of the customer. We adapt also the dimension table with the new codes.

DROP TABLE CUST2_STA CASCADE CONSTRAINTS;
CREATE TABLE CUST2_STA (
CUST2_COD VARCHAR2(9)
,PROF_COD VARCHAR2(30)
,STATUS_COD VARCHAR2(1)
,TYPE_COD VARCHAR2(1)
);

DROP TABLE CUST2_DIT CASCADE CONSTRAINTS;
CREATE TABLE CUST2_DIT (
CUST2_KEY NUMBER
,CUST2_COD VARCHAR2(9)
,PROF_COD VARCHAR2(30)
,PROF_BDS VARCHAR2(60)
,PROF_XDS VARCHAR2(93) GENERATED ALWAYS
      AS (PROF_COD||' - '||PROF_BDS) VIRTUAL
,STATUS_COD VARCHAR2(1)
,STATUS_BDS VARCHAR2(15)
,STATUS_XDS VARCHAR2(19) GENERATED ALWAYS
     AS (STATUS_COD||' - '||STATUS_BDS) VIRTUAL
,TYPE_COD VARCHAR2(1)
,TYPE_BDS VARCHAR2(15)
,TYPE_XDS VARCHAR2(19) GENERATED ALWAYS
     AS (TYPE_COD||' - '||TYPE_BDS) VIRTUAL
);

We now introduce two new configuration tables. A configuration table of the input streams (MEF_IO2_CFT) and a configuration table of the columns of the staging tables (MEF_STA2_CFT),which will become columns of the dimension table. The MEF_IO2_CFT table contains, for each identifier of a stream (IO_COD), the name of the staging table (STA_TABLE_COD) and the name of the dimension table (DIM_TABLE_COD) . We insert a configuration line .
The MEF_STA2_CFT table contains, for each identifier of a stream (IO_COD), the name of the column code (COLUMN_COD) and the domain name associated (DOM_COD). This configuration table is very important, because it is the only place where we define for each code, the name of its domain. This name will allow you to access to the domains table to find the descriptions. For our example, we insert three lines, one for each code, assuming that the customer type has not yet been taken into account in the analysis of its domain. So, the domain code remains NULL.

DROP TABLE MEF_IO2_CFT CASCADE CONSTRAINTS;
CREATE TABLE MEF_IO2_CFT (
IO_COD        VARCHAR2(9)
,STT_COD VARCHAR2(30)
,DIT_COD VARCHAR2(30)
);
INSERT INTO MEF_IO2_CFT VALUES ('CUST2','CUST2_STA','CUST2_DIT');

DROP TABLE MEF_STA2_CFT CASCADE CONSTRAINTS;
CREATE TABLE MEF_STA2_CFT (
IO_COD      VARCHAR2(9)
,COLUMN_COD VARCHAR2(30)
,DOM_COD    VARCHAR2(30)
);
INSERT INTO MEF_STA2_CFT VALUES ('CUST2','PROF_COD','PROF');
INSERT INTO MEF_STA2_CFT VALUES ('CUST2','STATUS_COD','STATUS');
INSERT INTO MEF_STA2_CFT VALUES ('CUST2','TYPE_COD',null);