Wednesday, December 10, 2014

The Descriptions Management

Introduction

The descriptions management, is a subject little discussed within the Data Warehouse and Business Intelligence community. It speaks little in books and articles, although it is a crucial issue because it affects the way we see the information.
So, we'll talk about descriptions. In particular, the descriptions of the codes that we can found in the dimension tables of a Data Warehouse.
I think it's important to show the complexity that is hidden behind what seem trivial. And it is necessary for all, (end users, project managers, data architects and the other stakeolder) know that the descriptions management is a particularly tricky topic.

A real life example

The best way to introduce this topic, is to start with a real example. Suppose we observe an end-user who is using a Business Intelligence interface.
The user needs to extract numerical data, for example, the movements of an account, and he wants to filter only those customers who have a particular profession, for example, only employees.
In order to meet this requirement, the Data Warehouse Team has just prepared a menu with a list of choices of values, (that is, descriptions) on which the end user can choose the line "Employee". The following figure shows this case. 




This figure shows the visual world. Now we enter in the world of descriptions management, with a number of considerations that describe the logical and the physical world that  hides behind the little word "just".

The vision of the end-user

The description "Employee" associated with the code "10" is a description wanted by the end user. It is very important to show a description that is in tune with the daily nomenclature of the end users. This need has a very important consequence.
It is very unlikely that the code description, which comes from the source systems (we call it "external"), it is the same description of the end user, in fact, usually, the description will be different.
This is because the description is present in the OLTP systems, not in the synthesis systems. So you need to always have in the Data Warehouse two descriptions. That of OLTP systems, let's call it "External description" and the final one, let's call it "Business description". To have both is important because it is the only link between two systems that have different purposes.

The extended description

As seen in the example, in the list of choices, must not appear only the description, but also the code. Often this does not happen, and it is shown only the description. My suggestion is to show them both.
We define this new description as "extended description", and it must be obtained by concatenating the value of the code with a "-" (minus) and the description of the value. It is useful to do this, because end users are not all equal.
Some user think using descriptions, but others think using codes. Often the end users have so many years working in a certain environment, that it is easier for them to find customers with profession "10" rather than those with the profession "Employee". Moreover, the descriptions, in time, may vary, so the presence of the code is undoubtedly safer. The best thing is to show them both, code and description, to meet all types of users. So we have "10 - Employee"

Because we have an external description and a business description, we have to decide which of the two use to build the extended description.
In general, it is appropriate to use a single description, namely the business description, but it can be an acceptable choice also have two extended descriptions.
Let's now look at the physical world behind the list of values displayed to the end user. Because as we know, the logical considerations, always have physical consequences.

How to see the extended descriptions

When a user requests a choice of values on the Business Intelligence interface, the display of the list is almost always the result of the generation of a query on the database of the Data Warehouse. Generally, it is very likely that starts a "select distinct" of the values of the extended description, on the dimension table.
The extended description can be implemented without taking up space, with the characteristic of the virtual columns. The virtual columns are defined on Oracle, but you can always create views that create those columns. We will show later how to define it.
Here's an example, very simplified, of the columns of a possible customers dimension table, which will be the base of the list of values:

  • CUST_KEY - Artificial key associated to the customer
  • CUST_COD - Natural key (customer code)
  • PROF_COD - Profession code
  • PROF_HDS - External Description (OLTP description)
  • PROF_BDS - Business description
  • PROF_XDS - Extended description. Obtained by concatenating the code with the business description

Where to find the extended descriptions

We have seen that a simple "click" on a list of choices, hides various considerations. But the most surprising thing is that the above is just the tip of the iceberg. In this case, the similarity is very apt. The tip is what the end user sees. But what happens below the surface? It 'so trivial to get descriptions that come (maybe) from source systems, and insert them in the dimension tables? As always happens in a Data Warehouse, nothing is trivial and everything is more complicated than it seems. Indeed, code descriptions can be located, in my experience, in seven different places.
  1. The code description is already in the staging table (which will become the dimension table)
  2. The code description is present in a table of descriptions, exclusively for that code.
  3. The code description is present in a table of descriptions for all of the codes. (or nearly all)
  4. The code description is present in a sheet excel / csv
  5. The code description is present in a Word document or pdf or xml
  6. The code description is in an email.
  7. The code description is hard-coded in the code of the OLTP forms , so, after a phone call to some referent of OLTP systems, we return to one of the 4 or 5 or 6 point.

Descriptions as metadata

Without thinking where are the descriptions and the way in which we will acquire, how can we manage the descriptions in our Data Warehouse?
My advice is to always have a metadata table as that of the point 3 and reduce all other cases to it. But to do so, we need to introduce the concept of domain of values. The concept is very simple. Just give a name to the set of values associated with a certain code. For example, if we know that the set of possible values of the code profession is:

10 = Employee
20 = Free Lance
30 = Tradesman

Associate the "PROF" name to this domain of values  and insert this data into a metadata table that we call MEF_DOM_CFT (MEF Domain Configuration). The table can have a very simple structure with the following information:

  • DOM_COD  - Domain code
  • COD_TXT  - Value of the code
  • HDS_TXT  - External description
  • BDS_TXT  - Business description

The Naming Convention of the descriptions

We adopt, as always, a naming convention. On this issue I have already written a lot on my blog, in this case we assume that all the fields of descriptive codes, ie codes that makes sense to have a description, must be called like  * _COD, the external descriptions  like * _HDS, the business  descriptions  like *_BDS *, the extended descriptions like *_XDS.

Who has experience of Data Warehouse knows that, unfortunately, if we wait for the end of the analysis of the domains, we lose a lot of time, so we must act preventively. My advice is to fill in the automatic way the domains table with the values of the codes that we found in the staging tables, giving them a unique name, and a default description like "nd" (not defined). In a second step we can take action on generic domains by changing their default name . The default domain name that we will adopt will be <source >. <Column name>.

Other considerations

  • Descriptions are names that rarely vary. It would be a waste of time to set a daily loading/update of the domains table. Do an initial massive load and manage, when it is necessary, any changes and additions.

  • Often you get, in the source files, short and long descriptions. Obviously, if they are present in the source you should load them, but you should always use the short description as "external description". Now the trend is to shorten the information (see the language of SMS messages) so, in my opinion, you can ignore long descriptions that can create problems of sizing on graphical interfaces (PC,Tablet and smartphones)
  • The business  description is necessarily manual. It will be the end user who will decide which description prefers to see. The initial loading, however, must initialize the business description equal to the external description, if,and only if, the business description is empty. Then, from time to time, we will proceed to the changes.
  • Another point that should not be overlooked is the following: how much are reliable, codes and descriptions that we find in the points 2-6? As often happens, the documentation is not always up to date, so we risk to have not the full domain of the code. On this point, we have to be proactive in the sense that we can not ask generically an update of the documentation of the source systems. We must be the ones to ask precisely the descriptions of the codes that we do not find in the documentation.

The importance of the domains

Finally, the main reason for the existence of the domains table. The reuse of the description. In the tables of a Data Warehouse the same domain can be applied to so many columns. 
In fact they have different names, but refer to the same domain of values. For example, a currency code, can be present in a table with name CURTRAT (currency of treatment) and CURTRA (currency of trading), but both must have the same domain, let's call it "CUR".

The descriptions management - Stage 1

At this point, we explore the descriptions management with a real simulation, obviously very simplified, after creating the simulation environment. The work plan is as follows.
  1. Create the environment
  2. Load the domains table
  3. Load the staging table
  4. Update the domains table on the basis of the codes that we find in the table of Staging Area
  5. Load the dimension table
  6. Update the dimension table with descriptions

Create the environment

We prepare a simulation environment, creating the necessary tables with only the information they we need, which is the table of the staging area, the dimension table and the domains table. As we can see, the dimension table contains both the external description and the business description. The extended description is built using the syntax of the virtual columns.

DROP TABLE CUST_STA CASCADE CONSTRAINTS;
CREATE TABLE CUST_STA (
  CUST_COD VARCHAR2(9),
  PROF_COD VARCHAR2(30)
);

DROP TABLE MEF_DOM_CFT CASCADE CONSTRAINTS;
CREATE TABLE MEF_DOM_CFT (
  DOM_COD  VARCHAR2(61) NOT NULL
  ,COD_TXT  VARCHAR2(60)
  ,HDS_TXT  VARCHAR2(255)
  ,BDS_TXT  VARCHAR2(255)
);

DROP TABLE CUST_DIT CASCADE CONSTRAINTS;
CREATE TABLE CUST_DIT (
  CUST_KEY NUMBER
  ,CUST_COD VARCHAR2(9)
  ,PROF_COD VARCHAR2(30)
  ,PROF_HDS VARCHAR2(60)
  ,PROF_BDS VARCHAR2(60)
  ,PROF_XDS VARCHAR2(93) GENERATED ALWAYS
     AS (PROF_COD||' - '||PROF_BDS) VIRTUAL
);

Load the domains table

The loading of this table will necessarily be a mix of automatic loading and manual loading. We refer to the 7 situations seen before. If the domains are already present in a specific table or in a common table, we will use just a SQL statement to load it. If you are in an Excel spreadsheet, convert it to a CSV file, point it with an Oracle external table and then load them with a SQL statement. Otherwise, you have to load them manually one by one. Suppose to be in the latter case and load the domain of the profession code. We note the difference between external description and business description. If, as often happens, the feeding system is a mainframe, the descriptions will come all in uppercase and they will not be very user-friendly for a directional reporting

INSERT INTO MEF_DOM_CFT VALUES ('PROF','10','EMPL.'    ,'Employee');
INSERT INTO MEF_DOM_CFT VALUES ('PROF','20','FREE L.'  ,'Free Lance');
INSERT INTO MEF_DOM_CFT VALUES ('PROF','30','TRADESMAN','Tradesman');
COMMIT;
 

After the insertion, the contents of the domains table will therefore be:



Obviously, if we have not yet considered the analysis of the domains or we are still waiting for documentation, this step can be ignored. The table remains empty and will be loaded automatically in a second step.

Load the staging table

The staging table loading is a job that will be part of the ETL process. For simplicity we simulate its load with the following SQL statements. The '?' as value code, substitutes NULL. From the methodological point of view, I consider important to avoid NULL values in the Data Warehouse, and replace them immediately in the Staging Area with a default value. Usually I use the question mark. [for a detailed discussion refer to [http://www.slideshare.net/jackbim/recipes-5-of-data-warehouse-the-null-values-management-in-the-etl-process]

INSERT INTO CUST_STA VALUES ('Larry','10');
INSERT INTO CUST_STA VALUES ('Tom'  ,'15');
INSERT INTO CUST_STA VALUES ('Brown','20');
INSERT INTO CUST_STA VALUES ('Steve','40');
INSERT INTO CUST_STA VALUES ('Jack' ,'?');
COMMIT;


After the insertion, the content of the staging table will be:




Update the domains table

As you can see from the staging area contents, I purposely simulated a situation of domain not updated, in fact are not present all the values that could arrive (missing the "15", "40" and a "?" ). In order to keep the domains table updated, we perform a procedure that we can run as post-processing in the Staging Area loading. The procedure is the following.

create or replace procedure p_dom_prof(p_dom_cod varchar2) is
begin
   for r in (select distinct p_dom_cod dom_cod
             ,a.prof_cod
             ,b.cod_txt
             ,nvl(b.hds_txt,'nd') hds_txt
             ,nvl(b.bds_txt,'nd') bds_txt
             from cust_sta a
             left outer join mef_dom_cft b
             on (a.prof_cod = b.cod_txt
             and b.dom_cod = p_dom_cod)) loop
      if (r.cod_txt is null) then
         insert into mef_dom_cft (dom_cod,cod_txt,hds_txt,bds_txt)
         values(p_dom_cod,r.prof_cod,r.hds_txt,r.bds_txt);
      end if;
   end loop;
   commit;
end;
/
sho errors
exec p_dom_prof('PROF');


The procedure performs a simple loop on the values of the profession code of the staging table. There is the link with the domains table,using domain name that receives as input. As we can see, the procedure will fill the domain table even if the table is initially empty (by means of the outer join). Also always acts as an insert, then does not alter or delete descriptions that have been entered manually.
The procedure also inserts the missing codes with the default description "nd". If now we query the domains table, we will get:




Load the dimension table

The loading of the dimension table, is one of the most challenging and delicate component of the entire ETL process. From my experience, almost all dimension tables should take account of the "Slowly Changing Dimension" of type 2, on which I will not go into detail now. Precisely because of this complexity, I preferred not to add complexity to complexity. I prefer manage the descriptions as post-processing of the loading of the dimension tables. So, first I load the dimension table by taking the data from the staging table, then I update the descriptive components. We load the dimension table in a very simply way with the following SQL statement:

INSERT INTO CUST_DIT(CUST_KEY,CUST_COD,PROF_COD)
SELECT ROWNUM,CUST_COD,PROF_COD
FROM CUST_STA;
COMMIT;


Situation after loading:




Update the dimension table

To update the dimension table, we create and execute the following procedure.

create or replace procedure p_dit_prof(p_dom_cod varchar2) as
begin
     for r in (select distinct prof_cod from cust_dit) loop
      update cust_dit d
      set (prof_hds,prof_bds) = (select hds_txt,bds_txt
         from mef_dom_cft s
         where d.prof_cod = s.cod_txt
         and s.dom_cod = p_dom_cod)
      where prof_cod = r.prof_cod
      and exists (select 1
         from mef_dom_cft s
         where d.prof_cod = s.cod_txt
         and s.dom_cod = p_dom_cod);
     end loop;
   commit;
end;
/
sho errors
exec p_dit_prof('PROF');


The logic of the procedure is very simple. For each value of the profession code, it performs the update of the description taken from the domains table. If now we query the dimension table we will get: 





After the nightly loading of the Data Warehouse, we can query the domain table (or the dimension table), and all situations, in which there is the description "nd", must be the object of attention for all the people who are concerned with the quality control of the data . In practice, for each code that is "nd", we must send a request for a description to the feeding systems.

The descriptions management - Stage 2

The Phase 1 is helpful to us, to be aware of the problems and difficulties associated with descriptions management . The solution shown, although functional, is practically unusable in a real Data Warehouse.
The procedures that update the domains table and the dimension table, are specific for the profession code, and such procedure should be repeated, with appropriate modifications, for all other codes that require a description. In a Data Warehouse, we can have dozens of analytical dimensions, each with dozens of descriptive codes. It is not thinkable write hundreds of procedures for the descriptions management.
So let's see how to implement a single procedure that handles all automatically. To achieve the result indicated, we must necessarily get help from some configuration tables. In particular, a configuration table of the input stream, which indicate me the name of the staging table and of the dimension table. And we need a configuration table of all columns that need a description.

Environment creation

We enrich the staging table assuming the presence of three codes. The first, already seen in the previous phase is the profession code.The second is the status code of the customer. The third is the type code of the customer. We adapt also the dimension table with the new codes.

DROP TABLE CUST2_STA CASCADE CONSTRAINTS;
CREATE TABLE CUST2_STA (
  CUST2_COD VARCHAR2(9)
  ,PROF_COD VARCHAR2(30)
  ,STATUS_COD VARCHAR2(1)
  ,TYPE_COD VARCHAR2(1)
);

DROP TABLE CUST2_DIT CASCADE CONSTRAINTS;
CREATE TABLE CUST2_DIT (
  CUST2_KEY NUMBER
  ,CUST2_COD VARCHAR2(9)
  ,PROF_COD VARCHAR2(30)
  ,PROF_BDS VARCHAR2(60)
  ,PROF_XDS VARCHAR2(93) GENERATED ALWAYS
      AS (PROF_COD||' - '||PROF_BDS) VIRTUAL
  ,STATUS_COD VARCHAR2(1)
  ,STATUS_BDS VARCHAR2(15)
  ,STATUS_XDS VARCHAR2(19) GENERATED ALWAYS
     AS (STATUS_COD||' - '||STATUS_BDS) VIRTUAL
  ,TYPE_COD VARCHAR2(1)
  ,TYPE_BDS VARCHAR2(15)
  ,TYPE_XDS VARCHAR2(19) GENERATED ALWAYS
     AS (TYPE_COD||' - '||TYPE_BDS) VIRTUAL
);


We now introduce two new configuration tables. A configuration table of the input streams (MEF_IO2_CFT) and a configuration table of the columns of the staging tables (MEF_STA2_CFT),which will become columns of the dimension table. The MEF_IO2_CFT table contains, for each identifier of a stream (IO_COD), the name of the staging table (STA_TABLE_COD) and the name of the dimension table (DIM_TABLE_COD) . We insert a configuration line .
The MEF_STA2_CFT table  contains, for each identifier of a stream (IO_COD), the name of the column code (COLUMN_COD) and the domain name associated (DOM_COD). This configuration table is very important, because it is the only place where we define for each code, the name of its domain. This name will allow you to access to the domains table to find the descriptions. For our example, we insert three lines, one for each code, assuming that the customer type has not yet been taken into account in the analysis of its domain. So, the domain code remains NULL.

DROP TABLE MEF_IO2_CFT CASCADE CONSTRAINTS;
CREATE TABLE MEF_IO2_CFT (
  IO_COD        VARCHAR2(9)
  ,STT_COD VARCHAR2(30)
  ,DIT_COD VARCHAR2(30)
);
INSERT INTO MEF_IO2_CFT VALUES ('CUST2','CUST2_STA','CUST2_DIT');

DROP TABLE MEF_STA2_CFT CASCADE CONSTRAINTS;
CREATE TABLE MEF_STA2_CFT (
  IO_COD      VARCHAR2(9)
  ,COLUMN_COD VARCHAR2(30)
  ,DOM_COD    VARCHAR2(30)
);
INSERT INTO MEF_STA2_CFT VALUES ('CUST2','PROF_COD','PROF');
INSERT INTO MEF_STA2_CFT VALUES ('CUST2','STATUS_COD','STATUS');
INSERT INTO MEF_STA2_CFT VALUES ('CUST2','TYPE_COD',null);

Load the domains table

Recreate and load this table by adding to the domain already seen, that of the state of the customer:

DROP TABLE MEF_DOM2_CFT CASCADE CONSTRAINTS;
CREATE TABLE MEF_DOM2_CFT (
  DOM_COD  VARCHAR2(61) NOT NULL
  ,COD_TXT  VARCHAR2(60)
  ,HDS_TXT  VARCHAR2(255) NOT NULL
  ,BDS_TXT  VARCHAR2(255) NOT NULL
);
INSERT INTO MEF_DOM2_CFT VALUES ('PROF','10','EMPL.'    ,'Employee');
INSERT INTO MEF_DOM2_CFT VALUES ('PROF','20','FREE L.'  ,'Free Lance');
INSERT INTO MEF_DOM2_CFT VALUES ('PROF','30','TRADESMAN','Tradesman');
INSERT INTO MEF_DOM2_CFT VALUES ('STATUS','0','NO','Not active');
INSERT INTO MEF_DOM2_CFT VALUES ('STATUS','1','YES'  ,'Active');
COMMIT;



The table content will be the following:




Load the staging table

The loading of this table will be similar to what has been seen.

INSERT INTO CUST2_STA VALUES ('Larry','10','0','P');
INSERT INTO CUST2_STA VALUES ('Tom'  ,'15','1','P');
INSERT INTO CUST2_STA VALUES ('ACME' ,'20','1','L');
INSERT INTO CUST2_STA VALUES ('Steve','40','?','P');
INSERT INTO CUST2_STA VALUES ('Jack' ,'?','1','P');
COMMIT;

Update the domains table

We will use a single procedure. It, using the information of the configuration tables, dynamically creates the SQL statement to do the update. Furthermore, in case that some domains are not defined , associates automatically the default domain code as already described. Because sometimes it can be useful to call a field "* _COD" although it is not sensible to associate to it a description, we assume that, if we set the code domain equal to the word "NULL", this field will not be taken into consideration in the treatment of the domains.

create or replace procedure p_dom2(p_io_cod varchar2 default '%'
   ,p_column_cod varchar2 default '%') is
v_sql varchar2(4000);
v_dom_cod varchar2(61);
v_hds_txt varchar2(200);
v_bds_txt varchar2(200);
v_cod_txt varchar2(200);
v_column_cod varchar2(200);
type t_rc is ref cursor;
v_cur t_rc;
cr char(1) := chr(10);
begin
     for r in (select a.STT_COD
               ,b.column_cod
             ,nvl(b.dom_cod,a.io_cod||'.'||b.column_cod) dom_cod
               from mef_io2_cft a
               inner join mef_sta2_cft b
               on (a.io_cod = b.io_cod
               and b.column_cod like p_column_cod
               and b.column_cod like '%_COD'
               and nvl(b.dom_cod,'*') <> 'NULL')
               where a.io_cod like p_io_cod) loop
         v_sql := 'select distinct '||''''||r.dom_cod||''''||cr||
                  '   ,a.'||r.column_cod||cr||
                  '   ,b.cod_txt'||cr||
                  '   ,nvl(b.hds_txt,''nd'') hds_txt '||cr||
                  '   ,nvl(b.bds_txt,''nd'') bds_txt '||cr||
                  'from '||r.STT_COD||' a '||cr||
                  'left outer join MEF_DOM2_CFT b '||cr||
                  'on (a.'||r.column_cod||' = b.cod_txt '||cr||
                  '    and dom_cod='||''''||r.dom_cod||''''||')';
         dbms_output.put_line(v_sql);
         open v_cur for v_sql;
         loop
            fetch v_cur into v_dom_cod,v_column_cod,v_cod_txt,v_hds_txt,v_bds_txt;
            exit when v_cur%notfound;
            if (v_cod_txt is null) then
               insert into MEF_DOM2_CFT (dom_cod,cod_txt,hds_txt,bds_txt)
               values(v_dom_cod,v_column_cod,v_hds_txt,v_bds_txt);
            end if;
         end loop;
         close v_cur;
     end loop;
   commit;
end;
/
sho errors
set serveroutput on
exec p_dom2('CUST2');


Through the command "set serveroutput on" we can see an example of the SQL statement that is generated automatically for each source / column.





After the procedure execution, the content of the table will be the following. Note the power of this procedure. Although we have not yet thought about the domains management, if you call it without parameters, it will be able to initialize the domains table of for all code fields (that have been configured) throughout the Data Warehouse.



Load the dimension table

The loading is similar to that of stage 1

INSERT INTO CUST2_DIT(CUST2_KEY,CUST2_COD,PROF_COD,STATUS_COD,TYPE_COD)
SELECT ROWNUM,CUST2_COD,PROF_COD,STATUS_COD,TYPE_COD
FROM CUST2_STA;
COMMIT;


We see now the content of the dimension table before the next update.


Update the dimension table

Also in this case we have to use a dynamic procedure.

create or replace procedure p_dit2(
   p_DIT_COD varchar2 default '%'
   ,p_col varchar2 default '%'
   ,p_force_cod number default 0) as
type t_rc is ref cursor;
v_cur t_rc;
v_sql varchar2(2000);
v_value_cod varchar2(60);
v_where varchar2(80);
v_count number;
cr char(1) := chr(10);
begin
   for s in (select io_cod,DIT_COD
             from MEF_IO2_CFT
             where DIT_COD like p_DIT_COD) loop
      for r in (select b.column_cod
                ,replace(b.column_cod,'_COD','_BDS') column_bds
                ,nvl(dom_cod,s.io_cod||'.'||column_cod) dom_cod
                from cols a
                inner join MEF_STA2_CFT b
                on (a.column_name = b.column_cod
                    and b.io_cod = s.io_cod)
                where a.table_name = s.DIT_COD
                and a.column_name like p_col
                and a.column_name like '%_COD'
                and nvl(b.dom_cod,'*') <> 'NULL') loop
         if (p_force_cod = 0) then
            v_where := ' where '||r.column_bds||' is null';
         else
            v_where := null;
         end if;
         v_sql := 'select distinct '||r.column_cod||' value_cod'||cr||
                  ' from '||s.DIT_COD||v_where;
         dbms_output.put_line(v_sql);
         open v_cur for v_sql;
         loop
            fetch v_cur into v_value_cod;
            exit when v_cur%notfound;
            -- usare using perchĆØ se il valore contiene degli apici, l'sql fallisce
            v_sql :=
            'update '||s.DIT_COD||' d'||cr||
            'set ('||r.column_bds||') = ('||cr||'select bds_txt '||cr||
            'from MEF_DOM2_CFT s where d.'||r.column_cod||' = s.cod_txt '||
            'and s.dom_cod = '||''''||r.dom_cod||''''||')'||cr||
            'where '||r.column_cod||' = :1 '||
            ' and exists ('||cr||
            'select 1 from MEF_DOM2_CFT s where d.'||r.column_cod||' = s.cod_txt '||
            'and s.dom_cod = '||''''||r.dom_cod||''''||
            ')';
            dbms_output.put_line(v_sql);
            execute immediate v_sql using v_value_cod;
         end loop;
         close v_cur;
      end loop;
   end loop;
   commit;
end;
/
sho errors
set serveroutput on
exec p_dit2('CUST2_DIT');


Some notes on this update procedure.

  • The first loop is used to identify the name of all the dimension tables, if you do not provide a specific input.
  • The second loop is used to identify values and descriptions associated with the domains. They will be the basis for the subsequent updates.
  • As we know, the UPDATE operations are, in general, slow operations at the Database level. But we perform the update only for the rows that have not the description valued (see, for example, the clause "where '|| || r.column_bds' is null"). This greatly reduces the weight of the update operation. In fact, in the case of the Slowly Changing Dimensions of type 2, we expect that new keys or new keys versioning, will involve only between the 5 and 10 percent of the total lines (this is why we call slowly dimensions). Then, with appropriate indexing, the weight of the update should not be excessive.
  • We note again that the condition (on NULL) seen before, is conditioned by an input parameter to the procedure. So, in the face of a massive update of the descriptions in the domains table, it is always possible to force a massive update of all dimension tables (off-line, of course).
  • We used the clause "using" in the dynamic execution, because if the value contains quotes, it could cause problems.
The dynamic SQL used in the procedure, will produce statements of type:

select distinct STATUS_COD value_cod
from CUST2_DIT where STATUS_BDS is null;

update CUST2_DIT d
set (STATUS_BDS) = (
select bds_txt
from MEF_DOM2_CFT s where d.STATUS_COD = s.cod_txt and s.dom_cod = 'STATUS')
where STATUS_COD = :1  and exists (
select 1 from MEF_DOM2_CFT s where d.STATUS_COD = s.cod_txt and s.dom_cod = 'STATUS')


At the end of the update, the dimension table will be:   





The descriptions management - Stage 3

The previous phases have produced a fairly accurate descriptions management. Phase two has made it dynamic and reusable for all the dimension tables of a Data Warehouse. Unfortunately, we are still quite far from reality. We have not taken into account a very important factor: the dependence between codes.
In a medium / large Data Warehouse and especially in an Enterprise Data Warehouse, where the dependence between codes exists, the phase 2 does not work anymore because we have to deal with the following situations.

Example 1

Suppose to be in a bank that has acquired other banks. The data warehouse will contain the data of the whole group, so that even the customer table must be unique, with a company code that differentiates the customers. We apply now conceptually this fact to our example of CUST_DIT. We are quite sure that the meaning "emplyee" (like description) of the profession code "10" for the B1 bank , has the same meaning for the B2 bank ?
The question is obviously rhetorical. The answer is NO. We cannot have this guarantee, nor can we hope that the phase of data migration from the bank acquired at the acquiring bank has normalized all tables and meanings. This is the problem of the dependence between codes. A code no longer has a unique description, but has a different description depending on the value of another code.

Example 2

It must be clear that this problem does not occur only in the case of company mergers, but also on tables  of a single company. Another example can be the causal of a banking operation. A causal code "01" may have a meaning "Descr XXX" if the movement takes place by the bank desk, but it can have a different description "Descr YYY" with the same code "01" if the transaction takes place via the Web. In this case the causal code is dependent from the transmission channel code.

Esempio 3

When I speak of dependence between codes, does not mean between two codes, but also among many  codes. An example is the table of the financial instruments. Because the instruments are very numerous, it appears almost always a hierarchy that classifies financial instruments associating them to categories, groups and subgroups.
As we can see from the example below, the subgroup with code "1" has three different descriptions depending on the group to which it belongs. In turn, the group code "2" has two different descriptions depending on the category to which it belongs.


So, if we have dependence between codes, the management of the descriptions shown in step 2 does not work. You have to manage it in a different way.

Change the configuration table

The first thing to do is to change the configuration table of the columns by adding the information on the dependencies between codes. We assume no more than 5 dependencies. The new example table will be the dimensionale table of the financial instruments.

DROP TABLE MEF_STA3_CFT CASCADE CONSTRAINTS;
CREATE TABLE MEF_STA3_CFT (
  IO_COD      VARCHAR2(9)
  ,COLUMN_COD VARCHAR2(30)
  ,DOM_COD    VARCHAR2(30)
  ,DC1_COD VARCHAR2(30)
  ,DC2_COD VARCHAR2(30)
  ,DC3_COD VARCHAR2(30)
  ,DC4_COD VARCHAR2(30)
  ,DC5_COD VARCHAR2(30)  
);
INSERT INTO MEF_STA3_CFT VALUES ('FIN','CAT_COD','CAT','','','','','');
INSERT INTO MEF_STA3_CFT VALUES ('FIN','GRP_COD','GRP','CAT_COD','','','','');
INSERT INTO MEF_STA3_CFT VALUES ('FIN','SUBGRP_COD','SUBGRP','GRP_COD','CAT_COD','','','');


If we see its contents, we can clearly see the dependencies between codes that we described earlier. The sub-group depends on the group and category, the group depends on the category, the category has no dependencies. 



Also configure the new source/dimension

DELETE MEF_IO2_CFT WHERE IO_COD='FIN';
INSERT INTO MEF_IO2_CFT VALUES ('FIN','FIN_STA','FIN_DIT');

Create and load the domains table

The domains table must contain the value of the dependence codes. These dependencies must be studied and analyzed. As in the previous stage, even if they are not known, all combinations of dependencies will be initialized automatically after the loading of the staging area.
The DCLIST_TXT field is only informative. It is useful to know what are the names of the columns to which the DV * codes refer themselves.

DROP TABLE MEF_DOM3_CFT CASCADE CONSTRAINTS;
CREATE TABLE MEF_DOM3_CFT (
  DOM_COD  VARCHAR2(61) NOT NULL
  ,COD_TXT  VARCHAR2(60) NOT NULL
  ,HDS_TXT  VARCHAR2(255)
  ,BDS_TXT  VARCHAR2(255)
  ,DV1_COD    VARCHAR2(30 BYTE)
  ,DV2_COD    VARCHAR2(30 BYTE)
  ,DV3_COD    VARCHAR2(30 BYTE)
  ,DV4_COD    VARCHAR2(30 BYTE)
  ,DV5_COD    VARCHAR2(30 BYTE)
  ,DCLIST_TXT   VARCHAR2(160 BYTE)
);
INSERT INTO MEF_DOM3_CFT VALUES ('SUBGRP','1','LUSS. MOB.','LUSS. MOB.','3','4','','','','GRP_COD,CAT_COD');
INSERT INTO MEF_DOM3_CFT VALUES ('SUBGRP','1','EQUITIES','EQUITIES','2','7','','','','GRP_COD,CAT_COD');
INSERT INTO MEF_DOM3_CFT VALUES ('SUBGRP','1','ORD. STOCK  BUY','ORD. STOCK  BUY','2','3','','','','GRP_COD,CAT_COD');
INSERT INTO MEF_DOM3_CFT VALUES ('GRP','3','FOR.INV.FOUNDS','FOR.INV.FOUNDS','4','','','','','CAT_COD');
INSERT INTO MEF_DOM3_CFT VALUES ('GRP','2','PUT  ITALY','PUT  ITALY','7','','','','','CAT_COD');
INSERT INTO MEF_DOM3_CFT VALUES ('GRP','2','FOREIGN WARRANTS','FOREIGN WARRANTS','3','','','','','CAT_COD');
INSERT INTO MEF_DOM3_CFT VALUES ('CAT','4','COMMON FOUNDS','COMMON FOUNDS','','','','','','');
INSERT INTO MEF_DOM3_CFT VALUES ('CAT','7','OPTIONS','OPTIONS','','','','','','');
INSERT INTO MEF_DOM3_CFT VALUES ('CAT','3','WARRANTS','WARRANTS','','','','','','');


See the domains table:




Load the staging table

The new staging table of financial instruments will be:

DROP TABLE FIN_STA CASCADE CONSTRAINTS;
CREATE TABLE FIN_STA
(
  FIN_COD VARCHAR2(12)
  ,CAT_COD VARCHAR2(3)
  ,GRP_COD VARCHAR2(3) 
  ,SUBGRP_COD VARCHAR2(3)   
);
INSERT INTO FIN_STA VALUES ('LU0108803940','4','3','1');
INSERT INTO FIN_STA VALUES ('IT0002693197','7','2','1');
INSERT INTO FIN_STA VALUES ('CH0014852120','3','2','1');
INSERT INTO FIN_STA VALUES ('IT0001119848','3','1','1');
INSERT INTO FIN_STA VALUES ('IT0000566056','1','1','1');
COMMIT;


As we can see, there's two combinations (1,1,1 and 3,1,1) not present in the domain.

Update the domain table

The update procedure of the domains is certainly more complicated than the previous case. The code is implemented only for demonstration, and surely can be implemented more efficiently.

create or replace procedure p_dom3(p_io_cod varchar2 default '%'
,p_column_cod varchar2 default '%') is
v_sql varchar2(4000);
v_dom_cod varchar2(61);
v_hds_txt varchar2(200);
v_bds_txt varchar2(200);
v_cod_txt varchar2(200);
v_column_cod varchar2(200);
type t_rc is ref cursor;
v_cur t_rc;
cr char(1) := chr(10);
v_list varchar2(200);
v_list2 varchar2(200);
v_c1 varchar2(200);
v_c2 varchar2(200);
v_c3 varchar2(200);
v_c4 varchar2(200);
v_c5 varchar2(200);
v_count number;
begin
     for r in (select a.STT_COD
               ,b.column_cod
             ,nvl(b.dom_cod,a.io_cod||'.'||b.column_cod) dom_cod
             ,b.dc1_cod,b.dc2_cod,b.dc3_cod,b.dc4_cod,b.dc5_cod
               from mef_io2_cft a
               inner join mef_sta3_cft b
               on (a.io_cod = b.io_cod
               and column_cod like p_column_cod
               and b.column_cod like '%_COD'
               and nvl(b.dom_cod,'*') <> 'NULL')
               where a.io_cod like p_io_cod) loop
         v_list := null;
         v_list2 := null;
         v_count := 0;
         if (r.dc1_cod is not null) then
            v_list := v_list||','||r.dc1_cod;
            v_list2 := v_list2||' and '||r.dc1_cod||' = dv1_cod';
            v_count := v_count+1;
         end if;
         if (r.dc2_cod is not null) then
            v_list := v_list||','||r.dc2_cod;
            v_list2 := v_list2||' and '||r.dc2_cod||' = dv2_cod';
            v_count := v_count+1;
         end if;
         if (r.dc3_cod is not null) then
            v_list := v_list||','||r.dc3_cod;
            v_list2 := v_list2||' and '||r.dc3_cod||' = dv3_cod';
            v_count := v_count+1;
         end if;
         if (r.dc4_cod is not null) then
            v_list := v_list||','||r.dc4_cod;
            v_list2 := v_list2||' and '||r.dc4_cod||' = dv4_cod';
            v_count := v_count+1;
         end if;
         if (r.dc5_cod is not null) then
            v_list := v_list||','||r.dc5_cod;
            v_list2 := v_list2||' and '||r.dc5_cod||' = dv5_cod';
            v_count := v_count+1;
         end if;              
         v_sql := 'select distinct '||''''||r.dom_cod||''''||cr||
                  '   ,a.'||r.column_cod||cr||
                  '   ,b.cod_txt'||cr||
                  '   ,nvl(b.hds_txt,''nd'') hds_txt '||cr||
                  '   ,nvl(b.bds_txt,''nd'') bds_txt '||cr||
                  v_list||cr||
                  'from '||r.STT_COD||' a '||cr||
                  'left outer join mef_dom3_cft b '||cr||
                  'on (a.'||r.column_cod||' = b.cod_txt '||cr||
                  '    and dom_cod='||''''||r.dom_cod||''''||cr||
                  v_list2||cr||
                  ')';
         dbms_output.put_line(v_sql);
         open v_cur for v_sql;
         loop
            if (v_count=0) then
               fetch v_cur into v_dom_cod,v_column_cod,v_cod_txt
               ,v_hds_txt,v_bds_txt;
            elsif (v_count=1) then
               fetch v_cur into v_dom_cod,v_column_cod,v_cod_txt
               ,v_hds_txt,v_bds_txt,v_c1;
            elsif (v_count=2) then
               fetch v_cur into v_dom_cod,v_column_cod,v_cod_txt
               ,v_hds_txt,v_bds_txt,v_c1,v_c2;
            elsif (v_count=3) then
               fetch v_cur into v_dom_cod,v_column_cod,v_cod_txt
               ,v_hds_txt,v_bds_txt,v_c1,v_c2,v_c3;
            elsif (v_count=4) then
               fetch v_cur into v_dom_cod,v_column_cod,v_cod_txt
               ,v_hds_txt,v_bds_txt,v_c1,v_c2,v_c3,v_c4;
            else
               fetch v_cur into v_dom_cod,v_column_cod,v_cod_txt
               ,v_hds_txt,v_bds_txt,v_c1,v_c2,v_c3,v_c4,v_c5;
            end if;        
            exit when v_cur%notfound;
            if (v_cod_txt is null) then
               insert into mef_dom3_cft (dom_cod,cod_txt,hds_txt,bds_txt
               ,dv1_cod, dv2_cod,dv3_cod, dv4_cod, dv5_cod,dclist_txt)
               values(v_dom_cod,v_column_cod,v_hds_txt,v_bds_txt
               ,v_c1,v_c2,v_c3,v_c4,v_c5,substr(v_list,2));
            end if;
         end loop;
         close v_cur;
     end loop;
   commit;
end;
/
sho errors
set serveroutput on
exec p_dom3('FIN');


Compared to the version of stage 2, there is the need to acquire the name of the columns of the dependence codes. After the execution of the procedure, which, remember, is to be performed as post-processing of the Staging Area loading, the table of the domains will have the following values.




Creation and load of the dimension table

We create the new dimension table and load it  from the staging table.

DROP TABLE FIN_DIT CASCADE CONSTRAINTS;
CREATE TABLE FIN_DIT (
  FIN_KEY NUMBER
  ,FIN_COD VARCHAR2(12)
  ,CAT_COD VARCHAR2(30)
  ,CAT_BDS VARCHAR2(60)
  ,CAT_XDS VARCHAR2(93) GENERATED ALWAYS
     AS (CAT_COD||' - '||CAT_BDS) VIRTUAL
  ,GRP_COD VARCHAR2(30)
  ,GRP_BDS VARCHAR2(60)
  ,GRP_XDS VARCHAR2(93) GENERATED ALWAYS
     AS (GRP_COD||' - '||GRP_BDS) VIRTUAL
  ,SUBGRP_COD VARCHAR2(30)
  ,SUBGRP_BDS VARCHAR2(60)
  ,SUBGRP_XDS VARCHAR2(93) GENERATED ALWAYS
     AS (SUBGRP_COD||' - '||SUBGRP_BDS) VIRTUAL        
);
INSERT INTO FIN_DIT(FIN_KEY,FIN_COD,CAT_COD,GRP_COD,SUBGRP_COD)
SELECT ROWNUM,FIN_COD,CAT_COD,GRP_COD,SUBGRP_COD
FROM FIN_STA;
COMMIT;


We see the dimension table before the upgrade of the descriptions.



Update the dimension table

Even the update procedure will be more complex than the previous, in order to manage the dependencies.

create or replace procedure p_dit3(
   p_DIT_COD varchar2 default '%'
   ,p_col varchar2 default '%'
   ,p_force_cod number default 0) as
type t_rc is ref cursor;
v_cur t_rc;
v_sql varchar2(2000);  
v_sql2 varchar2(2000);    
v_value_cod varchar2(60);
v_column_cod varchar2(30);
v_dom_cod varchar2(61);
v_exec_cnt number;
v_where varchar2(80);
v_count number;
cr char(1) := chr(10);
v_c1 varchar2(200);
v_c2 varchar2(200);
v_c3 varchar2(200);
v_c4 varchar2(200);
v_c5 varchar2(200);
v_ditc1 varchar2(200);
v_ditc2 varchar2(200);
v_ditc3 varchar2(200);
v_ditc4 varchar2(200);
v_ditc5 varchar2(200);
v_list varchar2(2000);
v_list2 varchar2(2000);
begin
   for s in (select io_cod,DIT_COD
             from MEF_IO2_CFT
             where DIT_COD like p_DIT_COD) loop
      for r in (select b.column_cod
                ,replace(b.column_cod,'_COD','_BDS') column_bds
                ,nvl(dom_cod,s.io_cod||'.'||column_cod) dom_cod
                ,dc1_cod
                ,dc2_cod
                ,dc3_cod
                ,dc4_cod
                ,dc5_cod
                from cols a
                inner join MEF_STA3_CFT b
                on (a.column_name = b.column_cod
                    and b.io_cod = s.io_cod
                    and b.column_cod like '%_COD'
                    and nvl(b.dom_cod,'*') <> 'NULL')
                where a.table_name = s.DIT_COD
                and a.column_name like p_col
                ) loop
            v_list := null;
            v_list2 := null;
            v_count := 0;
            dbms_output.put_line(r.column_bds);        
            if (r.dc1_cod is not null) then
               v_list := v_list||','||r.dc1_cod;
               v_count := v_count+1;
            end if;
            if (r.dc2_cod is not null) then
               v_list := v_list||','||r.dc2_cod;
               v_count := v_count+1;
            end if;
            if (r.dc3_cod is not null) then
               v_list := v_list||','||r.dc3_cod;
               v_count := v_count+1;
            end if;
            if (r.dc4_cod is not null) then
               v_list := v_list||','||r.dc4_cod;
               v_count := v_count+1;
            end if;
            if (r.dc5_cod is not null) then
               v_list := v_list||','||r.dc5_cod;
               v_count := v_count+1;
            end if;                 
            if (p_force_cod = 0) then
               v_where := ' where '||r.column_bds||' is null';
            else
               v_where := null;
            end if;
            dbms_output.put_line(v_count);   
            v_sql := 'select distinct '||r.column_cod||' value_cod'||cr||
                     v_list||' from '||s.DIT_COD||v_where;
            dbms_output.put_line(v_sql);
            open v_cur for v_sql;
            loop
               if (v_count=0) then
                  fetch v_cur into v_value_cod;
                  v_list2 := null;
               elsif (v_count=1) then
                  fetch v_cur into v_value_cod,v_ditc1;
                  v_list2 := ' and d.'||r.dc1_cod||' = s.dv1_cod';
               elsif (v_count=2) then
                  fetch v_cur into v_value_cod,v_ditc1,v_ditc2;
                  v_list2 := ' and d.'||r.dc1_cod||' = s.dv1_cod'||
                             ' and d.'||r.dc2_cod||' = s.dv2_cod';
               elsif (v_count=3) then
                  fetch v_cur into v_value_cod,v_ditc1,v_ditc2,v_ditc3;
                  v_list2 := ' and d.'||r.dc1_cod||' = s.dv1_cod'||
                             ' and d.'||r.dc2_cod||' = s.dv2_cod'||
                             ' and d.'||r.dc3_cod||' = s.dv3_cod';
               elsif (v_count=4) then
                  fetch v_cur into v_value_cod,v_ditc1,v_ditc2,v_ditc3,v_ditc4;
                  v_list2 := ' and d.'||r.dc1_cod||' = s.dv1_cod'||
                             ' and d.'||r.dc2_cod||' = s.dv2_cod'||
                             ' and d.'||r.dc3_cod||' = s.dv3_cod'||
                             ' and d.'||r.dc4_cod||' = s.dv4_cod';
               else
                  fetch v_cur into v_value_cod,v_ditc1,v_ditc2,v_ditc3,v_ditc4,v_ditc5;
                  v_list2 := ' and d.'||r.dc1_cod||' = s.dv1_cod'||
                             ' and d.'||r.dc2_cod||' = s.dv2_cod'||
                             ' and d.'||r.dc3_cod||' = s.dv3_cod'||
                             ' and d.'||r.dc4_cod||' = s.dv4_cod'||
                             ' and d.'||r.dc5_cod||' = s.dv5_cod';
               end if;
               exit when v_cur%notfound;   
               v_sql2 :=
               'update '||s.DIT_COD||' d'||cr||
               'set ('||r.column_bds||') = ('||cr||'select bds_txt '||cr||
               'from mef_dom3_cft s where s.dom_cod = '||''''||r.dom_cod||''''||
               ' and s.cod_txt = :1 '||
               v_list2||cr||
               ')'||cr||
               'where '||r.column_cod||' = :2 '||cr||
               ' and exists (select 1 from MEF_DOM3_CFT s '||
               'where s.dom_cod = '||''''||r.dom_cod||''''||' and s.cod_txt = :3 '||
               v_list2||')';
               dbms_output.put_line(v_sql2);
               execute immediate v_sql2 using v_value_cod,v_value_cod,v_value_cod;
            end loop;
            close v_cur;
      end loop;
   end loop;
   commit;
end;
/
sho errors
set serveroutput on
exec p_dit3('FIN_DIT');


After the execution of the procedure, the table of financial instruments will be: 




The first three lines have combinations of codes that are present in the table of the domains, so they have been updated the corresponding descriptions for all codes.
The fourth line has only updated the description of the category, because the category has no other dependencies.
The fifth line has all descriptions indefinite because it is a combination of unknown dependencies.

The descriptions management - Stage 4

Many of you will be amazed to see represented a further stage 4. It seems incredible, but we have not again said all about the descriptions management. We summarize the analysis that we have done so far.
In Phase 1 we highlighted the problem, we have introduced the domains table, a configuration table necessary to operate, we have defined a naming convention.
In Phase 2, we have automated the process, we have introduced other configuration tables in support of a dynamic management of the descriptions. We have shown how to pre-load the domains table, before to receive that information from the source systems.
In Phase 3, we have handled the dependence between codes.
What is still missing?
The answer depends on the international feature of your Data Warehouse. If you are sure that the Data Warehouse is internal and used only by users of your native language, you can stop at step 3, otherwise you have to take into account the multilanguage management.
I will not make a complete example as the other phases. The action is a change to the domains table, adding the language code field. (and some technical columns that is always useful to have). So you have to add language code as a key field in the procedures that have been implemented. With the multilanguage management , the domains table will increase volume as many times as the languages that you add.

The new domains table

The domains table, that we see below, can be considered complete, and I advise you to create it immediately with the code language even though at the time will not be used.

DROP TABLE MEF_DOM4_CFT CASCADE CONSTRAINTS;
CREATE TABLE MEF_DOM4_CFT (
  DOM_COD  VARCHAR2(61) NOT NULL
  ,COD_TXT  VARCHAR2(60) NOT NULL
  ,HDS_TXT  VARCHAR2(255)
  ,BDS_TXT  VARCHAR2(255)
  ,DV1_COD    VARCHAR2(30 BYTE)
  ,DV2_COD    VARCHAR2(30 BYTE)
  ,DV3_COD    VARCHAR2(30 BYTE)
  ,DV4_COD    VARCHAR2(30 BYTE)
  ,DV5_COD    VARCHAR2(30 BYTE)
  ,DCLIST_TXT   VARCHAR2(160 BYTE)
  ,LANG_COD VARCHAR2(30 BYTE)
  ,INS_DTS     DATE
  ,STAMP_DTS   DATE  DEFAULT SYSDATE
  ,CTRL_TXT    VARCHAR2(30)
);

Conclusion

As we have seen, the descriptions management is not a simple topic, has many implications, employ many people and it is extremely important because it is not a technical detail, but involves the way the end user sees the data. Determining factor for the success of the Data Warehouse.
I hope that what I have written can serve, for all the people involved in the design of a Data Warehouse, to become aware of the problem. The "descriptions management" activity must become a fixed point in the Gantt of the project.