Saturday, April 20, 2013

Data Warehouse: Naming convention techniques


The naming convention is a key component of any IT project. The purpose of this article is to suggest a standard for a practical and effective Data Warehouse design.

 

1 - Introduction

Usually, the naming convention is the method by which you decide to name of the various entities of the OLTP (Online Transaction Processing) or DSS (Decision Support System) system that you are designing.  In this article the term "entity" will be synonymous with "structure": a  generic term to identify to data structures typical of a relational Data Warehouse such as: tables, views, materialized views, synonyms, etc.
The application of a method should not be considered a separate issue, restricted to the particular entity that is being analyzed, separated from the global system architecture.  On the contrary, it shall produce a name that should be able to immediately recall all semantic entities involved, no matter if they belong to a Data Warehouse or to an operational system.
This is the keystone of the method, because to define a naming convention for an entity not simply give it a name but means defining an architectural context.  The names selected will be used in discussions, meetings, documents, deliverables, in relational schemas, and accompany our working life for several months. 
Very soon the relationships between various entities, and between entities and programs, will prevent any changes in the naming convention, and bad choices or no choice, will have unpleasant consequences for the economy of the entire project.
The examples shown in the article focus mainly on the data tables, because practically all the work that takes place within the data loading process, is aimed at filling these entities.
In these cases the effectiveness of the naming convention appears clear: try to think of the time needed to acquire knowledge about a system when all entities, perhaps hundreds, have not followed a standard, or followed only in part or in ambiguous way. The purpose of these techniques is to avoid these situations for us and for our partners. 

2 - Guidelines

Naming convention must be established early in the life-cycle of a project. You can not start the process of physical design and development phase without the naming convention of the entities that constitute our system.  No "create ..." will be possible without these decisions. 
The guidelines that are the basis of the techniques showned here are just three:

 1 – The logical partitioning
 2 – The coding rule
 3 – The classification

The Logical partitioning allows you to define a hierarchy among the entities involved, the classification allows you to associate a physical connotation to the entities, the coding rule standardizes the naming derived from partitioning and classification.  Since the best way to learn these concepts is to see them do in practice, the three guidelines are detailed in the next sections where we will see how you can apply to a real project.  Now focus our attention on the entities. 

3 – Naming convention for the entities

The paradigm that underlies the naming convention for the entities can be summarized in the following formula:

<entity name> = <project code>_<area code>_<section code>_<logical name>_<type code>

To understand the formula, use the first guideline, that the logical partitioning. Define it a guideline is certainly simplistic to say the least. It's actually a moment of reflection that can only be faced after realizing our global vision of the system. 
The term ‘logical partitioning’ means the decomposition of the system in its logical components so called areas and sections. It is an approach very similar to a drill-down: an overview towards a particular view.  It starts with the project, which is the root of our logic tree, it is divided into areas, each area is divided into the sections. The figure 1 show you immediately an example of logical partitioning.


Figure 1


Since we are talking about Data Warehouse, what are the areas and sections and the way you think the logical partitioning will be based on the Ralph Kimball philosophy, an approach that I use successfully from several years.  Let's look all the five individual components of the <entity name> . 


3.1 - The project code

The first decision to be addressed in a project is to assign it a code.  It may seem an obvious statement, but the identification of a project code is a precondition to lead the project in the right direction.  The choice of the project code has immediate and important consequences that impact at different levels.  Let's see.

Architectural level

The first level impacted is the logical / physical  architecture of the Data Warehouse.  After choosing the code project, each entity will be named with that project code followed by underscore "_".
 This means that each table, view, materialized view, synonym, program, or generally, any physical structure is created, it must start with <project code>.
Anywhere the physical structures are allocated, we have established a unique way to identify all the objects in our Data Warehouse.  (Obviously there must not be another project with the same code).  A selection from the database dictionary will allow us to identify exclusively the objects that concerns us.

File system level

The second level impacted is the physical file system.  All project directories, in which arrives the flat files coming from the source systems, should also start with ''<project code> _ ".  Another example might be the directories of the database files, of the creation script, etc.

Documentation level

The third level impacted concern documentation and deliverables.  Using the same technique for them, we will be sure to identify with our search always and only the documents associated with our project.

The rule of three

At this point we need a coding rule.  If, for example, we have defined our project as "Intelligent Enterprise Data Warehouse," it is unthinkable to use 38 characters as project code.  My experience about the design of several Data Warehouses, provided me a very clear indication.  It can be argued that the best rule to identify the project code is the "rule of three."
This rule suggests that the project code must be of three characters. In our example, a good acronym, and then the project code, will be "EDW”. If we are designing a Data Warehouse of Exploration and Production of an oil company, a good project code will be" EPD. "
As you know, three is the perfect number, and in my past, this perfection was clearly manifested in all its aspects. Two characters cause ambiguity and misunderstanding, four or plus characters are too much.  Since three is the perfect number, that rule will also apply to other encodings.  So, we start then with the project code “EDW”.

3.2 - The area code

The areas are a way to make a first logical partitioning of the objects of our Data Warehouse. Break all objects in areas of membership, allow us to associate them in a single scope, all  the items with similar logical/temporal affinity.
To understand the areas, we must express some principles of data warehousing, in particular, some aspects of the ETL process (Extract, Transform and Load).  These principles will be simplified as much as possible to ensure an understanding to all.  Obviously there is a very extensive literature on this subject which I suggest you for further investigations.

3.2.1 - The ETL process

The loading processes of the tables of a Data Warehouse, typically is a workflow like the following.
External systems produce source data files. Such systems may be of any type (OLTP, other Data Warehouse, Operational Data Store, etc.) and these files can be of any format (ASCII flat files of fixed length, with terminator, XML, etc.).  The data in these flat files are loaded into work tables, defined staging tables. These tables are almost always structured almost identical picture of the flat file. 
After the loading of the staging tables, the process continues with the loading of the others tables, which are called dimensional and basic fact tables; I like call them of “level 0”. This process is responsible to perform all the cleaning and checking work on the data of the staging tables, define the artificial keys for the fact tables, and then load the data according to the "dimensional model." With this term we mean that the data structure is composed with a fact table and dimensional tables that make up what is called the “star-schema”. (excuse me, Ralph, the simplification...)
After loading the fact tables of level 0, may be loaded aggregate tables (to improve the performance of reading data from the Business Intelligent tools) or higher-level fact tables (1.2, .. N). It depends on the complexity and needs of the Data Warehouse.
A typical example in the market of large retail may be the level 0 fact tables that stores the data of sales of the stores and the costs associated with the sold items.  From these fact tables, you can load a fact table to a level 1 where you can lose some dimensions of analysis and compare the sell and costs  to get the net or gross margin.
The whole process that loads the data arriving from the external systems up to the loading of all other structures is defined the ETL process.  There may be variations on this workflow, but it can be said that nearly all Data Warehouse projects are adapting to this process.

First step - the area partitioning

After this brief interlude, we return to the speech left open, that is the connection of the ETL process with our naming convention.  The logical lartitioning of the entities that we do in fact, is precisely focused on the process described above.  We can identify, for example, the following logical areas:
  • Staging Area: In this area we have all the entities that receive the flat data files coming from external systems.
  • Level 0 Data Mart Area: In this area there are all entities which contain the data cleaned and dimensionalized by the load process.
  • Level 1 Data Mart Area (and subsequent levels): In this area there are higher level entities that are derived by the entities of level 0.
  • Common Area: In this area there are all the dimension tables and all those entities that are common to the whole Data Warehouse, such as configuration tables, service tables, etc..
Of course you can also identify other areas.  For example we can identify an area called “Operational Data Store” where all the data structures are normalized and not structured according to the dimensional model, or we can identify an area of infrastructural information, linked to the logging of programs and schedules.  In short, the possibilities are different, and we will think carefully.

Second step – the coding rule

The next step then is to codify by the "rule of three" all areas that we identified.  Using acronyms very intuitive, like STA, DM0, DM1, COM obtain as indicated in Figure 1.
As you can see the effects of naming conventions are starting to take shape and already at this level, still high, you can easily identify all the entities belonging to a common logic.
If you want to know the names of all entities that collect data from external flat files, is enough to extract ,from the RDBMS data dictionary, all the object types that begin with “EDW_STA_”.  If you want to identify all the fact tables of level 1, search all object types that begin with “EDW_DM1_”
In reality, this initial partitioning process gives you a too high level.  In projects of some complexity, where the number of entities is high, should be further partitioned areas in the sections or sub-areas.  Each area will then be decomposed into smaller elements that deepen and enrich the knowledge of the Data Warehouse structures.

3.3 - The section code (sub-area)

The sections allow you to group logically all entities that belong to the different areas. Each area has specific characteristics, so that even the process of partitioning areas in the sections is specific to each area.
Obviously, every designer can define the sections as it seems appropriate, based on their experience and creativity.  Sections that now will be described, for the areas defined above, are those based on my personal experience and it is only  a suggestion.  Let's look in more detail.

3.3.1 - The sections of the Staging Area

Taking the example of the flat data files feeding the Data Warehouse, often producing subsytems can be different, each generating different data.  If, for example, we have three subsystem like sales, accounting and logistical movements, the approach can be as follows:
Codify with the rule of three the external subsystems, such as selling stores, with STO (Store), logistics movements with MOV (Movements) and accounts with PLS (Profit and Loss). Each entity linked to these subsystems will have these codes within its name.
If you want to know the names of all Staging Area tables loaded from logistics subsystem, simply extract from the database data dictionary all entities that begin with "EDW_STA_MOV_.
 .

3.3.2 - The sections of the level 0 Data Mart area

This is a key area because it found space all the basic fact tables, that are those with the most detailed granularity of the information.  The best way to partition the area into sections, is obviously to identify with the rule of three the different entities that comprise it.
For example, if a data mart contains all the sales data, a code should be SLS or SAL (short for sales), if it contains cost data, will be CST or COS (short for Costs).
Since the concept of data marts is essential in a Data Warehouse project, I prefer to open another parenthesis about principles of data warehousing and give you a more detailed definition for ease of understanding of the naming convention that uses it.

The Data Mart

The data mart is a "logical subset of the complete Data Warehouse" as shown by R. Kimball in his book [1].  Each data mart is usually organized around a single business processes,  and the union of all data marts is the Data Warehouse.
Then sections of the DM0, will be specific business process that you are deploying. Someone can think that the fact table with and data mart are conceptually the same.  The data mart is a logical concept, the fact table is only a part of the physical implementation of this logical concept.  In reality, the data mart consists of a fact table and all its aggregated tables (materialized views in Oracle) that are created to increase performance reporting associated with that business area.  Obviously, the aggregated structures are not mandatory, but a necessity for VLDW (Very Large Data Warehouse), where the amount of data processed may exceed terabytes.
A data mart that contains only the fact table, perhaps several gigabytes of data, is not a data mart: it is a problem.
Because in a Data Warehouse the number of aggregate tables tend to be high, the naming convention provides an help to immediately identify all data structures associated with a specific business process.

3.3.3 - The sections of the level 1 Data Mart area

In this area there are the data marts whose content is not immediately derivable from a data mart level 0, but consists of data taken by several lower-level data marts or obtained by processes that do not make a trivial aggregate calculation. We have two examples to clarify the concept.
As mentioned above, in the market associated with large retailers, it is typical to have a data mart of Profitability, which contains a single data structure sales information and a data Mart with various cost types that allow you to get the net or gross margin.
We can think Profitability as a data mart of level 1, which is loaded by the two data marts level 0 indicated above, ie EDW_DM0_SLS and EDW_DM0_CST.
You can then encode this data marts of profitability with  EDW_DM1_PRF or  EDW_DM1_PRO.
Another example is related to aggregate data that are not immediately achievable with simple sums or counts (because in that case I use the materialized views).  If a user requirement is to see the monthly data associates with the the same month of the last year, both current and progressive, a design choice may be to load, with the ETL process, a level1 data mart with these aggregate data already calculated, rather than entrust this task to the reporting tools.
You can encode this data mart as EDW_DM1_MPY (indicating an Monthly, with data P revious year and progressive, Year to date).

3.3.4 - The sections of the Common Area

In this area we associate all entities that are common to the entire project and not connected to one of the areas described above.  Typical examples of these sections are the following:

CDI section (Conformed Dimension).

The dimensions of analysis present in the Data Warehouse, are not specific to a Data Mart, but they are shared by many data marts. Just think of the time dimension, present in virtually all data mart of any level.  Are therefore common structures that can be grouped into one section.

CFA section (Conformed Fact).

Concept similar to the previous but applied to the fact tables. An example is the table of currency conversion, which is a typical fact table but comprehensive and usable by all the Data Warehouse.

LOG section (Logging). 

All structures associated with the log of the ETL process.

MTD section (metadata). 

All metadata structure in the Data Warehouse.

The Figure 2 show you an example of sections, and then complete our design with that just described.


Figure 2


At this point our partitioning process can be considered complete.  Additional sub-sections may give little added value to knowledge, in addition to further extend the final name of various entities.  Indeed, in very simple projects, may be enough the partitioning at the level area. Explore now the fourth components of the <entity name>, that is the logical name.

3.4 - The logical name

We have to identify a logical name, preferably short, after the codes just described.  Returning to the source data files that load a Staging Area table, if the subsystem Profit and Loss  provide us the customer informations, its logical name will be “CUST” and its complete name will be:

EDW_STA_PRL_CUST

You can see the completeness of information that comes from the name of this entity.  In a few characters we know that they are customer data, provided by Profit and Loss Subsystem, which are loaded on a Staging area that is part of the Intelligent Enterprise Data Warehouse.
It is not necessary submit the logical to the rule of three, and, in some cases, it may be omitted.  For example, in the data mart area, the prefix is already sufficiently explanatory, and the addition of a logical name will not give any added value to understanding the content of the entity.  Indeed, long before the start of deploy to production, the data marts of Profitability exemplified above, will be called “affectionately" PRO”, and all recognize the Profitability Data Mart naming only the three characters of the logical name.

3.5 - The type code

Now you might think of getting our initial goal.  The name obtained by following the guidelines of the naming convention is semantically significant, and I believe that if we simply called the entity CUSTOMER, we would have lost in knowledge and context.
Unfortunately it is not just so. The entity EDW_STA_PRL_CUST  still does not tell us everything  about its own nature.
If you carefully review the semantics that the naming convention has given to the word EDW_STA_PRL_CUST, we find that there are some cognitive deficiencies. 
It is quite clear that  customer data are provided by the Profit and Loss Subsystem, but does not show up anywhere that it is a table of data. The technique of classification can help us to fill these gaps.
Among the various entities which are present in the section of an area, there may be tables, but also view (views are the selection of certain rows and / or certain columns in a table), synonyms (ie tables aliases), sequences, etc.
This information is important because it avoids us to access to the database data dictionary.  This fact should not be underestimated because in many situations, this information, which seems highly associated with the "physical" side of the entity, also has a substantial logical content.
Consider, for example the case of the fact table and materialized views (the Oracle summary tables).  The fact tables are always present in a Data Warehouse projects.  The materialized views are aggregated tables that are typically generated from fact tables, but that aggregate the data according to certain dimensions of analysis.
It is important to distinguish immediately the two types of entities, because their content is different and also their feeding process can be different and performed at different times.
Inside the data mart level 0 area, there may be a section that identifies the sales data mart, but within this section may be dozens of entities, and how many and what are the "aggregated" ie, materialized views, is an important information for us and for the database administrator.
By classification, this information will be immediately apparent, not only for a specific section or area, but also across all areas.  If we decide to typify all the materialized views derived from fact tables with the code FMV  (Fact Materialized View), all entities whose name ends with FMV will surely "aggregated" tables.
What follows is a fairly exhaustive list of codes and descriptions that will certainly help us in the classification process. 
  • DAT - Data Table. General data table typical of staging areas.  Are basically tables that contain metrics, rate, or other measures, with their natural production codes not transformed in artificial keys.
  • DST - Descriptive Table.  Like the previous, it refers to tables typical of staging areas.  For example the personal data codes and their natural attributes of the customers, the product codes, etc. In a world, all staging tables that will turn into dimensional tables. 
  • EDT - External Data Table.  Typical of the staging areas. The 'external table’ is a very useful feature of Oracle by which a flat file is seen in effect as a table. Then easily managed with the SQL language.
  • EXT - Exception Table.  Table that contains all lines that have failed integrity checks in a table.
  • FAT - Fact Table. Typically contains facts at maximum transactional detail, how the individual movements of current account or the individual lines of receipts issued by a supermarket checkout.
  • FFT - Factless Fact Table.  Fact table without numeric information.  In them, the mere presence of the line certify the occurrence of an event, such as the presence of a student in a university course.
  • PSF - Periodic Snapshot Fact table.  This type of fact table reassume many facts in one line summary of a situation at the end of a period of time.  A typical example is the bank account: it is possible to construct a monthly summaries supplied by the basic fact table where there are details of each movement performed on the bank account.
  • ASF - Accumulating Snapshot Fact table.  This type of table describes the processes that have a beginning and an end not too time-deferred and when a dimension is repeated several times.  A typical example is the story of an order or an invoice, all summarized in one line.  Or a list of days (linked to the time dimension) representing all the state changes of an object or a process.
  • FAV - Fact View - Partial view of data in one or more fact tables
  • FMV - Fact materialized views - materialized view built on one or more fact tables.
  • DIT - Dimensional table - Table for dimensional analysis
  • DIV - Dimensional View - view associated with a dimensional table.
  • DIM - Dimension - For Oracle the dimension is not the dimension table, but the logical structure that describes its content in terms of hierarchical levels and attributes associated with the various levels.
  • TMT - Temporary table – They are tables that contains temporary data used in complex ETL processes.
  • CFT - Configuration table – Generic table containing configuration data.
  • DTE - Data Entry Table - Table that can be updated by the user with a graphical interface.
  • AWS - Analytic workspace - Cube OLAP data structure that is typical of the multidimensional databases.

Conclusions

It's the end of this brief journey through the naming convention techniques . How exposed is certainly not exhaustive of the potential for application of such techniques, nor intends to be a dogma to be followed.  Each of us, based on its experience, can build own naming convention according to their needs and according to their intuition.  In fact it does not matter which choice you start to partition the system or how its components are classified, but it is important to apply a standardization method.  An effective naming convention certainly provides all the tools needed to control the Data Warehouse, in terms of knowledge, management and maintenance.



You can download this article from Slideshare.
 Bibliography
 [1] R. Kimball - The Data WarehouseLifecycle Toolkit ", Wiley, 1998