Sunday, July 13, 2014

Data Warehouse: Naming convention techniques (part 2)

Introduction

In the part 1, the application of the Naming Convention techniques had as its privileged object the tables, certainly the basic entities of a information system. We have defined a name to these entities in their easier form (table), in their aggregate form (materialized view or summary table) in their logical form (view).
It was emphasized that these techniques can be applied to any logical/physical entity of a Data Warehouse. So, I wish to complete these thoughts having in mind three targets:
 
Completeness: The tables are basic, but alone, do not constitute a Data Warehouse. There must be access rights to view them, programs must exist to load them ,must exist indexes to speed their access, there must be constraints to ensure data integrity. Even programs, rights, indexes and constraints must be created by respecting the Naming Convention. The tables are made of attributes.Even the attributes have a name. We will speak.
Pragmatism: Only seeing apply the techniques described in a real case, we can recognize the utility, then
we will examine and we will give a name to all the other entities in the game, using the sample Data Warehouse.
Knowledge: Some of the entities that will be subject to naming are specific to Oracle and this is a good
opportunity  to give them a brief description. The choices made for the Naming Convention are only guidelines. They are not a dogma. The convention can be discussed and changed according to our needs and to our particular view of the system. The main objective was to put attention to the importance and usefulness of the Naming Convention. 

Another point I wish to emphasize is that the convention is "Database Administator oriented" and not "Business oriented." It means that the names chosen, for example, for the tables, will be physical ones, and the names will be those that only the DBA sees. The "rest of the world" should not see those names,but the "logical" names  that are  filtered by synonyms and/or views.

The users of the Naming Convention

Based on some useful questions received, I want to clarify this point. Take for example the EDW_COM_CDI_CUST_DIT entity. This entity represents the customers (CUST) of the dimension table (DIT) of the conformed dimensions section (CDI) of the common area (COM) for all entities, of our Data Warehouse (EDW). Using the Naming Convention, the content of this entity appears clear to any DBA, also to who, for example, inherits the management of a Data Warehouse that does not know. (try to think if the table had been named A01DWCST).
The EDW_COM_CDI_CUST_DIT entity is seen and handled only by the DBA. In my view, the only other users who can see the entity (and only what we need, that are, usually, facts and dimensions) are the business-area builders,  by means of an administration module that is part of the front-end tool (eg Oracle Business Intelligence).
These users do not need to see the EDW_COM_CDI_CUST_DIT physical name, but a view/synonym (logical name) as,for example, CUSTOMER_DIT. If we had the foresight to make unique the last two
components of the name, the rest of the world will see an entity name much shorter, simple and near to its business logic.

The Naming Convention of the table attributes

As we know from the theory of relational databases, the table attributes are a set of specific characteristics of the various entities that define the logical model. Since we have defined a Naming Convention for the entity, it is necessary define a Naming Convention for attributes.
The paradigm that underlies the Naming Convention of the table attributes can be summarized in the following formula:

       <attribute name> = <logical name>_<type code>

For them, the name is very simplified because their logical context is already structurally defined by the table to which they belong. Into the data dictionary tables of an RDBMS such as Oracle,you can locate all the attributes and their tables associated. So an effective Naming Convention will be very useful in the research of all the attributes with certain common characteristics. Here are some examples from personal experiences.
 
In a Data Warehouse for a bank, was born the need to change the size of the currency numbers fields from two to six decimal places. The need was clearly linked to rounding problems. Hundreds of tables with different columns had to be involved in the modification. Have adopted the Naming Convention to typify all the columns of currency amounts with <logica namel>_AMT was decisive. Has allowed us to generate  a script that, accessing to the data dictionary tables ,it made dynamically the change of structure of all and only the affected columns.
 
Some ETL and reporting tools allow us to identify automatically all the descriptive columns of alphanumeric codes that will be displayed in the output: the interface of the tool will then use a clause "like" to locate the fields. If you use the standard to name the description of all the code columns with  "*_DSC, this will allow you to take advantage of this feature of the tool, and it will not need to specify one by one all the fields. Now we see some examples of the <type code>

COD - Code - Alphanumeric code: This is the classic code that is associated with a description and a domain. It may be a customer number, an order type, an account status, etc.. I suggest to deal all the numeric codes as alphanumeric codes.
DSC - Description - the code description is always the description associated with the code that is used in the reporting tools and in front-end. It is a design choice understand if only a single description is sufficient or define a short description (SDS which stands for short description) and a long description (LDS which stands for long description). It often happens that the user requires the concatenation of the short description with the code (CSD which stands for code plus short description)
AMT - Amount - Always indicates an amount.
QTY - Quantity - Indicates a quantity in pieces, weight or in some other unit of measure.
KEY - Always indicates an artificial key. A column with this type must exist in all dimension tables and in the corresponding columns of the fact tables.
DTS - Date Stamp - Date: indicates a date in the format Oracle, that is inclusive of the time (hours, minutes, seconds)
FLG - Flag - It is always a binary field, ie that it may be only 0 or 1.
TXT - Text - Field of generic text.
. YMD - Day in the YYYYMMDD format

The other entities of a Data Warehouse

Identify all the main entities or structures of a Data Warehouse, is not an easy job without forgetting that in Oracle there are over 30 different types of structures.
Eeach RDBMS has its own requirements and peculiarities and would be long-winded and useless to try to give a Naming Convention at all. So we will focus on the main entities, almost always present, leaving to the reader the application of the learned techniques for the remaining ones. Here is the list of entities, subject of our next guidelines.

• Index
• Tablespace
• Datafile
• Integrity Constraint
• Role
• Package

The Naming Convention of the indexes

As the Naming Convention is linked to the type of index, I will give a brief overview of the most common types of indexes. They typically cover 90% of the need for a Data Warehouse.
As everyone knows, the indexes are data structures that are created on one or more columns in a table to optimize the performance of access to data; the goal of an index is therefore to provide an immediate physical access to the rows of the table that contains the values. In Oracle, but there are also in other RDBMS, the indexes most used are the classic B-tree indexes, the local or global bitmap indexes, and the function index. The paradigm that underlies the Naming Convention of the indexes can be summarized in the following formula:

<index name> = <project code>_<area code>_<section code>_<logical name>_<index type>
 
The Naming Convention of the indexes will then have the same syntax of the entities, but will only change the tipology. In practice its name is identical to that of the table on which it is created except for the suffix. What follows is a list of the indexes applicable to a sales fact table. X indicates a progressive number.

  • EDW_DM0_SLS_LBx: Represents a local bitmap index.
  • EDW_DM0_SLS_GBx: Represents a global bitmap index.
  • EDW_DM0_SLS_NUx: Represents a generic btree index not unique
  • EDW_DM0_SLS_UIx: Represents a generic index btree unique
    EDW_DM0_SLS_FUx: Represents a function index

 

The Naming Convention of integrity constraints

The Integrity constraints allow us to associate some rules to the Data Warehouse tables, to order to prevent the introduction of outliers or non-compliant values.
It is needless to emphasize the importance that these rules have in the design of the system. Dispelling immediately a myth that often we hear: constraints on tables encumbers the data manipulation operations. Nothing could be further from the truth. Let's see to make things clear.
  1. Is obvious that the introduction of an integrity constraint slows down the processes of manipulation of the table, but its overhead is minimal and, as a percentage, its weight in the loading process will be negligible. I remember you, however, that the constraints can be turned off before the data loading and reactivated immediately after loading.
  2. If implemented programmatically in your application, the constraints will never be so complete, secure and manageable as those defined automatically by the RDBMS.
  3. Always enter the integrity constraints, even if the source systems are in turn RDBMS with the active constraints. Do not to trust is better: try to think of what it means to have discovered duplicate keys after loading a few months of data and be in production.
  4. The integrity constraints are necessary to activate the query rewrite in Oracle, ie its internal functionality, which is able to rewrite a query based on the fact table, and redirecting it on a materialized view. Without the integrity constraints between the fact table and its dimension table this mechanism will never work.

The paradigm that is the basis of the Naming Convention of the integrity constraints can be summarized in the following formula:

<constr. name> = <project code>_<area code>_<section code>_<logical name>_< constr.  type>
 
The following is a list of integrity constraints applicable to a fact table of sales. X indicates a progressive number.
  •  EDW_DM0_SLS_Nxx: To indicate the requirement to have always a non-null value for a field. XX is a sequential number for each column in the table that requires the constraint.
  • EDW_DM0_SLS_PK1: To indicate the primary key.
  • EDW_DM0_SLS_UKx: To specify a unique key.
  • EDW_DM0_SLS_FKx: To specify the foreign key. If you think that the number of foreign key can be higher of 9, use the convention Fxx
  • EDW_DM0_SLS_CKx: To indicate a more complex constraint based on some conditions. (for example a start date should always be prior of the end date)

 

The Naming Convention of the tablespaces

The tablespaces are logical drives that connect objects with common logical characteristics. Each table, materialized view or index always has a table space that contains, either expressed explicitly inside the script of creation, or implied, that is (the default), the tablespace of the user who created the object.
In turn each tablespace are associated with one or more datafiles. The paradigm that underlies the Naming Convention of the tablespace can be summarized in the following formula:

<tbs name> = <project code>_<area code>_<section code>_< tbs  type>

where the section code  and type code are optional; In fact, the technique to be applied in this case, is not unique, but depends on the size of the objects that constitute the tablespace. Referring to our example of the sales , we have the following:
  • EDW_COM: Tablespace of common entities. In the area that we have defined COM, there are definitely tables and indexes of little size, compared to data from other areas, so will be sufficient the project code plus the area code.
  • EDW_STA: Tablespace for temporary objects. Also in this case, the staging tables, which are only transient and of small dimensions, may stay into only one tablespace.
  • EDW_DM0_SLS: Tablespace objects from the sales data mart. If the total space occupied by these objects is limited, this may be sufficient only one tablespace. (limited,for me, is under 8 Gb). If the volumes are higher, it can be used DFT, IFT, DMT and IMT, ie fact table, index fact table , materialized view and index materialized view.
In cases of VLDW (Very Large Data Warehouse) is conceivable a tablespace for indexes, and a tablespace for the data, of each table.

The Naming Convention of the datafiles

The next considerations are valid if you are not using the Automatic Storage Management feature of Oracle.
As stated in the previous paragraph, the tablespace is made up datafile. At the time of the creation of the tablespace, you must already know about, the total space occupied by the objects that will stay in the tablespace, because you will be asked to allocate physical space. 
Let's forget about "to drive" the location of the data files on some disks of the Database Server. Now the virtualization techniques of physical space allow us to see a single disk. My advice is to divide the space occupied by the objects of the tablespace in a number of different files, of size not too high, for their better management. The paradigm that underlies the Naming Convention of the datafile can be summarized in the following formula:

<datafile name> = <tablespace name>_XX.<file type>

In this case, XX is a progressive number, the type 01,02, .., while the file type is usually fixed to DBF (Data Base File). Of course, instead of DBF you can also associate other acronyms, it is important that all the datafiles follow the same logic.

The Naming Convention of the roles

In a Data Warehouse, tables and their structures, must be aggregated to be accessible to users for data selection. I spoke at the beginning of the users of the Data Warehouse. I am aware that often the reality is more complicated, and there will always be users who access or wish to access the data directly. For this reason I speak about roles.
Provide access, means giving the grant to the entities. Because users generally have access to one or more data marts, the best way to simplify the management of access rights is to group all accesses to the data mart using roles. (When I speak about Data Mart,that is logical, I intend the fact table and the related dimension tables).
So the grant does not associate a user with a structure, but a user with a role. Appears immediately clear that the Naming Convention of the roles is closely connected to the data marts, ie with the logical partitioning at the section level . The paradigm that underlies the Naming Convention of roles can be summarized in the following formula:

<role name> = <project code>_<area code>_<section code>_<type code>

The type code may be optional, as users of the Data Warehouse will access always with "SELECT" query (I hope !); this does not mean that we cannot use "_SEL" to indicate the role of read-only access, and with "_UPD" the role of insert, update and delete. The next figure shows a summary of the techniques applied so far.


The Naming Convention of the packages

Packages are libraries of PL/SQL code. In Oracle, PL/SQL (procedural language sql) is the internal database language ,although you can write programs in Java, C, or other programming languages, callable from  PL/SQL modules. 
These modules may be procedures or functions. In Oracle, to use the package is crucial: I highly recommend that all modules necessary for the loading process are contained into packages. The advantages of their use are numerous, and I will mention only two:
  • Modularity: organize your programs in an orderly manner according to the context in which they operate is essential for anyone that work, or will work, on the project.
  • Performance: when you call a module of a package for the first once, the entire package is loaded into memory. Subsequent calls to other modules of the package doesn't require disk access.
Returning to the Naming Convention, this means that the procedure which is used to load the fact table of the sales or the aggregate monthly one, must be contained in the package that has the same name (if possible) of the target table.
If this procedure uses functions or procedures of the generic Data Mart of the sales, such a procedure should be contained in the package that has the same name of the corresponding section. The logical process will continue until reaching the common procedures to the entire Data Warehouse (for example, a function that returns me the difference of two dates for calculate the delta). Next figure shows an example of such encapsulation. 


 The Naming Convention to be adopted for the package is very flexible and may be in its most extensive form:

<pkg name> = <project code>_<area code>_<section code>_<logical name>_<type code>
as well as in its simplest form:
<project code>

The presence of the logical name and of the type code can be usable in complex systems where the number of package tends to be very high.
Do not forget that the type code must give value added to the semantics of the name. Add as _PKG type code does not create added value, as this information is obtainable from the Oracle catalog with a simple select statement.
If you decide that all modules that recall Java procedures in a certain section are within a specific package, then "_PKG" and "_JPK" will definitely  effective choices. In the case where, as in Oracle, it is not possible to have the same name for a package and a table, the use of "_PKG" Will be mandatory.

Conclusions

We have really reached the end of this short journey within the Naming Convention techniques . What is mentioned, is not certainly exhaustive of the many possible  applications of these techniques. Each of us, on the basis of own experience, can partition and can codify according to their needs and according to your own intuition. 
Indeed it is not  important the choice by which  you partition or codify the system, but it is important follow a method of standardization, in the most rigorous
way. An effective Naming Convention certainly provides all the tools necessary to keep under control soon the system, in terms of knowledge, management and maintenance

(you can download this article from slideshare: http://www.slideshare.net/jackbim/recipes-8-the-naming-convention-part-2

No comments:

Post a Comment