Learn Develop Data Engineering

Thursday, September 10, 2015

Data Modeling

Data Model

A collection of concepts that can be used to describe structure of database and in other words representation of set of business requirement in standard framework which is understandable to biz users.

Data Model can be defined as an integrated collection of concepts for describing and manipulating data, relationships between data, and constraints on the data in an organization.

A data model comprises of three components:

• A structural part, consisting of a set of rules according to which databases can be constructed.

• A manipulative part, defining the types of operation that are allowed on the data (this includes the operations that are used for updating or retrieving data from the database and for changing the structure of the database).

• Possibly a set of integrity rules, which ensures that the data is accurate.

In Data modeling some points we need to remember while design data model

1. Identify Entities: It should be classified by its properties and characteristics and business definition for that entity.

2. Identify Attributes: Charteristics and properties of entities, name should be unique and self-explanatory.

3. Identify Relationship: We have to identify relation between entities and use of attributes for this.

4. Naming Convention: Name should be short avoid using special character as well as common used terms , follow standards for all entities and attributes.

5. Defining Keys to attributes.

Three categories of Data model

1. Object Based Data Models: Object based data models use concepts such as entities, attributes, and relationships. An entity is a distinct object (a person, place, concept, and event) in the organization that is to be represented in the database. An attribute is a property that describes some aspect of the object that we wish to record, and a relationship is an association between entities.

Types of object based data models

• Entity-Relationship

• Object Oriented

• Semantic

• Functional

The Entity-Relationship model has emerged as one of the main techniques for modeling database design and forms the basis for the database design methodology.

The object oriented data model extends the definition of an entity to include, not only the attributes that describe the state of the object but also the actions that are associated with the object, that is, its behavior. The object is said to encapsulate both state and behavior.

Entities in semantic systems represent the equivalent of a record in a relational system or an object in an system but they do not include behavior (methods). They are abstractions 'used to represent real world (e.g. customer) or conceptual (e.g. bank account) objects.

The functional data model is now almost twenty years old. The original idea was to' view the database as a collection of extension ally defined functions and to use a functional language for querying the database.

2. Physical Data Models:

Physical data models describe how data is stored in the computer, representing information such as record structures, record ordering, and access paths. There are not as many physical data models as logical data models, the most common one being the Unifying Model.

3. Record Based Logical Models:

Record based logical models are used in describing data at the logical and view levels. In contrast to object based data models, they are used to specify the overall logical structure of the database and to provide a higher-level description of the implementation. Record based models are so named because the database is structured in fixed format records of several types. Each record type defines a fixed number of fields, or attributes, and each field is usually of a fixed length.

The three most widely accepted record based data models are:

• Hierarchical Model

• Network Model

• Relational Model

The relational model has gained favor over the other two in recent years. The network and hierarchical models are still used in a large number of older databases.

Conceptual Data modeling

Conceptual data modeling is a map of concept and their relationship. This describes semantic of organization i.e. how organization doing business what are all components involve in this organization.

What are all information are required in terms of decision making for biz associates here it is considered in conceptual data modeling, in this we need to design solution.

Component of Conceptual Data modeling

1. Entity or Objects for example Product, customer, services.

2. Relationship between entities.

3. Identifiers which will distinguish the entity instances.

4. Various attributes required to maintain entity.

ER Model: ER Model is Conceptual Data model Here we need to describe data as entities, relationship and attributes.

Types of Attributes	Description	Example
Single	Not divisible	Aadhar no., Age, Sex
Composite	Divisible into small parts	Name(Fname, MName, LName)
Single-Valued	Single Value for entity	Age
Multi-Valued	Multiple value for entity	Academic marks of person
Derived	Contain calculated value	No of customer per region
Complex	Composite and multi value	Shipping address

Enhanced ER Modeling:

In this along with ER model we have additional concepts

· Subclass and super-class

· Top to Bottom

· Bottom to Top

· Category

· Attribute and relationship inheritance

Logical Data Modeling

Logical data modeling refers to actual implementation of conceptual data model in database. Logical data model is version of data model that represent the business requirement of an organization.

As and when conceptual data model is approved by design architect and functional team, development team start designing logical data model. A good logical data model created for future aspects and business requirement. Logical data model streamlined data structure and relationship between entities and attributes.

Logical data model includes all required entities, attributes, key groups and relationship that represents business information and business rules.

Logical data model characteristics

· Logical model works in iterative manner.

· Design is independent of database

· All Entities and relationship among them

· All attributes for each entity are specified

· Primary Key for each entity is specified

· Foreign key between entities specified

Dimension Data Model:

Dimension data modeling comprises of one or more dimension table and fact table.

Modeling techniques are nothing but way of storing data, dimension data model give us advantage to retrieve data fast.

In dimensional model, everything is divided in 2 distinct categories - dimension or measures. Anything we try to model, must fit in one of these two categories.

In Dimension data model some benefits

Faster Data Retrieval
Better Understand ability
Extensible

Step by step to design dimension data model

1. Identify Dimension: Dimensions are the object or context. That is - dimensions are the 'things' about which something is being spoken.

2. Identify Measures: Measures are the quantifiable subjects and these are often numeric in nature or aggregate value.

3. Identify attributes or properties of dimensions: In real time we have many attributes in one context but we have to select relevant attributes.

4. Identify the granularity of measures: Granularity refers to depth of information stored in data model, how granular is data stored in system

5. History stored: We need to make sure relevant data is getting stored and history is maintained and by using slowly changing dimension we can achieve this.

Physical Data Modeling

Physical data modeling is set of process to develop data structure for data warehousing in selected database.

Below are important points while developing physical data modeling

1. Convert Entities into Physical Tables

2. Convert Relationship to foreign keys

3. Convert Attributes into physical columns

4. Convert Unique Identifiers into define constraint on physical table-columns.

5. Creating Index, Views (materialized view), dimensions in table space.

6. Maintain datatype based on logical data model and conceptual data model.

Apart from these some more points we need to consider

How to insert, update and retrieve data into physical tables as well as how to store non-structured data into tables and Performance tuning, memory utilization.

Wednesday, September 9, 2015

Informatica-ETL performance Tuning

Informatica-ETL performance Tuning: Informatica performance tuning is an important aspect of ETL development or Data warehouse.

Now a days business people wants data to be more fast more responsive, in this situation we have to look for solution where we can improve our designed system.

While doing performance tuning we have to see all possible aspect like database, ETL tool and hardware resource.

Below are some of the points which will help to improve performance of data warehouse

1. In Database like Oracle, SQL Server, Sybase, DB2 we can sort, Grouping and aggregation and keep data into staging environment.

2. We try to avoid synonyms, DB links and remote access database and try to keep all tables in one instance.

3. Localize all target table and stored procedure, function, views & sequence in source DB.

4. Informatica runs well in RDBMS engine but not good on Java, OLAP and reporting engines.

5. Turn Off VERBOSE tracing in informatica session, due to this informatica takes more time to write into logs file.

6. Turn Off collect performance statistics from informatica.

7. If source is flat file use staging tables to reduce multiple transformation in mapping.

8. Try to avoid use of non-lookup cached if your lookup is growing.

9. Try to keep complex mapping, will divide complex mapping into parallel processing.

10. We need to keep equal balance between informatica and database, but here we have to use possible capability from DB to avoid more operation in informatica.

11. We can use DBMS for (Reading, Writing, Grouping, Filtering data) and for more complex logic, outside join, integrating various source, multiple source feed in this we have to identify what operation we can done in DB and informatica.

12. We have to Tune Session Setting by checking "Throttle Reader" and Increasing default buffer size.

13. While installing Informatica server we have to choose with good server so that PMServer will have good space to save the data.

14. Avoid DB sequence generator, if you cannot then use staging tables and if sequence generator is shared one then add Seq ID from flat file and call post target load stored procedure to populate column.

Tuning the Session

1. Partition the session this creates many connections to the source and target, and loads data in parallel pipelines. Each pipeline will be independent of the other. But the performance of the session will not improve if the number of records is less. Also the performance will not improve if it does updates and deletes. So session partitioning should be used only if the volume of data is huge and the job is mainly insertion of data.

2. Run the sessions in parallel rather than serial to gain time, if they are independent of each other.

3. Drop constraints and indexes before we run session. Rebuild them after the session run completes. Dropping can be done in pre-session script and Rebuilding in post session script. But if data is too much, dropping indexes and then rebuilding them etc. will be not possible. In such cases, stage all data, pre-create the index, use a transportable table space and then load into database.

4. Use bulk loading, external loading etc. Bulk loading can be used only if the table does not have an index.

5. In a session we have options to ‘Treat rows as ‘Data Driven, Insert, Update and Delete’. If update strategies are used, then we have to keep it as ‘Data Driven’. But when the session does only insertion of rows into target table, it has to be kept as ‘Insert’ to improve performance.

6. Increase the database commit level (The point at which the Informatica server is set to commit data to the target table.

Tuning Of Mapping

1. We can avoid executing major SQL queries from Mapplets or mappings.

2. Use optimized queries when we are using them.

3. Reduce the number of transformations in the mapping. Active transformations like rank, joiner, filter, aggregator etc. should be used as less as possible.

4. Remove all the unnecessary links between the transformations from mapping.

5. If a single mapping contains many targets, then dividing them into separate mappings can improve performance.

6. If we need to use a single source more than once in a mapping, then keep only one source and source qualifier in the mapping. Then create different data flows as required into different targets or same target.

7. If a session joins many source tables in one source qualifier, then an optimizing query will improve performance.

8. In the SQL query that Informatica generates, ORDER BY will be present. Remove the ORDER BY clause if not needed or at least reduce the number of column names in that list. For better performance it is best to order by the index field of that table.

9. Combine the mappings that use same set of source data.

10. On a mapping, field with the same information should be given the same type and length throughout the mapping. Otherwise time will be spent on field conversions.

11. Instead of doing complex calculation in query, use an expression transformer and do the calculation in the mapping.

12. If data is passing through multiple staging, remove the staging area will increase performance.

13. Try to keep the stored procedures simple in the mappings.

14. Unnecessary data type conversions should be avoided.

Tuning of Lookup Transformations

1. To improve performance, cache the lookup tables. Informatica can cache all the lookup and reference tables; this makes operations run very fast. (Meaning of cache is given in point 2 of this section and the procedure for determining the optimum cache size is given at the end of this document.)

2. Even after caching, the performance can be further improved by minimizing the size of the lookup cache. Reduce the number of cached rows by using a SQL override with a restriction.

Tuesday, September 8, 2015

SCD Type-1

SCD Type 1: In SCD 1, new information overwrites the current information

There will be no history in SCD 1 dimension.

It is used when it is not necessary for DWH to keep track of historical changes.

It’s works like “Update Else Insert”

Below is a sample source table "SRC_SCD1" and SCD1 Target Dimension "TGT_SCD1"

Now let’s create mapping “m_SCD1” and import source & target definition in mapping.

Let’s lookup on SCD1 dimension table for this we will create lookup transformation and keep required ports which is (SURRKEY_01, EMPNO, CURR_SAL) here EMPNO port used to compare between source and lookup table and CURR_SAL for value comparison based on this ROUTER takes decision to send for INSERT or UPDATE.

Now will create ROUERT T/R with two group one for INSERT and another one for UPDATE.

For INSERT condition will be ISNULL (SURRKEY_01)

For UPDATE condition will be NOT ISNULL (SURRKEY_01) AND (SAL != CURR_SAL)

Next will create Update Strategy T/R for update will use DD_UPDATE in Update Strategy expression, here will take necessary port which we want’s to update

Now will look for INSERT instance where we create Sequence Generator T/R for generating Surrogate Key in dimension table and drag and drop all ports from Router INSERT group to Target Insert instance.

Now will create session where we specify all source, target and lookup db. connections and in session properties will specify Treat Source Rows as Data Driven as we are using Update Strategy T/R.

Below is created SCD Type-1 Mapping

Will Run Session, below is target data after first Run.

Now let’s update and insert data into source

Again will re-run session and verify target to see data got updated with latest value.

With this i'm done with SCD type-1, please let me know if you have any question and provide your comment below to help me for improving things in future.
I will be explaining SCD Type-2 in Next coming Tutorial.

Monday, September 7, 2015

Slowly Changing Dimension-3

SCD-3 (Slowly Changing Dimension)

It is used to store partial history data into dimension table, which oldest and latest data.

Whenever any changes happen in any existing records in source updated data will be updated in update column in SCD-3 dimension in data warehouse.

Here I’m taking example of sample employee table (SRC_EMP_SCD3) and loading into target dimension (TGT_EMP_SCD3).

After creating source and target table into database, then import source and target definition into mapping (m_EMP_SCD3).

Let us create Lookup transformation on target dimension table and keep only required ports which is (EMPNO, SurrKey_001, Curr_sal) in my case.

Now let’s Lookup on Curr_sal port which will use to compare with Source sal port in router transformation, in lookup will write condition to compare the data with source which is SRC_EMPNO=TGT_EMPNO.

Now will create router transformation to send data for update and insert in different pipeline.

For INSERT: ISNULL (SURRKEY_001)

For update: NOT ISNULL (SURRKEY_001) AND (SAL!= CURR_SAL)

Now will create update strategy transformation for update, and will drag required column into update strategy transformation which is SurrKey_001 and Sal column received from source.

Here we will write DD_UPDATE into update strategy expression, because we are sending this for update and drag these column to another target instance.

For Insert we need to create sequence transformation for unique key generation will map this NEXTVAL column to target SurrKey_001 column and rest column from router from Insert group.

Below is developed mapping for same.

Target after first run

In Source one record getting inserted and one record updated, now let's run again and see difference in target.

After second run target look like below one record updated and one record inserted.

With this i'm done with SCD-3 implementation.

Hope you enjoyed by reading this, please write your comment below

Informatica Power center Architecture

Informatica Power center Architecture

Informatica power center is a service oriented architecture that provides capability to share service and resource across multiple machines.

Domain: It is primary component of informatica power center, domain is managing and administrating various services in informatica power center

Node: Node is a logical representation of machine in domain, one domain can have multiple nodes.

We are hosting domain in master gateway node, node can be configured to run integration service as well as repository service i.e. application service.

Application service: It comprises of informatica services like (power center integration service , data integration service,meta data manager service and power center repository service).

As Informatica power center is service oriented architecture all operation done by it’s client manage through services.

a) Power center integration service: Power center integration service works as controller and it gives instruction to power center workflow manager to run workflow based on logic written in power center designer mapping.

The Integration Service reads workflow information from the repository. The Integration Service connects to the repository through the repository service to fetch metadata from the repository.

It gives instruction to power center clients to extract, transform and load in other words read, transform and write data between source and target.

b) Power center repository service: It keeps all the Metadata information and storing into repository database.

It accepts & rejects request to create or modify metadata into repository database, this is also getting request from integration service to get metadata for running workflow.

c) Metadata manager service: It is used to analyze metadata from various metadata repository.

It uses power center workflows to extract metadata from different systems like application, BI system, Data modelling and RDBMS metadata source.

Power center Clients: It’s comprises of power center designer, power center workflow manager, power center workflow monitor, power center repository manager and power center metadata reporter

Power center designer: It is used to create mapping (pipeline) which is combination of source, target and transformation.

Power center workflow manager: This used to run mapping and it has various task to manager source and target and running workflows.

Power center workflow monitor: It is used to monitor execution of workflow and session.

power center repository manager: used to manage repositories and code migration from one environment to other.

Pages