Learn Develop Data Engineering

Friday, January 16, 2015

Rank Transformation, Sequence generator & Aggregator Transformation

· Active and connected transformation

The Rank transformation allows us to select only the top or bottom rank of data. It allows us to select a group of top or bottom values, not just one value.

During the session, the Power Center Server caches input data until it can perform the rank calculations.

Rank Transformation Properties :

· Cache Directory where cache will be made.

· Top/Bottom Rank as per need

· Number of Ranks Ex: 1, 2 or any number

· Case Sensitive Comparison can be checked if needed

· Rank Data Cache Size can be set

· Rank Index Cache Size can be set

Rank Index

The Designer automatically creates a RANKINDEX port for each Rank transformation. The Power Center Server uses the Rank Index port to store the ranking position for each row in a group.

For example, if we create a Rank transformation that ranks the top five salaried employees, the rank index numbers the employees from 1 to 5.

· The RANKINDEX is an output port only.

· We can pass the rank index to another transformation in the mapping or directly to a target.

· We cannot delete or edit it.

Defining Groups

Rank transformation allows us to group information. For example: If we want to select the top 3 salaried employees of each Department, we can define a group for Department.

· By defining groups, we create one set of ranked rows for each group.

· We define a group in Ports tab. Click the Group By for needed port.

· We cannot Group By on port which is also Rank Port.

Sequence Generator Transformation

· Passive and Connected Transformation.

· The Sequence Generator transformation generates numeric values.

· Use the Sequence Generator to create unique primary key values, replace missing primary keys, or cycle through a sequential range of numbers.

We use it to generate Surrogate Key in DWH environment mostly. When we want to maintain history, then we need a key other than Primary Key to uniquely identify the record. So we create a Sequence 1,2,3,4 and so on. We use this sequence as the key. Example: If EMPNO is the key, we can keep only one record in target and can’t maintain history. So we use Surrogate key as Primary key and not EMPNO.

Sequence Generator Ports :

The Sequence Generator transformation provides two output ports: NEXTVAL and CURRVAL.

· We cannot edit or delete these ports.

· Likewise, we cannot add ports to the transformation.

NEXTVAL:

Use the NEXTVAL port to generate sequence numbers by connecting it to a Transformation or target.

CURRVAL:

CURRVAL is NEXTVAL plus the Increment By value.

· We typically only connect the CURRVAL port when the NEXTVAL port is already connected to a downstream transformation.

· If we connect the CURRVAL port without connecting the NEXTVAL port, the Integration Service passes a constant value for each row.

· When we connect the CURRVAL port in a Sequence Generator Transformation, the Integration Service processes one row in each block.

· We can optimize performance by connecting only the NEXTVAL port in a Mapping.

Points to Ponder:

· If Current value is 1 and end value 10, no cycle option. There are 17 records in source. In this case session will fail.

· If we connect just CURR_VAL only, the value will be same for all records.

· If Current value is 1 and end value 10, cycle option there. Start value is 0.

· There are 17 records in source. Sequence: 1 2 – 10. 0 1 2 3 –

· To make above sequence as 1-10 1-20, give Start Value as 1. Start value is used along with Cycle option only.

· If Current value is 1 and end value 10, cycle option there. Start value is 1.

· There are 17 records in source. Session runs. 1-10 1-7. 7 will be saved in repository. If we run session again, sequence will start from 8.

· Use reset option if you want to start sequence from CURR_VAL every time.

Aggregator Transformation

· Connected and Active Transformation

· The Aggregator transformation allows us to perform aggregate calculations, such as averages and sums.

· Aggregator transformation allows us to perform calculations on groups.

Components of the Aggregator Transformation

1. Aggregate expression

2. Group by port

3. Sorted Input

4. Aggregate cache

1) Aggregate Expressions

· Entered in an output port.

· Can include non-aggregate expressions and conditional clauses.

The transformation language includes the following aggregate functions:

· AVG, COUNT, MAX, MIN, SUM

· FIRST, LAST

· MEDIAN, PERCENTILE, STDDEV, VARIANCE

Single Level Aggregate Function: MAX(SAL)

Nested Aggregate Function: MAX( COUNT( ITEM ))

Nested Aggregate Functions

· In Aggregator transformation, there can be multiple single level functions or multiple nested functions.

· An Aggregator transformation cannot have both types of functions together.

· MAX( COUNT( ITEM )) is correct.

· MIN(MAX( COUNT( ITEM ))) is not correct. It can also include one aggregate function nested within another aggregate function

Conditional Clauses

We can use conditional clauses in the aggregate expression to reduce the number of rows used in the aggregation. The conditional clause can be any clause that evaluates to TRUE or FALSE.

· SUM( COMMISSION, COMMISSION > QUOTA )

Non-Aggregate Functions

We can also use non-aggregate functions in the aggregate expression.

· IIF( MAX( QUANTITY ) > 0, MAX( QUANTITY ), 0))

2) Group By Ports

· Indicates how to create groups.

· When grouping data, the Aggregator transformation outputs the last row of each group unless otherwise specified.

The Aggregator transformation allows us to define groups for aggregations, rather than performing the aggregation across all input data.

For example, we can find Maximum Salary for every Department.

· In Aggregator Transformation, Open Ports tab and select Group By as needed.

3) Using Sorted Input

· Use to improve session performance.

· To use sorted input, we must pass data to the Aggregator transformation sorted by group by port, in ascending or descending order.

· When we use this option, we tell Aggregator that data coming to it is already sorted.

· We check the Sorted Input Option in Properties Tab of the transformation.

· If the option is checked but we are not passing sorted data to the transformation, then the session fails.

4) Aggregator Caches

· The Power Center Server stores data in the aggregate cache until it completes Aggregate calculations.

· It stores group values in an index cache and row data in the data cache. If the Power Center Server requires more space, it stores overflow values in cache files.

Note: The Power Center Server uses memory to process an Aggregator transformation with sorted ports. It does not use cache memory. We do not need to configure cache memory for Aggregator transformations that use sorted ports.

Wednesday, December 17, 2014

Filter Transformation

Filter transformation is an active & connected transformation. As an active transformation, Filter transformation may change the number of rows passed through it. A filter condition returns TRUE or FALSE for each row that passes through the transformation, depending on whether a row meets the specified condition. Only those rows that return TRUE pass through this transformation. Other rows discarded do not appear in the session log or reject files/bad file.

You cannot concatenate ports from more than one transformation into the Filter transformation. The input ports for the filter must come from a single transformation. The Filter transformation does not allow setting output default values.

Creating Filter Transformation:

Follow the below steps to create a filter transformation:-

In the mapping designer, open a mapping or create a new mapping.

Go to the toolbar->click on Transformation->Create->Select the filter transformation

Enter a name->Click on create and then click on done.

You can add ports either by dragging from other transformations or manually creating the ports within the transformation.

Specifying Filter Condition:

To configure the filter condition, go to the properties tab and in the filter condition section open the expression editor. Enter the filter condition you want to apply. Click on validate button to verify the syntax and then click OK.

Components of Filter Transformation:

The filter transformation has the following components.

Transformation: You can enter the name and description of the transformation.

Ports: Create new ports and configure them

Properties: You can provide filter condition to filter the rows. You can also configure tracing levels.

Metadata Extensions: Specify the metadata details like name, data type.

Configuring Filter Transformation:

The following properties needs to be configured on the ports tab in filter transformation

Port name: Enter the name of the ports created.

Port type: All the ports in filter transformation are input/output.

Data type, precision, and scale: Configure the data type and set the precision and scale for each port.

Filter Transformation examples:

Create a mapping to load the employee’s details into the target who has join organization after Dec-2012?

Filter condition: HIREDATE >=’01-01-2013’

It will reject employee’s details other than who has joined organization after ‘31-Dec-2012’

Performance Tuning Tips:

1. Use the filter transformation as close as possible to the sources in the mapping. This will reduce the number of rows to be processed in the downstream transformations.

2. In case of relational sources, if possible use the source qualifier transformation to filter the rows. This will reduce the number of rows to be read from the source.

Note: The input ports to the filter transformation must come from a single transformation. You cannot connect ports from more than one transformation to the filter Transformation.

Tuesday, December 16, 2014

Source Qualifier

Source Qualifier Transformation:

This is Active and Connected Transformation. The Source Qualifier transformation represents the rows that the Power Center Server reads when it runs a session. Source Qualifier Transformation is not reusable. This is default transformation except in case of XML or COBOL files.

Tasks performed by Source Qualifier:

Join data originating from the same source database: We can join two or more tables with primary key-foreign key relationships by linking the sources to one Source Qualifier transformation.
Filter rows when the Power Center Server reads source data: If we Include a filter condition, the Power Center Server adds a WHERE clause to the Default query.
Specify an outer join rather than the default inner join: If we include a User-defined join, the Power Center Server replaces the join information Specified by the metadata in the SQL query.
Specify sorted ports: If we specify a number for sorted ports, the Power Center Server adds an ORDER BY clause to the default SQL query.
Select only distinct values from the source: If we choose Select Distinct, the Power Center Server adds a SELECT DISTINCT statement to the default SQL query.
Create a custom query to issue a special SELECT statement for the Power Center Server to read source data: For example, you might use a Custom query to perform aggregate calculations. The entire above are possible in Properties Tab of Source Qualifier transformation.

Source Qualifier Properties Tab:
1) SOURCE FILTER:

We can use a source filter to reduce the number of rows the Power Center Server queries.

Note: When we use a source filter in the session properties, we override the customized SQL query in the Source Qualifier transformation.

2) NUMBER OF SORTED PORTS:

When we use sorted ports, the Power Center Server adds the ports to the ORDER BY clause in the default query. By default it is 0. If we change it to 1, then the data will be sorted by column that is at the top in Source If we change it to 2, then data will be sorted by top two columns.

3) SELECT DISTINCT:

If we want the Power Center Server to select unique values from a source, we can use the Select Distinct option.Just check the option in Properties tab to enable it.

4) PRE-SESSION and POST-SESSION Commands:

The Power Center Server runs pre-session SQL commands against the source database before it reads the source.
It runs post-session SQL commands against the source database after it writes to the target.
Use a semi-colon (;) to separate multiple statements.

5) USER DEFINED JOINS:
Entering a user-defined join is similar to entering a custom SQL query. However, we only enter the contents of the WHERE clause, not the entire query.

We can specify equi join, left outer join and right outer join only. We cannot specify full outer join. To use full outer join, we need to write SQL Query.

6) SQL QUERY:

For RDBMS sources, the Power Center Server generates a query for each Source Qualifier transformation when it runs a session. The default query is a SELECT statement for each source column used in the mapping. In other words, the Power Center Server reads only the columns that are connected to another Transformation

Note: If we do not cancel the SQL query, the Power Center Server overrides the default query with the custom SQL query. We can enter an SQL statement supported by our source database. Before entering the query, connect all the input and output ports we want to use in the mapping.

Important Points:

When creating a custom SQL query, the SELECT statement must list the port names in the order in which they appear in the transformation.
Make sure to test the query in database first before using it in SQL Query. If query is not running in database, then it won’t work in Informatica too.
Also always connect to the database and validate the SQL in SQL query editor.