Data Store Types :
Database Types ;
Flat Files :
^^General------ Type……………………………..Delimited or Fixed Width or R/3.
Name……………………………..Any Name
Custom Transfer Program………..Yes/No
Parallel Process Threads…………Any Number(2,5,10…)
Delimited — Select Delimited if the file uses a character sequence
to separate columns
Fixed width — Select Fixed width if the file uses specified widths
for each column.
^^ Data File -- Location …………….Local or JobServer
Root Directory……………C:\DI or C:\ or….
File Name …………………A.txt or *.txt or A.txt, B.txt, C.txt
Delete File………………….YES/NO. Yes If you don’t want
to append any data. If u want
to append the data you can set it to NO.
This option will be visible only if you use the flat file as target.
^^ Delimiters - Column …………Comma, Tab, Semicolon,Space or any special
character like | or ~ or # what ever it is….
Row…………….. {new line}
Text…………….. & or .anything
^^ Default Format- The file format editor provides a way to ignore rows containing a specified marker (or markers) when reading files.
For example, you might want to ignore comment line markers such as # and //.
Use the semicolon to delimit each marker, and use the backslash to indicate special characters as markers (such as the backslash and the semicolon). The default marker value is an empty string. When you specify the default value, no rows are ignored.
Eg : 1) Abc 2) abc;def;ghi 3) abc;\; 4) abc;\\;\;
We can even change the date, time or datetime formats also here.
^^Blank Padding The most important thing is the Blank Padding property will
occur only when you are selecting Fixed Width property for
Type in General.Then the Options are Trailing and Leading.
If we are using Trailing we will be getting right padding of spaces for all the rows of data If we are using Leading then we are getting Left padding of spaces for all the rows.
^^Input/Output-- Style ……………………..Headers,BOF/EOF
Skipped Rows…………….a NUMBER(like 2,5…)
Skip Row Header………….Yes/No
Write Row Header…………Yes/No
Skipped Rows is No: of rows u want to skip at beginning.
If u use Headers then only the other 3 properties work.
If u use BOF/EOF then the properties that appears are ….
BOF Marker………….{None}
EOF Marker………….{None}
^^Custom Transfer In order to activate this we got to check YES in General tab at
the beginning.
Then this tab will have the following :
Program Executable……… Any .Bat or .Exe File
User Name………………
Password…………….
Arguments…………..
The above can be justified by their names itself.
^^Locale------….-- Language………….. English,canada…….etc.
Code Page……………Various Code pages will b there thru
which the data from sources will be loaded into targets.
^^Error Handling Write Errors to File………………..
Error File Directory Location……….
Error File Name…………………
Capture row format error……………
The above can be justified by their names itself.
^^ Create File Format...
Creates a file format from a relational table schema. All data types match those of the original table schema.
To create a file format from an existing flat table schema
From the Query editor, right-click a schema and select Create File format.
#@) If the data in the source flat file is like
"CARE AMBULANCE OF ALABAMA, INC.","01-01252-","RISK"
Here the delimiter is , (Comma). Even the Comma is embedded in the first column data.
In this situation we got to do like this.
In the Delimiter section we got to put data in the following manner.
Column , (Comma)
Text “ (Double Quotes)
So we are explicitly telling data integrator to take text if it is embedded in “ (double quotes). In this way we can solve this issue.
Escape char
A special character sequence that causes Data Integrator to ignore the normal column delimiter. Characters following the escape character sequence are never used as column delimiters. For example, suppose you specify a forward slash as the escape character and a comma as the column delimiter. Then, you must have a forward slash to have a comma appear inside a field.
It’s nothing but to concatenate two words by removing the escape character.
Ignore row marker(s) :
Character sequence, which when found at the beginning of rows, cause Data Integrator to ignore the row when reading the file or automatically creating metadata. To enter multiple character sequences, separate each with a semi-colon. To include a semi-colon or backslash as a marking character, precede with a backslash.
Allowing the data even they r having some escape character.
*@) If you are using MAP-OP (operation) transform then before the Query Transform it should contain the KEY COLUMNS (Primary Key). We should not put all the columns in the Query Transform , We should have ONLY THE COLUMNS that are changing.
*@) If a DataFlow contains more than one Insert Flow Paths to the target, or if you are using more than one Source Table in your DataFlow then
Make Sure that you got to create the Source Paths named DF_SourcePath,DF_SourcePath1 etc. declared in the Frame Works at Workflow Level . The number of Source Paths will depend upon the No: of Insert Flow Paths in your DataFlow.
*@) While developing a Dataflow the Select Statement that is given in the Core Logic of Mapping Logic.Xls document must be put in the Where Condition Of the DataFlow.
*@) While working with flat files, if u have used *.txt make sure of the following :
• Don’t create Output file in the Same directory in which *.txt are there
• Be sure are you in need of all the TXT files in the directory or not
*@ We cannot import a table from a user, if there are indexes on that table.
The only way to import that table is
• Take a Backup of the Script of indexes
• Then Drop the indexes
• Now import the tables.
• Again Create the indexes.
*@ If the target column is of Data type CHAR(20) then while comparing with source column put Rpad() to the Source Column in order to make them equal.
*@ Got to see the Partitioned Columns if there are performanance Issues.
*@ Before asking an issue make a clear study of it (It might be a very silly question sometimes, that’s why be prepared before asking any issue)
*@ Unit Testing of Direct Loading is …..
Record count thru Optimized Sql comparing with Record count in the target table.
Taking the First row thru Optimized Sql and comparing with the First row in the target table. The data should be same.
*@ If we want to check for NULL values in DI, In Oracle or in Toad, we will delete the value of the particular column simply.Dont put any thing like NULL or spaces over there .Then we can compare the column to NULL in DI.(That means If there is no data in a column of a table in oracle , that will be treated as NULL in DataIntegrator)
*@ In Lookups while comparing the columns, we need the expression as source Column names. These expressions wont be validated. When you are running the job then only these expressions are validated.
The most important thing and the error gets regarding this is Some particular column i.e not selected from the table name.Actually this error comes if the expression is not found in that particular flow.
*@If the size of the column is not large enough with the source columns data then the source columns data will be truncated . The remedy is to increase the size of the column in the target table.
*@ To Avoid Initial Duplicate Records we got use Table Comparison, Two MAP Operations one for Insert Normal and other for Update Normal. In the Table Comparison
we will compare the key columns which can be noted from the indexed columns in Toad.
The Source and target columns are dragged from the source transform only. In the target side of Table comparison we will be placing an extra column which is changing.
*@Clean Load is nothing but the first time insertion into the target i.e target will not be having any records, since this is the first time load.
*@if you want to use ’N’ in an SQL( ) then you got to write it as \’N\’
For eg : sql('oracle_ds', ('SELECT "start_timestamp"
FROM "status_table"
WHERE "extract_name" = \'DF_RecoverDim\'
AND "stop_timestamp" = NULL')
Returns the timestamp
Returns the timestamp from a status table for the completed data flow. An Oracle datastore requires double quotation marks around the lowercase column and table names.
*@)While working with Lookup_ext and Lookup_Seq, we are having a special property called CUSTOM SQL, In order to work it out we got to follow the same procedure as we do with the normal Lookup_ext. But in the Lookup Condition we can put the where condition of SQL Stt.(One condition is enough). While returning Columns we got to use the same no: of columns with the same data type as we are using in the Select Stt. In CUSTOM SQL.
*@) In a single DATAFLOW We can have two flows which are having different Sources and different Targets.(There would be no connection between these 2 flows.)
*@) Generates the next value in a series, after determining the last value in the
series.
In Key_generation() function we got to pass 3 parameters.
1) DataStorename.AliasName.TableName (Of UserName, If alias is Created.)
DataStorename.UserName.TableName (If alias is not Created)
2)Column Name
3) Incremental Value.
Eg : DataStore Name……Scott_DS
Alias For User…..Scott_A
Table Name………Emp
Column Name ………Empno
Key_Generation(‘Scott_DS.Scott_A.Emp’,’Empno’,1)
If Alias is not Created for User Scott. Then
Key_Generation(‘Scott_DS.Scott.Emp’,’Empno’,1)
*@)if a Select Statement is returning Multiple Values and you want to check a column value whether it presents in the multiple values then
Deptno_Src in( sql(‘scott’,’select empno from emp where deptno > 20’))
Deptno_Src is the column that is coming from the source.
For Suppose the Deptno should not present in those Multiple values then
Not Deptno_Src in( sql(‘scott’,’select empno from emp where deptno > 20’))
If u want to check this in where Condition u can directly put it in Where condition or u can write the select stt. In Custom Function and call it in Where Condition.
*@) The square brackets ([]) indicate that the value of the expression should be substituted.
# The curly braces ({}) indicate that the value of the expression should be quoted with single quotation marks.
$Gbl_No=10;
$Gbl_Name='BBB';
Sql('SOURCE','Insert Into TST Values(10,\'AAA\')');
Sql('SOURCE','Insert Into TST Values(' || $Gbl_No || ',\'' || $Gbl_Name || '\')');
Sql('SOURCE','Insert Into TST Values([$Gbl_No],{$Gbl_Name})');
*@) If the DF contains a group by, then all the query transforms after the group by will not be pushed down to the DB? And in this case, we have to use a Data_Transfer component to enable the full push-down.
DI can push down some nesting. For example, in DI if you have in a mapping:
sum ( ifthenelse (a=1, 1, 0) )
This gets pushed down to a CASE within a SUM in SQL Server properly, and the group by is done in the database. Complex nesting, it cannot do.
*@) In order to replicate a job, simply take a atl of that job and rename the job,workflows and Dataflows. Then import the atl of the job u just exported.
*@) if a condition is like Column is NULL, if we put that one in Query Transform named Query1 that is put after the Source table, if 2 flows are coming from the Query1 then the data won’t go in 2 flows, the data comes in one flow only , the data doesn’t go thru another flow.
Eg :
*@ If the Column Mapping is like Is Number(Columnname) or Columnname is Numeric.
Then we can do the following.
Is_Valid_Int(Columnname,’9’)
@$)
XML SCHEMA AS SOURCE :
Ingredients : DTD(Document Type Definition)
XSD(XML(Extensible Markup Language) Type Definition)
In which the structure of the table will be available.
XML…in which the data related to the table will be available.
Need to fill :
Format Name…. Any valid name as per the rules
Filename/URL…. Locate the Valid filename /URL.
Root Element Name …. Fill this by seeing the XML file/XSD file. It is the element
that comes in the beginning of the file and at the end of the file.
We can create the New XSD, XML files in XML Schemas itself.
We can create the New DTD,XML files in DTD Schemas itself.
TERMS TO REMEMBER Make Current, Nested, Delete, Element, Tree,Field.
FIELD It is the last object which can’t be further expanded. Also contains, data.
TREE It is the parent of a object. Can have sub trees (Also called Elements). As Called till it can be expanded.
ELEMENT Can be called as Subset of Tree. On the whole we can sayTree as Element.
MAKE CURRENT In order to DELETE a field in a Element we got to make the Element as MAKE CURRENT. It will be more useful if we can do MAKE CURRENT to the parent ELEMENT itself so that we can delete any of the child ELEMENTS that comes under
that PARENT.
NESTED In order to get the fields that are needed to your target table, just make all the ELEMENT’s as NESTED and DELETE the fields that are not needed and keep the fields that are needed.
After making NESTED to all the the ELEMENTS and DELETING all the unnecessary FIELDS, then you can drag the FIELDS into the next Transform.
*@) In the Lookup table almost in all the cases we will be finding the multiple “AND” conditions. But rarely you will be finding multiple “AND”s along with a single/Mutiple “OR” condition. So at this time we are having 2 options.
1) Applying the “AND” conditions first and returning the columns that are needed for “OR” condition. Place another Query Transform and in that apply the “OR” Condition by using the return Columns.
2) Implementing it in two flows and then merging.
*&)In the case of normal LOOKUP(Not the Modify Function Call) We can use Global variables in order to assign the return columns data.
*&) A stored procedure completed with in 4 minutes in db engine is running more than 20 minutes to finish through DI job. How to tune the stored procedure calls for better performance?
print('starting SQL call');
sql('CIP_Staging', 'EXEC SP_GENERATE_GOLDENRECORD');
print('finished SQL call');
DI puts a transaction around the DB calls,
If this is SQL, either put into the SP as the first item (before it does anything to manipulate data)...
SET IMPLICIT_TRANSACTIONS OFF
...or in the script within the sql function.
*&) An ATL has been mailed to u. At which level has it been exported
how can we know?
In Order to find out the content of ATL remember the following and search for them in the following order in the A.T.L :
AL_PROJECT ------- PROJECT
CREATE SESSION -------- JOB
CREATE PLAN ------- WORKFLOW
CREATE DATAFLOW -------- DATAFLOW
*&) Which is best way to use lookups in query transform? Use multiple query transform or single query transform? Shortest way or big data flow?
I have five lookup for target dimension data.
Emp_ID
Department_id
Location_id
City_id
Country_id
Which is best way to use these lookups in data flows?
1: (Query1 + lookupEmpMast) + (Query1 + lookupDepartMast) + (Query1 + lookupLocMast) +… Multiple query transform with each lookup
Or
2 (Query 1 + looupupEmpMast + lookupDepartMast + lookupLocMast …)… Single query transform and all lookups in query.
Q1:How the performance is affected in both ways?
Q2:First method requires more cache memory (temporary space) on server?
A)
having multiple queries does not cost you more memory (almost true) as only pointers are moved in case the column value does not change. So performance wise, multiple queries will not be fast nor slower.
On the other hand, one query is one CPU-thread only, so if you have an unblievable numbers of lookups they would saturate one CPU only leaving the others idle. Splitting those lookups into multiple queries would avoid that. But actually, you can dived one query onto multiple CPUs by using the DegreeOfParallelism Parameter of the dataflow. So again, no reason.
To sum up, I would use one query for handling reasons.
Go to the DataFlow of that query, rightclick it an select properties. There is one field called "Degree of Parallelism"
0....same as Global Setting = 1
1....no partitioning
n....split all (possible) transforms into n parts.
*&)
I have that unbelievable 'Lookups' scenario. Have about 30-40 dimension lookups from a fact table. It is taking a long time to run the Job. 20 Minutes to run a 10000 row load. With the Dim Lookups not enabled, it is running in a minute. We have 4 CPUs on the Windows Job Server.
Any suggestions here for us? I don't want to write a SQl outer join on from Facts to Dims as that'll be not fully using DI's functionality. But this delay is causing us issues. We also have the Business Objects Reporting server on the same machine as the Job Server. I know that could be causing some issues too. There is 3GB RAM on the machine.
Should I use logical partitions along with DoP? We're on a DB2 DB. The fact doesn't have any DB partitions
A)
First, you should never have to do outer-joins from facts to dimensions. If you do, then it's philosophically a design problem. If you have missing data in your facts (i.e. dimension not present), then have an entry in your dimension called "NONE".
That is, if sometimes you have a Customer and sometimes not, in the facts where there is no customer, put a foreign key of "-1". In your Customer dimension, put an entry for "NO CUSTOMER" with a key of -1. Then all joins can be inner joins.
As far as using joins vs. lookups... our rule of thumb has been this:
- If you only need a single column of data from the table being looked up, then lookup() is fine, particularly if you can cache the results. (By the way, be sure to investigate doing a DEMAND_LOAD cache or a PRE_LOAD cache. It can make a huge difference if the table will fit in memory.)
- If you need more than one column of data from the table being looked up, it's probably more efficient to join it in, because then you can reference multiple columns without doing multiple lookups.
Be sure that your dimension tables are properly indexed. Not just the artitrary primary key of the dimensions, but whatever foreign keys you might be using if you're doing "reverse" lookups (meaning customer ID "ABCDE" from your source system is really customer ID 12345 in your dimension table.)
10000 rows -- even with 30-40 lookups -- should be do-able in 5 minutes or less. However if you have users doing reporting on the fact table in the middle of this, you're sure to have locking issues (along with data consistency problems).
3Gb on one machine to include both DI and BO seems a bit light, especially if you're going to try caching everything in sight.
-------------------------------------------------------
I assume you are doing the lookups to get the surrogate keys of all or dimensions. Otherwise I fully agree with the above one. Do you know now why I favor natural keys if possible?
Nevertheless you can limit the impact of using surrogate keys in dimension by using a trick in the data model.
Let's assume you have three source tables, customer, accounts and revenues. Obviously, customers and accounts will be loaded in the DIM_CUSTOMER and DIM_ACCOUNTS, both with surrogate keys as primary key. And when loading the revenues into the fact you need two lookups...... No you don't!
It would not be a big deal to have an extra column in the DIM_ACCOUNTS called ACCOUNT_CUSTOMER (each account belongs to customer) and you lookup the surrogate key of the customer while loading this DIM_ACCOUNTS. And if there are other dimension tables as well, like CUSTOMER_REGION, you load that surrogate key into the DIM_CUSTOMER and copy it to DIM_ACCOUNTS as well.
So at the end, when loading the fact, you have to lookup the account only and you get the other dimension keys for free.
So if your 20 tables are all independant - you do not benefit from that technique. But often you can use it.
There is another positive side effect, you will reduce not only the numbers of lookups but also the number of lookup calls. As there are far less customers than fact rows, you call the lookup for the region just customer-times, not as often as there are fact rows.
*&)
Source table has 6 million rows. Some of them are duplicates that cannot be fixed at the source side. Therefore I have to take care in ETL which I am doing through "AUTOCORRECT"
It's taking 2 days to load data from source to target.
Is there any other way to make it faster? I think Lookup / Table Comparison will not help because of the source size.
What other ways do I have?
A)
When dealing with autocorrect load, make sure the target table has a primary key and DI does actually know about, meaning in Designer you can see it.
Let's assume I have a table called CUSTOMER ( CUST_ID number, gender varchar(1));
Depending on the primary key as know within DI, autocorrect load defines the where clause: "update ....where CUST_ID = :Parameter". In case no PK is found we assume all columns are primary keys.
So it is a very good idea to make sure CUST_DI is a primary key otherwise a wrong where clause is generated. Second, it would be benefitial to have a physical primary key as well, otherwise there is no index the database can use to find this one CUST_ID efficently.
&*) LINKED DATASTORES :
First of all we got to create a datastore naming dst1 connecting with Oracle database and give the alias name as DS_OWNER1 and make this as the default configuration,. We got to create another configuration for the same datastore naming config2 in which we will be connected to SQL SERVER database with the alias name as DS_OWNER2.
The crutial thing comes now, we got to create a database link in the oracle database by using the following syntax for a particular user.
create database link lnk connect to scott identified by tiger using 'oradb9';
where oradb9 is the connection string.
After successfully completing all the above steps, edit the datastore dst1,
Click on the advanced button, then click on the Edit button of Configurations,
now go to the alias config1, there click on the LINKED DATASTORES option i.e available over there. After selecting the Datastore dst1 over there u got to click on the browse button over there, then check the use the database link and select the link u want over there.
*@) COMPACT REPOSITORY :
Data Integrator saves a version of each object every time you save the object.
Repeatedly modified object definitions can consume a substantial amount of
space. If you notice your repository performance degrading, consider
compacting the repository.
To access the Compact Repository command, select Project > Compact
Repository from the menu bar. This command removes previous object
versions maintained by Data Integrator.
You can also compact your repository manually. If you have never compacted
the repository, the majority of space in the repository could be occupied by old
versions of Data Integrator objects. In this case, the Compact Repository
Export/import
Backing up repositories 4
Data Integrator Advanced Development and Migration Guide
command might be too slow and tedious. Instead, you can export the latest
versions of the repository object definitions to a file, clear the repository
database by creating a new repository, then reimport the object definitions.
�� To compact your repository by creating a new repository
1. Export the repository to a file.
The file type can be either XML or ATL. The latest version of each object
is exported.
2. Choose Repository Manager from the Start > Programs > Data
Integrator menu.
3. From the Repository Manager, add the database connection information
for the repository.
4. Click Create.
Data Integrator warns that a valid repository already exists.
5. Click Yes to overwrite the old repository.
The Repository Manager creates a new repository, removing all of the old
objects.
6. Import the previously exported repository.
*@#) What are the different REPLICATABLE and NON-REPLICATABLE objects in D I?
REPLICATABLE OBJECTS NON-REPLICATABLE OBJECTS
Workflow Database Function
Dataflow XML Schema
Batch Job Data Store
Custom Functions Tables, Template tables
Flat Files
In Order to replicate the Non-Replicatable Objects, we got to take the A.T.L of that object and replace the existing name of it to the desired new name and then save it, import that A.T.L. Automatically a new object with the new name will be visible in the Designer.
Select this check box to use auto correct loading.
Auto correct loading ensures that the same row is not duplicated in
a target table. This is particularly useful for data recovery
operations.
Note: This option is not available for targets in real time jobs
or target tables that contain LONG column(s).
When you select this check box, Data Integrator:
1. Reads a row from the source.
2. Checks if a row exists in the target table with the same
values in the primary key. If the Use input keys check
box is selected, Data Integrator uses the primary key of
the source table. Otherwise, Data Integrator uses the
primary key of the target table; if the target table has no
primary key, Data Integrator considers the primary key to
be all the columns in the target.
• If a matching row does not exist, inserts the new row
regardless of other options.
• If a matching row exists, updates the row depending
on the values of Ignore columns with value and
Ignore columns with null:
• When the column data from the source matches
the value in Ignore columns with value, the
corresponding column in the target table is not
updated. The value may be spaces. Otherwise,
the corresponding column in the target is
updated with the source data.
• When the Ignore columns with null check box is
selected and the column data from the source is
NULL, then the corresponding column in the
target table is not updated. Otherwise, the
corresponding target column is updated as
NULL since the source column is NULL.
Data Integrator can optimize data flows such that an Oracle
source database completes the auto correct load operation.
When all other operations in the data flow can be pushed
down to the source database, the auto-correct loading
operation is also pushed down. The generated SQL
implements the Ignore columns with value and Ignore
columns with null options.
Tab Option Description
STAGING TABLES AUTOMATED
Occasionally, one dataflow became so complex or required so much memory that we split the dataflow into parts with staging tables inbetween.
The very same thing can be done with the new transform "Data_Transfer". Wherever you place that transform inside your complex dataflow, the engine will actually split the dataflow into two (or more) parts just as we did manually before. The major advantage is, such a Data_Transfer object can be reconnected or even disabled. Much easier to change the split points. Try to move one transform from one dataflow into the other of your manually broken up dataflow.
*&$) If we are having the TimeStamp Columns in the source tables and that to if we are using them in the SQL TRANSFORM then
1)Make sure to convert them to varchar using TO_CHAR() in the SQL TRANSFORM itself.
2) In the next QUERY TRANFORM convert the same to date by using TO_DATE().
3) Change the datatype in the QUERY TRANFORM for the particular column with DATE.
*&) If the column in the target table is having NUMBER Datatype with no Size and Precision
in the database then
1) By Default the datatype will be DECIMAL(28,7).
Change the datatype of that particular column in the QUERY TRANSFORM just before the target table with DECIMAL(28,0)
ORACLE
*&) TRUNC(DBMS_RANDOM.VALUE(1,100))
DBMS_RANDOM.VALUE(range) is used to get a random value from the specified range mentioned in it. Since it returns a decimal value according to the needs we can use TRUNC() to make it an integer.
Eg :
select b.*,trunc(dbms_random.value(1,100)) from emp b order by trunc(dbms_random.value(1,100));
@#) While calling a procedure/Function thru EXECUTE IMMEDIATE we got to pass a parameter in a procedure/Function and it can be of any as below.One more thing we got to write USING IN OUT
declare
k varchar2(30):='ramesh';
a varchar2(10):='prc';
begin
execute immediate ' begin '|| a || '(:1); end;'
using IN OUT k;
dbms_output.put_line(k);
end;
*$) In order to get the date after the difference on the basis of month,year we can use
NUMTOYMINTERVAL()
We can use this for both year and month.
If we want the difference with days , we can use the no: of days directly without using any function.
Eg : select to_char((sysdate - NUMTOymINTERVAL(3, 'month')),'dd-mon-yyyy') from dual;
select to_char((sysdate - NUMTOymINTERVAL(3, 'year')),'dd-mon-yyyy') from dual;
select sysdate – 20 from dual;
*$) In order to get the date after the difference on the basis of day,hour,minute,second we can use
NUMTODSINTERVAL()
We can use this for day,hour,minute,second.
eg: select to_char((sysdate - NUMTOdsINTERVAL(3, 'hour')),'dd-mon-yyyy hh:mi:ss') from dual;
*@$)Oracle
#@) DATABASE LINK SYNTAX
create database link lnk connect to scott identified by tiger using 'oradb9';