DataStage Certification preparation – Section 1

I had a glimpse on the topics that you need to know in Section 1. (Installation and Configuration ) for DataStage certification. Installation emcompasses a 5% weightage in the exam.

The sample question in this section is

1. What accessess are needed for the unix user to run a DataStage job?

                 a. Read access to APT_ORHCHOME

                 b. Read write access to Scratch space

                 c. Execute access to unix local scripts

The attachment contains important points that’s necessary for the certification on Section 1.

1 Comment

Filed under Websphere DataStage

Is Datastage a user friendly ETL tool? Do we need it?

I am getting close to complete one year using Datastage and like all others with whom I have interacted with, I feel the same thing.

DATASTAGE IS HARD TO LEARN and the learn path is time consuming. The most frustating part I felt is the lack of a demo version which is easily available and the mediocre documentation. So I just thought what’s in there in Datastage and why is for every one is crazy about ETL tools.

I guess there is so much market for ETL tools since the world is losing genuine software developers every day. Err… yeah its true. I guess no one will go for ETL tool if you can get a first hand PL/SQL developer who can just give in the excellent loading times using the fully loaded parallel Oracle with pipelined transformations.

Lets take the performance aspect. As I read in one of the DBA books, Databases are evolving too faster for ETL tools to catch up. I just wonder how many years it will take for DataStage to implement Oracle 10g Data pump export/Import’s or does an Partition exchange load? And by the time it does I guess Data Pump would have given way to a more distributed or a more parallel version.  I guess it might be the same case with most other databases.

So what is it that makes ETL tools so popular?  Reusability? Development time? In the end I can come with only one reason, they make the life of Software dev’s easy ….. or is it the other way ….  to kill the computing interest in software developers and produce a wave of lame drag and drop specialists…. :)

2 Comments

Filed under Websphere DataStage

Datstage : Email notification activity & job parameters

Yet Again some of the things that can be improved in Datastage.

We had a requirement of mailing the link statistics after the job has completed. A simple task, but the difficulty here is that the end user( read Datastage support personnel) are not comfortable with Before/After routines , so we had to program using the Email notification activity available in Job sequences and the support guys will update the email address in the activity in the future if any change is needed in email recipients

The problem is that the body of the Emai Notification Activity is static, except for the usage of Environment parameters( we will come back to this). Now if we want any run-time content to be published in the mail, we cant do that. I guess its  a minimum requirement to have the body to accept the job parameters. My idea is to have the link counts collected (.. either in the after job subroutine of the datastage job or else using dsjob command) and pass it as job parameter to the email notification activity.

The email notification activity is actually generating a UNIX script and executing it(probably using mailx and uuencode ).  The script is generated and destroyed within the sequencer, as a result the execution details are kept obscure to the developer.  I strongly believe that the “Environment parameters” are dynamic in the Email body since they map directly to the UNIX exported variables under the same UNIX shell. So you actually dont know that environment parameters get substituted in the email body, unless you “type the env. parameters manually” since the Email body has not got any “Insert job parameter” buttons.

Wow!! A small and insignificant discovery.

5 Comments

Filed under Websphere DataStage

Datastage: Playing with date format masks

Last week we got a requirement to Validate a date field. 

The dates were in mm/dd/yyyy format.

1.7/7/2007 format for some records

2. 07/7/2007 format for some records

3. 7/07/2007 format for some records

4. 07/07/2007 format for some records.

and inbetween them there were some invalid dates like 09/31/2007, 09/040/200.

Being an Oracle developer before I started using Datastage, the first thing i went is for TO_DATE() in Oracle enterprise stage. Damn! Well it wasnot so easy, the stage aborted for invalid dates  instead of throwing them down the reject link. I tried some online help on how to capture the To_Date() errors into reject link. After searching for couple of hours, nothing concrete has come up.

Ok. I decided to do the validation in Datastage. I was already having a transformer in the job and i included a constraint “isValid(‘date’,StringToDate(trim_leading_trailing(<inp_Col>), ‘%mm/%dd/%yyyy) “ and passed down the invalid dates into reject link. I compiled and ran the job.  The job ran successfully and i thought everything went fine, until i reconginised that the format mask is not intelligent enough to recognise single digit day and month fields.

Hmmm…. i was back to square one. Then i tried some innovation using the format mask as %[mm]/%[dd]/%yyyy, %m[m]/%d[d]/%yyyy etc…  etc…. nothing worked.

Anyway, at last i was able to do the task for the day using some trouble some logic (Identifying the single digit day and month and concatinating a Zero before them with the help of Field() function) inside the transformer and made my boss happy, but I wondered for such a simple requirement, I was not able to understand why Datastage Date field format mask is not modeled intelligent enough.

In Oracle, the To_Date() is intelligent enough to recognise such data when the format mask is specifeid as MM/DD/YYYY.

Also we had a time stamp field coming in from the source with the format mask as  ‘MM/DD/YYYY HH:MI:SS AM’  (Oracle style). Even this i am not able to validate properly in a DataStage time stamp field, since datastage is not intelligent enough in recognising ‘AM/PM’.  I guess I need to learn how regular expressions are specified in DataStage.

These are the day to day requirements in ETL, which i guess DataStage must handle. Either i need to do more research or DataStage should make more flexible format masks in next release and make the life of developers easy.

2 Comments

Filed under Websphere DataStage

Datastage: Warning removals

Here I am collecting most of the warnings developers encounter when coding datastage jobs and trying to resolve them. 

1. Warning: Possible truncation of input string when converting from a higher length string to lower length string in Modify.

Resolution: In the Modify stage explicitly specify the length of output column.

Ex: CARD_ACCOUNT:string[max=16] = string_trim[" ", end, begin](CARD_ACCOUNT) instead of just CARD_ACCOUNT = string_trim[" ", end, begin](CARD_ACCOUNT);

2. Warning: A Null result can come up in Non Nullable field. Mostly thrown by DataStage when aggregate functions are used in Oracle DB stage.

Resolution: Use a Modify or Transformer stage in between lookup and Oracle stage. When using a modify stage, use the handle_null clause.

EX: CARD_ACCOUNT:string[max=19] = handle_null(CARD_ACCOUNT,-128);

-128 will be replaced in CARD_ACCOUNT wherever CARD_ACCOUNT is Null.

3. Warning: Some Decimal precision error converting from decimal [p,s] to decimal[x,y].

Resolution: Specify the exact scale and precision of the output column in the Modfiy stage specification and use trunc_zero (the default r_type with decimal_from_decimal conversion)

Ex: CREDIT_LIMIT:decimal[10,2] = decimal_from_decimal[trunc_zero](CREDIT_LIMIT);instead of just CREDIT_LIMIT = decimal_from_decimal[trunc_zero](CREDIT_LIMIT);

For further information on where to specify the output column type explicitly and where not necessary, refer to the data type default/manual conversion guide

For default data type conversion (‘d’) size specification is not required. For manual conversion (‘m’) explicit size specification is required. The table is available in parallel job developer’s guide

4. Warning: A sequential operator cannot preserve the partitioning of input data set on input port 0

Resolution: Clear the preserve partition flag before Sequential file stages.

5. Warning: A user defined sort operator does not satisfy the requirements.

Resolution: In the job flow just notice the columns on which sort is happening . The order of the columns also must be the same. i.e if you specify sort on columns in the order X,Y  in sort stage and specify join on the columns  Y,X in order then join stage will throw the warning, since it cannot recognise the change in order. Also , i guess DataStage throws this warning at compile time . So if you rename a column in between stages, then also this warning is thrown. Say i have sorted on Column X in sort stage, but the column name is changed to Y at the output interface, then also the warning is thrown. Just revent the output interface column to ‘X’ and the warning disappears.

Leave a comment

Filed under Websphere DataStage

Datastage: Join or Lookup or Merge or CDC

Many times this question pops up in the mind of Datastage developers.

All the above stages can be used to do the same task. Match one set of data (say primary) with another set of data(references) and see the results. DataStage normally uses different execution plans (hmm… i should ignore my Oracle legacy when posting on Datastage). Since DataStage is not so nice as Oracle, to show its Execution plan easily, we need to fill in the gap of Optimiser and analyze our requiremets. Well I have come up with a nice table , 

Most importantly its the Primary/Reference ratio that needs to be considered not the actual counts.

Primary Source Volume

Reference Volume

Preferred Method

Little (< 5 million)

Very Huge ( > 50 million)

Sparse Lookup

Little ( < 5 million)

Little (< 5 million)

Normal Lookup

Huge (> 10 million)

Little (< 5 million)

Little (< 5 million)

Huge ( > 10 million)

Normal Lookup

Huge (> 10 million)

Huge (> 10 million)

Join

Huge (> 10 million)

Huge (> 10 million)

Merge, if you want to handle rejects in reference links.

2 Comments

Filed under Websphere DataStage

Datastage: Job design tips

I am just collecting the general design tips that helps the developers to build clean & effective jobs.

1. Turn off Runtime Column propagation wherever it’s not required.

2.Make use of Modify, Filter, and Aggregation, Col. Generator etc stages instead of Transformer stage only if the anticipated volumes are high and performance becomes a problem. Otherwise use Transformer. Its very easy to code a transformer than a modify stage.

3. Avoid propagation of unnecessary metadata between the stages. Use Modify stage and drop the metadata. Modify stage will drop the metadata only when explicitey specified using DROP clause.

4. One of the most important mistake that developers often make is not to have a volumetric analyses done before you decide to use Join or Lookup or Merge stages. Estimate the volumes and then decide which stage to go for.

5.Add reject files wherever you need reprocessing of rejected records or you think considerable data loss may happen. Try to keep reject file at least at Sequential file stages and writing to Database stages.

6.Make use of Order By clause when a DB stage is being used in join. The intention is to make use of Database power for sorting instead of datastage reources. Keep the join partitioning as Auto. Indicate don’t sort option between DB stage and join stage using sort stage when using order by clause.

7. While doing Outer joins, you can make use of Dummy variables for just Null checking instead of fetching an explicit column from table.

8. Use Sort stages instead of Remove duplicate stages. Sort stage has got more grouping options and sort indicator options.

9. One of the most frequent mistakes that developers face is lookup failures by not taking care of String padchar that datastage appends when converting strings of lower precision to higher precision.Try to decide on the APT_STRING_PADCHAR, APT_CONFIG_FILE parameters from the beginning. Ideally APT_STRING_PADCHAR should be set to OxOO (C/C++ end of string) and Configuration file to the maximum number of nodes available.

10. Data Partitioning is very important part of Parallel job design. It’s always advisable to have the data partitioning as ‘Auto’ unless you are comfortable with partitioning, since all DataStage stages are designed to perform in the required  way with Auto partitioning.

11.Do remember that Modify drops the Metadata only when it is explicitly asked to do so using KEEP/DROP clauses.

Comments are welcome.

2 Comments

Filed under Websphere DataStage