Extraction Dates

Post questions here relative to DataStage Server Edition for such areas as Server job design, DS Basic, Routines, Job Sequences, etc.

Moderators: chulett, rschirm, roy

Post Reply
vsi
Premium Member
Premium Member
Posts: 507
Joined: Wed Mar 15, 2006 1:44 pm

Extraction Dates

Post by vsi »

I posted this question early on, but to prevent exhaustive reading will continue from where I left last time.

Have patience for my lengthy question/discussion :twisted:

SCENARIO :-

Lets say I have got three sources SorA, SorB, SorC.


SorA -> has a field in its schema, to indicate that the data is valid prior to the date specified in this field (which means we can pull data only prior to specified date).

SorB -> The availability of data is known after a manual process of verification.

SorC-> Extracts should be on a particular date of Every Month.



I need to have extract dates for each source to be incorporated in target DW.

As far as i know till now is to have a table called "Audit Table" to store the extract dates (later on these values are populated to DW).

Question 1:

What are the fields that I can/should have for Audit Table?

My Comments

-> is it Sur_KEY, SorA_ExtDate, SorB_ExtDate, SorC_ExtDate,ProcessDate (anything else??)

Question 2 :

How/When should they be populated?

My Comments

-> Should I write the system date to audit table when ever the respective sequences are triggered, like eg. if SorA was triggered on 1st of Sept 2006, shall i take the date generated by datastage or the field which was specified in the field (as mentioned above). My guess from the source
-> In other two cases, my guess System Date

Question 3:
When it comes to next month, how will the extraction start

My Comments
-> should i first take extract date from audit table, add +1 to it and read data from there on???? to the current extract date (which is from Source for SorA and system date for SorB and SorC).
-> How can we be sure that the latest record ? we took from Audit Table is the last extract date????

Question 4 :

-> To trigger these three different sources at three different times, three Unix Scripts are to be writtien seperately, isnt it??
NBALA
Participant
Posts: 48
Joined: Tue Jul 11, 2006 11:52 am
Location: IL, USA

Re: Extraction Dates

Post by NBALA »

Hi,

Let me try to answer from my view point to your question.
[b]Q1:[/b]
you can have the audit table as below

Audit_key,
Source_name,
Process_Batch_number,
Batch_Extract_Date,
Processed_Row_Qty

Initialize the table with Source Names like below

1, SorA, 1, NULL, 0
2, SorB, 1, NULL, 0
3, SorC, 1, NULL, 0

[b]Q2:[/b]
Then for the first load update the table with extract date (system date) and row qty , from then for each load increment one row/source/load using a procedure (SQL).

[b]Q3:[/b]
Edited with following.
-----------------------
Each time you load increment the process_batch_number not the date.The date should be date of extraction for each source(if you use data stage to load take the @DATE for batch_extract_date.

I Hope this helps ! (IHTH ?)

-NB
Last edited by NBALA on Thu Sep 07, 2006 11:51 am, edited 2 times in total.
vsi
Premium Member
Premium Member
Posts: 507
Joined: Wed Mar 15, 2006 1:44 pm

Post by vsi »

I wasn't able to make much out of what you said (Conceptually)? Still stuck at the same place?

Looks like the rows/columns have flipped. To accomodate that you have got Process_Batch_number.
vsi
Premium Member
Premium Member
Posts: 507
Joined: Wed Mar 15, 2006 1:44 pm

Post by vsi »

:roll:
vsi
Premium Member
Premium Member
Posts: 507
Joined: Wed Mar 15, 2006 1:44 pm

Post by vsi »

could someone help???


I am planning to go with Audit table structure as mentioned in earlier post.

Would help if you can point out the pitfalls as well
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

Q1 has been answered.
Q2 use both todays date and the max date in your audit table.
Q3 use max date + 1 till today -1 because today is probably not complete.
Q4 You need a record in your audit table for each source. Insert this record only after today is done. A lot of people have a record like this for each phase of the ETL so the ETL is restartable. If all your extracts completed successfully then why extract again if you have to restart your ETL. You can have a STARTED record in the ETL_AUDIT table and a FINISHED record when it is done. At the first of a sequence check to see if there is a FINISHED record for today. If there is then exit this sequence gracefully.
Mamu Kim
NBALA
Participant
Posts: 48
Joined: Tue Jul 11, 2006 11:52 am
Location: IL, USA

Post by NBALA »

[quote="kduke"]Q1 has been answered.
Q2 use both todays date and the max date in your audit table.
Q3 use max date + 1 till today -1 because today is probably not complete.
Q4 You need a record in your audit table for each source. Insert this record only after today is done. A lot of people have a record like this for each phase of the ETL so the ETL is restartable. If all your extracts completed successfully then why extract again if you have to restart your ETL. You can have a STARTED record in the ETL_AUDIT table and a FINISHED record when it is done. At the first of a sequence check to see if there is a FINISHED record for today. If there is then exit this sequence gracefully.[/quote]

for Q1: You can add one more column as "Extract_Finish_Date" with default NULL

and update this when each extract is completed. This can be used for Q4 as Kim mentioned.

-NB
vsi
Premium Member
Premium Member
Posts: 507
Joined: Wed Mar 15, 2006 1:44 pm

Post by vsi »

In Addition to fields in Audit table, I am getting lots of doubts regarding implementation in datastage. Pls bear with me and answer my questions.
Point a
This is how I plan to design

Code: Select all

Master_Seq_Stage_A -->
Master_Seq_Stage_B -->  Sequencer -> Master_Seq_Common -->  Seq_End
Master_Seq_Stage_C -->

Figure 1
Point b
In Sequencer stage the mode is selected as all

Point c
In all Master_Seq_Stage_A (B,C). All I do is extract, standardise, validate and load to staging tables i.e. flow as shown below

Code: Select all

Seq_Extract -> Seq_Transform -> Seq_Load
Point d
Now if I have the audit table as given below :

Code: Select all

-----------------------------------------------------------------------
|Audit_Key	| Source	| Batch_Nbr		| Batch_Ext_Dt	| Batch_Ext_Fin  |
|----------------------------------------------------------------------
|1			| A			|	1				|	NULL		    |	NULL		|
|2			| B			|	1				|	NULL			|	NULL		|
|3			| C			|	1				|	NULL			|	NULL		|	
------------------------------------------------------------------------
Q1) Do I need to modify the above sequence (of extract,transform, load) as :

Code: Select all

Routine_Write_Extract_Date->Seq_Extract->Routine_Write_Extract_Finish_Date -> Seq_Transform->Seq_Load
or

Code: Select all

Routine_Write_Extract_Date->Seq_Extract-> Seq_Transform->Seq_Load ->Routine_Write_Extract_Finish_Date 
The result after execution of any one of the source will be

Code: Select all

-----------------------------------------------------------------------
|Audit_Key	| Source	| Batch_Nbr		| Batch_Ext_Dt	| Batch_Ext_Fin  |
|----------------------------------------------------------------------
|1			| A			|	1				|	01-01-2006		|	01-01-2006	|
|2			| B			|	1				|	NULL			|	NULL		|
|3			| C			|	1				|	NULL			|	NULL		|	
------------------------------------------------------------------------

Q2) How do I skip the extraction step (upon failures)??? Where can I insert that check in the Master_Seq_Stage_A (B,C)

Q3) In fig1, in Master_seq_Common. I need to populate the datawarehouse with extraction dates from each of the source.

shall I take it for granted that the data in audit table is up to date i.e. or do I need to make checks

Code: Select all

-----------------------------------------------------------------------
|Audit_Key	| Source	| Batch_Nbr		| Batch_Ext_Dt	| Batch_Ext_Fin  |
|----------------------------------------------------------------------
|1			| A			|	1				|	01-01-2006		|	01-01-2006	|
|2			| B			|	1				|	01-05-2006		|	01-05-2006	|
|3			| C			|	1				|	01-18-2006		|	01-19-2006	|	
------------------------------------------------------------------------

Q4) In the last stand alone sequence (fig 1) i.e. Seq_Load ( I will write new entries into the audit table, ready for next load)

Code: Select all

-----------------------------------------------------------------------
|Audit_Key	| Source	| Batch_Nbr		| Batch_Ext_Dt	| Batch_Ext_Fin  |
|----------------------------------------------------------------------
|1			| A			|	1				|	01-01-2006		|	01-01-2006	|
|2			| B			|	1				|	01-05-2006		|	01-05-2006	|
|3			| C			|	1				|	01-18-2006		|	01-19-2006	|	
|4			| A			|	2				|	NULL			|	NULL		|
|5			| B			|	2				|	NULL			|	NULL		|
|6			| C			|	2				|	NULL			|	NULL
------------------------------------------------------------------------
Q5) Last question, The sources (A,B,C) all occur at different dates, so do who should be kicking off the extract jobs using dsjob???? the people who provide the source???
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

Q1 First one
Q2 You need a job or a routine to check for a finished date. If there is a finished date then the job or routine needs some way to signal the sequence that this step is finished.

I would put this in as the first step of the extract sequence. I would either abort the job or return a @FALSE. I would then go to the end of the sequence gracefully.

Q3 If this is a table then use it. Consider it to be accurate. If this is a hashed file then rebuild it every time.

Q4 Not a question

Q5 Personal preferrence. Whatever is easier. Sometimes other factors require that they notify operations then another step has to happen before the load can run. If you can do it clean then let them kick off a shell script to load it.
Mamu Kim
vsi
Premium Member
Premium Member
Posts: 507
Joined: Wed Mar 15, 2006 1:44 pm

Post by vsi »

thanks for your quick and clear response

Point (a)
kduke
Q2 You need a job or a routine to check for a finished date. If there is a finished date then the job or routine needs some way to signal the sequence that this step is finished.

I would put this in as the first step of the extract sequence. I would either abort the job or return a @FALSE. I would then go to the end of the sequence gracefully.
If I abort the sequence then as I have Master_seq_Common following up the Sequencer (with mode selected as ALL), then wont the Master_Seq-Common NEVER get executed

Code: Select all

Master_Seq_Stage_A --> 
Master_Seq_Stage_B -->  Sequencer -> Master_Seq_Common -->  Seq_End 
Master_Seq_Stage_C -->
or even if I abort within Master_Seq_Stage_A (B,C)

Code: Select all

Seq_Extract -> Seq_Transform -> Seq_Load
How can I specify to jump to next Sequence, lets say to Seq_Transform above??? or to Sequencer in Main Sequence??

Point (b)

kduke
Q4 Not a question
-> Wanted to ask, whether the approach I took is right??, by having the last sequence to load additional entries in the audit table (with null values i.e. batch_nbr=2), for next run of sequence.
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

Abort was a poor choice of words. It should skip to the end of that sequence. So each sequence should check the audit table to see if it has been run today. If it has then it exits to the last step. If it has not been run today then run all of the jobs normally. So do this check first in the sequence. Next have 2 branches. One to the last step in this sequence which will be a sequencer. The other to the first real job.

Point B. Never create the next step ahead of time. When you start today's load then put a start time in only. Never 2 nulls. Null only the end time. Do not try to anticipate your next step. Let your next step take care of all it needs to function.
Mamu Kim
vsi
Premium Member
Premium Member
Posts: 507
Joined: Wed Mar 15, 2006 1:44 pm

Post by vsi »

(1)

kduke
It should skip to the end of that sequence. So each sequence should check the audit table to see if it has been run today. If it has then it exits to the last step. If it has not been run today then run all of the jobs normally. So do this check first in the sequence. Next have 2 branches. One to the last step in this sequence which will be a sequencer. The other to the first real job.
a) So Shall I have like this within MasterSequence (given below)??
b) Shall I check the day, month, year of the audit table to that of system date?
b) How can we specify in Check_Audit_Table (a sequence) which path to take???

Code: Select all

Check_Audit_Table->Routine_Wr_Extract_Date->Seq_Extract->Routine_Wr_Extract_Finish_Date -> Seq_Transform->Seq_Load
     \                                                                                                          |
      ------------------------------------------------------------------------------------------------>Sequencer 
2)

kduke
Point B. Never create the next step ahead of time. When you start today's load then put a start time in only. Never 2 nulls. Null only the end time. Do not try to anticipate your next step. Let your next step take care of all it needs to function.
a) Master_SeqA, Master_SeqB, Master_SeqC are triggered at different points of times(which is not known). So shall I write into the table only when they have triggered ?
Let me say the order of trigger is B,A,C

So after B triggers, I will have

Code: Select all

----------------------------------------------------------------------- 
|Audit_Key   | Source   | Batch_Nbr      | Batch_Ext_Dt   | Batch_Ext_Fin  | 
|---------------------------------------------------------------------- 
|1         | B         |   1            |   01-01-2006          |   NULL      | 
|------------------------------------------------------------------------
when A gets triggered then I will have

Code: Select all

----------------------------------------------------------------------- 
|Audit_Key   | Source   | Batch_Nbr      | Batch_Ext_Dt   | Batch_Ext_Fin  | 
|---------------------------------------------------------------------- 
|1         | B         |   1            |   01-01-2006          |   NULL      | 
|2         | A         |   1            |   01-10-2006         |   NULL      | 
------------------------------------------------------------------------
and so on...
(b) The doubt here I have is, how will I know the Batch_Nbr of A will be 1 as well??? (i cant PL/SQL) could u tell me datastage approach. Mostly because I dont know which source gets triggered first.

3)

Just now realisted that I will have, one more addition Master_Seq_Start,
i) the only requirement for it is -> to run BEFORE ANY Master_Seq_Stage_A (B,C) starts.
ii) It wont write anything to the audit table.
iii) But how should I do this??? How do I create the sequence order and how do i trigger them through unix.

Code: Select all

                     Master_Seq_Stage_A --> 
Master_Seq_Start --> Master_Seq_Stage_B -->  Sequencer -> Master_Seq_Common -->  Seq_End 
                    Master_Seq_Stage_C -->
NBALA
Participant
Posts: 48
Joined: Tue Jul 11, 2006 11:52 am
Location: IL, USA

Post by NBALA »

Hi,

The sequence looks ok.
In check audit table sequence - first increase the batch number for the Src (A or B or C depends on the your source) insert new row with incremented batch number and system date as extract date.

Yes, you have to write into table when they triggered.

//// The doubt here I have is, how will I know the Batch_Nbr of A will be 1 as well??? (i cant PL/SQL) could u tell me datastage approach. Mostly because I dont know which source gets triggered first. ////

You need to write a procedure or required an used defined SQL for this. Never bother about the trigger order of the source, all you required is each source needs to incremented to next count with the ext date.

Use the director scheduler for the Master you have or try with cron job.

-NB
kduke
Charter Member
Charter Member
Posts: 5227
Joined: Thu May 29, 2003 9:47 am
Location: Dallas, TX
Contact:

Post by kduke »

I think you need to quit asking questions and try something. Let us know if you get it working. You can talk this to death.
Mamu Kim
vsi
Premium Member
Premium Member
Posts: 507
Joined: Wed Mar 15, 2006 1:44 pm

Post by vsi »

Sorry for the trouble. I guess this would be last of my questions. Hope to get going after this. I was unable to figure out especially (Q3) and then (Q2). Because for Q2, I planned to write entries to the audit table at the end of all sequence with null, but proved wrong. So was confused how to approach.

But pls comment on (Q3), thats way out of my league.
Post Reply