Extraction Dates

vsi · Post by **vsi** » Wed Sep 06, 2006 12:23 pm

I posted this question early on, but to prevent exhaustive reading will continue from where I left last time.

Have patience for my lengthy question/discussion

SCENARIO :-

Lets say I have got three sources SorA, SorB, SorC.

SorA -> has a field in its schema, to indicate that the data is valid prior to the date specified in this field (which means we can pull data only prior to specified date).

SorB -> The availability of data is known after a manual process of verification.

SorC-> Extracts should be on a particular date of Every Month.

I need to have extract dates for each source to be incorporated in target DW.

As far as i know till now is to have a table called "Audit Table" to store the extract dates (later on these values are populated to DW).

Question 1:

What are the fields that I can/should have for Audit Table?

My Comments

-> is it Sur_KEY, SorA_ExtDate, SorB_ExtDate, SorC_ExtDate,ProcessDate (anything else??)

Question 2 :

How/When should they be populated?

My Comments

-> Should I write the system date to audit table when ever the respective sequences are triggered, like eg. if SorA was triggered on 1st of Sept 2006, shall i take the date generated by datastage or the field which was specified in the field (as mentioned above). My guess from the source
-> In other two cases, my guess System Date

Question 3:
When it comes to next month, how will the extraction start

My Comments
-> should i first take extract date from audit table, add +1 to it and read data from there on???? to the current extract date (which is from Source for SorA and system date for SorB and SorC).
-> How can we be sure that the latest record ? we took from Audit Table is the last extract date????

Question 4 :

-> To trigger these three different sources at three different times, three Unix Scripts are to be writtien seperately, isnt it??

NBALA · Post by **NBALA** » Wed Sep 06, 2006 2:17 pm

Hi,

Let me try to answer from my view point to your question.
[b]Q1:[/b]
you can have the audit table as below

Audit_key,
Source_name,
Process_Batch_number,
Batch_Extract_Date,
Processed_Row_Qty

Initialize the table with Source Names like below

1, SorA, 1, NULL, 0
2, SorB, 1, NULL, 0
3, SorC, 1, NULL, 0

[b]Q2:[/b]
Then for the first load update the table with extract date (system date) and row qty , from then for each load increment one row/source/load using a procedure (SQL).

[b]Q3:[/b]
Edited with following.
-----------------------
Each time you load increment the process_batch_number not the date.The date should be date of extraction for each source(if you use data stage to load take the @DATE for batch_extract_date.

I Hope this helps ! (IHTH ?)

-NB

vsi · Post by **vsi** » Thu Sep 07, 2006 5:30 am

I wasn't able to make much out of what you said (Conceptually)? Still stuck at the same place?

Looks like the rows/columns have flipped. To accomodate that you have got Process_Batch_number.

vsi · Post by **vsi** » Thu Sep 07, 2006 9:04 am

vsi · Post by **vsi** » Wed Sep 13, 2006 1:40 pm

could someone help???

I am planning to go with Audit table structure as mentioned in earlier post.

Would help if you can point out the pitfalls as well

kduke · Post by **kduke** » Wed Sep 13, 2006 3:04 pm

Q1 has been answered.
Q2 use both todays date and the max date in your audit table.
Q3 use max date + 1 till today -1 because today is probably not complete.
Q4 You need a record in your audit table for each source. Insert this record only after today is done. A lot of people have a record like this for each phase of the ETL so the ETL is restartable. If all your extracts completed successfully then why extract again if you have to restart your ETL. You can have a STARTED record in the ETL_AUDIT table and a FINISHED record when it is done. At the first of a sequence check to see if there is a FINISHED record for today. If there is then exit this sequence gracefully.

NBALA · Post by **NBALA** » Fri Sep 15, 2006 12:10 pm

[quote="kduke"]Q1 has been answered.
Q2 use both todays date and the max date in your audit table.
Q3 use max date + 1 till today -1 because today is probably not complete.
Q4 You need a record in your audit table for each source. Insert this record only after today is done. A lot of people have a record like this for each phase of the ETL so the ETL is restartable. If all your extracts completed successfully then why extract again if you have to restart your ETL. You can have a STARTED record in the ETL_AUDIT table and a FINISHED record when it is done. At the first of a sequence check to see if there is a FINISHED record for today. If there is then exit this sequence gracefully.[/quote]

for Q1: You can add one more column as "Extract_Finish_Date" with default NULL

and update this when each extract is completed. This can be used for Q4 as Kim mentioned.

-NB

vsi · Post by **vsi** » Tue Sep 19, 2006 12:21 pm

In Addition to fields in Audit table, I am getting lots of doubts regarding implementation in datastage. Pls bear with me and answer my questions.
Point a
This is how I plan to design

Code: Select all

Master_Seq_Stage_A -->
Master_Seq_Stage_B -->  Sequencer -> Master_Seq_Common -->  Seq_End
Master_Seq_Stage_C -->

Figure 1
Point b
In Sequencer stage the mode is selected as all

Point c
In all Master_Seq_Stage_A (B,C). All I do is extract, standardise, validate and load to staging tables i.e. flow as shown below

Code: Select all

Seq_Extract -> Seq_Transform -> Seq_Load

Point d
Now if I have the audit table as given below :

Code: Select all

-----------------------------------------------------------------------
|Audit_Key	| Source	| Batch_Nbr		| Batch_Ext_Dt	| Batch_Ext_Fin  |
|----------------------------------------------------------------------
|1			| A			|	1				|	NULL		    |	NULL		|
|2			| B			|	1				|	NULL			|	NULL		|
|3			| C			|	1				|	NULL			|	NULL		|	
------------------------------------------------------------------------

Q1) Do I need to modify the above sequence (of extract,transform, load) as :

Code: Select all

Routine_Write_Extract_Date->Seq_Extract->Routine_Write_Extract_Finish_Date -> Seq_Transform->Seq_Load

or

Code: Select all

Routine_Write_Extract_Date->Seq_Extract-> Seq_Transform->Seq_Load ->Routine_Write_Extract_Finish_Date

The result after execution of any one of the source will be

Code: Select all

-----------------------------------------------------------------------
|Audit_Key	| Source	| Batch_Nbr		| Batch_Ext_Dt	| Batch_Ext_Fin  |
|----------------------------------------------------------------------
|1			| A			|	1				|	01-01-2006		|	01-01-2006	|
|2			| B			|	1				|	NULL			|	NULL		|
|3			| C			|	1				|	NULL			|	NULL		|	
------------------------------------------------------------------------

Q2) How do I skip the extraction step (upon failures)??? Where can I insert that check in the Master_Seq_Stage_A (B,C)

Q3) In fig1, in Master_seq_Common. I need to populate the datawarehouse with extraction dates from each of the source.

shall I take it for granted that the data in audit table is up to date i.e. or do I need to make checks

Code: Select all

-----------------------------------------------------------------------
|Audit_Key	| Source	| Batch_Nbr		| Batch_Ext_Dt	| Batch_Ext_Fin  |
|----------------------------------------------------------------------
|1			| A			|	1				|	01-01-2006		|	01-01-2006	|
|2			| B			|	1				|	01-05-2006		|	01-05-2006	|
|3			| C			|	1				|	01-18-2006		|	01-19-2006	|	
------------------------------------------------------------------------

Q4) In the last stand alone sequence (fig 1) i.e. Seq_Load ( I will write new entries into the audit table, ready for next load)

Code: Select all

-----------------------------------------------------------------------
|Audit_Key	| Source	| Batch_Nbr		| Batch_Ext_Dt	| Batch_Ext_Fin  |
|----------------------------------------------------------------------
|1			| A			|	1				|	01-01-2006		|	01-01-2006	|
|2			| B			|	1				|	01-05-2006		|	01-05-2006	|
|3			| C			|	1				|	01-18-2006		|	01-19-2006	|	
|4			| A			|	2				|	NULL			|	NULL		|
|5			| B			|	2				|	NULL			|	NULL		|
|6			| C			|	2				|	NULL			|	NULL
------------------------------------------------------------------------

Q5) Last question, The sources (A,B,C) all occur at different dates, so do who should be kicking off the extract jobs using dsjob???? the people who provide the source???

kduke · Post by **kduke** » Tue Sep 19, 2006 1:26 pm

Q1 First one
Q2 You need a job or a routine to check for a finished date. If there is a finished date then the job or routine needs some way to signal the sequence that this step is finished.

I would put this in as the first step of the extract sequence. I would either abort the job or return a @FALSE. I would then go to the end of the sequence gracefully.

Q3 If this is a table then use it. Consider it to be accurate. If this is a hashed file then rebuild it every time.

Q4 Not a question

Q5 Personal preferrence. Whatever is easier. Sometimes other factors require that they notify operations then another step has to happen before the load can run. If you can do it clean then let them kick off a shell script to load it.

vsi · Post by **vsi** » Tue Sep 19, 2006 7:43 pm

thanks for your quick and clear response

Point (a)
kduke

Q2 You need a job or a routine to check for a finished date. If there is a finished date then the job or routine needs some way to signal the sequence that this step is finished.

I would put this in as the first step of the extract sequence. I would either abort the job or return a @FALSE. I would then go to the end of the sequence gracefully.

If I abort the sequence then as I have Master_seq_Common following up the Sequencer (with mode selected as ALL), then wont the Master_Seq-Common NEVER get executed

Code: Select all

Master_Seq_Stage_A --> 
Master_Seq_Stage_B -->  Sequencer -> Master_Seq_Common -->  Seq_End 
Master_Seq_Stage_C -->

or even if I abort within Master_Seq_Stage_A (B,C)

Code: Select all

Seq_Extract -> Seq_Transform -> Seq_Load

How can I specify to jump to next Sequence, lets say to Seq_Transform above??? or to Sequencer in Main Sequence??

Point (b)

kduke

Q4 Not a question

-> Wanted to ask, whether the approach I took is right??, by having the last sequence to load additional entries in the audit table (with null values i.e. batch_nbr=2), for next run of sequence.

kduke · Post by **kduke** » Tue Sep 19, 2006 10:02 pm

Abort was a poor choice of words. It should skip to the end of that sequence. So each sequence should check the audit table to see if it has been run today. If it has then it exits to the last step. If it has not been run today then run all of the jobs normally. So do this check first in the sequence. Next have 2 branches. One to the last step in this sequence which will be a sequencer. The other to the first real job.

Point B. Never create the next step ahead of time. When you start today's load then put a start time in only. Never 2 nulls. Null only the end time. Do not try to anticipate your next step. Let your next step take care of all it needs to function.

vsi · Post by **vsi** » Fri Sep 22, 2006 6:42 am

(1)

kduke

It should skip to the end of that sequence. So each sequence should check the audit table to see if it has been run today. If it has then it exits to the last step. If it has not been run today then run all of the jobs normally. So do this check first in the sequence. Next have 2 branches. One to the last step in this sequence which will be a sequencer. The other to the first real job.

a) So Shall I have like this within MasterSequence (given below)??
b) Shall I check the day, month, year of the audit table to that of system date?
b) How can we specify in Check_Audit_Table (a sequence) which path to take???

Code: Select all

Check_Audit_Table->Routine_Wr_Extract_Date->Seq_Extract->Routine_Wr_Extract_Finish_Date -> Seq_Transform->Seq_Load
     \                                                                                                          |
      ------------------------------------------------------------------------------------------------>Sequencer

2)

kduke

Point B. Never create the next step ahead of time. When you start today's load then put a start time in only. Never 2 nulls. Null only the end time. Do not try to anticipate your next step. Let your next step take care of all it needs to function.

a) Master_SeqA, Master_SeqB, Master_SeqC are triggered at different points of times(which is not known). So shall I write into the table only when they have triggered ?
Let me say the order of trigger is B,A,C

So after B triggers, I will have

Code: Select all

----------------------------------------------------------------------- 
|Audit_Key   | Source   | Batch_Nbr      | Batch_Ext_Dt   | Batch_Ext_Fin  | 
|---------------------------------------------------------------------- 
|1         | B         |   1            |   01-01-2006          |   NULL      | 
|------------------------------------------------------------------------

when A gets triggered then I will have

Code: Select all

----------------------------------------------------------------------- 
|Audit_Key   | Source   | Batch_Nbr      | Batch_Ext_Dt   | Batch_Ext_Fin  | 
|---------------------------------------------------------------------- 
|1         | B         |   1            |   01-01-2006          |   NULL      | 
|2         | A         |   1            |   01-10-2006         |   NULL      | 
------------------------------------------------------------------------

and so on...
(b) The doubt here I have is, how will I know the Batch_Nbr of A will be 1 as well??? (i cant PL/SQL) could u tell me datastage approach. Mostly because I dont know which source gets triggered first.

3)

Just now realisted that I will have, one more addition Master_Seq_Start,
i) the only requirement for it is -> to run BEFORE ANY Master_Seq_Stage_A (B,C) starts.
ii) It wont write anything to the audit table.
iii) But how should I do this??? How do I create the sequence order and how do i trigger them through unix.

Code: Select all

                     Master_Seq_Stage_A --> 
Master_Seq_Start --> Master_Seq_Stage_B -->  Sequencer -> Master_Seq_Common -->  Seq_End 
                    Master_Seq_Stage_C -->

NBALA · Post by **NBALA** » Fri Sep 22, 2006 7:45 am

Hi,

The sequence looks ok.
In check audit table sequence - first increase the batch number for the Src (A or B or C depends on the your source) insert new row with incremented batch number and system date as extract date.

Yes, you have to write into table when they triggered.

//// The doubt here I have is, how will I know the Batch_Nbr of A will be 1 as well??? (i cant PL/SQL) could u tell me datastage approach. Mostly because I dont know which source gets triggered first. ////

You need to write a procedure or required an used defined SQL for this. Never bother about the trigger order of the source, all you required is each source needs to incremented to next count with the ext date.

Use the director scheduler for the Master you have or try with cron job.

-NB

kduke · Post by **kduke** » Fri Sep 22, 2006 7:55 am

I think you need to quit asking questions and try something. Let us know if you get it working. You can talk this to death.

vsi · Post by **vsi** » Fri Sep 22, 2006 8:02 am

Sorry for the trouble. I guess this would be last of my questions. Hope to get going after this. I was unable to figure out especially (Q3) and then (Q2). Because for Q2, I planned to write entries to the audit table at the end of all sequence with null, but proved wrong. So was confused how to approach.

But pls comment on (Q3), thats way out of my league.

DSXchange

Extraction Dates

Extraction Dates

Re: Extraction Dates