Extraction Dates
Moderators: chulett, rschirm, roy
Extraction Dates
I posted this question early on, but to prevent exhaustive reading will continue from where I left last time.
Have patience for my lengthy question/discussion
SCENARIO :-
Lets say I have got three sources SorA, SorB, SorC.
SorA -> has a field in its schema, to indicate that the data is valid prior to the date specified in this field (which means we can pull data only prior to specified date).
SorB -> The availability of data is known after a manual process of verification.
SorC-> Extracts should be on a particular date of Every Month.
I need to have extract dates for each source to be incorporated in target DW.
As far as i know till now is to have a table called "Audit Table" to store the extract dates (later on these values are populated to DW).
Question 1:
What are the fields that I can/should have for Audit Table?
My Comments
-> is it Sur_KEY, SorA_ExtDate, SorB_ExtDate, SorC_ExtDate,ProcessDate (anything else??)
Question 2 :
How/When should they be populated?
My Comments
-> Should I write the system date to audit table when ever the respective sequences are triggered, like eg. if SorA was triggered on 1st of Sept 2006, shall i take the date generated by datastage or the field which was specified in the field (as mentioned above). My guess from the source
-> In other two cases, my guess System Date
Question 3:
When it comes to next month, how will the extraction start
My Comments
-> should i first take extract date from audit table, add +1 to it and read data from there on???? to the current extract date (which is from Source for SorA and system date for SorB and SorC).
-> How can we be sure that the latest record ? we took from Audit Table is the last extract date????
Question 4 :
-> To trigger these three different sources at three different times, three Unix Scripts are to be writtien seperately, isnt it??
Have patience for my lengthy question/discussion
SCENARIO :-
Lets say I have got three sources SorA, SorB, SorC.
SorA -> has a field in its schema, to indicate that the data is valid prior to the date specified in this field (which means we can pull data only prior to specified date).
SorB -> The availability of data is known after a manual process of verification.
SorC-> Extracts should be on a particular date of Every Month.
I need to have extract dates for each source to be incorporated in target DW.
As far as i know till now is to have a table called "Audit Table" to store the extract dates (later on these values are populated to DW).
Question 1:
What are the fields that I can/should have for Audit Table?
My Comments
-> is it Sur_KEY, SorA_ExtDate, SorB_ExtDate, SorC_ExtDate,ProcessDate (anything else??)
Question 2 :
How/When should they be populated?
My Comments
-> Should I write the system date to audit table when ever the respective sequences are triggered, like eg. if SorA was triggered on 1st of Sept 2006, shall i take the date generated by datastage or the field which was specified in the field (as mentioned above). My guess from the source
-> In other two cases, my guess System Date
Question 3:
When it comes to next month, how will the extraction start
My Comments
-> should i first take extract date from audit table, add +1 to it and read data from there on???? to the current extract date (which is from Source for SorA and system date for SorB and SorC).
-> How can we be sure that the latest record ? we took from Audit Table is the last extract date????
Question 4 :
-> To trigger these three different sources at three different times, three Unix Scripts are to be writtien seperately, isnt it??
Re: Extraction Dates
Hi,
Let me try to answer from my view point to your question.
[b]Q1:[/b]
you can have the audit table as below
Audit_key,
Source_name,
Process_Batch_number,
Batch_Extract_Date,
Processed_Row_Qty
Initialize the table with Source Names like below
1, SorA, 1, NULL, 0
2, SorB, 1, NULL, 0
3, SorC, 1, NULL, 0
[b]Q2:[/b]
Then for the first load update the table with extract date (system date) and row qty , from then for each load increment one row/source/load using a procedure (SQL).
[b]Q3:[/b]
Edited with following.
-----------------------
Each time you load increment the process_batch_number not the date.The date should be date of extraction for each source(if you use data stage to load take the @DATE for batch_extract_date.
I Hope this helps ! (IHTH ?)
-NB
Let me try to answer from my view point to your question.
[b]Q1:[/b]
you can have the audit table as below
Audit_key,
Source_name,
Process_Batch_number,
Batch_Extract_Date,
Processed_Row_Qty
Initialize the table with Source Names like below
1, SorA, 1, NULL, 0
2, SorB, 1, NULL, 0
3, SorC, 1, NULL, 0
[b]Q2:[/b]
Then for the first load update the table with extract date (system date) and row qty , from then for each load increment one row/source/load using a procedure (SQL).
[b]Q3:[/b]
Edited with following.
-----------------------
Each time you load increment the process_batch_number not the date.The date should be date of extraction for each source(if you use data stage to load take the @DATE for batch_extract_date.
I Hope this helps ! (IHTH ?)
-NB
Last edited by NBALA on Thu Sep 07, 2006 11:51 am, edited 2 times in total.
Q1 has been answered.
Q2 use both todays date and the max date in your audit table.
Q3 use max date + 1 till today -1 because today is probably not complete.
Q4 You need a record in your audit table for each source. Insert this record only after today is done. A lot of people have a record like this for each phase of the ETL so the ETL is restartable. If all your extracts completed successfully then why extract again if you have to restart your ETL. You can have a STARTED record in the ETL_AUDIT table and a FINISHED record when it is done. At the first of a sequence check to see if there is a FINISHED record for today. If there is then exit this sequence gracefully.
Q2 use both todays date and the max date in your audit table.
Q3 use max date + 1 till today -1 because today is probably not complete.
Q4 You need a record in your audit table for each source. Insert this record only after today is done. A lot of people have a record like this for each phase of the ETL so the ETL is restartable. If all your extracts completed successfully then why extract again if you have to restart your ETL. You can have a STARTED record in the ETL_AUDIT table and a FINISHED record when it is done. At the first of a sequence check to see if there is a FINISHED record for today. If there is then exit this sequence gracefully.
Mamu Kim
[quote="kduke"]Q1 has been answered.
Q2 use both todays date and the max date in your audit table.
Q3 use max date + 1 till today -1 because today is probably not complete.
Q4 You need a record in your audit table for each source. Insert this record only after today is done. A lot of people have a record like this for each phase of the ETL so the ETL is restartable. If all your extracts completed successfully then why extract again if you have to restart your ETL. You can have a STARTED record in the ETL_AUDIT table and a FINISHED record when it is done. At the first of a sequence check to see if there is a FINISHED record for today. If there is then exit this sequence gracefully.[/quote]
for Q1: You can add one more column as "Extract_Finish_Date" with default NULL
and update this when each extract is completed. This can be used for Q4 as Kim mentioned.
-NB
Q2 use both todays date and the max date in your audit table.
Q3 use max date + 1 till today -1 because today is probably not complete.
Q4 You need a record in your audit table for each source. Insert this record only after today is done. A lot of people have a record like this for each phase of the ETL so the ETL is restartable. If all your extracts completed successfully then why extract again if you have to restart your ETL. You can have a STARTED record in the ETL_AUDIT table and a FINISHED record when it is done. At the first of a sequence check to see if there is a FINISHED record for today. If there is then exit this sequence gracefully.[/quote]
for Q1: You can add one more column as "Extract_Finish_Date" with default NULL
and update this when each extract is completed. This can be used for Q4 as Kim mentioned.
-NB
In Addition to fields in Audit table, I am getting lots of doubts regarding implementation in datastage. Pls bear with me and answer my questions.
Point a
This is how I plan to design
Figure 1
Point b
In Sequencer stage the mode is selected as all
Point c
In all Master_Seq_Stage_A (B,C). All I do is extract, standardise, validate and load to staging tables i.e. flow as shown below
Point d
Now if I have the audit table as given below :
Q1) Do I need to modify the above sequence (of extract,transform, load) as :
or
The result after execution of any one of the source will be
Q2) How do I skip the extraction step (upon failures)??? Where can I insert that check in the Master_Seq_Stage_A (B,C)
Q3) In fig1, in Master_seq_Common. I need to populate the datawarehouse with extraction dates from each of the source.
shall I take it for granted that the data in audit table is up to date i.e. or do I need to make checks
Q4) In the last stand alone sequence (fig 1) i.e. Seq_Load ( I will write new entries into the audit table, ready for next load)
Q5) Last question, The sources (A,B,C) all occur at different dates, so do who should be kicking off the extract jobs using dsjob???? the people who provide the source???
Point a
This is how I plan to design
Code: Select all
Master_Seq_Stage_A -->
Master_Seq_Stage_B --> Sequencer -> Master_Seq_Common --> Seq_End
Master_Seq_Stage_C -->
Figure 1
Point b
In Sequencer stage the mode is selected as all
Point c
In all Master_Seq_Stage_A (B,C). All I do is extract, standardise, validate and load to staging tables i.e. flow as shown below
Code: Select all
Seq_Extract -> Seq_Transform -> Seq_Load
Now if I have the audit table as given below :
Code: Select all
-----------------------------------------------------------------------
|Audit_Key | Source | Batch_Nbr | Batch_Ext_Dt | Batch_Ext_Fin |
|----------------------------------------------------------------------
|1 | A | 1 | NULL | NULL |
|2 | B | 1 | NULL | NULL |
|3 | C | 1 | NULL | NULL |
------------------------------------------------------------------------
Code: Select all
Routine_Write_Extract_Date->Seq_Extract->Routine_Write_Extract_Finish_Date -> Seq_Transform->Seq_Load
Code: Select all
Routine_Write_Extract_Date->Seq_Extract-> Seq_Transform->Seq_Load ->Routine_Write_Extract_Finish_Date
Code: Select all
-----------------------------------------------------------------------
|Audit_Key | Source | Batch_Nbr | Batch_Ext_Dt | Batch_Ext_Fin |
|----------------------------------------------------------------------
|1 | A | 1 | 01-01-2006 | 01-01-2006 |
|2 | B | 1 | NULL | NULL |
|3 | C | 1 | NULL | NULL |
------------------------------------------------------------------------
Q2) How do I skip the extraction step (upon failures)??? Where can I insert that check in the Master_Seq_Stage_A (B,C)
Q3) In fig1, in Master_seq_Common. I need to populate the datawarehouse with extraction dates from each of the source.
shall I take it for granted that the data in audit table is up to date i.e. or do I need to make checks
Code: Select all
-----------------------------------------------------------------------
|Audit_Key | Source | Batch_Nbr | Batch_Ext_Dt | Batch_Ext_Fin |
|----------------------------------------------------------------------
|1 | A | 1 | 01-01-2006 | 01-01-2006 |
|2 | B | 1 | 01-05-2006 | 01-05-2006 |
|3 | C | 1 | 01-18-2006 | 01-19-2006 |
------------------------------------------------------------------------
Q4) In the last stand alone sequence (fig 1) i.e. Seq_Load ( I will write new entries into the audit table, ready for next load)
Code: Select all
-----------------------------------------------------------------------
|Audit_Key | Source | Batch_Nbr | Batch_Ext_Dt | Batch_Ext_Fin |
|----------------------------------------------------------------------
|1 | A | 1 | 01-01-2006 | 01-01-2006 |
|2 | B | 1 | 01-05-2006 | 01-05-2006 |
|3 | C | 1 | 01-18-2006 | 01-19-2006 |
|4 | A | 2 | NULL | NULL |
|5 | B | 2 | NULL | NULL |
|6 | C | 2 | NULL | NULL
------------------------------------------------------------------------
Q1 First one
Q2 You need a job or a routine to check for a finished date. If there is a finished date then the job or routine needs some way to signal the sequence that this step is finished.
I would put this in as the first step of the extract sequence. I would either abort the job or return a @FALSE. I would then go to the end of the sequence gracefully.
Q3 If this is a table then use it. Consider it to be accurate. If this is a hashed file then rebuild it every time.
Q4 Not a question
Q5 Personal preferrence. Whatever is easier. Sometimes other factors require that they notify operations then another step has to happen before the load can run. If you can do it clean then let them kick off a shell script to load it.
Q2 You need a job or a routine to check for a finished date. If there is a finished date then the job or routine needs some way to signal the sequence that this step is finished.
I would put this in as the first step of the extract sequence. I would either abort the job or return a @FALSE. I would then go to the end of the sequence gracefully.
Q3 If this is a table then use it. Consider it to be accurate. If this is a hashed file then rebuild it every time.
Q4 Not a question
Q5 Personal preferrence. Whatever is easier. Sometimes other factors require that they notify operations then another step has to happen before the load can run. If you can do it clean then let them kick off a shell script to load it.
Mamu Kim
thanks for your quick and clear response
Point (a)
kduke
or even if I abort within Master_Seq_Stage_A (B,C)
How can I specify to jump to next Sequence, lets say to Seq_Transform above??? or to Sequencer in Main Sequence??
Point (b)
kduke
Point (a)
kduke
If I abort the sequence then as I have Master_seq_Common following up the Sequencer (with mode selected as ALL), then wont the Master_Seq-Common NEVER get executedQ2 You need a job or a routine to check for a finished date. If there is a finished date then the job or routine needs some way to signal the sequence that this step is finished.
I would put this in as the first step of the extract sequence. I would either abort the job or return a @FALSE. I would then go to the end of the sequence gracefully.
Code: Select all
Master_Seq_Stage_A -->
Master_Seq_Stage_B --> Sequencer -> Master_Seq_Common --> Seq_End
Master_Seq_Stage_C -->
Code: Select all
Seq_Extract -> Seq_Transform -> Seq_Load
Point (b)
kduke
-> Wanted to ask, whether the approach I took is right??, by having the last sequence to load additional entries in the audit table (with null values i.e. batch_nbr=2), for next run of sequence.Q4 Not a question
Abort was a poor choice of words. It should skip to the end of that sequence. So each sequence should check the audit table to see if it has been run today. If it has then it exits to the last step. If it has not been run today then run all of the jobs normally. So do this check first in the sequence. Next have 2 branches. One to the last step in this sequence which will be a sequencer. The other to the first real job.
Point B. Never create the next step ahead of time. When you start today's load then put a start time in only. Never 2 nulls. Null only the end time. Do not try to anticipate your next step. Let your next step take care of all it needs to function.
Point B. Never create the next step ahead of time. When you start today's load then put a start time in only. Never 2 nulls. Null only the end time. Do not try to anticipate your next step. Let your next step take care of all it needs to function.
Mamu Kim
(1)
kduke
b) Shall I check the day, month, year of the audit table to that of system date?
b) How can we specify in Check_Audit_Table (a sequence) which path to take???
2)
kduke
Let me say the order of trigger is B,A,C
So after B triggers, I will have
when A gets triggered then I will have
and so on...
(b) The doubt here I have is, how will I know the Batch_Nbr of A will be 1 as well??? (i cant PL/SQL) could u tell me datastage approach. Mostly because I dont know which source gets triggered first.
3)
Just now realisted that I will have, one more addition Master_Seq_Start,
i) the only requirement for it is -> to run BEFORE ANY Master_Seq_Stage_A (B,C) starts.
ii) It wont write anything to the audit table.
iii) But how should I do this??? How do I create the sequence order and how do i trigger them through unix.
kduke
a) So Shall I have like this within MasterSequence (given below)??It should skip to the end of that sequence. So each sequence should check the audit table to see if it has been run today. If it has then it exits to the last step. If it has not been run today then run all of the jobs normally. So do this check first in the sequence. Next have 2 branches. One to the last step in this sequence which will be a sequencer. The other to the first real job.
b) Shall I check the day, month, year of the audit table to that of system date?
b) How can we specify in Check_Audit_Table (a sequence) which path to take???
Code: Select all
Check_Audit_Table->Routine_Wr_Extract_Date->Seq_Extract->Routine_Wr_Extract_Finish_Date -> Seq_Transform->Seq_Load
\ |
------------------------------------------------------------------------------------------------>Sequencer
kduke
a) Master_SeqA, Master_SeqB, Master_SeqC are triggered at different points of times(which is not known). So shall I write into the table only when they have triggered ?Point B. Never create the next step ahead of time. When you start today's load then put a start time in only. Never 2 nulls. Null only the end time. Do not try to anticipate your next step. Let your next step take care of all it needs to function.
Let me say the order of trigger is B,A,C
So after B triggers, I will have
Code: Select all
-----------------------------------------------------------------------
|Audit_Key | Source | Batch_Nbr | Batch_Ext_Dt | Batch_Ext_Fin |
|----------------------------------------------------------------------
|1 | B | 1 | 01-01-2006 | NULL |
|------------------------------------------------------------------------
Code: Select all
-----------------------------------------------------------------------
|Audit_Key | Source | Batch_Nbr | Batch_Ext_Dt | Batch_Ext_Fin |
|----------------------------------------------------------------------
|1 | B | 1 | 01-01-2006 | NULL |
|2 | A | 1 | 01-10-2006 | NULL |
------------------------------------------------------------------------
(b) The doubt here I have is, how will I know the Batch_Nbr of A will be 1 as well??? (i cant PL/SQL) could u tell me datastage approach. Mostly because I dont know which source gets triggered first.
3)
Just now realisted that I will have, one more addition Master_Seq_Start,
i) the only requirement for it is -> to run BEFORE ANY Master_Seq_Stage_A (B,C) starts.
ii) It wont write anything to the audit table.
iii) But how should I do this??? How do I create the sequence order and how do i trigger them through unix.
Code: Select all
Master_Seq_Stage_A -->
Master_Seq_Start --> Master_Seq_Stage_B --> Sequencer -> Master_Seq_Common --> Seq_End
Master_Seq_Stage_C -->
Hi,
The sequence looks ok.
In check audit table sequence - first increase the batch number for the Src (A or B or C depends on the your source) insert new row with incremented batch number and system date as extract date.
Yes, you have to write into table when they triggered.
//// The doubt here I have is, how will I know the Batch_Nbr of A will be 1 as well??? (i cant PL/SQL) could u tell me datastage approach. Mostly because I dont know which source gets triggered first. ////
You need to write a procedure or required an used defined SQL for this. Never bother about the trigger order of the source, all you required is each source needs to incremented to next count with the ext date.
Use the director scheduler for the Master you have or try with cron job.
-NB
The sequence looks ok.
In check audit table sequence - first increase the batch number for the Src (A or B or C depends on the your source) insert new row with incremented batch number and system date as extract date.
Yes, you have to write into table when they triggered.
//// The doubt here I have is, how will I know the Batch_Nbr of A will be 1 as well??? (i cant PL/SQL) could u tell me datastage approach. Mostly because I dont know which source gets triggered first. ////
You need to write a procedure or required an used defined SQL for this. Never bother about the trigger order of the source, all you required is each source needs to incremented to next count with the ext date.
Use the director scheduler for the Master you have or try with cron job.
-NB
Sorry for the trouble. I guess this would be last of my questions. Hope to get going after this. I was unable to figure out especially (Q3) and then (Q2). Because for Q2, I planned to write entries to the audit table at the end of all sequence with null, but proved wrong. So was confused how to approach.
But pls comment on (Q3), thats way out of my league.
But pls comment on (Q3), thats way out of my league.