Best way to Read N number of tables and load into 1 table

SURA · Post by **SURA** » Tue Nov 08, 2011 6:12 pm

Hi there

I am developing a Generic Datastage job to load the 100 tables data into a single table.

I have a control table in which I have the list of tables which need to pick for the day. But in general I may get around 100 tables daily and the total records volume is @ 2 million.

I wish to know the best approach to do this work.

I am in the plan to do a generic RCP job and herewith my design.

SQL Server --> TFM --> SEQ File stage

This job will be a multiple instance job and LOOP Sequence will trigger this job on each iteration and load into sequential file.

Final job will read all the Sequential files using file pattern and load it to the target table.

Is there is any other best approach which will get more performance than this approach?

Suggestions welcome.

DS User

ray.wurlod · Post by **ray.wurlod** » Tue Nov 08, 2011 7:41 pm

First and obvious question - do all of these tables and the target table have the same record structure?

SURA · Post by **SURA** » Tue Nov 08, 2011 8:25 pm

All the source tables are having the same record structure. Whereas in target adding metadata colums too for internal track.

DS User

chulett · Post by **chulett** » Tue Nov 08, 2011 9:18 pm

Since that's the case, you might want to consider an alternative to the looping construct - perhaps you can take your table list and construct a single source query where 'union all' is used. For example:

SELECT * FROM TABLE_A
UNION ALL
SELECT * FROM TABLE_B
UNION ALL
SELECT * FROM TABLE_C

Etc, etc. Write that out to a file and then have the job leverage the 'SQL File' option, get it all done in "one" select. Assuming they can all be sourced from the same credentials / connection, that is.

SURA · Post by **SURA** » Tue Nov 08, 2011 9:47 pm

Hi Criag

Thanks for your reply.

Day to day table names will change and no of tables will vary. Today i may pick 50 and tomorrow 500 and then 100.

How to construct select for these scenario?

DS User

ray.wurlod · Post by **ray.wurlod** » Tue Nov 08, 2011 9:57 pm

Edit the file of names to build SELECT statements around them. Or:

Execute Command activity:

Code: Select all

cat #filepath#

User Variables activity:

Code: Select all

Ereplace((Cats(Reuse("SELECT * FROM "), Command.$Output),@FM," UNION ALL ",-1,0)

That will get you close. You may have to trim the trailing @FM from Command.$Output before (or as part of) constructing the SQL.

chulett · Post by **chulett** » Tue Nov 08, 2011 10:00 pm

Seems to me you would just select the list of table names and then write them out to a flat file. Nothing in what I posted requires you know anything about the table names or how many there will be on any given day as you are (in essence) building dynamic sql. Write out the first record with just the select and then add on the "union all" to all of the rest.

However, you may run into a problem with the size of the file if your number of tables can go that high. But then I wonder if in a Sequence job the string which holds the "delimited list of things to loop through" can be that long.

SURA · Post by **SURA** » Wed Nov 09, 2011 8:46 am

Ray

How to execute this query?

The aim of my question is, if query is bigger, I can't pass it as a param.

DS User

SURA · Post by **SURA** » Wed Nov 09, 2011 8:49 am

Sorry Criag,

I am not clear with your comments. I have a control table, in which they will give me the list of table name which need to pick for the day.

DS User

Aruna Gutti · Post by **Aruna Gutti** » Wed Nov 09, 2011 2:18 pm

How about passing the table name as a parameter within a sequence job?

1. job 1 -- Read control table and create delimited seq file with all the tables to be loaded.
2. begin the loop
3. parse file created in step 1 for (next) source table name with User variable activity stage
4. Using the source table name passed by user variable activity stage load target table
5. end loop

Aruna.

ray.wurlod · Post by **ray.wurlod** » Wed Nov 09, 2011 3:08 pm

The point we've been trying to make is that a sequence job with a loop means that you process one table at a time. A UNION can (theoretically anyway) stream all rows from all tables at the same time.

You could construct the SQL statement within the sequence (Routine activity) rather than in an external command.

SURA · Post by **SURA** » Wed Nov 09, 2011 5:54 pm

I got it. Let me try it and get back to you.

Meanwhile , just for a clarification,

For some reason if it has to run 800 tables, still it doesn't matter with union all approach?

DS User

ray.wurlod · Post by **ray.wurlod** » Wed Nov 09, 2011 8:06 pm

Provided your database server can cope with an SQL statement that long, it should be OK.

You could always create a hybrid approach, such as a sequence with multiple loops each handling a subset of lines from the list of table names.

chulett · Post by **chulett** » Wed Nov 09, 2011 11:22 pm

Pretty simple to test your "800 table" question, simply put the same table name in your control table 800 times.