LOOK UP PERFORMANCE TUNING WITH A BIG SQL SERVER TABLE

sql.alp · Post by **sql.alp** » Wed Jul 09, 2014 5:18 pm

HI GUYS!
i thank you all even now for your responses!
i am fairly new to DS, and i am having issue with a job that checks the existing data in a SQL SERVER 3 table joins. (A, b,c) only one of the table is big sized (A). and there is very small table in NETEZZA (nz). i need to insert/update a field in A by nz. All values in nz need to be upd/ins to A. in order to do that i need to check existance of nz.date field in A.date. i thought look up would be a good choice to do that. Im sure you all know what kind of misery i had by matching date values btwn NZ to SQL server.
my problem is:
because A has some size +joins +converting date field +sql server performance

, the look up query takes some time even though nz is a very small able (=sometimes max 300 rows mostly less), and sql server times out. at the end i get error saying "socket unplugged to ODBC server"
SO I NEED TO FIND A WAY TO MAKE MY LOOK UP MORE EFFICIENT!
i am open for any ideas.
and i thank you all!

ray.wurlod · Post by **ray.wurlod** » Wed Jul 09, 2014 8:49 pm

Are the joins supported by indexes? Is the lookup field indexed?

sql.alp · Post by **sql.alp** » Wed Jul 09, 2014 9:38 pm

Thx for the response so quick,
Well they all have PKs. The reason for 3 join is nz to reach the A's PK by c then b. I know it's been set up inefficient but I have no control over A b and c tables which are in SQL server. At the end I will have to update A with all the values in nz. The look up is for filling up 1 field from nz 3 fields from A and I have to match by the date field which exist in both tables. If the nz.Date value is not in A.Date then I have to insert the nz.Date and other field along with matching value from A. (Like week of the year field on A)

Btw, I have tried flat file destination instead of ODBC destination and the jobs executes finely (takes little time but it does finish update/insert (like 5 min for 324 rows in nz)).

My best bet would be:
1) either using another methods like joins or using 2 look ups instead of 1 look up?
2) or (it is little detailed work) I could maybe capture the min value from nz.date, make a variable and then I can use that variable in where clause in the look up sql query? If I can do that that would cut off a huge chunk of unnecessary data from A. At the end nz is kinda an update table for A and the dates are usually very closed the current.

Can I do 2nd theory? Or is there a better way then look up that matches the data faster?

ray.wurlod · Post by **ray.wurlod** » Thu Jul 10, 2014 1:13 am

Why not try it?

chulett · Post by **chulett** » Thu Jul 10, 2014 6:53 am

Hvve you considered using a hashed file to cache the lookup data rather than reading directly from the database? They are the heart and soul of Server jobs, after all. The 'trick' is to load into them only what you need for each job run which typically involves joining the incoming keys to that data set.

Now, this may not be worth it if your runs are as small as you noted, will they be larger at some point?

sql.alp · Post by **sql.alp** » Thu Jul 10, 2014 8:00 am

My etl background is SSIS and SQL server based(believe me i got into some transitions to become an IBM developer

) and i used to create a file rather than looking directly into the DB. Plus, it does back n fort for every single data.
as i mentioned earlier i'm new to DS but i mentioned anything about it. i guess my colleague does not use the hash file that much so he said this is how he uses the look up.

i believe what you are saying should be the missing link!

Correct me if im mistaken;
1- i use the join statement first and load into the hash file (ill call it H1)
2- then H1 is the one with the transformer stage look up as reference (since it will be the bigger size) and nz as the stream link,

Q1:any corrections or any other suggestions on this strategy?

Q2: is it also better to use another hash file (H2) just before the load into A?

Q3-i have to put in consideration that A is a few hundred thousand row table. so will a hash file be able to handle all data or should i still eliminate some rows the way i have mentioned in my earlier post?

thank you all!

chulett · Post by **chulett** » Thu Jul 10, 2014 10:15 am

You are NOT mistaken.

Q2: Not quite sure... what would you use another hashed file for?

Q3: A hashed file will easily handle that load. Best to set the initial size (Minimum Modulus) to something large so it doesn't immediately start to extend and fragment. If it takes 'too long' to build or becomes 'too large' (2.2GB max) then look into constraining the source SQL for the hashed build.

sql.alp · Post by **sql.alp** » Thu Jul 10, 2014 10:47 am

Q2: is it more efficient to use another hash file that captures the data from output of look up and direct it to final ODBC inst/updt stage?

Q4: what would be the best Parameter array size and Rows per transaction value on upsert stage for SQL server?

chulett · Post by **chulett** » Thu Jul 10, 2014 11:03 am

A2: No.

A3: Too many variables for a generic answer and most are dependent on your data and hardware. You might want to discuss that with your SQL Server DBA and see if they can help. Do you understand what they control?

sql.alp · Post by **sql.alp** » Thu Jul 10, 2014 11:18 am

Q4 : well it was set up to 0 as default, and it was getting timed out error on sql server 324 rows upsert. then i set it up for array= 1, transaction rows =1 then it executed. but still it took about 5 min!

data type = smallint
total row= 300-500 max (usually less)
table to upsert= only PK index, no triggers!
hardware= source table = NZ
dest. table = sql server 2012
etl= DS 9.1 version, job type= server job

chulett · Post by **chulett** » Thu Jul 10, 2014 1:52 pm

Array size = how many records are sent over the network to the database at a time.
Transaction size = controls commit in the target database based on row count.

Setting both to 1 would guarantee the slowest load possible.

When you pick new numbers, make sure the transaction size is an even multiple of the array size. I'm assuming you meant the transaction size was zero, that tells it to only do a single commit on 'end of data'. Zero is an invalid array size.

sql.alp · Post by **sql.alp** » Thu Jul 10, 2014 3:28 pm

wow! what a difference! it executes now way faster and i dont even have to put where clause to eliminate the unnecessary rows! even it selects all columns, still it is 5 times faster! unbelievable :D

thank you All Guys! your expertise on DS solved another issue! without you, IBM's limited source would make the developers devastated i guess!

ray.wurlod · Post by **ray.wurlod** » Thu Jul 10, 2014 4:16 pm

I'll have to disagree with your comment about IBM support - in my experience they are a highly skilled and highly motivated group, whose unenviable task it is is to solve even those problems that we are unable to.

sql.alp · Post by **sql.alp** » Thu Jul 10, 2014 5:12 pm

im sure they are experts and experienced. and i am figuring that it is better for IBM developer's community. my quote was based on comparison with microsoft resource online! Everyone says, i personally experienced, it is very easy to find anything for any question you have about Microsoft! which makes it a lot better for IBM developers i assume for the reason they will have (or at least i will hopefully have soon :D ) exceptional skills that not everyone can get as easy as a microsoft developer!

chulett · Post by **chulett** » Thu Jul 10, 2014 6:03 pm

I'll say one thing, we can certainly be... quicker... at times. Been there, done that, got the t-shirt kind of stuff.