Apologies in advance to those who don't like new threads for old problems.
First a disclaimer: Benchmarks are notoriously system dependent; what is true for me may not be true for you. I have tried to avoid this in my conclusions, but ultimately it is inescapable.
My environment:
Datastage Server: DS 7.5.1a on SuSE Linux Enterprise Server 9 (i586)
Database Server: Oracle Enterprise Edition Release 10.1.0.4.0 on SuSE SLES 8 (Linux).
With my release of DS, there are 3 Oracle stages available:
- Oracle OCI (I'll call it OCI from here in) - can select/insert/update/delete from Oracle tables
- Oracle OCI Load (I'll call it OCI Load from here in) - can only insert (ie. load) Oracle tables
- Oracle 7 Load (I'll call it OraBulk from here in) - can only insert (ie. load) Oracle tables.
[28Sep05] Pursuant to responses made by ian_bennett below, I have edited the paragraphs below replacing references to "array size" with "transaction size", as the original text was misleading. Edits in blue.
When writing back to Oracle from DataStage, there are five important concepts to understand: Transaction Size, Direct/Conventional Path, Update Action, Load Mode, Local/Remote sessions.
Transaction Size
Transaction Size refers to the number of rows you group together to post to the database in one hit. It relates primarily to INSERTs on the OCI stage. A transaction size of 1 means that you are sending one row at a time to the database; 250 means 250 rows at a time.
Clearly, a larger transaction size means fewer round trips and a faster job; you are really only restricted by memory. Performance benefits tend to tail off at higher numbers, and there are some tricky issues with values over 255 relating to the MAXTRANS setting on Oracle tables and indexes. I have found 255 to be a fairly efficient setting.
Direct/Conventional Path
Direct Path load was introduced in Oracle 7 as a faster way to perform bulk inserts. Rather than looking for free space in blocks that already contain rows (or once contained rows), Oracle just goes straight to the first never-used block and starts inserting. There are lots of restrictions and considerations though - it is very important to read the Oracle Utilities manual so that you understand the consequences of Direct Path loads before you start using them. Common consequences are tables being left in an unusable state, slower (yes! slower) load in some cases, and huge increases in disk usage if you don't know what you are doing.
The alternative of Direct Path is Conventional Path - it looks for free space and uses it. The OCI stage always uses Conventional Path. It is very safe compared to Direct Path.
There has been some discussion over whether OCI Load uses Direct or Conventional path. I am here to tell you that the OCI Load stage on DS 7.5 DOES use Direct path. I have monitored the database whilst OCI Load was running and confirmed it as true (which is what the manual says, so that's encouraging).
I am not using the OraBulk stage, but from what I gather you have to write your own after-stage-subroutine to kick off SQL*Loader, which gives you the power to select direct or conventional path as you see fit. I can't see how Conventional Path would be any better than an OCI stage, but you might want to defer your load to avoid Snapshot Too Old errors and the like (see Oracle Forums).
Update Action
Update Action is an OCI stage concept. The Update Action instructs DS what statements to send to Oracle. eg. Insert Rows Without Clearing sends only INSERT statements. Insert New Rows or Update Existing sends an Insert, and if that fails, an update. You get the idea.
Now this is really important: If you set the Transaction Size to (say) 255 and the Update Action to Insert Rows Without Clearing, then Oracle will send only one INSERT statement for every 255 rows. Efficient huh? However, if you set the Update Action to Insert New Rows or Update Existing, then DS has to check every single row to see if it exists, and if not send an UPDATE. The result is that your Transaction Size setting is completely ignored! If 10% of incoming data already exists, then you will send 255+25=280 statements to the database compared to just one for Insert Rows Without Clearing.
As a rule, NEVER use "Insert New Rows or Update Existing" or "Update Existing Rows or Insert New" for data volumes exceeding a few hundred. Every single one of my Jobs that use an OCI to insert/update compares the incoming row to the existing rows on the database. New rows go to an Insert New Rows Without Clearing OCI, updated rows go to an Update Existing Rows Only OCI, and unchanged data goes to a discard file (you could just not output unchanged rows, but I like completeness).
The discard file is actually important. You could just not bother checking and update it anyway, right? I load a file that is a full dump of Product data every day - it has 50000 rows of which only a handful are changes/inserts. Updating every row processes at up to 150 rows per second. Filtering unchanged rows processes over 2000 rows per second.
Load Mode
Load Mode is a setting on the OCI Load stage. Set to Automatic, it uses Direct Path Load to load the data whilst it being transformed. Set to Manual, it writes to a file and you have to load it yourself with SQL*Loader (or DataPump if you want to roll your own). Manual is the equivalent to the OraBulk stage without the convenience of an after-stage-subroutine - I don't know why you'd use it.
The Automatic setting is the big selling point for OCI Load. I mean, why wouldn't you want to load it asynchronously? Asynchronous loading has to be faster than synchronous, right?
Not necessarily. More below.
Local/Remote Sessions
When your database lives on your DataStage server, DataStage is able to connect Locally to the database. When Oracle is on another machine, DS must make a Remote connection across Oracle Net (formerly SQL*Net). This is most noticable in bulk load. Local Direct Path loads are the absolute fastest way to get data into Oracle - no correspondence will be entered into. The performance difference between local and remote connections will depend on many factors, but orders of magnitude are not uncommon. Inexplicably, the time difference tends to be greater than the time it takes to FTP the same data file between the machines
Clearly, if you use a Load Mode of Automatic and your Oracle server is remote, then you will get a remote session. The good news is that if DS and Oracle are on the same box then Automatic will use a Local session to asynchronously load data as it is transformed. Chuck in some Inter-Process stages and you couln't wipe the smile off my face with a shotgun.
Not all is lost though if you have a remote database. You can use Manual load (or OraBulk) to drop the output to a file, then a Routine to copy it over to the database server and issue a remote command to start the load with a Local connection. If you're really clever about it you could get this working asynchronously rather than waiting for the transformation to complete, but this would be much harder for the next guy (who's not quite so clever) to support it.
OK on to the results.
I took a standard transformation from my system and adapted it to run in a variety of different ways. It is active from source to target, so there are no intermediate files to mess up the averages, and it is not so functionally complex that it is CPU bound (ie. It keeps up with the reads and writes reasonably well).
I fed it with Approx 133000 rows, of which only a handful were UPDATEs, the rest were all new. Since I didn't want to use and Update Action of Insert New Rows Else Update Existing, I transformed the data to an empty copy of the real target table, and then issued an Oracle MERGE /*+ APPEND*/ statement, which inserts new rows Direct Path, and updates existing rows Conventional Path in a single statement. Incidentally, this is the standard for high volume upserts in this warehouse.
Going from worst to best:
- OCI stage with Update Action of Update Else Insert processed 14 rows per second. I was so disgusted I didn't let it finish. Clearly I didn't need the MERGE at the end with this method, but it didn't get that far to matter.
- OCI stage with Update Action of Insert Else Update processed 38 rows per second. Faster because it does an INSERT first, and only does an UPDATE if the row exists. Most of my data was INSERTs.
- OCI stage with Insert New Rows Only and a Transaction Size of 255 processed 100 rows per second. The MERGE takes 433 seconds, giving an average speed of 75 rows per second.
- OCI Load with Automatic Load Mode and Max Record Number (equiv of Transaction Size for Loads) of 255 processed 127 rows/sec to a remote database. With the MERGE the average came down to 90 rows per second.
- OCI Load with Manual Load Mode processed 195 rows per second. The final file transferred to the database server in 8 seconds, loaded (Direct Path Local) in 83 seconds, and MERGEd in 433 seconds. Average: 110 rows per second.
My recommendations are as follows:
- 1. For dimension (reference) data, always join the source to the target, discard unchanged rows, separate INSERTs and UPDATEs into two OCI stages. Direct Path load is unsuitable for dimensional data, because the % changes to the table is usually so small (see Oracle manuals for more).
You can compare the source to target by dropping the target table down to a Hash table, or join it with SQL by using Oracle Externally Organised Tables to access your incoming file.
Never use an OCI stage as a Lookup Input to a Transformer - it performs a round trip to the database for each input row. PX users may ignore this tip, as they have a special stage that caches the lookup.
- 2. For very low volume fact data (<100 rows), you can use an OCI stage with Insert Else Update. Direct Load is not appropriate for such low volumes, and it's just not worth the bother to do a MERGE (see above) or join source to target.
- 3. For medium volume fact data (100-10000 rows), use an Insert Only OCI stage to achieve Conventional Path load. OCI Load will use Direct Path, which is inappropriate when the incoming volume is such a small percentage of the existing data. INSERT into an empty table, and MERGE (without APPEND) into the real target table.
If you know your data to be INSERTs only, then ignore the intermediate table.
- 4. For large volume Fact data on a local database, use OCI Load in Automatic mode to load an empty table, and MERGE /*+APPEND*/ into the real target.
If you know your data to be INSERTs only, then ignore the intermediate table.
- 5. For large volume Fact data on a remote database, use OCI Load in Manual mode or OraBulk to drop the data to a file and generate a SQL*Loader control file; move the control file and data to the Oracle server, and Direct Path load the file to an empty table and MERGE /*APPEND*/ to the real target. Again, if you know your data to be INSERTs only, then ignore the intermediate table.
Perhaps this sounds like too much effort? How concerned about performance are you?
Personally, I have implemented a compromise.
* I always follow #1, #2, and #4
* I use solution #4 in place of #3, because I am prepared for small Fact transformations (#3) to take a little longer with Direct Path, and I don't want to create too many options for programmers (the more options, the more likely the wrong choice).
* I use solution #4 in place of #5, because I am lazy and don't want to go to the extra effort of copying and remote shells - Automatic Load is just sooooo easy. Having said that, I don't have any data volumes that would really punish me for this decision. If I was working with 500,000+ row files then I'd get with the program.
A better compromise would be to use put in the infrastructure to automate #5, and use it in place of #3 and #4. With some smarts, you could get it to use Conventional Path for low volumes and Direct Path for high volumes. Then you have one-solution fits all for Fact data.
Happy tuning :D