No OCI stage in v8 Parallel job

rleishman · Post by **rleishman** » Wed Sep 26, 2007 11:53 pm

I'm a former 7.5.1A Server developer. We've just installed an evaluation version of DSv8 for a new project and I'm sussing out the differences.

We can only see the Oracle OCI stage in the palette for Server jobs, and the Oracle Enterprise stage for Parallel jobs.

The doco doesn't say that OCI is server-only, and some of the threads on DSX seem to suggest Parallel users have OCI; but maybe its different in v7 parallel jobs!!!??? Is it true you cannot use the OCI stage in v8 Parallel jobs? Or have I botched the install?

ray.wurlod · Post by **ray.wurlod** » Thu Sep 27, 2007 1:44 am

The Palette is fully customizable; anything you find in the Stage Types branch of the Repository can be emplaced in the Palette. I realize that the Repository category structure is not quite so straightforward, but you can find things in the "everything" pane of the Palette customization too. This should include the Oracle API (OCI) stage type.

vmcburney · Post by **vmcburney** » Thu Sep 27, 2007 6:23 am

In my parallel stage types (under Stage Types and not under shortcuts) I can see Oracle Enterprise and Oracle OCI Load. I can access native oracle connectivity through the Dynamic RDBMS stage. For DB2 I see three stages: enterprise, API and Load.

lstsaur · Post by **lstsaur** » Thu Sep 27, 2007 10:27 am

DS simply wouldn't let you drop Oracle OCI stage on parallel job for either V7 or V8.

rleishman · Post by **rleishman** » Thu Sep 27, 2007 6:29 pm

Ray, we found OCI in the repository tree - under SERVER stages. We cannot drag it into a Parallel Job though. Good tip to know though, thanks.

Vincent, found the OCI Load stage under Parallel Stages (same as yours), but I doubt this will give us anything the Oracle Enterprise stage cannot - probably why it was excluded from the pallette.

@lstsaur - that is exactly my experience. Nice to know it also happened on V7.

Vincent, I found the Dynamic RDBMS Stage and am now curled up in my warm blanket of familiarity. It's (almost) exactly the same as the server OCI stage! Now I just need to work out whether to use it instead of the Enterprise stage. Time for some benchmarking I think.

vmcburney · Post by **vmcburney** » Thu Sep 27, 2007 7:17 pm

IBM benchmarked the DB2 Enterprise stage at 250% faster than the OCI stage for extracts, see DataStage Tip: Extracting database data 250% faster. Would love to see how your benchmark goes with Oracle.

rleishman · Post by **rleishman** » Fri Sep 28, 2007 8:40 pm

It's all starting to come together! Enterprise stage requires SELECT privs on a bunch of data dictionary tables including DBA_EXTENTS. My bet is that - when you use the "Table" SQL Type - it looks up the starting and ending block address of each extent in DBA_EXTENTS, and farms out ROWID RANGE queries to each parallel thread.

I'll try to very this with a SQL Trace. Stay tuned.

rleishman · Post by **rleishman** » Fri Oct 05, 2007 11:07 pm

Here's an interim update on Oracle Enterprise stage. I haven't done a heap of benchmarking, but I have some interesting points:

General

As Vincent pointed out Dynamic RDBMS gives you all of the Oracle OCI stuff you know and love. I've run some tests and it appears to work exactly the same as well.
Dynamic RDBMS and OCI Load are implemented behind-the-scenes using OCI. I suspect Enterprise is the same for upsert, update, and user-defined, but for Load it uses SQL*Loader. Oracle puts a bit more effort into SQL*Loader than OCI, so this makes Enterprise a bit more functional than OCI Load.
Oracle Enterprise does not have Before SQL and Ater SQL. It has an equivalent Open and Close command, but these are only available with some retrieval and load methods. If you experiment (OK, fiddle) with the settings you can LOSE your Before and After SQL.

Parallel

Dynamic RDBMS does NOT support parallel. Parallel inputs are collected and run in a single session. Oracle Enterprise DOES support parallel.
OCI Load (bulk load) is also available in DS Parallel, but also loads serially. Oracle Enterprise performs parallel bulk loads.
Oracle Enterprise has some funky-looking options for sourcing data from Oracle. When you specify a method of Table or Auto-Generated for a partitioned table, you can tell DS to parallelise each partition of the table. If you use user-defined SQL or don't select this option, it runs serially (there might be a run-on-every-node option for MPP systems - I havent looked)

Upserts

Something I wasn't aware of: in OCI and Dynamic RDBMS, the Insert-Else-Update and Update-Else-Insert COMPLETELY IGNORE your Array Size setting. They run with an array-size of 1 no-matter-what. Array Size is even ignored for Update Only - I had wrongly assumed that updates were array-applied - I was wrong! The only time the array-size is used is with Insert-Only.
Oracle Enterprise does not have an Insert-Only. The only equivalent is Load with DIRECT=FLASE option, and this uses SQL*Loader. If you know every row will insert correctly, you can use an Upsert method with the Insert command run first - it's the same but kind of scary too.
Now this is REALLY interesting. When you run an Upsert with Enterprise, Inserts are array-inserted and updates are applied 1 at a time. Wha-huh? How does that work? Say you have an array size of 500 and it contains three rows in random places within the array that already exist. Enterprise tries to insert all 500 at once and gets a failure - no rows are inserted. But it cleverly knows WHICH rows failed. It splits the array, array inserts the 497 new rows, and row-by-row updates the other 3. This is MUCH better than OCI, which row-by-row insert/updates all 500. I'd still reccommend splitting your Inserts and Updates though.

Benchmarking

My results are not conclusive. My client has a freeze on new hardware at the moment and we are prototyping the project using a single CPU 1Gb RAM Celeron Windows PC as our database server, and our DS server is a dual-core 3Gb Linux PC. Because of the database server, it only works marginally faster in Parallel than Serial.
Even so, Oracle Enterprise was Upserting (Inserts only) at 3500/sec with 4 threads, whereas Dynamic RDBMS was running insert-only on the same data into the same table serially at 2400/sec.
This was unexpected! Oracle Enterprise in DIRECT=FALSE PARALLEL=TRUE mode was SLOWER than Enterprise Upserts! I took a look at the SQL*Loader Control file generated by Enterprise, and it did not datatype the columns. Upsert on the other hand is binding already typed C structures. I suspect this is the difference.
I haven't mucked about with Enterprise loading in DIRECT=TRUE PARALLEL=TRUE. In this mode, all indexes must be disabled and FULLY REBUILT afterwards - this is inappropriate for our project because we are loading a LIVE database.
DIRECT=TRUE PARALLEL=FALSE is similarly out for us because - although it can defer maintenance of the indexes (rather than full rebuild) it makes the table unusable during the load.
I haven't benchmarked any of the parallel data retrieval features of Oracle Enterprise outlined above, as we are sourcing from files.

Conclusion

DIRECT=TRUE PARALLEL=TRUE loading is a pipe-dream for delta-loads. The cost of index rebuilds just gets too high unless you have a VERY agressive archiving strategy. It's great for full-rebuilds though.
If DS is serious about supporting Oracle, it needs to invest some serious money in developing partition-wise parallel loads and partition-exchange loading.
OCI Load: just don't.
Oracle Enterprise: use it! DS really needs to add before/after SQL for every load/select option, and it also needs and insert-only option. Use Upsert freely if updates are very rare, otherwise stick with splitting updates into a separate link.
Dynamic RDBMS: I wouldn't bother unless you REALLY need to store your SQL in a file (not supported by Enterprise) or you need a before/after SQL where one is not offered. Never use Upsert.

throbinson · Post by **throbinson** » Sat Oct 06, 2007 3:20 am

Extremely impressive. Thank you very much for taking the time to make this excellent post. This is very valuable information. It is much appreciated!

rleishman · Post by **rleishman** » Sun Oct 07, 2007 6:41 pm

Did some very quick 'n' nasty testing using Oracle Enterprise to perform parallel SELECT.

Sadly my 1-proc windoze Oracle server is not really up to the task, so the results are - again - inconclusive. Once again though, some interesting things to note.

Parallel Unload with Oracle Enterprise Stage

I set up a straight unload of a partitioned table from Oracle Enterprise stage to Seq File. In parallel, it took 11+ sec to initiate and then rocketed through to complete 1M rows at an average of 50,000/sec. With the parallel unload switched off, it initiated in 1 sec and completed with an average only marginally slower. I suspect it would blow out if I threw more data at it.
Repeated the test copying the 1M rows to another database with similar results. Serial was only a bit slower. 8,000/sec vs. 6,000/sec
The way you get Oracle Enterprise to run a parallel extraction is to provide the table name of a partitioned table that appears in the SELECT statement. You would THINK (at least, I would) that DS would perform a partition-wise extraction using SELECT ... FROM tname PARTITION pname. But you (and I) would be wrong.

Looking at the trace file, I have reconstructed a plausible account of what DS is doing.
- It works out how many data blocks there are to read.
  Code: Select all
```
select sum(blocks)
from
 sys.dba_extents where owner = upper('XXXXX') and segment_name =
  upper('XXXXXXXXXXXXXXXXXX')
```
- Then it gets the number of partitions and sub-partitions
  Code: Select all
```
select count(*)
from
 dba_tab_partitions where table_owner = upper('XXXXXXX') and table_name =
  upper('XXXXXXXXXXXXXXXXX')

select count(*)
from
 all_tab_subpartitions where table_owner = upper('C985675') and table_name =
  upper('EF_ACTL_EXPNS')
```
  Presumably this is to help work out the ideal parallel strategy. I can't see why it needs it myself.
- Now this is the important bit: it divides the table up into ROWID ranges by picking up all of the components required to build a ROWID from dba_extents:
  Code: Select all
```
select a.relative_fno, a.block_id, a.blocks , c.data_object_id
from
 sys.dba_extents a   , sys.dba_objects c where a.owner = upper('XXXXXXXXXX') and
  a.segment_name = upper('XXXXXXXXXXXXX') and ( c.object_type = 'TABLE' or
  c.object_type = 'TABLE PARTITION' or c.object_type = 'TABLE SUBPARTITION'
  or c.object_type = 'VIEW')  and NVL(a.partition_name,' ') =
  NVL(c.subobject_name,' ') and c.subobject_name =                                               a.partition_name and
  c.owner = a.owner and                          c.object_name = a.segment_name order by
  data_object_id, relative_fno, block_id
```
  Note that this SQL will ONLY work for partitioned tables. This is strange because the technique would work equally well for non-partitioned table with multiple extents. The reference to object type of VIEW is a mystery to me.
  
  At this point, it knows how many blocks it has to read in total, how many parallel threads it is going to throw at the job, and a list of extents with the number of blocks in each extent. Say it had 1000 block to read with 4 threads: it would allocate the first n extents totalling as close to 25 blocks as possible to the first thread, the next n extents to the second thread, etc.
- Then it constructs the SQL
  Code: Select all
```
SELECT /*+ rowid(XXXXXXXXXXXXX)*/  ....
FROM
 XXXXXXXXXXXXX WHERE (XXXXXXXXXXXXX.rowid between 'AAAO7LAAEAAAERJAAA' and
  'AAAO7NAAEAAAAZYH//' )
```
Pretty cool, huh?

Conclusion

Parallel unload in Oracle Enterprise is great, but not necessarily useful. If a serial unload can supply rows at (say) 10,000/sec, and you have (say) 4 parallel threads consuming data at (say) 2,000/sec, then parallelising the unload is not going to speed things up - you need to spped up the consumption rate first. This is often the case that a SELECT outperforms the transformation.
By all means use parallel extraction, but don't bother unless you have oodles of data to extract; the initiation time makes it uneconomical for smaller jobs.
Don't trick yourself into thinking that the DS partitions will align to the Oracle partitions - they won't! I had grand schemes of pre-partitioning load data by joining unpartitioned data to a junk table with one row per partition. Foiled! If there was some neat trick to pre-partitioning load data, you could perform parallel direct-path loads without invalidating locally partitioned indexes. But perhaps I'm the only one who gets excited about that.... Hmmm.

DSXchange

No OCI stage in v8 Parallel job

No OCI stage in v8 Parallel job

OCI Load vs. Dynamic RDBMS vs. Oracle Enterprise