No OCI stage in v8 Parallel job

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

Post Reply
rleishman
Premium Member
Premium Member
Posts: 252
Joined: Mon Sep 19, 2005 10:28 pm
Location: Melbourne, Australia
Contact:

No OCI stage in v8 Parallel job

Post by rleishman »

I'm a former 7.5.1A Server developer. We've just installed an evaluation version of DSv8 for a new project and I'm sussing out the differences.

We can only see the Oracle OCI stage in the palette for Server jobs, and the Oracle Enterprise stage for Parallel jobs.

The doco doesn't say that OCI is server-only, and some of the threads on DSX seem to suggest Parallel users have OCI; but maybe its different in v7 parallel jobs!!!??? Is it true you cannot use the OCI stage in v8 Parallel jobs? Or have I botched the install?
Ross Leishman
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

The Palette is fully customizable; anything you find in the Stage Types branch of the Repository can be emplaced in the Palette. I realize that the Repository category structure is not quite so straightforward, but you can find things in the "everything" pane of the Palette customization too. This should include the Oracle API (OCI) stage type.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

In my parallel stage types (under Stage Types and not under shortcuts) I can see Oracle Enterprise and Oracle OCI Load. I can access native oracle connectivity through the Dynamic RDBMS stage. For DB2 I see three stages: enterprise, API and Load.
lstsaur
Participant
Posts: 1139
Joined: Thu Oct 21, 2004 9:59 pm

Post by lstsaur »

DS simply wouldn't let you drop Oracle OCI stage on parallel job for either V7 or V8.
rleishman
Premium Member
Premium Member
Posts: 252
Joined: Mon Sep 19, 2005 10:28 pm
Location: Melbourne, Australia
Contact:

Post by rleishman »

Ray, we found OCI in the repository tree - under SERVER stages. We cannot drag it into a Parallel Job though. Good tip to know though, thanks.

Vincent, found the OCI Load stage under Parallel Stages (same as yours), but I doubt this will give us anything the Oracle Enterprise stage cannot - probably why it was excluded from the pallette.

@lstsaur - that is exactly my experience. Nice to know it also happened on V7.

Vincent, I found the Dynamic RDBMS Stage and am now curled up in my warm blanket of familiarity. It's (almost) exactly the same as the server OCI stage! Now I just need to work out whether to use it instead of the Enterprise stage. Time for some benchmarking I think.
Ross Leishman
vmcburney
Participant
Posts: 3593
Joined: Thu Jan 23, 2003 5:25 pm
Location: Australia, Melbourne
Contact:

Post by vmcburney »

IBM benchmarked the DB2 Enterprise stage at 250% faster than the OCI stage for extracts, see DataStage Tip: Extracting database data 250% faster. Would love to see how your benchmark goes with Oracle.
rleishman
Premium Member
Premium Member
Posts: 252
Joined: Mon Sep 19, 2005 10:28 pm
Location: Melbourne, Australia
Contact:

Post by rleishman »

It's all starting to come together! Enterprise stage requires SELECT privs on a bunch of data dictionary tables including DBA_EXTENTS. My bet is that - when you use the "Table" SQL Type - it looks up the starting and ending block address of each extent in DBA_EXTENTS, and farms out ROWID RANGE queries to each parallel thread.

I'll try to very this with a SQL Trace. Stay tuned.
Ross Leishman
rleishman
Premium Member
Premium Member
Posts: 252
Joined: Mon Sep 19, 2005 10:28 pm
Location: Melbourne, Australia
Contact:

OCI Load vs. Dynamic RDBMS vs. Oracle Enterprise

Post by rleishman »

Here's an interim update on Oracle Enterprise stage. I haven't done a heap of benchmarking, but I have some interesting points:

General
  • As Vincent pointed out Dynamic RDBMS gives you all of the Oracle OCI stuff you know and love. I've run some tests and it appears to work exactly the same as well.
  • Dynamic RDBMS and OCI Load are implemented behind-the-scenes using OCI. I suspect Enterprise is the same for upsert, update, and user-defined, but for Load it uses SQL*Loader. Oracle puts a bit more effort into SQL*Loader than OCI, so this makes Enterprise a bit more functional than OCI Load.
  • Oracle Enterprise does not have Before SQL and Ater SQL. It has an equivalent Open and Close command, but these are only available with some retrieval and load methods. If you experiment (OK, fiddle) with the settings you can LOSE your Before and After SQL.
Parallel
  • Dynamic RDBMS does NOT support parallel. Parallel inputs are collected and run in a single session. Oracle Enterprise DOES support parallel.
  • OCI Load (bulk load) is also available in DS Parallel, but also loads serially. Oracle Enterprise performs parallel bulk loads.
  • Oracle Enterprise has some funky-looking options for sourcing data from Oracle. When you specify a method of Table or Auto-Generated for a partitioned table, you can tell DS to parallelise each partition of the table. If you use user-defined SQL or don't select this option, it runs serially (there might be a run-on-every-node option for MPP systems - I havent looked)
Upserts
  • Something I wasn't aware of: in OCI and Dynamic RDBMS, the Insert-Else-Update and Update-Else-Insert COMPLETELY IGNORE your Array Size setting. They run with an array-size of 1 no-matter-what. Array Size is even ignored for Update Only - I had wrongly assumed that updates were array-applied - I was wrong! The only time the array-size is used is with Insert-Only.
  • Oracle Enterprise does not have an Insert-Only. The only equivalent is Load with DIRECT=FLASE option, and this uses SQL*Loader. If you know every row will insert correctly, you can use an Upsert method with the Insert command run first - it's the same but kind of scary too.
  • Now this is REALLY interesting. When you run an Upsert with Enterprise, Inserts are array-inserted and updates are applied 1 at a time. Wha-huh? How does that work? Say you have an array size of 500 and it contains three rows in random places within the array that already exist. Enterprise tries to insert all 500 at once and gets a failure - no rows are inserted. But it cleverly knows WHICH rows failed. It splits the array, array inserts the 497 new rows, and row-by-row updates the other 3. This is MUCH better than OCI, which row-by-row insert/updates all 500. I'd still reccommend splitting your Inserts and Updates though.
Benchmarking
  • My results are not conclusive. My client has a freeze on new hardware at the moment and we are prototyping the project using a single CPU 1Gb RAM Celeron Windows PC as our database server, and our DS server is a dual-core 3Gb Linux PC. Because of the database server, it only works marginally faster in Parallel than Serial.
  • Even so, Oracle Enterprise was Upserting (Inserts only) at 3500/sec with 4 threads, whereas Dynamic RDBMS was running insert-only on the same data into the same table serially at 2400/sec.
  • This was unexpected! Oracle Enterprise in DIRECT=FALSE PARALLEL=TRUE mode was SLOWER than Enterprise Upserts! I took a look at the SQL*Loader Control file generated by Enterprise, and it did not datatype the columns. Upsert on the other hand is binding already typed C structures. I suspect this is the difference.
  • I haven't mucked about with Enterprise loading in DIRECT=TRUE PARALLEL=TRUE. In this mode, all indexes must be disabled and FULLY REBUILT afterwards - this is inappropriate for our project because we are loading a LIVE database.
  • DIRECT=TRUE PARALLEL=FALSE is similarly out for us because - although it can defer maintenance of the indexes (rather than full rebuild) it makes the table unusable during the load.
  • I haven't benchmarked any of the parallel data retrieval features of Oracle Enterprise outlined above, as we are sourcing from files.
Conclusion
  • DIRECT=TRUE PARALLEL=TRUE loading is a pipe-dream for delta-loads. The cost of index rebuilds just gets too high unless you have a VERY agressive archiving strategy. It's great for full-rebuilds though.
  • If DS is serious about supporting Oracle, it needs to invest some serious money in developing partition-wise parallel loads and partition-exchange loading.
  • OCI Load: just don't.
  • Oracle Enterprise: use it! DS really needs to add before/after SQL for every load/select option, and it also needs and insert-only option. Use Upsert freely if updates are very rare, otherwise stick with splitting updates into a separate link.
  • Dynamic RDBMS: I wouldn't bother unless you REALLY need to store your SQL in a file (not supported by Enterprise) or you need a before/after SQL where one is not offered. Never use Upsert.
Ross Leishman
throbinson
Charter Member
Charter Member
Posts: 299
Joined: Wed Nov 13, 2002 5:38 pm
Location: USA

Post by throbinson »

Extremely impressive. Thank you very much for taking the time to make this excellent post. This is very valuable information. It is much appreciated!
rleishman
Premium Member
Premium Member
Posts: 252
Joined: Mon Sep 19, 2005 10:28 pm
Location: Melbourne, Australia
Contact:

Post by rleishman »

Did some very quick 'n' nasty testing using Oracle Enterprise to perform parallel SELECT.

Sadly my 1-proc windoze Oracle server is not really up to the task, so the results are - again - inconclusive. Once again though, some interesting things to note.

Parallel Unload with Oracle Enterprise Stage
  • I set up a straight unload of a partitioned table from Oracle Enterprise stage to Seq File. In parallel, it took 11+ sec to initiate and then rocketed through to complete 1M rows at an average of 50,000/sec. With the parallel unload switched off, it initiated in 1 sec and completed with an average only marginally slower. I suspect it would blow out if I threw more data at it.
  • Repeated the test copying the 1M rows to another database with similar results. Serial was only a bit slower. 8,000/sec vs. 6,000/sec
  • The way you get Oracle Enterprise to run a parallel extraction is to provide the table name of a partitioned table that appears in the SELECT statement. You would THINK (at least, I would) that DS would perform a partition-wise extraction using SELECT ... FROM tname PARTITION pname. But you (and I) would be wrong.

    Looking at the trace file, I have reconstructed a plausible account of what DS is doing.
    • It works out how many data blocks there are to read.

      Code: Select all

      select sum(blocks)
      from
       sys.dba_extents where owner = upper('XXXXX') and segment_name =
        upper('XXXXXXXXXXXXXXXXXX')
    • Then it gets the number of partitions and sub-partitions

      Code: Select all

      select count(*)
      from
       dba_tab_partitions where table_owner = upper('XXXXXXX') and table_name =
        upper('XXXXXXXXXXXXXXXXX')
      
      select count(*)
      from
       all_tab_subpartitions where table_owner = upper('C985675') and table_name =
        upper('EF_ACTL_EXPNS')
      Presumably this is to help work out the ideal parallel strategy. I can't see why it needs it myself.
    • Now this is the important bit: it divides the table up into ROWID ranges by picking up all of the components required to build a ROWID from dba_extents:

      Code: Select all

      select a.relative_fno, a.block_id, a.blocks , c.data_object_id
      from
       sys.dba_extents a   , sys.dba_objects c where a.owner = upper('XXXXXXXXXX') and
        a.segment_name = upper('XXXXXXXXXXXXX') and ( c.object_type = 'TABLE' or
        c.object_type = 'TABLE PARTITION' or c.object_type = 'TABLE SUBPARTITION'
        or c.object_type = 'VIEW')  and NVL(a.partition_name,' ') =
        NVL(c.subobject_name,' ') and c.subobject_name =                                               a.partition_name and
        c.owner = a.owner and                          c.object_name = a.segment_name order by
        data_object_id, relative_fno, block_id
      Note that this SQL will ONLY work for partitioned tables. This is strange because the technique would work equally well for non-partitioned table with multiple extents. The reference to object type of VIEW is a mystery to me.

      At this point, it knows how many blocks it has to read in total, how many parallel threads it is going to throw at the job, and a list of extents with the number of blocks in each extent. Say it had 1000 block to read with 4 threads: it would allocate the first n extents totalling as close to 25 blocks as possible to the first thread, the next n extents to the second thread, etc.
    • Then it constructs the SQL

      Code: Select all

      SELECT /*+ rowid(XXXXXXXXXXXXX)*/  ....
      FROM
       XXXXXXXXXXXXX WHERE (XXXXXXXXXXXXX.rowid between 'AAAO7LAAEAAAERJAAA' and
        'AAAO7NAAEAAAAZYH//' )
    Pretty cool, huh?
Conclusion
  • Parallel unload in Oracle Enterprise is great, but not necessarily useful. If a serial unload can supply rows at (say) 10,000/sec, and you have (say) 4 parallel threads consuming data at (say) 2,000/sec, then parallelising the unload is not going to speed things up - you need to spped up the consumption rate first. This is often the case that a SELECT outperforms the transformation.
  • By all means use parallel extraction, but don't bother unless you have oodles of data to extract; the initiation time makes it uneconomical for smaller jobs.
  • Don't trick yourself into thinking that the DS partitions will align to the Oracle partitions - they won't! I had grand schemes of pre-partitioning load data by joining unpartitioned data to a junk table with one row per partition. Foiled! If there was some neat trick to pre-partitioning load data, you could perform parallel direct-path loads without invalidating locally partitioned indexes. But perhaps I'm the only one who gets excited about that.... Hmmm.
Ross Leishman
Post Reply