DataStage Analysis spreadsheet

PhilHibbs · Post by **PhilHibbs** » Wed Mar 24, 2010 8:29 am

This thread is for discussion of DSX analysis, particularly in the context of pulling out and collating information on job designs.

Currently I'm working on adding Stage Variable Descriptions, I think I should also pick up the Description text from the General tab of a Transformer, which means I should add a new tab to store Transformer details. Maybe it should be a generic Stage tab that stores details of all the stages in a job, as any stage can have a Description.

*Update*: Anyone reading the code should bear in mind: this has been thrown together in a hap-hazard manner, the code is poorly structured and largely undocumented (although I've been adding comments since I started sharing it), and it contains some hard-coded bits that are specific to our standards. The project name prefix is now a Const at the start of the VBA code, though, as is the "Change History" line that it looks for in an Annotation (this should work if it is in the Job Description as well though). Version numbers are pulled from the lines following the "Change History", in one of two formats, "[1.0]" or "[Vno=1.0]".

*Update*: If anyone gets a "Programmatic access to Visual Basic Project is not trusted", then you are using an older version, as this is fixed in 1.8g. If you still want to use that version, you should be able to fix this by going to Tools->Macro->Security, click on Trusted Publishers tab, and tick the "Trust access to Visual Basic Project" box.

*Update*: Latest Version. See end of this thread for details.

PhilHibbs · Post by **PhilHibbs** » Wed Mar 24, 2010 1:00 pm

DataStage Analysis Version 1.2 has a new tab "Stages", stores all stages, with Description and NLS.

PhilHibbs · Post by **PhilHibbs** » Thu Mar 25, 2010 10:45 am

DataStage Analysis Version 1.3 now picks up FTP file details, and some more details about Sort stages in the Stages tab.

DSguru2B · Post by **DSguru2B** » Thu Mar 25, 2010 11:17 am

I admire your passion. I am game for a copy to play with.

PhilHibbs · Post by **PhilHibbs** » Thu Mar 25, 2010 12:34 pm

One of the features that we use most is the File List tab. This collects the details of all Sequential File and Hashed File stage links, as well as Merge Stage input files. Files are labelled as "Input" or "Output", but the terminology is reversed from the DSX file, as an input to a Sequential File stage is an output from the Job's perspective.

All parameters are filled in with their job default values, which in our case nearly always results in a complete file name. This may not be true for you, though, if your job does not have default parameters or if it uses $PROJDEF which we never use because it didn't work in 7.0.1 when we started.

The particular use of this tab is for tracking where a file is used, or where it is created. If you are looking at a job, just open the latest analysis sheet and search for the job name, then find the Stage and Link name for the file that you are looking at. Move the cursor into the Decode File Name column, which should be the default file name with default job parameter values filled in, and click the "Filter" button at the top. This is just a simple shortcut to setting a filter on the column that restricts it to only the file name that is on the cursor, so you can see all references to the same file name in the entire dsx file.

Another use of the File List tab is checking that file schemas are the same. The sheet shows you the number of columns in the file, and also a concatenated list of all the column names. The "Check" button does an analysis of this information and highlights any schema mismatches, where a file is created with one schema but read with another. For Sequential Files, this would probably cause your job to fail, but Hashed Files just quietly do the wrong thing so this check can be very useful.

DSguru2B · Post by **DSguru2B** » Thu Mar 25, 2010 12:41 pm

I opened your spreadsheet, pointed it to a dsx for an entire project and I was pleasantly surprised. It was very quick for 400 jobs.
I agree with you that the file names tab is going to help me a lot.
Phil, very well put together. Cheerz.

PhilHibbs · Post by **PhilHibbs** » Fri Mar 26, 2010 3:03 am

A handy time-saver when processing a number of dsx files is that on opening the spreadsheet, it looks for a file called autodsx.txt in the same directory as the spreadsheet, and processes all the dsx files that are listed in that file and then quits. It saves itself after each one as the same name as the dsx file but with an xls extension, which is also what the "Save" button does.

PhilHibbs · Post by **PhilHibbs** » Fri Mar 26, 2010 6:19 am

One of the major challenges in building my DSX analysis was handling output file schemas. Here's the problem, and how I solved it.

Input files are not a problem, because the DSRECORD that holds the File Name and other file properties also contains the DSSUBRECORDs that define the fields, so I can collect all the data about an input file in one place and write it to the File List tab.

Output files, however, have this information separated. What appears to be a single line in Designer is split into two parts - the schema is in one DSRECORD that is associated with the stage that it is coming from, but the file name and other properties are in another DSRECORD that is connected to the Sequential File stage. There is a connection between these DSRECORDs, however, and this connection is defined through the "Identifier" and "Partner" properties - the Identifier property of the Stage that the link comes out of is also the Partner of the half of the link definition that connects to the Sequential File. Here's a diagram.

TX ---> L1 ---> L2 ---> SF

TX is a transformer, L1 is the first half of the link that contains the schema and derivations, L2 is the second half of the link that contains the file name and other properties, and SF is the Sequential File stage. The relationships between the four parts is not simple like the diagram implies, though! L1 is associated with TX by their Identifiers, L1's Identifier is TX's Identifier with a suffix, similarly L2's Identifier is SF's Identifier with a suffix. Here's where it gets complicated, and I may be misinterpreting it, but I think that there is no direct connection between L1 and L2. L2's Partner property is TX's Identifier, and L1's Partner property is SF's Identifier. Don't ask me what a link's Identifier and Partner are when it is not connected to a stage!

How do I navigate this maze of twisty little passages, all alike? I shove it in a tab and do a bunch of VLOOKUPs. The Output Links tab exists purely to store link details for use in connecting the first halves of links (which I have to keep all of because they might connect to a file) with the second halves which store the file details. When I find a second half of a link that connects to a Sequential or Hashed File stage, I write a VLOOKUP function into the two columns that hold schema information (number of columns and a concatenated list of all column names) that looks for the Job+StageID+Link in the File Links tab. That way I don't need to worry about whether the information is there yet, it might start out being #N/A but it'll get filled in eventually. I also write a VLOOKUP that goes the other way from the File Links tab back to the File List. When I reach the end of the job, I go through the File Links tab deleting any links that didn't end up connecting to a file, because otherwise the sheet would overflow as it would contain the details of every link in every job. This behaviour is controlled by the "Trim Links" setting on the first tab, but in practice this should never be turned off. It also converts the VLOOKUP on the File Links tab into a fixed text value, which speeds things up significantly.

I just realised that the column variables are all the wrong ones when I write the File Links tab. I'll fix that in the next version, and I'll also explain what a "column variable" is. It's all about the "metrics" system.

*Update*: I've found out what happens when you detach a link. It creates a "virtual" stage of OLEType "CUnattachedSrc", and L1's Identifier and L2's Partner connect to this invisible stage.

chulett · Post by **chulett** » Fri Mar 26, 2010 6:30 am

PhilHibbs wrote:How do I navigate this maze of twisty little passages, all alike?

Plugh!

PhilHibbs · Post by **PhilHibbs** » Fri Mar 26, 2010 7:25 am

Someone asked me how they could distinguish between two Oracle source stages, one of which is the Stream input to the Transformer, and the other of which is a Reference Link. This is tricky, and is related to my last post about associating output file schemas with the "other half" of the link that holds the file name. The "LinkType" attribute that determines whether a link is a Stream or a Reference is in the second half of the link. The schema for the link is in the first half, and I don't process the second half at all if it connects to a Transformer. In order to pick up the Stream/Reference attribute, you would need to implement the same system as for output file links, and store all the CTrxInput link details in another tab just to hold the LinkType attribute and do a VLOOKUP from the File List tab into it. I initially wrote that sentence as "...I would need to...", but it's not on my to-do list right now.

ray.wurlod · Post by **ray.wurlod** » Fri Mar 26, 2010 2:08 pm

chulett wrote:
PhilHibbs wrote:How do I navigate this maze of twisty little passages, all alike?
Plugh!

Eat the mushrooms.

PhilHibbs · Post by **PhilHibbs** » Thu Apr 01, 2010 11:04 am

Version 1.3a now available, fixed a couple of bugs (version numbers, file duplication)
*Update*: 1.3b fixed the botch I made when trying to fix duplication
Download

chulett · Post by **chulett** » Fri Apr 02, 2010 6:27 am

Grabbed it, hopefully someday be able to use it.

PhilHibbs · Post by **PhilHibbs** » Fri Apr 02, 2010 10:08 am

I just thought of another caveat - I don't know what it would do with a Hashed File that is created under an "Account name" rather than "Directory path". We only ever use directory paths. What should I do with the Account element? Just pick it up instead of the directory and use it as a prefix, so you'd get account/hashedfile instead of /root/path/hashedfile? You could then distinguish "account" hashed files because they don't begin with a '/' for Unix or that they don't contain a ':' for Windows. I know that's not perfect since you don't actually need to specify a directory, but no-one creates files in the project directory, do they? (If you do, why not try creating an output file called DSParams - actually, don't do that, I'm just joking - nooooo - stop - dooooon't - damn too late, you did it, didn't you? You foolish fool!)
Version 1.3c now picks up the Account, and I've fixed it so that it doesn't put a '/' in front of Hashed File names if the Directory/Account is blank. *Edit:* Here's the link:
DataStage_Analysis_1.3c.zip

ray.wurlod · Post by **ray.wurlod** » Fri Apr 02, 2010 2:20 pm

SDKSequences is always an account-based hashed file, so this edit will be beneficial.

DSXchange

DataStage Analysis spreadsheet

DataStage Analysis spreadsheet

Re: DataStage Analysis spreadsheet

Re: DataStage Analysis spreadsheet