parallel datasets

PeteM2 · Post by **PeteM2** » Mon Apr 16, 2012 10:18 am

Given the scenario of a dataset being defined as 'overwrite' in a job using 2 nodes and each node had a seperate disk resource file system.

If the nodes was increased to 4 nodes with each node having seperate disk resource file systems, would the file be seamlessly re-created across the 4 file systems when the job was run with the new version of the config file. Or would a seperate task be required to spread the dataset across the 4 file systems prior to the job running with the new config file ?

ray.wurlod · Post by **ray.wurlod** » Mon Apr 16, 2012 3:00 pm

Seamless.

But you may need a manual process to clean up the old 2-node data files.

PeteM2 · Post by **PeteM2** » Tue Apr 17, 2012 5:29 am

I take it that after the job has run with the new config file, the header component will point to the new 4 partitions of data. It will no longer reference the original 2 partitions of data.

Therefore does the dataset utility easily identify these orphaned data partitions?

jwiles · Post by **jwiles** » Tue Apr 17, 2012 8:27 am

Why do you think the data segments would be orphaned? You have overwritten the dataset, meaning the dataset has been replaced...you haven't just rewritten the descriptor file.

The most common cause of orphaned data segments is users deleting the descriptor files using standard O/S commands (rm, delete, etc.) rather than the appropriate tools (orchadmin, dataset management, dataset stage).

Regards,

priyadarshikunal · Post by **priyadarshikunal** » Tue Apr 17, 2012 9:16 am

"you may need a manual process" the comment does not say that you will need it.

The 2 node dataset will be overwritten and there won't be any orphaned files as the configuration file copy in descriptor will be used to delete/overwrite the data files

ray.wurlod · Post by **ray.wurlod** » Tue Apr 17, 2012 4:37 pm

It all depends on whether the original two nodes' resource disk settings are included in the new four node configuration file or not.

PeteM2 · Post by **PeteM2** » Wed Apr 18, 2012 2:46 am

Is it the case that If the resource disks allocated to the 4 nodes will not be the same as the original disks allocated to the 2 node configuration .

Then there will be orphan data files and the dataset utility can identify these files for deletion?

ray.wurlod · Post by **ray.wurlod** » Wed Apr 18, 2012 2:59 am

There will be orphaned segment files in this scenario, and the Data Set utility (by which I assume you mean the Data Set Management utility) can not identify them.