Page 1 of 1

Column Analysis failing due to java heap space

Posted: Sun Oct 07, 2018 5:12 am
by Novak
Hi experts,

Running on Windows system with 32 GB RAM, 8 CPUs and 4 node file.
We have a flat file with 96 columns needing to be analyzed.
When running on a sample of 20 records it runs incredibly slow, with only a few columns analyzed before eventually failing with "java.lang.OutOfMemoryError: Java heap space" error message.

What we have noticed also, is that in Director's log there is 10 jobs for 1 column analysis being run. Guessing that is what is making the process heavy.

Does anyone know how can we fix this?

Regards,

Novak

Posted: Sun Oct 07, 2018 6:51 am
by chulett
Is your DataStage 32bit or 64bit?

Posted: Sun Oct 07, 2018 7:34 pm
by Novak
Hi Craig,

It is 32bit.

Posted: Mon Oct 08, 2018 2:55 am
by ray.wurlod
You can increase the size of Java heap space. Search here and/or IBM Information Center for "Xmx".

If you run your column analyses with "preserve scripts" enabled, you will be able to look at the jobs in DataStage director, including the logs, to ascertain what some of these processes do.

Note, too, that Information Analyzer will break up a column analysis request into multiple requests each of which doesn't process too many columns. So, to process your 96 columns, it's no real surprise that the workload was split into ten units each processing 10 (or 9) columns.

Posted: Mon Oct 08, 2018 5:23 am
by chulett
I only ask because you have more ability with a 64bt system to increase the heap size than you do with a 32bit one and the memory limits it brings. There are several Technotes out there on that subject, here is one example.

Posted: Wed Oct 31, 2018 1:09 am
by Novak
Thanks a lot guys.

We will almost definitely upgrade to 64-bit on Linux and hopefully within couple of months. This is the second time I am running IA on Windows and it is painful to say the least. Not just because of this failure, but the overall end user response times.

Until then, and on advice from IBM's support, we have continued our data profiling on 2 nodes, rather than 8. Hardly any failures since then.
The run times between them are not that different so we can live with it.

Cheers,

Novak