Page 1 of 1

Big integrate - joins

Posted: Fri Jul 13, 2018 4:00 am
by rajalaa
Hello,
I have datastage job which reads data from BDFS stage. The underlying HQL is currently reading data from single table (Hive Hql). All the other lookup / joins are done in Datastage job.
What would be the advantage if I perform the joins with multiple tables in Hive hql and have the data read from BDFS file stage instead of performing the joins in Datastage after reading data from single table?

Thanks.

Posted: Fri Jul 13, 2018 8:04 am
by chulett
I'm going to give some general advice here on the topic, meaning nothing specific to BDFS or Hive. The general rule of thumb is you do joins within the job flow when you need to join either disparate data sources or sources like database tables that come from separate systems. As a corollary to that, if you've got multiple tables in the same relational source, do the joins there - with a caveat. I've been at clients where the source system is the weakest link in the chain and we weren't "allowed" to do joins there for fear of impacting performance. So in those cases, we simply streamed data out as gently as we could, loaded them into our processing or target database and did the joins there if possible. Otherwise, they were done in job.

Hope some of that helps. IMHO, well worth the experiment to give it a shot and see if / how much it helps in your specific case.