Coding Tips and Best Practice in Hive and Oozie…

bestpracticehiveMany time during the code review found some common mistakes done by the developer. Here are few of them…

Workflow mandatory item :

Add this property in all workflows that have a Hive action. This property will make sure that the hive job runs with the necessary number of reducers instead of just 1.

<name> mapreduce.job.reduces </name>  <value>-1</value>

HQL items : Setting properties:

Keep the set properties in the HQL to a minimum. Let it take the default values. Add only what is absolutely necessary for that script. If you are using older code as template do not carry over all the set properties.

HQL items : Container properties:

Do not add any memory settings for the hql unless, it’s absolutely needed and has been verified by the Leads. This is very important, as assigning large amounts of memory will cause cluster imbalance. Specifically

SET hive.tez.container.size

Not to Turn of Autoconvert:
Do not turn off autoconvert joins ( set hive.tez.autoconvert  property is defaulted to TRUE)
HQL items : Table joins
When joining 2 tables just use the table name or a data restricting query for the table Example

Good practice

Select a,b….. From external_table  inner JOIN orc_table On key1= key1 and key2 = key2

 Good practice

Select a, b….. From external_table inner JOIN (select col1, col2 from orc_table Where key1 = key1 and key2 = key2) On key1= key1 and key2 = key2

Do not use this:

Select a, b….. From external_table inner JOIN (select col1, col2 from orc_table) On key1= key1 and key2 = key2

This will cause all the data of orc_table to be loaded.
For left outer joins

For left outer joins , restrict the search criteria for the last 30 days ( or whatever is the business need ),if not it might throw outOfMemory
Multiple left outer joins

When there are multiple left outer joins, do not put them in one query. Create temporary tables progressively


If the query has left joins with 4 tables a, b, c, d then don’t write one query with left joins

Select … From a left join b on … Left join c on … Left join d on …

Instead create temp tables progressively

Create temp table1 as Select …from a Left join b on …

Create temp table2 as Select … from table1 Left join c on …

Crate temp table3 as Select … from table2 Left join d on ….

A query with  group by

When doing a query with  group by , between  a lower level and a higher level table, first create a temp table for the group by and then left join with the higher level table
Change unix_timestamp

Change unix_timestamp to to_unix_timestamp (current_timestamp ()) in the Where clause. It’s ok to use unix_timestamp in the select clause.

Please add your code review comments or best practice in comment section!!!


One thought on “Coding Tips and Best Practice in Hive and Oozie…

  1. It’s a pity you don’t have a donate button! I’d most certainly donate to this outstanding blog! I suppose for now i’ll settle for book-marking and adding your RSS feed to my Google account. I look forward to new updates and will share this blog with my Facebook group. Chat soon!

Leave a Reply

Your email address will not be published. Required fields are marked *