Interested in some Hive best practices ? Last year, I had wrapped up a gigantic project on campaign analytics, that leveraged Apache Hive. There is a famous quote on Hive in the developer/architects community.
Hive assumes either you know nothing about it or you know everything about it
Either way, you’ll get the job done but there are 2 ways the jobs gets done: slowly or fast and faster. This post is focused on the fast and faster option.
Use Case: Background
What is campaign analytics in simple terms?
Campaign analytics is typically driven by the marketing folk in a company. They want to understand what was the impact of a campaign on customers for chosen products.
- How many customers bought products that were a part of a campaign?
- How many were exposed to a campaign?
- How much impact a campaign can have in varied conditions (this is predictive campaign analytics)? E.g. Will crispy wasabi ginger chips if promoted as a part of a campaign make more sales during the Thanksgiving football or Superbowl Sunday?
Hadoop Cluster Environment Details
|Total # of nodes||64|
|Total Cluster Memory||16.38 TB|
|Hadoop Distribution||MapR MRv2|
First off, this project ran in a MapReduce execution framework that ran loads of Hive jobs using fair scheduling (good and bad!). Lots of fights (fortunately no cursing!) occurred at times when one department submitted a large-scale job, and another department waited long on YARN to release some resources/containers for their job in queue to get submitted. Lots of joins, lots of data type inconsistencies. [Continue reading]