Big data engineering and analytics

Foundational topics in data and computer engineering and sciences

Stratified sampling in Hive

Hadoop can sample data very easily using as described in my earlier article Hive performance optimization. But more often than not a more sophisticated sampling scheme is required. One of the more popular sampling scheme that can sample data based on the distribution of data is known as stratified random sampling.

To achieve stratified random sampling in Hive on multiple columns is not that hard. Here is how you could achieve that.

First of I created some sample data for the article:

IL,1,1
IL,1,2
IL,2,3
IL,2,4
IL,2,5
IL,2,6
IL,3,7
IL,3,8
IL,3,9
WI,1,10
WI,1,11
WI,1,12
WI,1,13
WI,2,14
WI,2,15
WI,2,16
WI,2,17
WI,3,18
WI,3,19
WI,3,20
WI,3,21
WI,4,22
WI,4,23
WI,4,24
WI,4,25
WI,5,26
WI,5,27

I created a hive table on it:

create external table stratified (STATE CHAR(2), ZIP int, id int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ","
LOCATION "/user/cloudera/sampling";

To sample 50% of records based on STATE and ZIP code together here is the query you need to run.

select ID, STATE, ZIP, rank1, st_count
from (
  select ID, STATE, ZIP, 
  count(*) over (partition by hash(STATE, ZIP)) as st_count, 
  rank() over (partition by hash(STATE, ZIP) order by rand()) as rank1
  from stratified) A 
where rank1 <= 0.5 * st_count;

For one of the runs the result came out as

id,state,zip,rank1,st_count 
1,IL,1,1,2 
6,IL,2,1,4 
4,IL,2,2,4 
7,IL,3,1,3 
12,WI,1,1,4 
11,WI,1,2,4 
16,WI,2,1,4 
15,WI,2,2,4 
18,WI,3,1,4 
19,WI,3,2,4 
24,WI,4,1,4 
23,WI,4,2,4 
27,WI,5,1,2

As you can see Hive sampled the data correctly. I will explain the logic line by line in another post. But let me know if you have questions.

Leave a comment