Big data engineering and analytics
Foundational topics in data and computer engineering and sciences
Stratified sampling in Hive
June 13, 2016
Posted by on Hadoop can sample data very easily using as described in my earlier article Hive performance optimization. But more often than not a more sophisticated sampling scheme is required. One of the more popular sampling scheme that can sample data based on the distribution of data is known as stratified random sampling.
To achieve stratified random sampling in Hive on multiple columns is not that hard. Here is how you could achieve that.
First of I created some sample data for the article:
IL,1,1 IL,1,2 IL,2,3 IL,2,4 IL,2,5 IL,2,6 IL,3,7 IL,3,8 IL,3,9 WI,1,10 WI,1,11 WI,1,12 WI,1,13 WI,2,14 WI,2,15 WI,2,16 WI,2,17 WI,3,18 WI,3,19 WI,3,20 WI,3,21 WI,4,22 WI,4,23 WI,4,24 WI,4,25 WI,5,26 WI,5,27
I created a hive table on it:
create external table stratified (STATE CHAR(2), ZIP int, id int) ROW FORMAT DELIMITED FIELDS TERMINATED BY "," LOCATION "/user/cloudera/sampling";
To sample 50% of records based on STATE and ZIP code together here is the query you need to run.
select ID, STATE, ZIP, rank1, st_count from ( select ID, STATE, ZIP, count(*) over (partition by hash(STATE, ZIP)) as st_count, rank() over (partition by hash(STATE, ZIP) order by rand()) as rank1 from stratified) A where rank1 <= 0.5 * st_count;
For one of the runs the result came out as
id,state,zip,rank1,st_count 1,IL,1,1,2 6,IL,2,1,4 4,IL,2,2,4 7,IL,3,1,3 12,WI,1,1,4 11,WI,1,2,4 16,WI,2,1,4 15,WI,2,2,4 18,WI,3,1,4 19,WI,3,2,4 24,WI,4,1,4 23,WI,4,2,4 27,WI,5,1,2
As you can see Hive sampled the data correctly. I will explain the logic line by line in another post. But let me know if you have questions.
Recent Comments