Big data engineering and analytics

Foundational topics in data and computer engineering and sciences

Stratified sampling in Hive

Leave a comment Posted by anandj123 on June 13, 2016

Hadoop can sample data very easily using as described in my earlier article Hive performance optimization. But more often than not a more sophisticated sampling scheme is required. One of the more popular sampling scheme that can sample data based on the distribution of data is known as stratified random sampling.

To achieve stratified random sampling in Hive on multiple columns is not that hard. Here is how you could achieve that.

First of I created some sample data for the article:

IL,1,1
IL,1,2
IL,2,3
IL,2,4
IL,2,5
IL,2,6
IL,3,7
IL,3,8
IL,3,9
WI,1,10
WI,1,11
WI,1,12
WI,1,13
WI,2,14
WI,2,15
WI,2,16
WI,2,17
WI,3,18
WI,3,19
WI,3,20
WI,3,21
WI,4,22
WI,4,23
WI,4,24
WI,4,25
WI,5,26
WI,5,27

I created a hive table on it:

create external table stratified (STATE CHAR(2), ZIP int, id int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ","
LOCATION "/user/cloudera/sampling";

To sample 50% of records based on STATE and ZIP code together here is the query you need to run.

select ID, STATE, ZIP, rank1, st_count
from (
  select ID, STATE, ZIP, 
  count(*) over (partition by hash(STATE, ZIP)) as st_count, 
  rank() over (partition by hash(STATE, ZIP) order by rand()) as rank1
  from stratified) A 
where rank1 <= 0.5 * st_count;

For one of the runs the result came out as

id,state,zip,rank1,st_count 
1,IL,1,1,2 
6,IL,2,1,4 
4,IL,2,2,4 
7,IL,3,1,3 
12,WI,1,1,4 
11,WI,1,2,4 
16,WI,2,1,4 
15,WI,2,2,4 
18,WI,3,1,4 
19,WI,3,2,4 
24,WI,4,1,4 
23,WI,4,2,4 
27,WI,5,1,2

As you can see Hive sampled the data correctly. I will explain the logic line by line in another post. But let me know if you have questions.

Hadoop

← Hadoop cluster configuration – Java tuning Few tips on ingesting data from SQL server using SQOOP →

	How to write a Pig A… on How to write a Pig EvalFunc (m…
	Stratified sampling… on Hive performance optimization
	Hadoop cluster confi… on How to test the read and write…
	Hadoop cluster confi… on Hadoop cluster planning…
	Hive performance opt… on SIMD instruction – Vecto…

Big data engineering and analytics

Stratified sampling in Hive

Leave a comment Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta

Big data engineering and analytics

Stratified sampling in Hive

Share this:

Related

Leave a comment Cancel reply

Recent Posts

Recent Comments

Archives

Categories

Meta