hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Ginzburg <ginz...@hotmail.com>
Subject Inconsistent result when fairscheduler preemption is on
Date Thu, 12 Jan 2012 12:36:01 GMT


I am running a 70 node cdh3u2 cluster.

A week ago the analyst ran a hive aggregation query over a year of data and compared the results
to what we have in our relational 
data warehouse. There was about  2.5% deviation from what was in the Data warehouse. 
The data in the data warehouse was  generated daily, so the jobs are much smaller.

I ran the same query over the same data and each time got slightly different results !

After further investigation I found that I submitted the job to a pool with low priority while
using preemption.

I found there is a correlation between the deviation and the amount of killed reduce tasks.

The only time I got the correct results was when I turned off preemption and submitted the

It is difficult to reproduce this issue. at first we thought it is a hive issue -http://mail-archives.apache.org/mod_mbox/hive-user/201201.mbox/%3CSNT135-W29D26007E9692D5BE07458B7990@phx.gbl%3E
, but now we suspect it is a mapreduce issue.

The query produced a 3 stage MR job - the largest with 678 reducers.

That's the highest resolution we have gotten so far . I suspect this issue has never come
up before , since  it is rare there exists 
a reference for large data processing results, and the the phenomenon doesn't occur  for small
data jobs.
View raw message