hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Georgi Ivanov <iva...@vesseltracker.com>
Subject Re-sampling time data with MR job. Ideas
Date Fri, 19 Sep 2014 08:17:15 GMT
I have time related data like this :
entity_id, timestamp , data

The resolution of the data is something like 5 seconds.
I want to extract the data with 10 minutes resolution.

So what i can do is :
Just emit everything in the mapper as data is not sorted there .
Emit only every 10 minutes from reducer. The reducer is receiving data 
sorted by entity_id,timestamp pair (secondary sorting)

This will work fine, but it will take forever, since i have to process 
TB's of data.
Also the data emitted to the reducer will be huge( as i am not filtering 
in map phase at all) and the number of reducers is much smaller than the 
number of mappers.

Are there any better ideas how to do this ?


View raw message