hadoop-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mirko Kämpf <mirko.kae...@gmail.com>
Subject Re: Re-sampling time data with MR job. Ideas
Date Fri, 19 Sep 2014 08:34:36 GMT
Hi Georgi,

I would already emit the new time stamp (with resolution 10 min) in the
mapper. This allows you to (pre)aggregate the data already in the mapper
and you have less traffic during the shuffle & sort stage. Changing the
resolution means you have to aggregate the individual entities or do you
still need all individual entities and just want to translate the timestamp
to another resolution (5s => 10 min)?


2014-09-19 9:17 GMT+01:00 Georgi Ivanov <ivanov@vesseltracker.com>:

> Hello,
> I have time related data like this :
> entity_id, timestamp , data
> The resolution of the data is something like 5 seconds.
> I want to extract the data with 10 minutes resolution.
> So what i can do is :
> Just emit everything in the mapper as data is not sorted there .
> Emit only every 10 minutes from reducer. The reducer is receiving data
> sorted by entity_id,timestamp pair (secondary sorting)
> This will work fine, but it will take forever, since i have to process
> TB's of data.
> Also the data emitted to the reducer will be huge( as i am not filtering
> in map phase at all) and the number of reducers is much smaller than the
> number of mappers.
> Are there any better ideas how to do this ?
> Georgi

View raw message