hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Georgi Ivanov <iva...@vesseltracker.com>
Subject Re: Re-sampling time data with MR job. Ideas
Date Fri, 19 Sep 2014 09:06:58 GMT
Hi Mirko,
Thanks for the reply.

Lets assume i have a record every 1 second for every given entity.

entity_id | timestamp | data

1 , 2014-01-01 12:13:01 - i want this
..some more for different entity
1 , 2014-01-01 12:13:02
1 , 2014-01-01 12:13:03
1 , 2014-01-01 12:13:04
1 , 2014-01-01 12:13:05
........
1 , 2014-01-01 12:23:01 - I want this
1 , 2014-01-01 12:23:02

The problem is that in reality this is not coming sorted by entity_id , 
timestamp
so i can't filter in the mapper .
The mapper will get different entity_id's and based on the input split.



Georgi

On 19.09.2014 10:34, Mirko Kämpf wrote:
> Hi Georgi,
>
> I would already emit the new time stamp (with resolution 10 min) in 
> the mapper. This allows you to (pre)aggregate the data already in the 
> mapper and you have less traffic during the shuffle & sort stage. 
> Changing the resolution means you have to aggregate the individual 
> entities or do you still need all individual entities and just want to 
> translate the timestamp to another resolution (5s => 10 min)?
>
> Cheers,
> Mirko
>
>
>
>
> 2014-09-19 9:17 GMT+01:00 Georgi Ivanov <ivanov@vesseltracker.com 
> <mailto:ivanov@vesseltracker.com>>:
>
>     Hello,
>     I have time related data like this :
>     entity_id, timestamp , data
>
>     The resolution of the data is something like 5 seconds.
>     I want to extract the data with 10 minutes resolution.
>
>     So what i can do is :
>     Just emit everything in the mapper as data is not sorted there .
>     Emit only every 10 minutes from reducer. The reducer is receiving
>     data sorted by entity_id,timestamp pair (secondary sorting)
>
>     This will work fine, but it will take forever, since i have to
>     process TB's of data.
>     Also the data emitted to the reducer will be huge( as i am not
>     filtering in map phase at all) and the number of reducers is much
>     smaller than the number of mappers.
>
>     Are there any better ideas how to do this ?
>
>     Georgi
>
>


Mime
View raw message