Return-Path: X-Original-To: apmail-hadoop-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id DC0BC1121B for ; Fri, 19 Sep 2014 09:07:22 +0000 (UTC) Received: (qmail 77055 invoked by uid 500); 19 Sep 2014 09:07:18 -0000 Delivered-To: apmail-hadoop-user-archive@hadoop.apache.org Received: (qmail 76944 invoked by uid 500); 19 Sep 2014 09:07:18 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 76934 invoked by uid 99); 19 Sep 2014 09:07:17 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Sep 2014 09:07:17 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [83.220.137.132] (HELO post.ynnor.de) (83.220.137.132) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Sep 2014 09:06:51 +0000 Received: from localhost (localhost [127.0.0.1]) by post.ynnor.de (Postfix) with ESMTP id 9096B8430E5 for ; Fri, 19 Sep 2014 11:06:50 +0200 (CEST) Received: from [192.168.1.220] (ip1f120211.dynamic.kabel-deutschland.de [31.18.2.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by post.ynnor.de (Postfix) with ESMTPSA id 46111840CE4 for ; Fri, 19 Sep 2014 11:06:50 +0200 (CEST) Message-ID: <541BF232.3010608@vesseltracker.com> Date: Fri, 19 Sep 2014 11:06:58 +0200 From: Georgi Ivanov User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1 MIME-Version: 1.0 To: user@hadoop.apache.org Subject: Re: Re-sampling time data with MR job. Ideas References: <541BE68B.4080603@vesseltracker.com> In-Reply-To: Content-Type: multipart/alternative; boundary="------------030108060904020205060007" X-Virus-Checked: Checked by ClamAV on apache.org This is a multi-part message in MIME format. --------------030108060904020205060007 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Hi Mirko, Thanks for the reply. Lets assume i have a record every 1 second for every given entity. entity_id | timestamp | data 1 , 2014-01-01 12:13:01 - i want this ..some more for different entity 1 , 2014-01-01 12:13:02 1 , 2014-01-01 12:13:03 1 , 2014-01-01 12:13:04 1 , 2014-01-01 12:13:05 ........ 1 , 2014-01-01 12:23:01 - I want this 1 , 2014-01-01 12:23:02 The problem is that in reality this is not coming sorted by entity_id , timestamp so i can't filter in the mapper . The mapper will get different entity_id's and based on the input split. Georgi On 19.09.2014 10:34, Mirko Kämpf wrote: > Hi Georgi, > > I would already emit the new time stamp (with resolution 10 min) in > the mapper. This allows you to (pre)aggregate the data already in the > mapper and you have less traffic during the shuffle & sort stage. > Changing the resolution means you have to aggregate the individual > entities or do you still need all individual entities and just want to > translate the timestamp to another resolution (5s => 10 min)? > > Cheers, > Mirko > > > > > 2014-09-19 9:17 GMT+01:00 Georgi Ivanov >: > > Hello, > I have time related data like this : > entity_id, timestamp , data > > The resolution of the data is something like 5 seconds. > I want to extract the data with 10 minutes resolution. > > So what i can do is : > Just emit everything in the mapper as data is not sorted there . > Emit only every 10 minutes from reducer. The reducer is receiving > data sorted by entity_id,timestamp pair (secondary sorting) > > This will work fine, but it will take forever, since i have to > process TB's of data. > Also the data emitted to the reducer will be huge( as i am not > filtering in map phase at all) and the number of reducers is much > smaller than the number of mappers. > > Are there any better ideas how to do this ? > > Georgi > > --------------030108060904020205060007 Content-Type: text/html; charset=utf-8 Content-Transfer-Encoding: 8bit
Hi Mirko,
Thanks for the reply.

Lets assume i have a record every 1 second for every given entity.

entity_id | timestamp | data

1 , 2014-01-01 12:13:01 - i want this
..some more for different entity
1 , 2014-01-01 12:13:02
1 , 2014-01-01 12:13:03
1 , 2014-01-01 12:13:04
1 , 2014-01-01 12:13:05
........
1 , 2014-01-01 12:23:01 - I want this
1 , 2014-01-01 12:23:02

The problem is that in reality this is not coming sorted by entity_id , timestamp
so i can't filter in the mapper .
The mapper will get different entity_id's and based on the input split.



Georgi

On 19.09.2014 10:34, Mirko Kämpf wrote:
Hi Georgi,

I would already emit the new time stamp (with resolution 10 min) in the mapper. This allows you to (pre)aggregate the data already in the mapper and you have less traffic during the shuffle & sort stage. Changing the resolution means you have to aggregate the individual entities or do you still need all individual entities and just want to translate the timestamp to another resolution (5s => 10 min)?

Cheers,
Mirko




2014-09-19 9:17 GMT+01:00 Georgi Ivanov <ivanov@vesseltracker.com>:
Hello,
I have time related data like this :
entity_id, timestamp , data

The resolution of the data is something like 5 seconds.
I want to extract the data with 10 minutes resolution.

So what i can do is :
Just emit everything in the mapper as data is not sorted there .
Emit only every 10 minutes from reducer. The reducer is receiving data sorted by entity_id,timestamp pair (secondary sorting)

This will work fine, but it will take forever, since i have to process TB's of data.
Also the data emitted to the reducer will be huge( as i am not filtering in map phase at all) and the number of reducers is much smaller than the number of mappers.

Are there any better ideas how to do this ?

Georgi


--------------030108060904020205060007--