Return-Path: X-Original-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 20AC911094 for ; Fri, 19 Sep 2014 08:17:46 +0000 (UTC) Received: (qmail 82118 invoked by uid 500); 19 Sep 2014 08:17:37 -0000 Delivered-To: apmail-hadoop-mapreduce-user-archive@hadoop.apache.org Received: (qmail 81952 invoked by uid 500); 19 Sep 2014 08:17:37 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Received: (qmail 81521 invoked by uid 99); 19 Sep 2014 08:17:36 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Sep 2014 08:17:36 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [83.220.137.132] (HELO post.ynnor.de) (83.220.137.132) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 19 Sep 2014 08:17:30 +0000 Received: from localhost (localhost [127.0.0.1]) by post.ynnor.de (Postfix) with ESMTP id BD0E5843945 for ; Fri, 19 Sep 2014 10:17:07 +0200 (CEST) Received: from [192.168.1.220] (ip1f120211.dynamic.kabel-deutschland.de [31.18.2.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by post.ynnor.de (Postfix) with ESMTPSA id 8FC9D840093 for ; Fri, 19 Sep 2014 10:17:07 +0200 (CEST) Message-ID: <541BE68B.4080603@vesseltracker.com> Date: Fri, 19 Sep 2014 10:17:15 +0200 From: Georgi Ivanov User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.1.1 MIME-Version: 1.0 To: "user@hadoop.apache.org" Subject: Re-sampling time data with MR job. Ideas Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org Hello, I have time related data like this : entity_id, timestamp , data The resolution of the data is something like 5 seconds. I want to extract the data with 10 minutes resolution. So what i can do is : Just emit everything in the mapper as data is not sorted there . Emit only every 10 minutes from reducer. The reducer is receiving data sorted by entity_id,timestamp pair (secondary sorting) This will work fine, but it will take forever, since i have to process TB's of data. Also the data emitted to the reducer will be huge( as i am not filtering in map phase at all) and the number of reducers is much smaller than the number of mappers. Are there any better ideas how to do this ? Georgi