Return-Path: X-Original-To: apmail-hadoop-common-user-archive@www.apache.org Delivered-To: apmail-hadoop-common-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2A24AE846 for ; Fri, 22 Feb 2013 20:33:21 +0000 (UTC) Received: (qmail 41921 invoked by uid 500); 22 Feb 2013 20:33:16 -0000 Delivered-To: apmail-hadoop-common-user-archive@hadoop.apache.org Received: (qmail 41814 invoked by uid 500); 22 Feb 2013 20:33:16 -0000 Mailing-List: contact user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@hadoop.apache.org Delivered-To: mailing list user@hadoop.apache.org Delivered-To: moderator for user@hadoop.apache.org Received: (qmail 90427 invoked by uid 99); 22 Feb 2013 20:18:31 -0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of saptarshi.guha@gmail.com designates 74.125.82.44 as permitted sender) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:mime-version:reply-to:from:date:message-id:subject:to :content-type; bh=xuvwzoTufwQ6gPB0Zyk8y1ozLtpdcSv9vY117ZnJTNg=; b=bu1WPNb9XngnU0fRpJV6/J1MBsYIFncgTxG+EcxEhr8Zr0k07v6ca1pEVw+r3y90Jr Py3XYEdBCmXSO9Ewm/UbnVOltl2ehCFyC06kVk9Db9cAdSOx910WEM7tS/Hwh0o53+Jy shXUVMm5fVF7Fq2ckZv0S5s4tG8J7Kxot1d76szOQ8UgU1asz96asoqC4Qds+ZaS9c6d eglNS48b9D0VTsrXGClVmhFjXi/0jYIFoaS6rTCj+aMAyj+H3rn3zbIMkz0C+mOO+EyG dAArKerH8sz3TXdxhH3KlhsvDnYk+VzjJMq3O/ucP1PmLi01pcskAnHzH4hQGUL5dCsv +STQ== X-Received: by 10.180.105.232 with SMTP id gp8mr364367wib.33.1361564284063; Fri, 22 Feb 2013 12:18:04 -0800 (PST) MIME-Version: 1.0 Reply-To: saptarshi.guha@gmail.com From: Saptarshi Guha Date: Fri, 22 Feb 2013 12:17:44 -0800 Message-ID: Subject: Single JVM, many tasks: How do I know when I'm on the last map task To: Hadoop user Content-Type: multipart/alternative; boundary=f46d044402d07d901d04d655e3db X-Virus-Checked: Checked by ClamAV on apache.org --f46d044402d07d901d04d655e3db Content-Type: text/plain; charset=UTF-8 Hello, In my Java Hadoop job, i have reset the reuse variable to be -1. hence a JVM will process multiple tasks. I have also seen to it that instead of writing to the job context, the keys and values are accumulated in a hashtable. When the bytes written to this table reach BUFSIZE (e..g 150MB) i call my reducer(or what some call combiner) (inside the map task). However if BUFSIZE is never accumulated my reducer is never called. So i have to flush it. Now I could flush this in the map classes 'cleanup' method. In that case, the data would be rewritten to the same hashtable. But at one point this hashtable must be written to the job context onto the Hadoop Reduce stage. The way i see it, if i intend to share this hashtable across map tasks (within the same JVM), i need to know when the JVM has reached it's final map task. When that is complete, then i know i *must* flush this to the job context. Hopefully i've been some what clear. Does Hadoop 0.20.2 have an API that tells the child JVM if it's on the last map task? Cheers Saptarshi --f46d044402d07d901d04d655e3db Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hello,

In my Java Hadoop job, i have reset the reus= e variable to be -1.
hence a JVM will process multiple tasks.

I h= ave also seen to it that instead of writing to the job context, the
keys= and values are accumulated in a hashtable.
When the bytes written to this table reach BUFSIZE (e..g 150MB)
i call m= y reducer(or what some call combiner) (inside the map task).

Howeve= r if BUFSIZE is never accumulated my reducer is never called.
So i have = to flush it. Now I could flush this in the map classes
'cleanup' method. In that case, the data would be rewritten to the<= br>same hashtable.

But at one point this hashtable must be written t= o the job context
onto the Hadoop Reduce stage. The way i see it, if i i= ntend to share
this hashtable across map tasks (within the same JVM), i need to know
wh= en the JVM has reached it's final map task. When that is complete,
t= hen i know i *must* flush this to the job context.

Hopefully i'v= e been some what clear. Does Hadoop 0.20.2 have an API
that tells the child JVM if it's on the last map task?

CheersSaptarshi

--f46d044402d07d901d04d655e3db--