Return-Path: X-Original-To: apmail-incubator-hama-dev-archive@minotaur.apache.org Delivered-To: apmail-incubator-hama-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 0B78D9E2E for ; Mon, 14 Nov 2011 08:13:51 +0000 (UTC) Received: (qmail 5189 invoked by uid 500); 14 Nov 2011 08:13:51 -0000 Delivered-To: apmail-incubator-hama-dev-archive@incubator.apache.org Received: (qmail 5109 invoked by uid 500); 14 Nov 2011 08:13:49 -0000 Mailing-List: contact hama-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hama-dev@incubator.apache.org Delivered-To: mailing list hama-dev@incubator.apache.org Received: (qmail 5101 invoked by uid 99); 14 Nov 2011 08:13:48 -0000 Received: from minotaur.apache.org (HELO minotaur.apache.org) (140.211.11.9) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 14 Nov 2011 08:13:48 +0000 Received: from localhost (HELO mail-qw0-f47.google.com) (127.0.0.1) (smtp-auth username edwardyoon, mechanism plain) by minotaur.apache.org (qpsmtpd/0.29) with ESMTP; Mon, 14 Nov 2011 08:13:48 +0000 Received: by qabj40 with SMTP id j40so6176376qab.6 for ; Mon, 14 Nov 2011 00:13:47 -0800 (PST) MIME-Version: 1.0 Received: by 10.229.65.150 with SMTP id j22mr2887271qci.289.1321258427299; Mon, 14 Nov 2011 00:13:47 -0800 (PST) Received: by 10.229.87.68 with HTTP; Mon, 14 Nov 2011 00:13:47 -0800 (PST) In-Reply-To: References: Date: Mon, 14 Nov 2011 17:13:47 +0900 Message-ID: Subject: Re: #Task setting and IO From: "Edward J. Yoon" To: hama-dev@incubator.apache.org Content-Type: text/plain; charset=UTF-8 > set the #bsptasks to what the split calculated. *What if this exeeds the> cluster capacity?* I think, there're two option. 1) Fix the computeSplitSize() method to return the max split length (less than cluster capacity). 2) Or assign the split array (one more splits) to each task. On Mon, Nov 14, 2011 at 3:38 PM, Thomas Jungblut wrote: > Hey, > > I have several unclarity with the setting of number of tasks and I don't > think it currently runs correctly. > > Let's make some scenarios: > > 1. User defines no input and number of tasks: "vanilla"-hama behaviour -> > Check if the number of tasks fit in the cluster and then run. > > 2. User defines input, no number of tasks and no partitioner -> this should > set the #bsptasks to what the split calculated. *What if this exeeds the > cluster capacity?* > > 3. User defines input, number of tasks and a partitioner -> this should > partition the dataset via the partitioner to >number of tasks< files and > let the fileinput split assign the files to the tasks. > > 4. User defines already defines partitioned input (e.G. Output of a M/R > job), and no other stuff -> What do you think this should do? > > Part 4 is the most important I guess, because a mapreduce job partitions > the data faster than our partitioner, especially for large inputs. > And I don't actually know if all this steps are the right way we want it. > What do you think? > > -- > Thomas Jungblut > Berlin > -- Best Regards, Edward J. Yoon @eddieyoon