incubator-hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edward J. Yoon" <edwardy...@apache.org>
Subject Re: #Task setting and IO
Date Mon, 14 Nov 2011 08:13:47 GMT
> set the #bsptasks to what the split calculated. *What if this exeeds the> cluster capacity?*
I think, there're two option.

1) Fix the computeSplitSize() method to return the max split length
(less than cluster capacity).

2) Or assign the split array (one more splits) to each task.

On Mon, Nov 14, 2011 at 3:38 PM, Thomas Jungblut
<thomas.jungblut@googlemail.com> wrote:
> Hey,
>
> I have several unclarity with the setting of number of tasks and I don't
> think it currently runs correctly.
>
> Let's make some scenarios:
>
> 1. User defines no input and number of tasks: "vanilla"-hama behaviour ->
> Check if the number of tasks fit in the cluster and then run.
>
> 2. User defines input, no number of tasks and no partitioner -> this should
> set the #bsptasks to what the split calculated. *What if this exeeds the
> cluster capacity?*
>
> 3. User defines input, number of tasks and a partitioner -> this should
> partition the dataset via the partitioner to >number of tasks< files and
> let the fileinput split assign the files to the tasks.
>
> 4. User defines already defines partitioned input (e.G. Output of a M/R
> job), and no other stuff -> What do you think this should do?
>
> Part 4 is the most important I guess, because a mapreduce job partitions
> the data faster than our partitioner, especially for large inputs.
> And I don't actually know if all this steps are the right way we want it.
> What do you think?
>
> --
> Thomas Jungblut
> Berlin <thomas.jungblut@gmail.com>
>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Mime
View raw message