hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edward J. Yoon" <edwardy...@apache.org>
Subject Re: Task Priorities
Date Sat, 08 Dec 2012 12:26:51 GMT
'GraphJobRunner' BSP program already showed why disk-queue is
important. The user always could be faced with memory issues.

But, I'm talking about our task's priority. High performance computers
and its parts are cheap and getting cheaper. And, I'm sure the
message-passing and In-memory technologies are receiving attention as
a near-future trend.

In my case, the memory is 40GB per node. I want to confirm whether
Hama is good candidate (ASAP). Hama can't processing large data but
Hama team is currently working on YARN, FT, and Disk-queue.

On Sat, Dec 8, 2012 at 6:28 PM, Thomas Jungblut
<thomas.jungblut@gmail.com> wrote:
> Yes that's nothing new, my rule of thumb is 10x the input size.
> Which is bad, but the scalability must be done on multiple levels.
> Spilling the graph to disk is just one part, because it consumes at least
> the half of the memory for really sparse graphs.
> The other is messaging, removing the bundling and the compression will not
> save you much space.
> We are writing messages to disk in fault tolerance anyways, so why not
> directly writing it and then bundle/compress stuff on the fly while sending
> (e.g. in 32m chunks)?
>
> 2012/12/8 Edward J. Yoon <edwardyoon@apache.org>
>
>> Task is created per input split, and input splits are created one per
>> block of each input file by default. If block size is 60~200 MB, 1 ~
>> 3GB memory per task is enough.
>>
>> Yeah, there's still a queueing/messaging scalability issue as you
>> know. However, according to my experiences, message bundler and
>> compressor are mainly responsible for poor scalability and consumes
>> huge memory. This is more urgent than "queue".
>>
>> On Sat, Dec 8, 2012 at 2:05 AM, Thomas Jungblut
>> <thomas.jungblut@gmail.com> wrote:
>> >>
>> >>  not disk-based.
>> >
>> >
>> > So how do you want to archieve scalability without that?
>> > In order to process tasks independend of each other (not in parallel, but
>> > e.g. in small mini batches), you have to save the state. RAM is limited
>> and
>> > can't store huge states (persistent in case of crashes).
>> >
>> > 2012/12/7 Suraj Menon <surajsmenon@apache.org>
>> >
>> >> On Thu, Dec 6, 2012 at 8:27 PM, Edward J. Yoon <edwardyoon@apache.org
>> >> >wrote:
>> >>
>> >> > I think large data processing capability is more important than fault
>> >> > tolerance at the moment.
>> >> >
>> >>
>> >> +1
>> >>
>>
>>
>>
>> --
>> Best Regards, Edward J. Yoon
>> @eddieyoon
>>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Mime
View raw message