incubator-hama-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edward J. Yoon" <edwardy...@apache.org>
Subject Re: [DISCUSS] "Fault Tolerant" system design and consideration of compatibility for the hadoop nextgen or mesos platform.
Date Fri, 01 Jul 2011 07:31:34 GMT
I wonder whether Hama job can't be executed with MR2 if we have own
BSP computing engine.

If so, what's the reason we can't support both MR2 version and own BSP
cluster version?

On Tue, Mar 29, 2011 at 1:33 PM, Chia-Hung Lin <clin4j@googlemail.com> wrote:
> Failure detection is used to solve consensus issue because of FLP
> impossibility. But it has its usefulness in asynchronous distributed
> system. For instance, [1] presents an accrual failure detector and
> advocates such service should be valuable for system management,
> replication, etc. [2] supposes that failure detection should be the
> basic service for distributed system in supporting the scenario when
> failure presents.
>
> Regarding to the worker failure, I think hadoop metrics can be applied
> for cropping the internal statistic of the groomserver/ jvm. But for
> quickly identify when a failure occurs (regardless of network or host
> failure) without knowing the internal stage of a progress, phi accrual
> failure detection seems to better fit the case. Also, when considering
> the scenario in the presence of master failure, failure detection
> would be required.
>
> The overall design I think basically can mimic what has been done in
> hadoop and take further actions preventing issues that previously
> happened in hadoop if any.
>
> [1.] A gossip-style failure detection service.
> http://portal.acm.org/citation.cfm?id=866975
> [2]. A Fault Detection Service for Wide Area Distributed Computations.
> http://portal.acm.org/citation.cfm?id=823194
>
>
>
> 2011/3/29 Edward J. Yoon <edwardyoon@apache.org>:
>> I'm reading ϕ failure detector. It seems widely used for distributed
>> database. I guess, the reason is that the real-time operations of
>> database. If I am wrong, Please point me.
>>
>> In our case, it's a batch job processing. I'm not sure that we really
>> need to adopt ϕ failure detector. During job processing, "too
>> sensitivity detection" could not be a help, rather a hindrance.
>>
>> What do you think?
>>
>> On Mon, Mar 28, 2011 at 3:14 PM, Edward J. Yoon <edwardyoon@apache.org> wrote:
>>> I'm attach the IRC chat log here.
>>>
>>> In this thread, we'll talk about HAMA-370 "Fault Tolerant" system
>>> design and future architecture. :)
>>>
>>> ----
>>> [14:12] ==  server   : stross.freenode.net [Corvallis, OR, USA]
>>> [14:12] ==  idle     : 0 days 0 hours 0 minutes 6 seconds [connected:
>>> Mon Mar 28 12:42:54 2011]
>>> [14:12] == End of WHOIS
>>> [14:14] <edyoon_korea> I'm heading out to lunch. CU~
>>> [14:25] <chl5011> Sorry, I can not see the differences. I think that's
>>> because I view the adapting to e.g. mapreduce2.0 is the same as
>>> standalone mode; both of which have fault tolerance, etc. features.
>>> Why would users want to run hama without those features?
>>> [14:29] <chl5011> Just curious. I am not keen on to porting anything
>>> to new arch (e.g. mesos) immediately before issues are getting clear.
>>> It is just that when thinking of fault tolerance issue, we may also
>>> need to consider the communication, nexus (master/ workers) etc. issue
>>> into account.
>>> [14:38] <edyoon_korea> Oh, OK. I think, it's a mis-communication. 1)
>>> Basically, hama cluster should be able to handle their jobs without
>>> other helps. 2) at the same time, we should consider compatibility
>>> with hadoop or mesos. Right?
>>> [14:46] <chl5011> Regarding to the first issue, it looks like mesos or
>>> mapreduce 2.0 is not suitable for hama because they separate
>>> scheduling from the original function of the master server (in our
>>> case it is bspmaster).
>>> [14:48] <chl5011> Then we might take the original approach which
>>> simply makes bspmaster fault tolerance ( zookeeper + multiple masters)
>>> and tasks fault tolerance with e.g. checkpoint + re-executing failure
>>> tasks.
>>> [14:54] <edyoon_korea> yes.
>>> [14:56] <edyoon_korea> so i'm still not sure, that HAMA-370 is really
>>> necessary for us.
>>> [14:59] <chl5011> I think HAMA 370 can be seen as part of 363 as the
>>> monitoring is a broaden issue which should cover the probing of the
>>> process failure.
>>> [15:00] <chl5011> the master can deterministicly identify if a process
>>> fail without needing to know the usage of network, etc.
>>> [15:01] <chl5011> because the worker does not send any report back (using
udp).
>>> [15:02] <chl5011> But I think we should implement 363 because it
>>> covers more issues such as which groomserver the master should assign
>>> task to.
>>> [15:07] <edyoon_korea> if you are OK, let's move this discussion to
>>> our mailing list.
>>> [15:08] <chl5011> np.
>>>
>>>
>>> --
>>> Best Regards, Edward J. Yoon
>>> http://blog.udanax.org
>>> http://twitter.com/eddieyoon
>>>
>>
>>
>>
>> --
>> Best Regards, Edward J. Yoon
>> http://blog.udanax.org
>> http://twitter.com/eddieyoon
>>
>
>
>
> --
> ChiaHung Lin @ nuk, tw
>



-- 
Best Regards, Edward J. Yoon
@eddieyoon

Mime
View raw message