incubator-s4-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Arun C Murthy <...@hortonworks.com>
Subject Re: S4 & YARN
Date Sat, 21 Jul 2012 21:07:31 GMT
Thanks Daniel. Your proposal makes much more sense to me now! :)

Maybe we should continue discussions on https://issues.apache.org/jira/browse/S4-25, I'm happy
to help there.

Arun

On Jul 20, 2012, at 12:59 PM, Daniel Gómez Ferro wrote:

> Hi Arun,
> 
> I reply inline.
> 
> On Fri Jul 20 16:46:39 2012, Arun C Murthy wrote:
>> Daniel,
>> 
>>> * The user submits an S4 application using the S4 YARN client,
>>> specifying the number of nodes it needs, the s4r location, etc.
>>> * The S4 YARN client does the following:
>>> - connects to ZK and configures a cluster with the given parameters
>>> - deploys the s4r using ZK
>>> - starts the S4 ApplicationMaster
>>> * The S4 ApplicationMaster does the following:
>>> - sends a request to the ResourceManager for the given number of nodes
>>> - configures the job, which is just an S4 node, specifying the
>>> cluster it belongs to, the user supplied parameters, etc.
>>> - submits the job and monitors it
>> 
>> I'm a S4 neophyte, but the flow looks like a great start! I'm happy
>> to help with details of YARN if you would like.
> 
> That would be great, as I said I just started playing with YARN recently, so there could
be some holes in my understanding :)
> 
>> 
>> One question: What does 'configure a cluster' imply? Do you see that
>> done for every S4 application via the S4 YARN client? Would that be a
>> big overhead for every S4 application?
> 
> I think I should have explained the S4 deployment workflow beforehand:
> 
> When I talked about 'clusters' I referred to S4 clusters, which corresponds to a set
of nodes that run one application.
> 
> * S4 relies on ZooKeeper for coordination, failure detection and also for deployment.
Configuring an S4 cluster just means to set some metadata in ZK: the S4 cluster name, the
number of partitions, etc.
> 
> * Each S4 cluster runs only one application
> 
> * Each S4 node belongs to only one cluster.
> 
> * Various S4 clusters may be deployed on a given infrastructure and their apps may communicate
through a pub-sub mechanism, provided they access the same Zookeeper ensemble.
> 
> * Deployment is a 3-steps operation:
> 1. an S4R archive is made available somewhere through http or file system or some other
protocol
> 2. the user publishes an application to the S4 cluster manager (Zookeeper),
> 3. participating nodes are notified, download the app and start running it.
> 
> * S4 applications usually don't "finish" if you use them for processing unbounded streams.
If the stream is bounded, upon end of stream, all participating S4 nodes would stop and the
resources allocated by YARN should probably be released.
> 
> 
> In summary, the current workflow is the following:
> 
> 1.- Define an S4 cluster in ZK (name, partitions, etc.)
> 2.- Start S4 nodes for that cluster (you could start more than partitions for failure
tolerance).
> 3.- Deploy the application (this registers the a URI location for the app in ZK, S4 nodes
will get notified, download the app and start running it).
> 
> 
> Btw, Matthieu has helped me write this up, we will document this in the S4 wiki as well.
> 
> Answering your question: I don't think configuring a cluster for each S4 app is a big
overhead, because S4 apps are usually long lived and the configuration step is not costly.
> 
>> 
>> Maybe a future enhancement would be to share the same cluster across
>> multiple S4 applications by breaking apart a YARN S4-DEPLOY
>> application (client+appmaster) to first set up a S4 cluster and then
>> another YARN S4 application to just run S4 apps in the already
>> deployed cluster? Thoughts? Am I hopelessly off-base? *smile*
> 
> Currently in S4 each node can only run one application, so it wouldn't be possible to
reuse them, but it may be worth considering for the future.
> 
> On the other hand, I thought the idea of YARN was to start independent/isolated containers
(S4 nodes) for each app. How does it work with MapReduce? Do you run X map processes + Y reduce
processes and then reuse those processes for all user apps or do you start new map and reduce
processes for each app?
> 
> Thanks a lot for your feedback!
> 
> Daniel
> 
>> 
>> thanks,
>> Arun
>> 
>> On Jul 18, 2012, at 9:49 AM, Daniel Gómez Ferro wrote:
>> 
>>> Hi all,
>>> 
>>> I've been playing a bit with YARN and I think the integration with S4
>>> should be quite simple. For those unfamiliar with YARN, here's a
>>> simplification of how it works (check
>>> http://hadoop.apache.org/common/docs/r0.23.0/hadoop-yarn/hadoop-yarn-site/YARN.html
>>> for an in depth view):
>>> 
>>> * There is one global ResourceManager (RM)
>>> * For each application the user submits, an ApplicationMaster (AM) is
>>> created
>>> * The AM requests resources (nodes) to the RM
>>> * When the AM has enough resources, it launches the job
>>> * Now the AM can monitor the job, the user can kill the AM, etc.
>>> 
>>> I modified the example application DistributedShell that comes in
>>> Hadoop and I was able to start s4 nodes through YARN. This simple
>>> experiment led me to the following ideas on how to integrate the S4
>>> workflow in YARN:
>>> 
>>> * The user submits an S4 application using the S4 YARN client,
>>> specifying the number of nodes it needs, the s4r location, etc.
>>> * The S4 YARN client does the following:
>>> - connects to ZK and configures a cluster with the given parameters
>>> - deploys the s4r using ZK
>>> - starts the S4 ApplicationMaster
>>> * The S4 ApplicationMaster does the following:
>>> - sends a request to the ResourceManager for the given number of nodes
>>> - configures the job, which is just an S4 node, specifying the
>>> cluster it belongs to, the user supplied parameters, etc.
>>> - submits the job and monitors it
>>> 
>>> Of course there is more to the YARN integration than just starting an
>>> application, but I think this would be a functional enough first
>>> step. Maybe having the possibility to stop a running application
>>> would be nice as well.
>>> 
>>> Do you have any thoughts on how to improve this workflow?
>>> 
>>> Thanks!
>>> 
>>> Daniel
>>> 
>>> On Wed Jun 13 12:26:18 2012, Flavio Junqueira wrote:
>>>> I'm interested in this integration, but I need to wrap my head
>>>> around the deployment model of piper first.
>>>> 
>>>> -Flavio
>>>> 
>>>> On Jun 13, 2012, at 5:26 AM, Arun C Murthy wrote:
>>>> 
>>>>> Folks,
>>>>> 
>>>>> I'd like to start a discussion around getting S4 to run within
>>>>> Hadoop YARN in hadoop-2.
>>>>> 
>>>>> Brief background: Hadoop YARN is a generic resource management
>>>>> framework which aims to host multiple applications such as
>>>>> MapReduce, MPI etc. It would be very beneficial to get S4 running
>>>>> within YARN since it would make it much more accessible for Hadoop
>>>>> users to do real-time processing with S4 in their existing clusters
>>>>> with almost no operation overheads - thus, aiding S4 adoption.
>>>>> 
>>>>> I had a brief discussion with Flavio on this topic recently and he
>>>>> encouraged me to start a discussion on this list.
>>>>> 
>>>>> I see that https://issues.apache.org/jira/browse/S4-25 is already
>>>>> open for running S4 within YARN, but hasn't been any activity since.
>>>>> 
>>>>> Some more details on YARN and writing applications:
>>>>> http://hadoop.apache.org/common/docs/r2.0.0-alpha/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html
>>>>> 
>>>>> I'm happy to help anyone interested in working on this.
>>>>> 
>>>>> thanks,
>>>>> Arun
>>>>> 
>>>> 
>>>> flavio
>>>> junqueira
>>>> senior research scientist
>>>> 
>>>> fpj@yahoo-inc.com <mailto:fpj@yahoo-inc.com>
>>>> direct +34 93-183-8828
>>>> 
>>>> avinguda diagonal 177, 8th floor, barcelona, 08018, es
>>>> phone (408) 349 3300    fax (408) 349 3301
>>>> 
>> 
>> --
>> Arun C. Murthy
>> Hortonworks Inc.
>> http://hortonworks.com/
>> 
>> 

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/



Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message