hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Lilley <john.lil...@redpoint.net>
Subject RE: Distribution of native executables and data for YARN-based execution
Date Fri, 17 May 2013 17:15:27 GMT
Vinod,
Your first two items are spot on.  We don't expect to have the cluster to ourselves.  We also
expect to interop with existing HDFS data and want to schedule for data locality.
John


From: Vinod Kumar Vavilapalli [mailto:vinodkv@hortonworks.com]
Sent: Friday, May 17, 2013 11:08 AM
To: user@hadoop.apache.org
Subject: Re: Distribution of native executables and data for YARN-based execution


I have a little bit of conflict of interest given I worked on Hadoop YARN all time but..

I have worked on torque/condor based resource management systems too. There are many advantages
of working on top of YARN, a couple that should be specifically relevant here:
 - MR and non MR all on same cluster (there are a few not-so-ready MR implementations on existing
schedulers but with lots of limitations)
 - Data locality feature that is native in Hadoop YARN and hard to simulate in other schedulers
(we have experience trying this in the past)
 - Elastic resource managements - jobs can grow and shrink elastically

Thanks,
+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/

On May 17, 2013, at 7:20 AM, Tim St Clair wrote:


Hi John -

If you are doing extensive levels of non-MR C-style batch, you may be better served to look
at myriad universes of existing schedulers (torque, condor, etc.).  Or investigate the space
around interop (1 cluster, many schedulers).

Either way, I recommend minimizing your dependency graph on your C-application where possible
if you are working in a heterogeneous environment.

Cheers,
Tim


________________________________
From: "John Lilley" <john.lilley@redpoint.net<mailto:john.lilley@redpoint.net>>
To: user@hadoop.apache.org<mailto:user@hadoop.apache.org>
Sent: Friday, May 17, 2013 8:35:53 AM
Subject: RE: Distribution of native executables and data for YARN-based execution

Thanks!  This sounds exactly like what I need.  PUBLIC is right.

Do you know if this works for executables as well?  Like, would there be any issue transferring
the executable bit on the file?

john

From: Vinod Kumar Vavilapalli [mailto:vinodkv@hortonworks.com]
Sent: Friday, May 17, 2013 12:56 AM
To: user@hadoop.apache.org<mailto:user@hadoop.apache.org>
Subject: Re: Distribution of native executables and data for YARN-based execution


The "local resources" you mentioned is the exact solution for this. For each LocalResource,
you also mention a LocalResourceVisibility which takes one of the three values today - PUBLIC,
PRIVATE and APPLICATON.

PUBLIC resources are downloaded only once and shared by any application running on that node.

PRIVATE resources are downloaded only once and shared by any application run by the same user
on that node

APPLICATION resources are downloaded per application and removed after the application finishes.

Seems like you want PUBLIC or PRIVATE.

Note that for PUBLIC resources to work, the corresponding files need to be public on HDFS
too.

Also if the remote files on HDFS are updated, these local files will be uploaded afresh again
on each node where your containers run.

HTH

Thanks,
+Vinod Kumar Vavilapalli
Hortonworks Inc.
http://hortonworks.com/


On May 16, 2013, at 2:21 PM, John Lilley wrote:

I am attempting to distribute the execution of a C-based program onto a Hadoop cluster, without
using MapReduce.  I read that YARN can be used to schedule non-MapReduce applications by programming
to the ASM/RM interfaces.  As I understand it, eventually I get down to specifying each sub-task
via ContainerLaunchContext.setCommands().

However, the program and shared libraries need to be stored on each worker's local disk to
run.  In addition there is a hefty data set that the application uses (say, 4GB) that is accessed
via regular open()/read() calls by a library.  I thought a decent strategy would be to push
the program+data package to a known folder in HDFS, then launch a "bootstrap" that compared
the HDFS folder version to a local folder, copying any updated files as needed before launching
the native application task.

Are there better approaches?  I notice that one can implicitly copy "local resources" as part
of the launch, but I don't want to copy 4GB every time, only occasionally when the application
or reference data is updated.  Also, will my bootstrapper be allowed to set executable-mode
bits on the programs after they are copied?

Thanks
John





Mime
View raw message