hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tim St Clair <tstcl...@redhat.com>
Subject Re: Distribution of native executables and data for YARN-based execution
Date Fri, 17 May 2013 14:20:22 GMT
Hi John - 

If you are doing extensive levels of non-MR C-style batch, you may be better served to look
at myriad universes of existing schedulers (torque, condor, etc.). Or investigate the space
around interop (1 cluster, many schedulers). 

Either way, I recommend minimizing your dependency graph on your C-application where possible
if you are working in a heterogeneous environment. 

Cheers, 
Tim 

----- Original Message -----

> From: "John Lilley" <john.lilley@redpoint.net>
> To: user@hadoop.apache.org
> Sent: Friday, May 17, 2013 8:35:53 AM
> Subject: RE: Distribution of native executables and data for YARN-based
> execution

> Thanks! This sounds exactly like what I need. PUBLIC is right.

> Do you know if this works for executables as well? Like, would there be any
> issue transferring the executable bit on the file?

> john

> From: Vinod Kumar Vavilapalli [mailto:vinodkv@hortonworks.com]
> Sent: Friday, May 17, 2013 12:56 AM
> To: user@hadoop.apache.org
> Subject: Re: Distribution of native executables and data for YARN-based
> execution

> The "local resources" you mentioned is the exact solution for this. For each
> LocalResource, you also mention a LocalResourceVisibility which takes one of
> the three values today - PUBLIC, PRIVATE and APPLICATON.

> PUBLIC resources are downloaded only once and shared by any application
> running on that node.

> PRIVATE resources are downloaded only once and shared by any application run
> by the same user on that node

> APPLICATION resources are downloaded per application and removed after the
> application finishes.

> Seems like you want PUBLIC or PRIVATE.

> Note that for PUBLIC resources to work, the corresponding files need to be
> public on HDFS too.

> Also if the remote files on HDFS are updated, these local files will be
> uploaded afresh again on each node where your containers run.

> HTH

> Thanks,

> +Vinod Kumar Vavilapalli

> Hortonworks Inc.
> http://hortonworks.com/

> On May 16, 2013, at 2:21 PM, John Lilley wrote:

> I am attempting to distribute the execution of a C-based program onto a
> Hadoop cluster, without using MapReduce. I read that YARN can be used to
> schedule non-MapReduce applications by programming to the ASM/RM interfaces.
> As I understand it, eventually I get down to specifying each sub-task via
> ContainerLaunchContext.setCommands().

> However, the program and shared libraries need to be stored on each worker’s
> local disk to run. In addition there is a hefty data set that the
> application uses (say, 4GB) that is accessed via regular open()/read() calls
> by a library. I thought a decent strategy would be to push the program+data
> package to a known folder in HDFS, then launch a “bootstrap” that compared
> the HDFS folder version to a local folder, copying any updated files as
> needed before launching the native application task.

> Are there better approaches? I notice that one can implicitly copy “local
> resources” as part of the launch, but I don’t want to copy 4GB every time,
> only occasionally when the application or reference data is updated. Also,
> will my bootstrapper be allowed to set executable-mode bits on the programs
> after they are copied?

> Thanks

> John

Mime
View raw message