hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Lilley <john.lil...@redpoint.net>
Subject RE: What else can be built on top of YARN.
Date Sat, 01 Jun 2013 14:02:44 GMT

This is a very good question, and one we are grappling with currently in our application port.
 I think there are a lot of legacy data-processing applications like ours which would benefit
by a port to Hadoop.  However, because we have a great load of C++, it is not necessarily
a good fit for MR.  There seem to be two main choices:

·         Run under Hadoop “streams”

·         Run as a custom ApplicationMaster

One of the selling points of our application is its performance and single-code efficiency.
 I have concerns about streams:

·         We will lose performance, because of the extra layers of translation and I/O and
because streams data is uncompressed

·         The streams model is limited to single-in, single-out

·         We have a very large number and size of files to make available locally, it is
unclear that the -files option is going to recursively copy and cache all of it

In contrast, porting our application as a YARN ApplicationMaster appears to offer several
benefits (which come at the expense of extra complexity):

·         Negotiation for container resources and scheduling.  Some of our operations are
very heavy (load time and memory use), so they need larger containers and will benefit from
larger data splits.

·         Direct access to HDFS via JNI without translation layers.

·         Algorithms that are not well-suited to the MR model, such as transitive closure.
 They are more naturally expressed as MPI-like algorithms.

·         If warranted, the ability to replace MR shuffle with a C++ data partition (this
could be a discussion thread in its own right).

Moving our processing into native Java for a more seamless MR integration is not an option
due to the size and complexity of the code base.

It may be that I am completely wrong about the limitations of the streams interface; if so
please tell me why.


From: Rahul Bhattacharjee [mailto:rahul.rec.dgp@gmail.com]
Sent: Wednesday, May 29, 2013 8:34 AM
To: user@hadoop.apache.org
Subject: What else can be built on top of YARN.

Hi all,
I was going through the motivation behind Yarn. Splitting the responsibility of JT is the
major concern.Ultimately the base (Yarn) was built in a generic way for building other generic
distributed applications too.
I am not able to think of any other parallel processing use case that would be useful to built
on top of YARN. I though of a lot of use cases that would be beneficial when run in parallel
, but again ,we can do those using map only jobs in MR.
Can someone tell me a scenario , where a application can utilize Yarn features or can be built
on top of YARN and at the same time , it cannot be done efficiently using MRv2 jobs.

View raw message