hadoop-mapreduce-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Lilley <john.lil...@redpoint.net>
Subject Scheduling non-MR processes
Date Sat, 12 Jan 2013 16:09:11 GMT
I am trying to understand how one can make a "side process" cooperate with the Hadoop MapReduce
task scheduler.  Suppose that I have an application that is not directly integrated with MapReduce
(i.e., it is not a MapReduce job at all; there are no mappers or reducers).  This application
could access HDFS as an external client, but it would be limited in its throughput.  I want
to run this application in parallel on HDFS nodes to realize the benefits of parallel computation
and data locality.  But I want to cooperate in resource management with Hadoop.  But I don't
want the *data* to get pushed through MapReduce, because the nature of the application doesn't
lend itself nicely to MR integration.

Perhaps if I explain why I think this is not suitable for regular MR jobs it may help.  Suppose
that I have stored into HDFS a very large file for which there is no Java library.  JNI could
be an option, but wrapping the complex function of legacy application code into JNI may be
more work than it is worth.  The application performs some very complex processing, and this
is something that we don't necessarily want to redesign to fit the MR paradigm.  Obviously
the data file is "splittable" or this approach wouldn't work at all.  So perhaps it is possible
to hook into MR at the Splitter level, and use that to create a series of mapper tasks where
the mappers don't actually read the data directly, but hand off the corresponding data block
to the legacy application for processing?

Sorry if this is somewhat loosely defined as we are searching to understand the optimal integration
strategy.   I hope you can see what I am trying to do and give some suggestions.

View raw message