hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Kimball <aa...@cloudera.com>
Subject Re: Processing High CPU & Memory intensive tasks on Hadoop - Architecture question
Date Sat, 25 Apr 2009 08:12:08 GMT

This can be made to work with Hadoop. Basically, in your mapper's
"configure" stage it would do the heavy load-in process, then it would
process your individual work items as records during the actual "map" stage.
A map task can be comprised of many records, so you'll be fine here.

If you use Hadoop 0.19 or 0.20, you can also enable JVM reuse, where
multiple map tasks are performed serially in the same JVM instance. In this
case, the first task in the JVM would do the heavy load-in process into
static fields or other globally-accessible items; subsequent tasks could
recognize that the system state is already initialized and would not need to
repeat it.

The number of mapper/reducer tasks that run in parallel on a given node can
be configured with a simple setting; setting this to 6 will work just fine.
The capacity / fairshare schedulers are not what you need here -- their main
function is to ensure that multiple jobs (separate sets of tasks) can all
make progress simultaneously by sharing cluster resources across jobs rather
than running jobs in a FIFO fashion.

- Aaron

On Sat, Apr 25, 2009 at 2:36 PM, amit handa <amhanda@gmail.com> wrote:

> Hi,
> We are planning to use hadoop for some very expensive and long running
> processing tasks.
> The computing nodes that we plan to use are very heavy in terms of CPU and
> memory requirement e.g one process instance takes almost 100% CPU (1 core)
> and around 300 -400 MB of RAM.
> The first time the process loads it can take around 1-1:30 minutes but
> after
> that we can provide the data to process and it takes few seconds to
> process.
> Can I model it on hadoop ?
> Can I have my processes pre-loaded on the task processing machines and the
> data be provided by hadoop? This will save the 1-1:30 minutes of intial
> load
> time that it would otherwise take for each task.
> I want to run a number of these processes in parallel  based on the
> machines
> capacity (e.g 6 instances on a 8 cpu box) or using capacity scheduler.
> Please let me know if this is possible or any pointers to how it can be
> done
> ?
> Thanks,
> Amit

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message