hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From amit handa <amha...@gmail.com>
Subject Re: Processing High CPU & Memory intensive tasks on Hadoop - Architecture question
Date Sat, 25 Apr 2009 13:44:56 GMT
Thanks Aaron,

The processing libs that we use, which take time to load are all c++ based
.so libs.
Can i invoke it from JVM during the configure stage of the mapper and keep
it running as you suggested ?
Can you point me to some documentation regarding the same ?

Regards,
Amit

On Sat, Apr 25, 2009 at 1:42 PM, Aaron Kimball <aaron@cloudera.com> wrote:

> Amit,
>
> This can be made to work with Hadoop. Basically, in your mapper's
> "configure" stage it would do the heavy load-in process, then it would
> process your individual work items as records during the actual "map"
> stage.
> A map task can be comprised of many records, so you'll be fine here.
>
> If you use Hadoop 0.19 or 0.20, you can also enable JVM reuse, where
> multiple map tasks are performed serially in the same JVM instance. In this
> case, the first task in the JVM would do the heavy load-in process into
> static fields or other globally-accessible items; subsequent tasks could
> recognize that the system state is already initialized and would not need
> to
> repeat it.
>
> The number of mapper/reducer tasks that run in parallel on a given node can
> be configured with a simple setting; setting this to 6 will work just fine.
> The capacity / fairshare schedulers are not what you need here -- their
> main
> function is to ensure that multiple jobs (separate sets of tasks) can all
> make progress simultaneously by sharing cluster resources across jobs
> rather
> than running jobs in a FIFO fashion.
>
> - Aaron
>
> On Sat, Apr 25, 2009 at 2:36 PM, amit handa <amhanda@gmail.com> wrote:
>
> > Hi,
> >
> > We are planning to use hadoop for some very expensive and long running
> > processing tasks.
> > The computing nodes that we plan to use are very heavy in terms of CPU
> and
> > memory requirement e.g one process instance takes almost 100% CPU (1
> core)
> > and around 300 -400 MB of RAM.
> > The first time the process loads it can take around 1-1:30 minutes but
> > after
> > that we can provide the data to process and it takes few seconds to
> > process.
> > Can I model it on hadoop ?
> > Can I have my processes pre-loaded on the task processing machines and
> the
> > data be provided by hadoop? This will save the 1-1:30 minutes of intial
> > load
> > time that it would otherwise take for each task.
> > I want to run a number of these processes in parallel  based on the
> > machines
> > capacity (e.g 6 instances on a 8 cpu box) or using capacity scheduler.
> >
> > Please let me know if this is possible or any pointers to how it can be
> > done
> > ?
> >
> > Thanks,
> > Amit
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message