hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aaron Kimball <aa...@cloudera.com>
Subject Re: Processing High CPU & Memory intensive tasks on Hadoop - Architecture question
Date Sun, 26 Apr 2009 05:11:26 GMT
I'm not aware of any documentation about this particular use case for
Hadoop. I think your best bet is to look into the JNI documentation about
loading native libraries, and go from there.
- Aaron


On Sat, Apr 25, 2009 at 10:44 PM, amit handa <amhanda@gmail.com> wrote:

> Thanks Aaron,
>
> The processing libs that we use, which take time to load are all c++ based
> .so libs.
> Can i invoke it from JVM during the configure stage of the mapper and keep
> it running as you suggested ?
> Can you point me to some documentation regarding the same ?
>
> Regards,
> Amit
>
> On Sat, Apr 25, 2009 at 1:42 PM, Aaron Kimball <aaron@cloudera.com> wrote:
>
> > Amit,
> >
> > This can be made to work with Hadoop. Basically, in your mapper's
> > "configure" stage it would do the heavy load-in process, then it would
> > process your individual work items as records during the actual "map"
> > stage.
> > A map task can be comprised of many records, so you'll be fine here.
> >
> > If you use Hadoop 0.19 or 0.20, you can also enable JVM reuse, where
> > multiple map tasks are performed serially in the same JVM instance. In
> this
> > case, the first task in the JVM would do the heavy load-in process into
> > static fields or other globally-accessible items; subsequent tasks could
> > recognize that the system state is already initialized and would not need
> > to
> > repeat it.
> >
> > The number of mapper/reducer tasks that run in parallel on a given node
> can
> > be configured with a simple setting; setting this to 6 will work just
> fine.
> > The capacity / fairshare schedulers are not what you need here -- their
> > main
> > function is to ensure that multiple jobs (separate sets of tasks) can all
> > make progress simultaneously by sharing cluster resources across jobs
> > rather
> > than running jobs in a FIFO fashion.
> >
> > - Aaron
> >
> > On Sat, Apr 25, 2009 at 2:36 PM, amit handa <amhanda@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > We are planning to use hadoop for some very expensive and long running
> > > processing tasks.
> > > The computing nodes that we plan to use are very heavy in terms of CPU
> > and
> > > memory requirement e.g one process instance takes almost 100% CPU (1
> > core)
> > > and around 300 -400 MB of RAM.
> > > The first time the process loads it can take around 1-1:30 minutes but
> > > after
> > > that we can provide the data to process and it takes few seconds to
> > > process.
> > > Can I model it on hadoop ?
> > > Can I have my processes pre-loaded on the task processing machines and
> > the
> > > data be provided by hadoop? This will save the 1-1:30 minutes of intial
> > > load
> > > time that it would otherwise take for each task.
> > > I want to run a number of these processes in parallel  based on the
> > > machines
> > > capacity (e.g 6 instances on a 8 cpu box) or using capacity scheduler.
> > >
> > > Please let me know if this is possible or any pointers to how it can be
> > > done
> > > ?
> > >
> > > Thanks,
> > > Amit
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message