hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Leo Leung <lle...@ddn.com>
Subject RE: Hadoop and Hibernate
Date Fri, 02 Mar 2012 18:30:37 GMT
Geoffry,

 Hadoop distributedCache (as of now) is used to "cache" M/R application specific files.
 These files are used by M/R app only and not the framework. (Normally as side-lookup)

 You can certainly try to use Hibernate to query your SQL based back-end within the M/R code.
 But think of what happens when a few hundred or thousands of M/R task do that concurrently.
 Your back-end is going to cry. (if it can - before it dies)

 So IMO,  prep your M/R job with distributedCache files (pull it down first) is a better approach.

 Also, MPI is pretty much out of question (not baked into the framework).  
 You'll likely have to roll your own.  (And try to trick the JobTracker in not starting the
same task)

 Anyone has a better solution for Geoffry?



-----Original Message-----
From: Geoffry Roberts [mailto:geoffry.roberts@gmail.com] 
Sent: Friday, March 02, 2012 9:42 AM
To: common-user@hadoop.apache.org
Subject: Re: Hadoop and Hibernate

This is a tardy response.  I'm spread pretty thinly right now.

DistributedCache<http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache>is
apparently deprecated.  Is there a replacement?  I didn't see anything about this in the documentation,
but then I am still using 0.21.0. I have to for performance reasons.  1.0.1 is too slow and
the client won't have it.

Also, the DistributedCache<http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#DistributedCache>approach
seems only to work from within a hadoop job.  i.e. From within a Mapper or a Reducer, but
not from within a Driver.  I have libraries that I must access both from both places.  I take
it that I am stuck keeping two copies of these libraries in synch--Correct?  It's either that,
or copy them into hdfs, replacing them all at the beginning of each job run.

Looking for best practices.

Thanks

On 28 February 2012 10:17, Owen O'Malley <omalley@apache.org> wrote:

> On Tue, Feb 28, 2012 at 5:15 PM, Geoffry Roberts 
> <geoffry.roberts@gmail.com> wrote:
>
> > If I create an executable jar file that contains all dependencies
> required
> > by the MR job do all said dependencies get distributed to all nodes?
>
> You can make a single jar and that will be distributed to all of the 
> machines that run the task, but it is better in most cases to use the 
> distributed cache.
>
> See
> http://hadoop.apache.org/common/docs/r1.0.0/mapred_tutorial.html#Distr
> ibutedCache
>
> > If I specify but one reducer, which node in the cluster will the 
> > reducer run on?
>
> The scheduling is done by the JobTracker and it isn't possible to 
> control the location of the reducers.
>
> -- Owen
>



--
Geoffry Roberts

Mime
View raw message