hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Dmitry Pushkarev" <u...@stanford.edu>
Subject RE: streaming split sizes
Date Wed, 21 Jan 2009 03:07:31 GMT
Well, database is specifically designed to fit into memory and if it is not
it will slow things down hundreds of time. One simple hack I came to is to
replace map tasks by /bin/cat and then run 150 reducers that will have
database constantly in memory. Parallelism is also not a problems, since
we're running very small (15 nodes, 120 cores) specifically built for the

Dmitry Pushkarev

-----Original Message-----
From: Delip Rao [mailto:deliprao@gmail.com] 
Sent: Tuesday, January 20, 2009 6:19 PM
To: core-user@hadoop.apache.org
Subject: Re: streaming split sizes

Hi Dmitry,

Not a direct answer to your question but I think the right approach
would be to not load your database into memory during config() but
instead lookup the database from map() via Hbase or something similar.
That way you don't have to worry about the split sizes. In fact using
fewer splits would limit the parallelism you can achieve, given that
your maps are so fast.

- delip

On Tue, Jan 20, 2009 at 8:25 PM, Dmitry Pushkarev <umka@stanford.edu> wrote:
> Hi.
> I'm running streaming on relatively big (2Tb) dataset, which is  being
> by hadoop in 64mb pieces.  One of the problems I have with that is my map
> tasks take very long time to initialize (they need to load 3GB database
> RAM) and they are finishing these 64mb in 10 seconds.
> So I'm wondering if there is any way to make hadoop give larger datasets
> map jobs? (trivial way, of course would be to split dataset to N files and
> make it feed one file at a time, but is there any standard solution for
> that?)
> Thanks.
> ---
> Dmitry Pushkarev
> +1-650-644-8988

View raw message