hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "GOEKE, MATTHEW (AG/1000)" <matthew.go...@monsanto.com>
Subject RE: how to set the number of mappers with 0 reducers?
Date Tue, 20 Sep 2011 16:08:00 GMT
There is currently no way to disable S/S. You can do many things to alleviate any issues you
have with it though, one of them you mentioned below. Is there a reason why you are allowing
each of your keys to be unique? If it is truly because you do not care then just create an
even distribution of keys that you assign to allow for more aggregation.

On a side note, what is the actual stack trace you are getting when the reducers fail and
what is the reducer doing? I think for your use case using a reduce phase is the best way
to go, as long as the job time meets your SLA, so we need to figure out why the job is failing.


-----Original Message-----
From: Peng, Wei [mailto:Wei.Peng@xerox.com] 
Sent: Tuesday, September 20, 2011 10:44 AM
To: common-user@hadoop.apache.org
Subject: RE: how to set the number of mappers with 0 reducers?

The input is 9010 files (each 500MB), and I would estimate the output to
be around 50GB.
My hadoop job failed because of out of memory (with 66 reducers). I
guess that the key from each mapper output is unique so the sorting
would be memory-intensive. 
Although I can set another key to reduce the number of unique keys, I am
curious if there is a way to disable sorting/shuffling.


-----Original Message-----
From: GOEKE, MATTHEW (AG/1000) [mailto:matthew.goeke@monsanto.com] 
Sent: Tuesday, September 20, 2011 8:34 AM
To: common-user@hadoop.apache.org
Subject: RE: how to set the number of mappers with 0 reducers?

Amusingly this is almost the same question that was asked the other day

<quote from Owen O'Malley>
There isn't currently a way of getting a collated, but unsorted list of
key/value pairs. For most applications, the in memory sort is fairly
cheap relative to the shuffle and other parts of the processing.

If you know that you will be filtering out a significant amount of
information to the point where shuffle will be trivial then the impact
of a reduce phase should be minimal using an identity reducer. It is
either that aggregate as much data as you feel comfortable with into
each split and have 1 file per map. 

How much data/percentage of input are you assuming will be output from
each of these maps?


-----Original Message-----
From: Peng, Wei [mailto:Wei.Peng@xerox.com] 
Sent: Tuesday, September 20, 2011 10:22 AM
To: common-user@hadoop.apache.org
Subject: RE: how to set the number of mappers with 0 reducers?

Thank you all for the quick reply!!

I think I was wrong. It has nothing to do with the number of mappers
because each input file has size 500M, which is not too small in terms
of 64M per block.

The problem is that the output from each mapper is too small. Is there a
way to combine some mappers output together? Setting the number of
reducers to 1 might get a very huge file. Can I set the number of
reducers to 100, but skip sorting, shuffling...etc.?


-----Original Message-----
From: Soumya Banerjee [mailto:soumya.sbanerjee@gmail.com] 
Sent: Tuesday, September 20, 2011 2:06 AM
To: common-user@hadoop.apache.org
Subject: Re: how to set the number of mappers with 0 reducers?.


If you want all your map outputs in a single file you can use a
IdentityReducer and set the number of reducers to 1.
This would ensure that all your mapper output goes into the reducer and
wites into a single file.


On Tue, Sep 20, 2011 at 2:04 PM, Harsh J <harsh@cloudera.com> wrote:

> Hello Wei!
> On Tue, Sep 20, 2011 at 1:25 PM, Peng, Wei <Wei.Peng@xerox.com> wrote:
> (snip)
> > However, the output from the mappers result in many small files
(size is
> > ~50k, the block size is however 64M, so it wastes a lot of space).
> >
> > How can I set the number of mappers (say 100)?
> What you're looking for is to 'pack' several files per mapper, if I
> get it right.
> In that case, you need to check out the CombineFileInputFormat. It can
> pack several files per mapper (with some degree of locality).
> Alternatively, pass a list of files (as a text file) as your input,
> and have your Mapper logic read them one by one. This way, if you
> divide 50k filenames over 100 files, you will get 100 mappers as you
> want - but at the cost of losing almost all locality.
> > If there is no way to set the number of mappers, the only way to
> > it is "cat" some files together?
> Concatenating is an alternative, if affordable - yes. You can lower
> the file count (down from 50k) this way.
> --
> Harsh J
This e-mail message may contain privileged and/or confidential
information, and is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error,
please notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other
use of this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring,
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for
checking for the presence of "Viruses" or other "Malware".
Monsanto, along with its subsidiaries, accepts no liability for any
damage caused by any such code transmitted by or accompanying
this e-mail or any attachment.

The information contained in this email may be subject to the export
control laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR)
and sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of
this information you are obligated to comply with all
applicable U.S. export laws and regulations.

View raw message