hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: Too many maps?
Date Wed, 07 Sep 2011 03:06:58 GMT
You can use an input format that lets you read multiple files per map
(like say, all local files. See CombineFileInputFormat for one
implementation that does this). This way you get reduced map #s and
you don't really have to clump your files. One record reader would be
initialized per file, so I believe you should be free to generate
unique identities per file/email with this approach (whenever a new
record reader is initialized)?

On Wed, Sep 7, 2011 at 7:12 AM, Mark Kerzner <markkerzner@gmail.com> wrote:
> Hi,
> I am testing my Hadoop-based FreeEed <http://freeeed.org/>, an open source
> tool for eDiscovery, and I am using the Enron data
> set<http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set-v2>for
> that. In my processing, each email with its attachments becomes a map,
> and it is later collected by a reducer and written to the output. With the
> (PST) mailboxes of around 2-5 Gigs, I begin to the see the numbers of emails
> of about 50,000. I remember in Yahoo best practices that the number of maps
> should not exceed 75,000, and I can see that I can break this barrier soon.
> I could, potentially, combine a few emails into one map, but I would be
> doing it only to circumvent the size problem, not because my processing
> requires it. Besides, my keys are the MD5 hashes of the files, and I use
> them to find duplicates. If I combine a few emails into a map, I cannot use
> the hashes as keys in a meaningful way anymore.
> So my question is, can't I have millions of maps, if that's how many
> artifacts I need to process, and why not?
> Thank you. Sincerely,
> Mark

Harsh J

View raw message