hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mark Kerzner <markkerz...@gmail.com>
Subject Too many maps?
Date Wed, 07 Sep 2011 01:42:23 GMT
Hi,

I am testing my Hadoop-based FreeEed <http://freeeed.org/>, an open source
tool for eDiscovery, and I am using the Enron data
set<http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set-v2>for
that. In my processing, each email with its attachments becomes a map,
and it is later collected by a reducer and written to the output. With the
(PST) mailboxes of around 2-5 Gigs, I begin to the see the numbers of emails
of about 50,000. I remember in Yahoo best practices that the number of maps
should not exceed 75,000, and I can see that I can break this barrier soon.

I could, potentially, combine a few emails into one map, but I would be
doing it only to circumvent the size problem, not because my processing
requires it. Besides, my keys are the MD5 hashes of the files, and I use
them to find duplicates. If I combine a few emails into a map, I cannot use
the hashes as keys in a meaningful way anymore.

So my question is, can't I have millions of maps, if that's how many
artifacts I need to process, and why not?

Thank you. Sincerely,
Mark

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message