hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Joydeep Sen Sarma" <jssa...@facebook.com>
Subject RE: problem with IdentityMapper
Date Thu, 10 Jan 2008 23:01:00 GMT
what are the key value types in the Sequencefile?
seems that the maprunner calls createKey and createValue just once. so if the value serializes
out it's entire memory allocated (and not what it last read) - it would cause this problem.
(I have periodically shot myself in the foot with this bullet).


From: Mike Forrest [mailto:mforrest@trailfire.com]
Sent: Thu 1/10/2008 2:51 PM
To: hadoop-user@lucene.apache.org
Subject: problem with IdentityMapper

I'm running into a problem where IdentityMapper seems to produce way too
much data.  For example, I have a job that reads a sequence file using
IdentityMapper and then uses IdentityReducer to write everything back
out to another sequence file.  My input is a ~60MB sequence file and
after the map phase has completed, the job tracker UI reports about 10GB
for "Map output bytes".  It seems like the output collector does not get
properly reset and so each map that gets emitted has the correct key but
the value ends up being all the data you've encountered up to that
point.  I think this is a known issue but I can't seem to find any
discussion about it right now.  Has anyone else run into this, and if
so, is there a solution?  I'm using the latest code in the 0.15 branch.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message