hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Harsh J <ha...@cloudera.com>
Subject Re: Mappers and Reducer not being called, but no errors indicated
Date Thu, 10 Nov 2011 16:59:57 GMT
Hey Andy,


On 10-Nov-2011, at 10:03 PM, Andy Doddington wrote:

> Thanks for your kind words - it still feels like pulling teeth at times :-(
> Following on from your comments, here are a few more questions - hope you don’t find
them too dumb…
> 1) How does each mapper ‘know’ which file name to associate itself with?

The client creates a special file on the HDFS while submitting a job to the JT that, dumbly
speaking, carries an array of filenames (along with offset and length info for splits -- this
is called a 'FileSplit'). This is used by JobTracker to determine # of tasks and such, and
then later on used by each scheduled mapper to lookup its own index (map [0] gets file a,
map [1] gets file b, etc.) and initialize its appropriate reads. Again, this is a very dumb
explanation -- the truth is slightly more complicated but this is how the mechanism works
(pull, not push).

> 2) Is it important that I name my files part<n> or will any unique name suffice?

HDFS is like any other filesystem. Filenames do not matter.

The "part" is short for "partition", and is used for output files by the default APIs to indicate
that each part-XXXXX file is a partition of the whole output. It is just a terminology used
by MR, by default (again, like everything, the default output name is configurable as well).

Naming with numbers gets you free sorting though, when you list out files of a directory.

> 3) I’m using binary serialisation with Sequence files - are these ‘split’ across
multiple mappers? What happens if the split occurs in the middle of a binary object?

Record splits will never happen. This is guaranteed. See the second para of the 'Map' section
at http://wiki.apache.org/hadoop/HadoopMapReduce to understand how this is ensured.

For sequence files, instead of 'newlines', there are 'magic' byte markers that serve the same
purpose (aligning record readers to start from a proper point, that's not between a record).
These markers are placed at regular intervals in your sequence file already.

> Current state of play is that the mappers are being called the correct number of times
and are generating the correct result for the first half of the number of mappers (e.g. ~502
out of 100 mappers, running small test), but are then generating bad results after that. The
reducer is then correctly selecting the minimum - it just happens to be a bad value due to
the mapper problem. Ho hum…

Unfortunately I have no clue what you are talking about here. Looks like a key/val data issue
to me by the sound of it. Perhaps bad partitioning/grouping is happening as a result of that.

P.s. If its better that way for you, you can also contact me off-list.

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message