hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Howell <dehow...@gmail.com>
Subject inappropriate input splits
Date Thu, 17 Dec 2009 18:38:16 GMT
I just discovered some weird behavior on my cluster. If I start up a
mapreduce job with inputs files much smaller than my block size, each
input file is translated into two input splits, containing identical
content. This is, in effect, doubling every single record I try to
process. If I manually set mapred.min.split.size to my block size, I'm
back to one split for file as I expected.

The input files are gzipped text, and I'm processing them with Hadoop Streaming.

Any debugging suggestions?


View raw message