hadoop-common-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shrinivas Joshi <jshrini...@gmail.com>
Subject questions on map-side spills
Date Thu, 31 Mar 2011 16:29:41 GMT
I am trying TeraSort with Apache 0.21.0 build. io.sort.mb is 360M,
map.sort.spill.percent is 0.8, dfs.blocksize is 256M. I am having some
difficulty understanding spill related decisions from the log files. Here
are the relevant log lines:

2011-03-30 13:46:51,591 INFO org.apache.hadoop.mapred.MapTask: (EQUATOR) 0
kvi 94371836(377487344)
2011-03-30 13:46:51,592 INFO org.apache.hadoop.mapred.MapTask:
mapreduce.task.io.sort.mb: 360
2011-03-30 13:46:51,592 INFO org.apache.hadoop.mapred.MapTask: soft limit at
301989888
2011-03-30 13:46:51,592 INFO org.apache.hadoop.mapred.MapTask: bufstart = 0;
bufvoid = 377487360
2011-03-30 13:46:51,592 INFO org.apache.hadoop.mapred.MapTask: kvstart =
94371836; length = 23592960
2011-03-30 13:47:05,528 INFO org.apache.hadoop.mapred.MapTask: Spilling map
output
2011-03-30 13:47:05,528 INFO org.apache.hadoop.mapred.MapTask: bufstart = 0;
bufend = 261042174; bufvoid = 377487360
2011-03-30 13:47:05,528 INFO org.apache.hadoop.mapred.MapTask: kvstart =
94371836(377487344); kvend = 84134892(336539568); length = 10236945/23592960
2011-03-30 13:47:05,529 INFO org.apache.hadoop.mapred.MapTask: (EQUATOR)
271279102 kvi 67819768(271279072)
2011-03-30 13:47:06,355 INFO org.apache.hadoop.mapred.MapTask: Starting
flush of map output
2011-03-30 13:47:20,822 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded
the native-hadoop library
2011-03-30 13:47:20,824 INFO org.apache.hadoop.io.compress.zlib.ZlibFactory:
Successfully loaded & initialized native-zlib library
2011-03-30 13:47:20,825 INFO org.apache.hadoop.io.compress.CodecPool: Got
brand-new compressor
2011-03-30 13:47:54,317 INFO org.apache.hadoop.mapred.MapTask: *Finished
spill 0*
2011-03-30 13:47:54,318 INFO org.apache.hadoop.mapred.MapTask: (RESET)
equator 271279102 kv 67819768(271279072) kvi 66442776(265771104)
2011-03-30 13:47:54,318 INFO org.apache.hadoop.mapred.MapTask: Spilling map
output
2011-03-30 13:47:54,318 INFO org.apache.hadoop.mapred.MapTask: bufstart =
271279102; bufend = 306392398; bufvoid = 377487360
2011-03-30 13:47:54,318 INFO org.apache.hadoop.mapred.MapTask: kvstart =
67819768(271279072); kvend = 66442780(265771120); length = 1376989/23592960
2011-03-30 13:48:00,198 INFO org.apache.hadoop.mapred.MapTask: *Finished
spill 1*

Couple of questions:

   - It says length = 23592960 for records. Does it mean it is setting aside
   23592960 * 4 bytes (90M) for storing spilled records meta-data? OR is it
   23592960/(1024*1024) = 22.5M?
   - Why is it triggering 2 spills? By the first spill it looks like 248.94M
   (bufend = 261042174) of intermediate map output is generated. If 90M is
   reserved for record meta-data then (360M - 90M) * 0.8 = 216M is less than
   map output size and the spill should have been triggered earlier. If 22.5M
   is reserved for record meta-data then (360M - 22.5M) * 0.8 = 270M still has
   more room for in io.sort buffer. May be changes in
   https://issues.apache.org/jira/browse/MAPREDUCE-64 rely on dynamic info
   and the straigh forward calculations that I am using here are incorrect?
   - Is there any value in simplifying spill decisions related debug output
   for general user who might not necessarily have insight in to Hadoop source
   code?

Thanks,
-Shrinivas

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message