hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sameer Paranjpye (JIRA)" <j...@apache.org>
Subject [jira] Updated: (HADOOP-253) we need speculative execution for reduces
Date Wed, 31 May 2006 00:39:30 GMT
     [ http://issues.apache.org/jira/browse/HADOOP-253?page=all ]

Sameer Paranjpye updated HADOOP-253:
------------------------------------

    Fix Version: 0.4

> we need speculative execution for reduces
> -----------------------------------------
>
>          Key: HADOOP-253
>          URL: http://issues.apache.org/jira/browse/HADOOP-253
>      Project: Hadoop
>         Type: Improvement

>   Components: mapred
>     Versions: 0.2.1
>     Reporter: Owen O'Malley
>     Assignee: Owen O'Malley
>      Fix For: 0.4

>
> With my new http-based shuffle (on top of the svn head including sameer's parallel fetch),
I just finished sorting 2010 g on 200 nodes in 8:49 with 9 reduce failures. However, the amusing
part is that the replacement reduces were _not_ the slow ones. 8 of the original reduces were
the only things running for the last hour. The job timings looked like:
> Job 0001
>   Total:
>     Tasks: 16551
>     Total: 10056104 secs
>     Average: 607 secs
>     Worst: task_0001_r_000291_0
>     Worst time: 31050 secs
>     Best: task_0001_m_013597_0
>     Best time: 20 secs
>   Maps:
>     Tasks: 16151
>     Total: 2762635 secs
>     Average: 171 secs
>     Worst: task_0001_m_002290_0
>     Worst time: 2663 secs
>     Best: task_0001_m_013597_0
>     Best time: 20 secs
>   Reduces:
>     Tasks: 400
>     Total: 7293469 secs
>     Average: 18233 secs
>     Worst: task_0001_r_000291_0
>     Worst time: 31050 secs
>     Best: task_0001_r_000263_1
>     Best time: 5591 secs
> And the number of tasks run per a node was very uneven:
> #tasks node
> 124 node1161
> 117 node1307
> 117 node1124
> 116 node1253
> 114 node1310
> 111 node1302
> 111 node1299
> 111 node1298
> 111 node1249
> 111 node1221
> 110 node1288
> 110 node1286
> 110 node1211
> 109 node1268
> 108 node1292
> 108 node1202
> 108 node1200
> 107 node1313
> 107 node1277
> 107 node1246
> 107 node1242
> 107 node1231
> 107 node1214
> 106 node1243
> 105 node1251
> 105 node1212
> 105 node1205
> 104 node1272
> 104 node1269
> 104 node1210
> 104 node1203
> 104 node1193
> 104 node1128
> 103 node1300
> 103 node1285
> 103 node1279
> 103 node1209
> 103 node1173
> 103 node1165
> 102 node1276
> 102 node1239
> 102 node1228
> 102 node1204
> 102 node1188
> 101 node1314
> 101 node1303
> 100 node1301
> 100 node1252
> 99 node1287
> 99 node1213
> 99 node1206
> 98 node1295
> 98 node1186
> 97 node1293
> 97 node1265
> 97 node1262
> 97 node1260
> 97 node1258
> 97 node1235
> 97 node1229
> 97 node1226
> 97 node1215
> 97 node1208
> 97 node1187
> 97 node1175
> 97 node1171
> 96 node1291
> 96 node1248
> 96 node1224
> 96 node1216
> 95 node1305
> 95 node1280
> 95 node1263
> 95 node1254
> 95 node1153
> 95 node1115
> 94 node1271
> 94 node1261
> 94 node1234
> 94 node1233
> 94 node1227
> 94 node1225
> 94 node1217
> 94 node1142
> 93 node1275
> 93 node1198
> 93 node1107
> 92 node1266
> 92 node1220
> 92 node1219
> 91 node1309
> 91 node1289
> 91 node1270
> 91 node1259
> 91 node1256
> 91 node1232
> 91 node1179
> 89 node1290
> 89 node1255
> 89 node1247
> 89 node1207
> 89 node1201
> 89 node1190
> 89 node1154
> 89 node1141
> 88 node1306
> 88 node1282
> 88 node1250
> 88 node1222
> 88 node1184
> 88 node1149
> 88 node1117
> 87 node1278
> 87 node1257
> 87 node1191
> 87 node1185
> 87 node1180
> 86 node1297
> 86 node1178
> 85 node1195
> 85 node1143
> 85 node1112
> 84 node1281
> 84 node1274
> 84 node1264
> 83 node1296
> 83 node1148
> 82 node1218
> 82 node1168
> 82 node1167
> 81 node1311
> 81 node1240
> 81 node1223
> 81 node1196
> 81 node1164
> 81 node1116
> 80 node1267
> 80 node1230
> 80 node1177
> 80 node1119
> 79 node1294
> 79 node1199
> 79 node1181
> 79 node1170
> 79 node1166
> 79 node1103
> 78 node1244
> 78 node1189
> 78 node1157
> 77 node1304
> 77 node1172
> 74 node1182
> 71 node1160
> 71 node1147
> 68 node1236
> 68 node1183
> 67 node1245
> 59 node1139
> 58 node1312
> 57 node1162
> 56 node1308
> 56 node1197
> 55 node1146
> 54 node1106
> 53 node1111
> 53 node1105
> 49 node1145
> 49 node1123
> 48 node1176
> 46 node1136
> 44 node1132
> 44 node1125
> 44 node1122
> 44 node1108
> 43 node1192
> 43 node1121
> 42 node1194
> 42 node1138
> 42 node1104
> 41 node1155
> 41 node1126
> 41 node1114
> 40 node1158
> 40 node1151
> 40 node1137
> 40 node1110
> 40 node1100
> 39 node1156
> 38 node1140
> 38 node1135
> 38 node1109
> 37 node1144
> 37 node1120
> 36 node1118
> 34 node1133
> 34 node1113
> 31 node1134
> 26 node1127
> 23 node1101
> 20 node1131
> And it should not surprise us that the last 8 reduces were running on nodes 1134, 1127,1101,
and 1131. This really demonstrates the need to run speculative reduce runs. 
> I propose that when the list of reduce jobs running is down to 1/2 the cluster size that
we start running speculative reduces. I estimate that it would have saved around an hour on
this run. Does that sound like a reasonable heuristic?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


Mime
View raw message