accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Keith Turner (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (ACCUMULO-2677) Single node bottle neck during map reduce
Date Wed, 16 Apr 2014 15:10:16 GMT

     [ https://issues.apache.org/jira/browse/ACCUMULO-2677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Keith Turner updated ACCUMULO-2677:
-----------------------------------

    Description: 
While running the verification map reduce job as part of the continuous ingest test, I noticed
the map phase was taking longer than expected.  I had run 24 hours of ingest and then verification.
  There were 2048 tablets and ~32B entries.  List scans showed that a lot of mappers were
reading from one node.  That single tserver was thrashing and had a much lower aggregate read
rate than tservers that only had a few mappers reading (like ~35KV/s vs 150KV/s).

Below is the output of listscans 

{noformat}
root@test160> listscans
 TABLET SERVER        | CLIENT               | AGE      | LAST     | STATE  | TYPE  | USER
   | TABLE   | COLUMNS   | AUTHORIZATIONS      | TABLET    | ITERATORS  | ITERATOR OPTIONS
    ip-10-1-2-15:9997 |      10.1.2.14:35838 |    2m47s |      5ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;121e33;120f25 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.28:42586 |    2h33m |    248ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;422d5;421e3d |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.18:60511 |    2h53m |    193ms | QUEUED |SINGLE |    root
|      ci |        [] |                     |3;554b7;553c5e |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.13:40589 |    2h19m |    246ms | QUEUED |SINGLE |    root
|      ci |        [] |                     |3;7f1e43e;7f0f3 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.25:55164 |   56m18s |     73ms | QUEUED |SINGLE |    root
|      ci |        [] |                     |3;1bf149b;1be238 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.28:42618 |    2h26m |    263ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;555a83;554b7 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.17:60869 |    1h47m |    131ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;42e206d;42d2f |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.14:59576 |    4m31s |     71ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;225a77;224b6be |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.27:35342 |     3h1m |    252ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;6587a;65789e |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.16:36073 |    2h31m |    131ms | QUEUED |SINGLE |    root
|      ci |        [] |                     |3;41f107;41e1f |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.13:40526 |    2h29m |    350ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;423c6;422d5 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.18:60560 |    2h37m |    344ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;424b6f;423c6 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.29:45044 |    1h17m |    253ms | QUEUED |SINGLE |    root
|      ci |        [] |                     |3;1d0048;1cf14 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.20:48103 |    3h12m |    277ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;400f13;4000000000000004 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.16:36053 |    2h33m |    230ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;28b4f;28a5e2 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.15:39470 |    2h57m |    269ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;2787b;2778a8 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.26:53819 |    3h27m |    449ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;430f37;43002ac |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.17:32894 |     1h9m |     31ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;224b6be;223c5c |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.16:36351 |   49m54s |    263ms | QUEUED |SINGLE |    root
|      ci |        [] |                     |3;5eb4fc;5ea5e7a |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.20:48227 |    2h46m |    116ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;5b4b7;5b3c68e |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.15:39676 |    1h57m |    262ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;5e96d;5e87bd |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.14:58104 |    2h15m |    245ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;545a7f;544b7 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.27:35745 |    1h52m |    231ms | QUEUED |SINGLE |    root
|      ci |        [] |                     |3;417895c;41698 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.22:40331 |    2h30m |    192ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;54c3d4f;54b4c |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.17:60923 |    1h32m |    261ms | QUEUED |SINGLE |    root
|      ci |        [] |                     |3;5f004fc;5ef13f |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.28:42506 |    2h54m |    117ms | QUEUED |SINGLE |    root
|      ci |        [] |                     |3;404b67;403c5 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.22:40342 |    2h29m |     34ms | QUEUED |SINGLE |    root
|      ci |        [] |                     |3;6bc3f;6bb4e5d |        [] | {}
    ip-10-1-2-26:9997 |      10.1.2.16:45905 | 21s291ms |  6s841ms |   IDLE |SINGLE |    root
|      ci |        [] |                     |3;4396ab;43879c |        [] | {}
    ip-10-1-2-18:9997 |      10.1.2.26:48600 |     2m2s |      5ms |   IDLE |SINGLE |    root
|      ci |        [] |                     |3;4b1e32;4b0f22 |        [] | {}
    ip-10-1-2-20:9997 |      10.1.2.21:36546 |    2m18s |  7s920ms |   IDLE |SINGLE |    root
|      ci |        [] |                     |3;601e91;600f83 |        [] | {}
{noformat}

Below is the output ~20 min later.

{noformat}
root@test160> listscans
 TABLET SERVER        | CLIENT               | AGE      | LAST     | STATE  | TYPE  | USER
   | TABLE   | COLUMNS   | AUTHORIZATIONS      | TABLET    | ITERATORS  | ITERATOR OPTIONS
    ip-10-1-2-15:9997 |      10.1.2.14:36125 |    3m10s |      3ms |   IDLE |SINGLE |    root
|      ci |        [] |                     |3;1c4ba5a;1c3c9 |        [] | {}
    ip-10-1-2-14:9997 |      10.1.2.16:35327 |     5m9s |      1ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;1b0f58c;1b004c |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.28:42586 |    2h54m |    509ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;422d5;421e3d |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.13:40589 |    2h40m |    251ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;7f1e43e;7f0f3 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.25:55164 |    1h17m |     26ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;1bf149b;1be238 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.28:42618 |    2h47m |    455ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;555a83;554b7 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.17:60869 |     2h8m |    352ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;42e206d;42d2f |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.14:59576 |   25m43s |    112ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;225a77;224b6be |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.27:35342 |    3h22m |    113ms | QUEUED |SINGLE |    root
|      ci |        [] |                     |3;6587a;65789e |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.16:36073 |    2h52m |    299ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;41f107;41e1f |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.13:40526 |    2h50m |     71ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;423c6;422d5 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.18:60560 |    2h58m |    160ms | QUEUED |SINGLE |    root
|      ci |        [] |                     |3;424b6f;423c6 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.29:45044 |    1h39m |    426ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;1d0048;1cf14 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.16:36053 |    2h54m |    184ms | QUEUED |SINGLE |    root
|      ci |        [] |                     |3;28b4f;28a5e2 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.26:53819 |    3h48m |    263ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;430f37;43002ac |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.17:32894 |    1h30m |    163ms | QUEUED |SINGLE |    root
|      ci |        [] |                     |3;224b6be;223c5c |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.16:36351 |    1h11m |    180ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;5eb4fc;5ea5e7a |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.20:48227 |     3h7m |    317ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;5b4b7;5b3c68e |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.15:39676 |    2h19m |    160ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;5e96d;5e87bd |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.14:58104 |    2h36m |    238ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;545a7f;544b7 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.27:35745 |    2h13m |    162ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;417895c;41698 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.22:40331 |    2h51m |     72ms | QUEUED |SINGLE |    root
|      ci |        [] |                     |3;54c3d4f;54b4c |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.17:60923 |    1h53m |     27ms | QUEUED |SINGLE |    root
|      ci |        [] |                     |3;5f004fc;5ef13f |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.28:42506 |    3h16m |    268ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;404b67;403c5 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.22:40342 |    2h51m |    239ms | QUEUED |SINGLE |    root
|      ci |        [] |                     |3;6bc3f;6bb4e5d |        [] | {}
    ip-10-1-2-29:9997 |      10.1.2.15:56044 |  4s505ms |      3ms |   IDLE |SINGLE |    root
|      ci |        [] |                     |3;602da;601e91 |        [] | {}
    ip-10-1-2-16:9997 |      10.1.2.21:50234 | 51s534ms |      9ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;20f12;20e218f |        [] | {}
    ip-10-1-2-16:9997 |      10.1.2.26:50232 |    3m10s |      5ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;5e4b66;5e3c5 |        [] | {}
    ip-10-1-2-16:9997 |      10.1.2.21:50206 |    3m47s |    285ms |   IDLE |SINGLE |    root
|      ci |        [] |                     |3;1ad31;1ac400b |        [] | {}
    ip-10-1-2-28:9997 |      10.1.2.18:38857 | 42s643ms |      6ms |   IDLE |SINGLE |    root
|      ci |        [] |                     |3;2ef13;2ee229 |        [] | {}
    ip-10-1-2-20:9997 |      10.1.2.20:44062 |    1m23s |  4s928ms |   IDLE |SINGLE |    root
|      ci |        [] |                     |3;6b4b7;6b3c669 |        [] | {}
{noformat}

I am not sure what caused things to get in this situation, but I have a theory.  While the
mappers were running a single AWS node was rebooted for some reason.  This would have caused
tablets to migrate.  AccumuloInputFormat calculates its locality information up front, if
tablets move mappers will run where the tablets used to be.  So maybe a slighty higher than
avg number of tablets started reading from ip-10-1-2-23 as a result of the migration.  This
caused those mapper to run slower and over time more mappers read from  ip-10-1-2-23 and things
just snowballed.

Regardless of how this situation occurred, Accumulo should handle it better when it does occur.
 If a single tablet server has much higher number of clients than avg attempting to read for
long periods of time, then something should be done.  In this case decisions could not be
made off of the read rate, because this tserver had a much lower read rate than other tservers
that only had 1 or 2 mappers reading.  

  was:
While running the verification map reduce job as part of the continuous ingest test, I noticed
the map phase was taking longer than expected.  I had run 24 hours of ingest and then verification.
  There were 2048 tablets and ~32B entries.  List scans showed that a lot of mappers were
reading from one node.  That single tserver was thrashing and had a much lower aggregate read
rate than tservers that only had a few mappers reading (like ~35KV/s vs 150KV/s).

Below is the output of listscans 

{noformat}
root@test160> listscans
 TABLET SERVER        | CLIENT               | AGE      | LAST     | STATE  | TYPE  | USER
   | TABLE   | COLUMNS   | AUTHORIZATIONS      | TABLET    | ITERATORS  | ITERATOR OPTIONS
    ip-10-1-2-15:9997 |      10.1.2.14:35838 |    2m47s |      5ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;121e33;120f25 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.28:42586 |    2h33m |    248ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;422d5;421e3d |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.18:60511 |    2h53m |    193ms | QUEUED |SINGLE |    root
|      ci |        [] |                     |3;554b7;553c5e |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.13:40589 |    2h19m |    246ms | QUEUED |SINGLE |    root
|      ci |        [] |                     |3;7f1e43e;7f0f3 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.25:55164 |   56m18s |     73ms | QUEUED |SINGLE |    root
|      ci |        [] |                     |3;1bf149b;1be238 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.28:42618 |    2h26m |    263ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;555a83;554b7 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.17:60869 |    1h47m |    131ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;42e206d;42d2f |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.14:59576 |    4m31s |     71ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;225a77;224b6be |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.27:35342 |     3h1m |    252ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;6587a;65789e |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.16:36073 |    2h31m |    131ms | QUEUED |SINGLE |    root
|      ci |        [] |                     |3;41f107;41e1f |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.13:40526 |    2h29m |    350ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;423c6;422d5 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.18:60560 |    2h37m |    344ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;424b6f;423c6 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.29:45044 |    1h17m |    253ms | QUEUED |SINGLE |    root
|      ci |        [] |                     |3;1d0048;1cf14 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.20:48103 |    3h12m |    277ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;400f13;4000000000000004 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.16:36053 |    2h33m |    230ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;28b4f;28a5e2 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.15:39470 |    2h57m |    269ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;2787b;2778a8 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.26:53819 |    3h27m |    449ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;430f37;43002ac |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.17:32894 |     1h9m |     31ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;224b6be;223c5c |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.16:36351 |   49m54s |    263ms | QUEUED |SINGLE |    root
|      ci |        [] |                     |3;5eb4fc;5ea5e7a |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.20:48227 |    2h46m |    116ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;5b4b7;5b3c68e |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.15:39676 |    1h57m |    262ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;5e96d;5e87bd |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.14:58104 |    2h15m |    245ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;545a7f;544b7 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.27:35745 |    1h52m |    231ms | QUEUED |SINGLE |    root
|      ci |        [] |                     |3;417895c;41698 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.22:40331 |    2h30m |    192ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;54c3d4f;54b4c |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.17:60923 |    1h32m |    261ms | QUEUED |SINGLE |    root
|      ci |        [] |                     |3;5f004fc;5ef13f |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.28:42506 |    2h54m |    117ms | QUEUED |SINGLE |    root
|      ci |        [] |                     |3;404b67;403c5 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.22:40342 |    2h29m |     34ms | QUEUED |SINGLE |    root
|      ci |        [] |                     |3;6bc3f;6bb4e5d |        [] | {}
    ip-10-1-2-26:9997 |      10.1.2.16:45905 | 21s291ms |  6s841ms |   IDLE |SINGLE |    root
|      ci |        [] |                     |3;4396ab;43879c |        [] | {}
    ip-10-1-2-18:9997 |      10.1.2.26:48600 |     2m2s |      5ms |   IDLE |SINGLE |    root
|      ci |        [] |                     |3;4b1e32;4b0f22 |        [] | {}
    ip-10-1-2-20:9997 |      10.1.2.21:36546 |    2m18s |  7s920ms |   IDLE |SINGLE |    root
|      ci |        [] |                     |3;601e91;600f83 |        [] | {}
{nofotmat}

Below is the output ~20 min later.

{noformat}
root@test160> listscans
 TABLET SERVER        | CLIENT               | AGE      | LAST     | STATE  | TYPE  | USER
   | TABLE   | COLUMNS   | AUTHORIZATIONS      | TABLET    | ITERATORS  | ITERATOR OPTIONS
    ip-10-1-2-15:9997 |      10.1.2.14:36125 |    3m10s |      3ms |   IDLE |SINGLE |    root
|      ci |        [] |                     |3;1c4ba5a;1c3c9 |        [] | {}
    ip-10-1-2-14:9997 |      10.1.2.16:35327 |     5m9s |      1ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;1b0f58c;1b004c |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.28:42586 |    2h54m |    509ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;422d5;421e3d |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.13:40589 |    2h40m |    251ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;7f1e43e;7f0f3 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.25:55164 |    1h17m |     26ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;1bf149b;1be238 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.28:42618 |    2h47m |    455ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;555a83;554b7 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.17:60869 |     2h8m |    352ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;42e206d;42d2f |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.14:59576 |   25m43s |    112ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;225a77;224b6be |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.27:35342 |    3h22m |    113ms | QUEUED |SINGLE |    root
|      ci |        [] |                     |3;6587a;65789e |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.16:36073 |    2h52m |    299ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;41f107;41e1f |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.13:40526 |    2h50m |     71ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;423c6;422d5 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.18:60560 |    2h58m |    160ms | QUEUED |SINGLE |    root
|      ci |        [] |                     |3;424b6f;423c6 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.29:45044 |    1h39m |    426ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;1d0048;1cf14 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.16:36053 |    2h54m |    184ms | QUEUED |SINGLE |    root
|      ci |        [] |                     |3;28b4f;28a5e2 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.26:53819 |    3h48m |    263ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;430f37;43002ac |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.17:32894 |    1h30m |    163ms | QUEUED |SINGLE |    root
|      ci |        [] |                     |3;224b6be;223c5c |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.16:36351 |    1h11m |    180ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;5eb4fc;5ea5e7a |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.20:48227 |     3h7m |    317ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;5b4b7;5b3c68e |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.15:39676 |    2h19m |    160ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;5e96d;5e87bd |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.14:58104 |    2h36m |    238ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;545a7f;544b7 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.27:35745 |    2h13m |    162ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;417895c;41698 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.22:40331 |    2h51m |     72ms | QUEUED |SINGLE |    root
|      ci |        [] |                     |3;54c3d4f;54b4c |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.17:60923 |    1h53m |     27ms | QUEUED |SINGLE |    root
|      ci |        [] |                     |3;5f004fc;5ef13f |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.28:42506 |    3h16m |    268ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;404b67;403c5 |        [] | {}
    ip-10-1-2-23:9997 |      10.1.2.22:40342 |    2h51m |    239ms | QUEUED |SINGLE |    root
|      ci |        [] |                     |3;6bc3f;6bb4e5d |        [] | {}
    ip-10-1-2-29:9997 |      10.1.2.15:56044 |  4s505ms |      3ms |   IDLE |SINGLE |    root
|      ci |        [] |                     |3;602da;601e91 |        [] | {}
    ip-10-1-2-16:9997 |      10.1.2.21:50234 | 51s534ms |      9ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;20f12;20e218f |        [] | {}
    ip-10-1-2-16:9997 |      10.1.2.26:50232 |    3m10s |      5ms |RUNNING |SINGLE |    root
|      ci |        [] |                     |3;5e4b66;5e3c5 |        [] | {}
    ip-10-1-2-16:9997 |      10.1.2.21:50206 |    3m47s |    285ms |   IDLE |SINGLE |    root
|      ci |        [] |                     |3;1ad31;1ac400b |        [] | {}
    ip-10-1-2-28:9997 |      10.1.2.18:38857 | 42s643ms |      6ms |   IDLE |SINGLE |    root
|      ci |        [] |                     |3;2ef13;2ee229 |        [] | {}
    ip-10-1-2-20:9997 |      10.1.2.20:44062 |    1m23s |  4s928ms |   IDLE |SINGLE |    root
|      ci |        [] |                     |3;6b4b7;6b3c669 |        [] | {}
{noformat}

I am not sure what caused things to get in this situation, but I have a theory.  While the
mappers were running a single AWS node was rebooted for some reason.  This would have caused
tablets to migrate.  AccumuloInputFormat calculates its locality information up front, if
tablets move mappers will run where the tablets used to be.  So maybe a slighty higher than
avg number of tablets started reading from ip-10-1-2-23 as a result of the migration.  This
caused those mapper to run slower and over time more mappers read from  ip-10-1-2-23 and things
just snowballed.

Regardless of how this situation occurred, Accumulo should handle it better when it does occur.
 If a single tablet server has much higher number of clients than avg attempting to read for
long periods of time, then something should be done.  In this case decisions could not be
made off of the read rate, because this tserver had a much lower read rate than other tservers
that only had 1 or 2 mappers reading.  


> Single node bottle neck during map reduce
> -----------------------------------------
>
>                 Key: ACCUMULO-2677
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-2677
>             Project: Accumulo
>          Issue Type: Improvement
>    Affects Versions: 1.4.0
>         Environment: 1.6.0-RC2, Hadoop 2.2.0, AWS 20 node cluster
>            Reporter: Keith Turner
>             Fix For: 1.7.0
>
>
> While running the verification map reduce job as part of the continuous ingest test,
I noticed the map phase was taking longer than expected.  I had run 24 hours of ingest and
then verification.   There were 2048 tablets and ~32B entries.  List scans showed that a lot
of mappers were reading from one node.  That single tserver was thrashing and had a much lower
aggregate read rate than tservers that only had a few mappers reading (like ~35KV/s vs 150KV/s).
> Below is the output of listscans 
> {noformat}
> root@test160> listscans
>  TABLET SERVER        | CLIENT               | AGE      | LAST     | STATE  | TYPE  |
USER    | TABLE   | COLUMNS   | AUTHORIZATIONS      | TABLET    | ITERATORS  | ITERATOR OPTIONS
>     ip-10-1-2-15:9997 |      10.1.2.14:35838 |    2m47s |      5ms |RUNNING |SINGLE |
   root |      ci |        [] |                     |3;121e33;120f25 |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.28:42586 |    2h33m |    248ms |RUNNING |SINGLE |
   root |      ci |        [] |                     |3;422d5;421e3d |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.18:60511 |    2h53m |    193ms | QUEUED |SINGLE |
   root |      ci |        [] |                     |3;554b7;553c5e |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.13:40589 |    2h19m |    246ms | QUEUED |SINGLE |
   root |      ci |        [] |                     |3;7f1e43e;7f0f3 |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.25:55164 |   56m18s |     73ms | QUEUED |SINGLE |
   root |      ci |        [] |                     |3;1bf149b;1be238 |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.28:42618 |    2h26m |    263ms |RUNNING |SINGLE |
   root |      ci |        [] |                     |3;555a83;554b7 |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.17:60869 |    1h47m |    131ms |RUNNING |SINGLE |
   root |      ci |        [] |                     |3;42e206d;42d2f |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.14:59576 |    4m31s |     71ms |RUNNING |SINGLE |
   root |      ci |        [] |                     |3;225a77;224b6be |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.27:35342 |     3h1m |    252ms |RUNNING |SINGLE |
   root |      ci |        [] |                     |3;6587a;65789e |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.16:36073 |    2h31m |    131ms | QUEUED |SINGLE |
   root |      ci |        [] |                     |3;41f107;41e1f |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.13:40526 |    2h29m |    350ms |RUNNING |SINGLE |
   root |      ci |        [] |                     |3;423c6;422d5 |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.18:60560 |    2h37m |    344ms |RUNNING |SINGLE |
   root |      ci |        [] |                     |3;424b6f;423c6 |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.29:45044 |    1h17m |    253ms | QUEUED |SINGLE |
   root |      ci |        [] |                     |3;1d0048;1cf14 |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.20:48103 |    3h12m |    277ms |RUNNING |SINGLE |
   root |      ci |        [] |                     |3;400f13;4000000000000004 |        []
| {}
>     ip-10-1-2-23:9997 |      10.1.2.16:36053 |    2h33m |    230ms |RUNNING |SINGLE |
   root |      ci |        [] |                     |3;28b4f;28a5e2 |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.15:39470 |    2h57m |    269ms |RUNNING |SINGLE |
   root |      ci |        [] |                     |3;2787b;2778a8 |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.26:53819 |    3h27m |    449ms |RUNNING |SINGLE |
   root |      ci |        [] |                     |3;430f37;43002ac |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.17:32894 |     1h9m |     31ms |RUNNING |SINGLE |
   root |      ci |        [] |                     |3;224b6be;223c5c |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.16:36351 |   49m54s |    263ms | QUEUED |SINGLE |
   root |      ci |        [] |                     |3;5eb4fc;5ea5e7a |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.20:48227 |    2h46m |    116ms |RUNNING |SINGLE |
   root |      ci |        [] |                     |3;5b4b7;5b3c68e |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.15:39676 |    1h57m |    262ms |RUNNING |SINGLE |
   root |      ci |        [] |                     |3;5e96d;5e87bd |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.14:58104 |    2h15m |    245ms |RUNNING |SINGLE |
   root |      ci |        [] |                     |3;545a7f;544b7 |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.27:35745 |    1h52m |    231ms | QUEUED |SINGLE |
   root |      ci |        [] |                     |3;417895c;41698 |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.22:40331 |    2h30m |    192ms |RUNNING |SINGLE |
   root |      ci |        [] |                     |3;54c3d4f;54b4c |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.17:60923 |    1h32m |    261ms | QUEUED |SINGLE |
   root |      ci |        [] |                     |3;5f004fc;5ef13f |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.28:42506 |    2h54m |    117ms | QUEUED |SINGLE |
   root |      ci |        [] |                     |3;404b67;403c5 |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.22:40342 |    2h29m |     34ms | QUEUED |SINGLE |
   root |      ci |        [] |                     |3;6bc3f;6bb4e5d |        [] | {}
>     ip-10-1-2-26:9997 |      10.1.2.16:45905 | 21s291ms |  6s841ms |   IDLE |SINGLE |
   root |      ci |        [] |                     |3;4396ab;43879c |        [] | {}
>     ip-10-1-2-18:9997 |      10.1.2.26:48600 |     2m2s |      5ms |   IDLE |SINGLE |
   root |      ci |        [] |                     |3;4b1e32;4b0f22 |        [] | {}
>     ip-10-1-2-20:9997 |      10.1.2.21:36546 |    2m18s |  7s920ms |   IDLE |SINGLE |
   root |      ci |        [] |                     |3;601e91;600f83 |        [] | {}
> {noformat}
> Below is the output ~20 min later.
> {noformat}
> root@test160> listscans
>  TABLET SERVER        | CLIENT               | AGE      | LAST     | STATE  | TYPE  |
USER    | TABLE   | COLUMNS   | AUTHORIZATIONS      | TABLET    | ITERATORS  | ITERATOR OPTIONS
>     ip-10-1-2-15:9997 |      10.1.2.14:36125 |    3m10s |      3ms |   IDLE |SINGLE |
   root |      ci |        [] |                     |3;1c4ba5a;1c3c9 |        [] | {}
>     ip-10-1-2-14:9997 |      10.1.2.16:35327 |     5m9s |      1ms |RUNNING |SINGLE |
   root |      ci |        [] |                     |3;1b0f58c;1b004c |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.28:42586 |    2h54m |    509ms |RUNNING |SINGLE |
   root |      ci |        [] |                     |3;422d5;421e3d |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.13:40589 |    2h40m |    251ms |RUNNING |SINGLE |
   root |      ci |        [] |                     |3;7f1e43e;7f0f3 |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.25:55164 |    1h17m |     26ms |RUNNING |SINGLE |
   root |      ci |        [] |                     |3;1bf149b;1be238 |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.28:42618 |    2h47m |    455ms |RUNNING |SINGLE |
   root |      ci |        [] |                     |3;555a83;554b7 |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.17:60869 |     2h8m |    352ms |RUNNING |SINGLE |
   root |      ci |        [] |                     |3;42e206d;42d2f |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.14:59576 |   25m43s |    112ms |RUNNING |SINGLE |
   root |      ci |        [] |                     |3;225a77;224b6be |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.27:35342 |    3h22m |    113ms | QUEUED |SINGLE |
   root |      ci |        [] |                     |3;6587a;65789e |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.16:36073 |    2h52m |    299ms |RUNNING |SINGLE |
   root |      ci |        [] |                     |3;41f107;41e1f |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.13:40526 |    2h50m |     71ms |RUNNING |SINGLE |
   root |      ci |        [] |                     |3;423c6;422d5 |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.18:60560 |    2h58m |    160ms | QUEUED |SINGLE |
   root |      ci |        [] |                     |3;424b6f;423c6 |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.29:45044 |    1h39m |    426ms |RUNNING |SINGLE |
   root |      ci |        [] |                     |3;1d0048;1cf14 |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.16:36053 |    2h54m |    184ms | QUEUED |SINGLE |
   root |      ci |        [] |                     |3;28b4f;28a5e2 |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.26:53819 |    3h48m |    263ms |RUNNING |SINGLE |
   root |      ci |        [] |                     |3;430f37;43002ac |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.17:32894 |    1h30m |    163ms | QUEUED |SINGLE |
   root |      ci |        [] |                     |3;224b6be;223c5c |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.16:36351 |    1h11m |    180ms |RUNNING |SINGLE |
   root |      ci |        [] |                     |3;5eb4fc;5ea5e7a |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.20:48227 |     3h7m |    317ms |RUNNING |SINGLE |
   root |      ci |        [] |                     |3;5b4b7;5b3c68e |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.15:39676 |    2h19m |    160ms |RUNNING |SINGLE |
   root |      ci |        [] |                     |3;5e96d;5e87bd |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.14:58104 |    2h36m |    238ms |RUNNING |SINGLE |
   root |      ci |        [] |                     |3;545a7f;544b7 |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.27:35745 |    2h13m |    162ms |RUNNING |SINGLE |
   root |      ci |        [] |                     |3;417895c;41698 |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.22:40331 |    2h51m |     72ms | QUEUED |SINGLE |
   root |      ci |        [] |                     |3;54c3d4f;54b4c |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.17:60923 |    1h53m |     27ms | QUEUED |SINGLE |
   root |      ci |        [] |                     |3;5f004fc;5ef13f |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.28:42506 |    3h16m |    268ms |RUNNING |SINGLE |
   root |      ci |        [] |                     |3;404b67;403c5 |        [] | {}
>     ip-10-1-2-23:9997 |      10.1.2.22:40342 |    2h51m |    239ms | QUEUED |SINGLE |
   root |      ci |        [] |                     |3;6bc3f;6bb4e5d |        [] | {}
>     ip-10-1-2-29:9997 |      10.1.2.15:56044 |  4s505ms |      3ms |   IDLE |SINGLE |
   root |      ci |        [] |                     |3;602da;601e91 |        [] | {}
>     ip-10-1-2-16:9997 |      10.1.2.21:50234 | 51s534ms |      9ms |RUNNING |SINGLE |
   root |      ci |        [] |                     |3;20f12;20e218f |        [] | {}
>     ip-10-1-2-16:9997 |      10.1.2.26:50232 |    3m10s |      5ms |RUNNING |SINGLE |
   root |      ci |        [] |                     |3;5e4b66;5e3c5 |        [] | {}
>     ip-10-1-2-16:9997 |      10.1.2.21:50206 |    3m47s |    285ms |   IDLE |SINGLE |
   root |      ci |        [] |                     |3;1ad31;1ac400b |        [] | {}
>     ip-10-1-2-28:9997 |      10.1.2.18:38857 | 42s643ms |      6ms |   IDLE |SINGLE |
   root |      ci |        [] |                     |3;2ef13;2ee229 |        [] | {}
>     ip-10-1-2-20:9997 |      10.1.2.20:44062 |    1m23s |  4s928ms |   IDLE |SINGLE |
   root |      ci |        [] |                     |3;6b4b7;6b3c669 |        [] | {}
> {noformat}
> I am not sure what caused things to get in this situation, but I have a theory.  While
the mappers were running a single AWS node was rebooted for some reason.  This would have
caused tablets to migrate.  AccumuloInputFormat calculates its locality information up front,
if tablets move mappers will run where the tablets used to be.  So maybe a slighty higher
than avg number of tablets started reading from ip-10-1-2-23 as a result of the migration.
 This caused those mapper to run slower and over time more mappers read from  ip-10-1-2-23
and things just snowballed.
> Regardless of how this situation occurred, Accumulo should handle it better when it does
occur.  If a single tablet server has much higher number of clients than avg attempting to
read for long periods of time, then something should be done.  In this case decisions could
not be made off of the read rate, because this tserver had a much lower read rate than other
tservers that only had 1 or 2 mappers reading.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message