accumulo-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Adam J Shook (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (ACCUMULO-4506) Some in-progress files for replication never replicate
Date Thu, 03 Nov 2016 22:18:59 GMT

    [ https://issues.apache.org/jira/browse/ACCUMULO-4506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15634471#comment-15634471
] 

Adam J Shook edited comment on ACCUMULO-4506 at 11/3/16 10:18 PM:
------------------------------------------------------------------

There are two znodes under the {{locks}} node, one for each file.  They belong to different
tservers, which I located by checking the same {{ephemeralOwner}} from the nodes under {{tservers}}.

{noformat}
[zk: host:2181(CONNECTED) 9] get /accumulo/43d2ac5e-0df0-4727-aba9-05bae9e908e7/replication/workqueue/locks/ae4b03ec-159b-44e8-9a88-ccf7fa849c19|peer_instance|5h|k
ephemeralOwner = 0x357d1bf618f80ad

[zk: host:2181(CONNECTED) 14] get /accumulo/43d2ac5e-0df0-4727-aba9-05bae9e908e7/replication/workqueue/locks/9f038f64-4252-44a0-bfd0-99d4a316b397|peer_instance|5g|j
ephemeralOwner = 0x357d1bf618f4f72

[zk: host:2181(CONNECTED) 12] get /accumulo/43d2ac5e-0df0-4727-aba9-05bae9e908e7/tservers/host:31658/zlock-0000000000
TSERV_CLIENT=host:31658
ephemeralOwner = 0x357d1bf618f80ad

[zk: host:2181(CONNECTED) 13] get /accumulo/43d2ac5e-0df0-4727-aba9-05bae9e908e7/tservers/host:31368/zlock-0000000000
TSERV_CLIENT=host:31368
ephemeralOwner = 0x357d1bf618f4f72
{noformat}

Unfortunately, we don't keep logs around long enough to see back when these files were initially
assigned.  We only have data back to October 27th -- a Kibana search for the WAL UUID only
returns log entries from the Master and the GC.

For what it is worth, we've been trying out replication and are seeing some behavior we can't
really explain without digging into it a lot more (source code included).  The time frame
between a WAL file being closed and it actually being replicated seems to take a lot longer
than I would expect it to -- anywhere from five minutes to a couple hours.  I see a lot of
log entries saying saying work is being scheduled, but it takes a while to see the work being
done.  This particular cluster has four tablet servers and there are always 40-60 files pending
replication, with files rarely "in-progress" (besides these two problematic files).  It seems
to replicate in waves, and I haven't put my finger on when files move from pending to in-progress.
 With that said, things *are* replicating, it just seems to be taking a while longer than
we anticipated.  Not sure if this is the way it is, or something else is going on.


was (Author: adamjshook):
There are two znodes under the {locks} node, one for each file.  They belong to different
tservers, which I located by checking the same {ephemeralOwner} from the nodes under {tservers}.

{noformat}
[zk: host:2181(CONNECTED) 9] get /accumulo/43d2ac5e-0df0-4727-aba9-05bae9e908e7/replication/workqueue/locks/ae4b03ec-159b-44e8-9a88-ccf7fa849c19|peer_instance|5h|k
ephemeralOwner = 0x357d1bf618f80ad

[zk: host:2181(CONNECTED) 14] get /accumulo/43d2ac5e-0df0-4727-aba9-05bae9e908e7/replication/workqueue/locks/9f038f64-4252-44a0-bfd0-99d4a316b397|peer_instance|5g|j
ephemeralOwner = 0x357d1bf618f4f72

[zk: host:2181(CONNECTED) 12] get /accumulo/43d2ac5e-0df0-4727-aba9-05bae9e908e7/tservers/host:31658/zlock-0000000000
TSERV_CLIENT=host:31658
ephemeralOwner = 0x357d1bf618f80ad

[zk: host:2181(CONNECTED) 13] get /accumulo/43d2ac5e-0df0-4727-aba9-05bae9e908e7/tservers/host:31368/zlock-0000000000
TSERV_CLIENT=host:31368
ephemeralOwner = 0x357d1bf618f4f72
{noformat}

Unfortunately, we don't keep logs around long enough to see back when these files were initially
assigned.  We only have data back to October 27th -- a Kibana search for the WAL UUID only
returns log entries from the Master and the GC.

For what it is worth, we've been trying out replication and are seeing some behavior we can't
really explain without digging into it a lot more (source code included).  The time frame
between a WAL file being closed and it actually being replicated seems to take a lot longer
than I would expect it to -- anywhere from five minutes to a couple hours.  I see a lot of
log entries saying saying work is being scheduled, but it takes a while to see the work being
done.  This particular cluster has four tablet servers and there are always 40-60 files pending
replication, with files rarely "in-progress" (besides these two problematic files).  It seems
to replicate in waves, and I haven't put my finger on when files move from pending to in-progress.
 With that said, things *are* replicating, it just seems to be taking a while longer than
we anticipated.  Not sure if this is the way it is, or something else is going on.

>  Some in-progress files for replication never replicate
> -------------------------------------------------------
>
>                 Key: ACCUMULO-4506
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4506
>             Project: Accumulo
>          Issue Type: Bug
>          Components: replication
>    Affects Versions: 1.7.2
>            Reporter: Adam J Shook
>
> We're seeing an issue with replication where two files have been in-progress for a long
time and based on the logs are not going to be replicated.  The metadata from the {{accumulo.replication}}
table looks a little funky, with a very large {{begin}} value.
> *Logs*
> {noformat}
> 2016-11-02 19:52:50,900 [replication.DistributedWorkQueueWorkAssigner] DEBUG: Not queueing
work for hdfs://host:9000/accumulo/wal/host+31032/9f038f64-4252-44a0-bfd0-99d4a316b397 to
Remote Name: peer_instance Remote identifier: 5h Source Table ID: k because [begin: 9223372036854775807
end: 0 infiniteEnd: true closed: true createdTime: 1477314365827] doesn't need replication
> 2016-11-02 19:53:08,900 [replication.DistributedWorkQueueWorkAssigner] DEBUG: Not queueing
work for hdfs://host:9000/accumulo/wal/host+31368/ae4b03ec-159b-44e8-9a88-ccf7fa849c19 to
Remote Name: peer_instance Remote identifier: 5i Source Table ID: l because [begin: 9223372036854775807
end: 0 infiniteEnd: true closed: true createdTime: 1477052816174] doesn't need replication
> {noformat}
> *Replication table*
> {noformat}
> scan -r hdfs://host:9000/accumulo/wal/host+31032/9f038f64-4252-44a0-bfd0-99d4a316b397
-t accumulo.replication
> hdfs://host:9000/accumulo/wal/host+31032/9f038f64-4252-44a0-bfd0-99d4a316b397 repl:j
[]    [begin: 0 end: 0 infiniteEnd: true closed: true createdTime: 1477314369633]
> hdfs://host:9000/accumulo/wal/host+31032/9f038f64-4252-44a0-bfd0-99d4a316b397 repl:k
[]    [begin: 9223372036854775807 end: 0 infiniteEnd: true closed: true createdTime: 1477314365827]
> hdfs://host:9000/accumulo/wal/host+31032/9f038f64-4252-44a0-bfd0-99d4a316b397 repl:l
[]    [begin: 9223372036854775807 end: 0 infiniteEnd: true closed: true createdTime: 1477314365707]
> hdfs://host:9000/accumulo/wal/host+31032/9f038f64-4252-44a0-bfd0-99d4a316b397 work:\x01\x00\x00\x00\x17peer_instance\x01\x00\x00\x00\x025g\x01\x00\x00\x00\x01j
[]    [begin: 0 end: 0 infiniteEnd: true closed: true createdTime: 1477314369633]
> hdfs://host:9000/accumulo/wal/host+31032/9f038f64-4252-44a0-bfd0-99d4a316b397 work:\x01\x00\x00\x00\x17peer_instance\x01\x00\x00\x00\x025h\x01\x00\x00\x00\x01k
[]    [begin: 9223372036854775807 end: 0 infiniteEnd: true closed: true createdTime: 1477314365827]
> hdfs://host:9000/accumulo/wal/host+31032/9f038f64-4252-44a0-bfd0-99d4a316b397 work:\x01\x00\x00\x00\x17peer_instance\x01\x00\x00\x00\x025i\x01\x00\x00\x00\x01l
[]    [begin: 9223372036854775807 end: 0 infiniteEnd: true closed: true createdTime: 1477314365707]
> scan -r hdfs://host:9000/accumulo/wal/host+31368/ae4b03ec-159b-44e8-9a88-ccf7fa849c19
-t accumulo.replication
> hdfs://host:9000/accumulo/wal/host+31368/ae4b03ec-159b-44e8-9a88-ccf7fa849c19 repl:j
[]    [begin: 9223372036854775807 end: 0 infiniteEnd: true closed: true createdTime: 1477052819752]
> hdfs://host:9000/accumulo/wal/host+31368/ae4b03ec-159b-44e8-9a88-ccf7fa849c19 repl:k
[]    [begin: 0 end: 0 infiniteEnd: true closed: true createdTime: 1477052816238]
> hdfs://host:9000/accumulo/wal/host+31368/ae4b03ec-159b-44e8-9a88-ccf7fa849c19 repl:l
[]    [begin: 9223372036854775807 end: 0 infiniteEnd: true closed: true createdTime: 1477052816174]
> hdfs://host:9000/accumulo/wal/host+31368/ae4b03ec-159b-44e8-9a88-ccf7fa849c19 work:\x01\x00\x00\x00\x17peer_instance\x01\x00\x00\x00\x025g\x01\x00\x00\x00\x01j
[]    [begin: 9223372036854775807 end: 0 infiniteEnd: true closed: true createdTime: 1477052819752]
> hdfs://host:9000/accumulo/wal/host+31368/ae4b03ec-159b-44e8-9a88-ccf7fa849c19 work:\x01\x00\x00\x00\x17peer_instance\x01\x00\x00\x00\x025h\x01\x00\x00\x00\x01k
[]    [begin: 0 end: 0 infiniteEnd: true closed: true createdTime: 1477052816238]
> hdfs://host:9000/accumulo/wal/host+31368/ae4b03ec-159b-44e8-9a88-ccf7fa849c19 work:\x01\x00\x00\x00\x17peer_instance\x01\x00\x00\x00\x025i\x01\x00\x00\x00\x01l
[]    [begin: 9223372036854775807 end: 0 infiniteEnd: true closed: true createdTime: 1477052816174]
> {noformat}
> *HDFS*
> {noformat}
> hdfs dfs -ls hdfs://host:9000/accumulo/wal/host+31032/9f038f64-4252-44a0-bfd0-99d4a316b397
hdfs://host:9000/accumulo/wal/host+31368/ae4b03ec-159b-44e8-9a88-ccf7fa849c19
> -rwxr-xr-x   3 ubuntu supergroup 1117650900 2016-10-24 13:09 hdfs://host:9000/accumulo/wal/host+31032/9f038f64-4252-44a0-bfd0-99d4a316b397
> -rwxr-xr-x   3 ubuntu supergroup 1171968390 2016-10-21 12:31 hdfs://host:9000/accumulo/wal/host+31368/ae4b03ec-159b-44e8-9a88-ccf7fa849c19
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message