hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hari Sekhon (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-8341) HDFS mover stuck in loop trying to move corrupt block with no other valid replicas, doesn't move rest of other data blocks
Date Fri, 18 Sep 2015 15:59:04 GMT

    [ https://issues.apache.org/jira/browse/HDFS-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14875829#comment-14875829
] 

Hari Sekhon commented on HDFS-8341:
-----------------------------------

[~szetszwo] No I meant the original log description at the top which shows
{code}balancer.Dispatcher: Failed to move blk_1075156654_1438349{code} repeats over and over
in the output which is what made me think it was looping on the same block.

There's only 1 replica for each block... so it's not iterating on locations as the code snippet
you are pointing to suggests since there are no other locations to try, but exiting and then
restarting at the same block which still has no uncorrupted replicas available, exiting again,
restarting at the same block again etc.

> HDFS mover stuck in loop trying to move corrupt block with no other valid replicas, doesn't
move rest of other data blocks
> --------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-8341
>                 URL: https://issues.apache.org/jira/browse/HDFS-8341
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: balancer & mover
>    Affects Versions: 2.6.0
>         Environment: HDP 2.2
>            Reporter: Hari Sekhon
>            Priority: Minor
>
> HDFS mover gets stuck looping on a block that fails to move and doesn't migrate the rest
of the blocks.
> This is preventing recovery of data from a decomissioning external storage tier used
for archive (we've had problems with that proprietary "hyperscale" storage product which is
why a couple blocks here and there have checksum problems or premature eof as shown below),
but this should not prevent moving all the other blocks to recover our data:
> {code}hdfs mover -p /apps/hive/warehouse/<custom_scrubbed>
> 15/05/07 14:52:50 INFO mover.Mover: namenodes = {hdfs://nameservice1=[/apps/hive/warehouse/<custom_scrubbed>]}
> 15/05/07 14:52:51 INFO balancer.KeyManager: Block token params received from NN: update
interval=10hrs, 0sec, token lifetime=10hrs, 0sec
> 15/05/07 14:52:51 INFO block.BlockTokenSecretManager: Setting block keys
> 15/05/07 14:52:51 INFO balancer.KeyManager: Update block keys every 2hrs, 30mins, 0sec
> 15/05/07 14:52:52 INFO block.BlockTokenSecretManager: Setting block keys
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: /default-rack/<ip>:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: /default-rack/<ip>:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: /default-rack/<ip>:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: /default-rack/<ip>:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: /default-rack/<ip>:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: /default-rack/<ip>:1019
> 15/05/07 14:52:52 WARN balancer.Dispatcher: Failed to move blk_1075156654_1438349 with
size=134217728 from <ip>:1019:ARCHIVE to <ip>:1019:DISK through <ip>:1019:
block move is failed: opReplaceBlock BP-120244285-<ip>-1417023863606:blk_1075156654_1438349
received exception java.io.EOFException: Premature EOF: no length prefix available
> <NOW IT STARTS LOOPING ON SAME BLOCK>
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: /default-rack/<ip>:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: /default-rack/<ip>:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: /default-rack/<ip>:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: /default-rack/<ip>:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: /default-rack/<ip>:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: /default-rack/<ip>:1019
> 15/05/07 14:53:31 WARN balancer.Dispatcher: Failed to move blk_1075156654_1438349 with
size=134217728 from <ip>:1019:ARCHIVE to <ip>:1019:DISK through <ip>:1019:
block move is failed: opReplaceBlock BP-120244285-<ip>-1417023863606:blk_1075156654_1438349
received exception java.io.EOFException: Premature EOF: no length prefix available
> ...<repeat indefinitely>...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message