hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Tsz Wo Nicholas Sze (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-8341) HDFS mover stuck in loop after failing to move block, doesn't move rest of blocks, can't get data back off decommissioning external storage tier as a result
Date Mon, 11 May 2015 20:51:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-8341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14538625#comment-14538625
] 

Tsz Wo Nicholas Sze commented on HDFS-8341:
-------------------------------------------

> ... If replica scheduled successfully it will return, but here it should continue for
next replica.

The reason of returning is that we don't want to move multiple replicas of the same block
at once.

> Now the problem is, if file have more than one block for example 10 ... 

Do you mean "a block has more than one replicas for example 10"?

> ... and some problem in moving first replica then scheduleMoves4Block() API will always
schedule first replica in each iteration and it will return.
The locations are shuffled so that the first replica is not necessarily the same in each iteration.
 Am I missing anything?
{code}
//scheduleMoves4Block
      Collections.shuffle(locations);
{code}


> HDFS mover stuck in loop after failing to move block, doesn't move rest of blocks, can't
get data back off decommissioning external storage tier as a result
> ------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-8341
>                 URL: https://issues.apache.org/jira/browse/HDFS-8341
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: balancer & mover
>    Affects Versions: 2.6.0
>         Environment: HDP 2.2
>            Reporter: Hari Sekhon
>            Assignee: surendra singh lilhore
>            Priority: Blocker
>
> HDFS mover gets stuck looping on a block that fails to move and doesn't migrate the rest
of the blocks.
> This is preventing recovery of data from a decomissioning external storage tier used
for archive (we've had problems with that proprietary "hyperscale" storage product which is
why a couple blocks here and there have checksum problems or premature eof as shown below),
but this should not prevent moving all the other blocks to recover our data:
> {code}hdfs mover -p /apps/hive/warehouse/<custom_scrubbed>
> 15/05/07 14:52:50 INFO mover.Mover: namenodes = {hdfs://nameservice1=[/apps/hive/warehouse/<custom_scrubbed>]}
> 15/05/07 14:52:51 INFO balancer.KeyManager: Block token params received from NN: update
interval=10hrs, 0sec, token lifetime=10hrs, 0sec
> 15/05/07 14:52:51 INFO block.BlockTokenSecretManager: Setting block keys
> 15/05/07 14:52:51 INFO balancer.KeyManager: Update block keys every 2hrs, 30mins, 0sec
> 15/05/07 14:52:52 INFO block.BlockTokenSecretManager: Setting block keys
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: /default-rack/<ip>:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: /default-rack/<ip>:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: /default-rack/<ip>:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: /default-rack/<ip>:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: /default-rack/<ip>:1019
> 15/05/07 14:52:52 INFO net.NetworkTopology: Adding a new node: /default-rack/<ip>:1019
> 15/05/07 14:52:52 WARN balancer.Dispatcher: Failed to move blk_1075156654_1438349 with
size=134217728 from <ip>:1019:ARCHIVE to <ip>:1019:DISK through <ip>:1019:
block move is failed: opReplaceBlock BP-120244285-<ip>-1417023863606:blk_1075156654_1438349
received exception java.io.EOFException: Premature EOF: no length prefix available
> <NOW IT STARTS LOOPING ON SAME BLOCK>
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: /default-rack/<ip>:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: /default-rack/<ip>:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: /default-rack/<ip>:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: /default-rack/<ip>:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: /default-rack/<ip>:1019
> 15/05/07 14:53:31 INFO net.NetworkTopology: Adding a new node: /default-rack/<ip>:1019
> 15/05/07 14:53:31 WARN balancer.Dispatcher: Failed to move blk_1075156654_1438349 with
size=134217728 from <ip>:1019:ARCHIVE to <ip>:1019:DISK through <ip>:1019:
block move is failed: opReplaceBlock BP-120244285-<ip>-1417023863606:blk_1075156654_1438349
received exception java.io.EOFException: Premature EOF: no length prefix available
> ...<repeat indefinitely>...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message