hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Weiwei Yang (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HDFS-12528) Short-circuit reads getting disabled frequently in certain scenarios
Date Mon, 30 Oct 2017 02:56:03 GMT

    [ https://issues.apache.org/jira/browse/HDFS-12528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16222326#comment-16222326
] 

Weiwei Yang edited comment on HDFS-12528 at 10/30/17 2:55 AM:
--------------------------------------------------------------

I believe we met the same issue, but triggered by a different cause. Our case a hbase regionserver
is trying to SCR a block that has been moved away by balancer, it complains that the block
replica not valid, but this is an unknown error

{noformat}
impl.BlockReaderFactory: BlockReaderFactory(fileName=xxx, block=BP-547663139-11.139.225.193-1497349178310:blk_2222386301_1148917283):
unknown response code ERROR while attempting to set up short-circuit access. Block BP-xxx:blk_2222386301_1148917283
is not valid
{noformat}

this causes the SCR been disabled for 10 minutes and impacted hbase performance a lot. [~jzhuge],
how do you plan to fix this? I guess this long pause was in case something really bad happens.
But in real use cases, we might keep seeing new exceptions (unknown) but don't prefer to disable
SCR for so long (like here). Can we have this to be configurable length of time? And if we
set it to 0, that gives us a way to NOT disable SCR at all.

Thanks


was (Author: cheersyang):
I believe we met the same issue, but triggered by a different cause. Our case a hbase client
is trying to SCR a block that has been moved away by balancer, it complains that the block
replica not valid, but this is an unknown error

{noformat}
impl.BlockReaderFactory: BlockReaderFactory(fileName=xxx, block=BP-547663139-11.139.225.193-1497349178310:blk_2222386301_1148917283):
unknown response code ERROR while attempting to set up short-circuit access. Block BP-xxx:blk_2222386301_1148917283
is not valid
{noformat}

this causes the SCR been disabled for 10 minutes and impacted hbase performance a lot. [~jzhuge],
how do you plan to fix this? I guess this long pause was in case something really bad happens.
But in real use cases, we might keep seeing new exceptions (unknown) but don't prefer to disable
SCR for so long (like here). Can we have this to be configurable length of time? And if we
set it to 0, that gives us a way to NOT disable SCR at all.

Thanks

> Short-circuit reads getting disabled frequently in certain scenarios
> --------------------------------------------------------------------
>
>                 Key: HDFS-12528
>                 URL: https://issues.apache.org/jira/browse/HDFS-12528
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs-client, performance
>    Affects Versions: 2.6.0
>            Reporter: Andre Araujo
>            Assignee: John Zhuge
>         Attachments: HDFS-12528.000.patch
>
>
> We have scenarios where data ingestion makes use of the -appendToFile operation to add
new data to existing HDFS files. In these situations, we're frequently running into the problem
described below.
> We're using Impala to query the HDFS data with short-circuit reads (SCR) enabled. After
each file read, Impala "unbuffer"'s the HDFS file to reduce the memory footprint. In some
cases, though, Impala still keeps the HDFS file handle open for reuse.
> The "unbuffer" call, however, causes the file's current block reader to be closed, which
makes the associated ShortCircuitReplica evictable from the ShortCircuitCache. When the cluster
is under load, this means that the ShortCircuitReplica can be purged off the cache pretty
fast, which closes the file descriptor to the underlying storage file.
> That means that when Impala re-reads the file it has to re-open the storage files associated
with the ShortCircuitReplica's that were evicted from the cache. If there were no appends
to those blocks, the re-open will succeed without problems. If one block was appended since
the ShortCircuitReplica was created, the re-open will fail with the following error:
> {code}
> Meta file for BP-810388474-172.31.113.69-1499543341726:blk_1074012183_273087 not found
> {code}
> This error is handled as an "unknown response" by the BlockReaderFactory [1], which disables
short-circuit reads for 10 minutes [2] for the client.
> These 10 minutes without SCR can have a big performance impact for the client operations.
In this particular case ("Meta file not found") it would suffice to return null without disabling
SCR. This particular block read would fall back to the normal, non-short-circuited, path and
other SCR requests would continue to work as expected.
> It might also be interesting to be able to control how long SCR is disabled for in the
"unknown response" case. 10 minutes seems a bit to long and not being able to change that
is a problem.
> [1] https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/client/impl/BlockReaderFactory.java#L646
> [2] https://github.com/apache/hadoop/blob/f67237cbe7bc48a1b9088e990800b37529f1db2a/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/shortcircuit/DomainSocketFactory.java#L97



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org


Mime
View raw message