hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chao Sun (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-14211) [Consistent Observer Reads] Allow for configurable "always msync" mode
Date Mon, 11 Mar 2019 17:47:00 GMT

    [ https://issues.apache.org/jira/browse/HDFS-14211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16789811#comment-16789811

Chao Sun commented on HDFS-14211:

One potential downside with this approach, IMO, is that {{msync}} still has to go through
the RPC queue on the active NN. In a busy cluster this could impact the read-only performance.
For instance, in our environment the RPC queue time in observer nodes is at least 10X lower
than that from the active NN. This is also one major motivation for us to use observer for
Presto workloads.

> [Consistent Observer Reads] Allow for configurable "always msync" mode
> ----------------------------------------------------------------------
>                 Key: HDFS-14211
>                 URL: https://issues.apache.org/jira/browse/HDFS-14211
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs-client
>            Reporter: Erik Krogen
>            Assignee: Erik Krogen
>            Priority: Major
>         Attachments: HDFS-14211.000.patch
> To allow for reads to be serviced from an ObserverNode (see HDFS-12943) in a consistent
way, an {{msync}} API was introduced (HDFS-13688) to allow for a client to fetch the latest
transaction ID from the Active NN, thereby ensuring that subsequent reads from the ObserverNode
will be up-to-date with the current state of the Active.
> Using this properly, however, requires application-side changes: for examples, a NodeManager
should call {{msync}} before localizing the resources for a client, since it received notification
of the existence of those resources via communicate which is out-of-band to HDFS and thus
could potentially attempt to localize them prior to the availability of those resources on
the ObserverNode.
> Until such application-side changes can be made, which will be a longer-term effort,
we need to provide a mechanism for unchanged clients to utilize the ObserverNode without exposing
such a client to inconsistencies. This is essentially phase 3 of the roadmap outlined in the
[design document|https://issues.apache.org/jira/secure/attachment/12915990/ConsistentReadsFromStandbyNode.pdf]
for HDFS-12943.
> The design document proposes some heuristics based on understanding of how common applications
(e.g. MR) use HDFS for resources. As an initial pass, we can simply have a flag which tells
a client to call {{msync}} before _every single_ read operation. This may seem counterintuitive,
as it turns every read operation into two RPCs: {{msync}} to the Active following by an actual
read operation to the Observer. However, the {{msync}} operation is extremely lightweight,
as it does not acquire the {{FSNamesystemLock}}, and in experiments we have found that this
approach can easily scale to well over 100,000 {{msync}} operations per second on the Active
(while still servicing approx. 10,000 write op/s). Combined with the fast-path edit log tailing
for standby/observer nodes (HDFS-13150), this "always msync" approach should introduce only
a few ms of extra latency to each read call.
> Below are some experimental results collected from experiments which convert a normal
RPC workload into one in which all read operations are turned into an {{msync}}. The baseline
is a workload of 1.5k write op/s and 25k read op/s.
> ||Rate Multiplier|2|4|6|8||
> ||RPC Queue Avg Time (ms)|14|53|110|125||
> ||RPC Queue NumOps Avg (k)|51|102|147|177||
> ||RPC Queue NumOps Max (k)|148|269|306|312||
> _(numbers are approximate and should be viewed primarily for their trends)_
> Results are promising up to between 4x and 6x of the baseline workload, which is approx.
100-150k read op/s.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message