hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Erik Krogen (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HDFS-14211) [Consistent Observer Reads] Allow for configurable "always msync" mode
Date Wed, 13 Mar 2019 16:34:00 GMT

     [ https://issues.apache.org/jira/browse/HDFS-14211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Erik Krogen updated HDFS-14211:
    Attachment: HDFS-14211.001.patch

> [Consistent Observer Reads] Allow for configurable "always msync" mode
> ----------------------------------------------------------------------
>                 Key: HDFS-14211
>                 URL: https://issues.apache.org/jira/browse/HDFS-14211
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs-client
>            Reporter: Erik Krogen
>            Assignee: Erik Krogen
>            Priority: Major
>         Attachments: HDFS-14211.000.patch, HDFS-14211.001.patch
> To allow for reads to be serviced from an ObserverNode (see HDFS-12943) in a consistent
way, an {{msync}} API was introduced (HDFS-13688) to allow for a client to fetch the latest
transaction ID from the Active NN, thereby ensuring that subsequent reads from the ObserverNode
will be up-to-date with the current state of the Active.
> Using this properly, however, requires application-side changes: for examples, a NodeManager
should call {{msync}} before localizing the resources for a client, since it received notification
of the existence of those resources via communicate which is out-of-band to HDFS and thus
could potentially attempt to localize them prior to the availability of those resources on
the ObserverNode.
> Until such application-side changes can be made, which will be a longer-term effort,
we need to provide a mechanism for unchanged clients to utilize the ObserverNode without exposing
such a client to inconsistencies. This is essentially phase 3 of the roadmap outlined in the
[design document|https://issues.apache.org/jira/secure/attachment/12915990/ConsistentReadsFromStandbyNode.pdf]
for HDFS-12943.
> The design document proposes some heuristics based on understanding of how common applications
(e.g. MR) use HDFS for resources. As an initial pass, we can simply have a flag which tells
a client to call {{msync}} before _every single_ read operation. This may seem counterintuitive,
as it turns every read operation into two RPCs: {{msync}} to the Active following by an actual
read operation to the Observer. However, the {{msync}} operation is extremely lightweight,
as it does not acquire the {{FSNamesystemLock}}, and in experiments we have found that this
approach can easily scale to well over 100,000 {{msync}} operations per second on the Active
(while still servicing approx. 10,000 write op/s). Combined with the fast-path edit log tailing
for standby/observer nodes (HDFS-13150), this "always msync" approach should introduce only
a few ms of extra latency to each read call.
> Below are some experimental results collected from experiments which convert a normal
RPC workload into one in which all read operations are turned into an {{msync}}. The baseline
is a workload of 1.5k write op/s and 25k read op/s.
> ||Rate Multiplier|2|4|6|8||
> ||RPC Queue Avg Time (ms)|14|53|110|125||
> ||RPC Queue NumOps Avg (k)|51|102|147|177||
> ||RPC Queue NumOps Max (k)|148|269|306|312||
> _(numbers are approximate and should be viewed primarily for their trends)_
> Results are promising up to between 4x and 6x of the baseline workload, which is approx.
100-150k read op/s.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message