hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bob Hansen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-8746) Reduce the latency of streaming reads by re-using DN connections
Date Thu, 09 Jul 2015 20:20:05 GMT

    [ https://issues.apache.org/jira/browse/HDFS-8746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14621187#comment-14621187

Bob Hansen commented on HDFS-8746:

I propose that we extend the semantics of the InputStreamImpl to optionally retain the connection
to the last DN along with enough metadata to identify it.  While the current read is going
to the same DN, continue using the same connection.  This can remove the overhead of the initial
TCP and authN handshake as long as the reads are to the same DN.  As a follow-on, we can prioritize
the DN that the InputStreamImpl is currently connected to when choosing a DN for the next
block, if available.

It was proposed that we keep the connections short-lived, and amortize the connection cost
by using read-ahead and making longer reads.  This requires dynamically allocating sufficient
buffer space on a per-stream basis to sufficiently amortize the connection cost.  For SSL
connections, this would need to be a sizeable amount.  For the target environment for HDFS-8707
(applications requiring thousands of concurrent connections), this could turn into many GB
of heap required.  It would mitigate the degenerate case of incredibly short reads, however,
which under the proposed solution would have lots of communication latency (though still no
connection setup latency).

> Reduce the latency of streaming reads by re-using DN connections
> ----------------------------------------------------------------
>                 Key: HDFS-8746
>                 URL: https://issues.apache.org/jira/browse/HDFS-8746
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: hdfs-client
>            Reporter: Bob Hansen
>            Assignee: Bob Hansen
> The current libhdfspp implementation opens a new connection for each pread.  For streaming
reads (especially streaming short-buffer reads coming from the C API, and especially once
we get SSL handshake overhead), our throughput will be dominated by the connection latency
of reconnecting to the DataNodes.
> The target use case is a multi-block file that is being sequentially streamed and processed
by the client application, which consumes the data as it comes from the DN and throws it away.
 The data is read into moderately small buffers (~64k - ~1MB) owned by the consumer, and overall
throughput is the critical metric.

This message was sent by Atlassian JIRA

View raw message