hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-6607) DFSInputStream Seek performance improvement
Date Mon, 30 Jun 2014 11:17:26 GMT

    [ https://issues.apache.org/jira/browse/HDFS-6607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14047564#comment-14047564

Steve Loughran commented on HDFS-6607:

Some of the object store streams (e.g. for Swift) do this too -the cost of a seek is very
expensive there.

what might be useful is moving this to BufferedInputStream, which already has some buffered
operations -it could be enhanced to also skip forward some bytes on a read. Or factor out
the skip logic in some other way so that we stop having to replicate it everywhere. 

> DFSInputStream Seek performance improvement
> -------------------------------------------
>                 Key: HDFS-6607
>                 URL: https://issues.apache.org/jira/browse/HDFS-6607
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs-client, performance
>    Affects Versions: 2.4.1
>            Reporter: Abdullah Alamoudi
> When having a DFSInputStream open and seeking to a position that resides in the same
block, if the target position is in the TCP buffer already, the seek is performed efficiently
simply by eating up the intervening data. See line 1368 in the file: hadoop-common/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java.
> However, if the position is in the same block but after the TCP buffer, the inputstream
performs a set of actions including closing the current block reader, locating the block again,
selecting a data node and creating a new block reader. During this, many objects are created
and all of this is very inefficient for users with random access needs (e.g index access).
> I have conducted some experiments which showed that reading 3,000,000 records using seeks
and reads is slower than reading 60,000,000 records using seeks and reads as well which shows
the need to improve the seek implementation.

This message was sent by Atlassian JIRA

View raw message