hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Abdullah Alamoudi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-6607) Improve DFSInputStream forward seek performance
Date Tue, 01 Jul 2014 06:49:24 GMT

    [ https://issues.apache.org/jira/browse/HDFS-6607?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048577#comment-14048577

Abdullah Alamoudi commented on HDFS-6607:

I know I didn't give enough details. What I've done is the following: I built B-Tree indexes
on records stored in HDFS files where I am using the records positions offsets to point to
the records so I can perform seek and read in order to get the records of interest. when I
search my index and before I use the list of offsets to access the records I sort them so
I can access the records sequentially when possible. After that, I wanted to test the performance
for the indexes and found out that they perform better when the number of accessed records
is 60M than when they are 3M which doesn't make sense. 

So I went and started looking at the seek implementation and came up with this conjecture
that when I have more records and I am accessing them sequentially, the chances for the next
record to be in the buffer is higher and so I am doing less of (closing current reader, creating
new sockets and looking for the block location, etc) than when I am accessing less number
of records.

As for the heavy stuff being in the read and not the seek, you are right but what I meant
is that they are related as they occur because I perform a seek followed by a read :-). What
I think should be done is that if the next desired position is in the same currently being
read block, a cheaper way to do this using the same block reader should be implemented.

> Improve DFSInputStream forward seek performance
> -----------------------------------------------
>                 Key: HDFS-6607
>                 URL: https://issues.apache.org/jira/browse/HDFS-6607
>             Project: Hadoop HDFS
>          Issue Type: Improvement
>          Components: hdfs-client, performance
>    Affects Versions: 2.4.1
>            Reporter: Abdullah Alamoudi
> When having a DFSInputStream open and seeking to a position that resides in the same
block, if the target position is in the TCP buffer already, the seek is performed efficiently
simply by eating up the intervening data. See line 1368 in the file: hadoop-common/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java.
> However, if the position is in the same block but after the TCP buffer, the inputstream
performs a set of actions including closing the current block reader, locating the block again,
selecting a data node and creating a new block reader. During this, many objects are created
and all of this is very inefficient for users with random access needs (e.g index access).
> I have conducted some experiments which showed that reading 3,000,000 records using seeks
and reads is slower than reading 60,000,000 records using seeks and reads as well which shows
the need to improve the seek implementation.

This message was sent by Atlassian JIRA

View raw message