hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stephen O'Donnell (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HDFS-14663) HTTPFS ListStatus_Batch does not return batches as expected
Date Tue, 23 Jul 2019 16:23:00 GMT
Stephen O'Donnell created HDFS-14663:

             Summary: HTTPFS ListStatus_Batch does not return batches as expected
                 Key: HDFS-14663
                 URL: https://issues.apache.org/jira/browse/HDFS-14663
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: httpfs
    Affects Versions: 3.3.0
            Reporter: Stephen O'Donnell

The webhdfs protocol supports a LISTSTATUS_BATCH operation where it can retrieve the file
listing for a large directory in chunks.

When using the webhdfs service embedded in the namenode, this works as expected, but when
using HTTPFS, any call to LISTSTATUS_BATCH simply returns the entire listing rather than batches,
working effectively like LISTSTATUS instead.

This seems to be because HTTPFS falls back to using the method org.apache.hadoop.fs.FileSystem#listStatusBatch,
which is intended to be overridden, but the implementation used in HTTPFS has not done that,
leading to this limitation.

This feature (LISTSTATUS_BATCH) was added to HTTPFS by HDFS-10823, but based on my testing
it does not work as intended. I suspect it is because the listStatusBatch operation was added
to the WebHdfsFileSystem and HttpFSFileSystem as part of the above Jira, but behind the scenes
HTTPFS seems to use DistributeFileSystem and hence it falls back to the default implementation
"org.apache.hadoop.fs.FileSystem#listStatusBatch" which returns all entries in a single batch.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org

View raw message