drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe L. Korn (JIRA)" <j...@apache.org>
Subject [jira] [Created] (DRILL-4977) Reading parquet metadata cache from S3 with fadvise=random and Hadoop 3 generates a large number of requests
Date Fri, 28 Oct 2016 08:05:58 GMT
Uwe L. Korn created DRILL-4977:
----------------------------------

             Summary: Reading parquet metadata cache from S3 with fadvise=random and Hadoop
3 generates a large number of requests
                 Key: DRILL-4977
                 URL: https://issues.apache.org/jira/browse/DRILL-4977
             Project: Apache Drill
          Issue Type: Improvement
          Components: Storage - Parquet
    Affects Versions: 1.8.0
         Environment: Hadoop 3.0
            Reporter: Uwe L. Korn


When using the new {{fs.s3a.experimental.input.fadvise=random}} mode for accessing Parquet
files stored in S3, we see a significant improvement for the query performance but a slowdown
on query planning. This is due to the way the metadata file is read (each chunk of 8000 bytes
generates a new GET request to S3). Indicating with {{FSDataInputStream.setReadahead(metadata-filesize)}}
that we will read the whole file, this behaviour is circumvented. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message