drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe L. Korn (JIRA)" <j...@apache.org>
Subject [jira] [Created] (DRILL-4976) Querying Parquet files on S3 pulls
Date Fri, 28 Oct 2016 07:59:59 GMT
Uwe L. Korn created DRILL-4976:

             Summary: Querying Parquet files on S3 pulls 
                 Key: DRILL-4976
                 URL: https://issues.apache.org/jira/browse/DRILL-4976
             Project: Apache Drill
          Issue Type: Improvement
          Components: Storage - Parquet
    Affects Versions: 1.8.0
            Reporter: Uwe L. Korn

Currently (Drill 1.8, Hadoop 2.7.2) when queries are executed on files stored in S3, the underlying
implementation of s3a requests magnitudes too much data. Given sufficient seek sizes, the
following HTTP pattern is observed:

* GET bytes=8k-100M
* GET bytes=2M-100M
* GET bytes=4M-100M

Although the HTTP request were normally aborted before all the data was
send by the server, it was still about 10-15x the size of the input files
that went over the network, i.e. for a file of the size of 100M, sometimes 1G of data is transferred
over the network.

A fix for this is the newly introduced {{fs.s3a.experimental.input.fadvise=random}} mode which
will be introduced with Hadoop 3.

This message was sent by Atlassian JIRA

View raw message