drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Aman Sinha (JIRA)" <j...@apache.org>
Subject [jira] [Created] (DRILL-4861) Revisit the 'entries' stored as part of ParquetGroupScan
Date Wed, 24 Aug 2016 01:12:21 GMT
Aman Sinha created DRILL-4861:

             Summary: Revisit the 'entries' stored as part of ParquetGroupScan
                 Key: DRILL-4861
                 URL: https://issues.apache.org/jira/browse/DRILL-4861
             Project: Apache Drill
          Issue Type: Bug
          Components: Storage - Parquet
    Affects Versions: 1.7.0
            Reporter: Aman Sinha

The ParquetGroupScan stores a list of ReadEntryWithPath in the form of 'entries' field (https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java#L104)
as well as a hash set of file names  in the 'fileSet' field (https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java#L263).

The underlying data stored by both is essentially the same set of filenames.  We should try
to consolidate these into a single entity.  This is not just useful for code simplification
but has a real performance cost: when a ParquetGroupScan is serialized and sent as part of
a Json plan fragment, the overhead is quite high if the number of files is large (tens of
thousands or higher). 

This message was sent by Atlassian JIRA

View raw message