drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From jinfengni <...@git.apache.org>
Subject [GitHub] drill pull request: DRILL-4363: Row count based pruning for parque...
Date Thu, 11 Feb 2016 04:39:49 GMT
Github user jinfengni commented on a diff in the pull request:

    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
    @@ -791,6 +799,43 @@ public FileGroupScan clone(FileSelection selection) throws IOException
    +  public GroupScan applyLimit(long maxRecords) {
    --- End diff --
    I gave some thoughts about this optimization as well. Then, I realized that until we have
some performance measurement, it's not very clear which way we want to have. For example,
I'm not clear whether 1000 small parquet files is better than 1 large parquet files. 1000
files might have big metadata overhead than 1 large file (?). But 1000 small files might be
better option, in case we do want to parallelize the execution.
    I'll add some comment saying further optimization could be done in terms of how subset
of files are chosen.

If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.

View raw message