drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-4380) Fix performance regression: in creation of FileSelection in ParquetFormatPlugin to not set files if metadata cache is available.
Date Tue, 09 Feb 2016 21:31:18 GMT

    [ https://issues.apache.org/jira/browse/DRILL-4380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15139809#comment-15139809

ASF GitHub Bot commented on DRILL-4380:

Github user jacques-n commented on a diff in the pull request:

    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FileSelection.java
    @@ -183,12 +194,16 @@ private static String buildPath(final String[] path, final int folderIndex)
       public static FileSelection create(final DrillFileSystem fs, final String parent, final
String path) throws IOException {
    +    Stopwatch timer = Stopwatch.createStarted();
         final Path combined = new Path(parent, removeLeadingSlash(path));
         final FileStatus[] statuses = fs.globStatus(combined);
         if (statuses == null) {
           return null;
    -    return create(Lists.newArrayList(statuses), null, combined.toUri().toString());
    +    final FileSelection fileSel = create(Lists.newArrayList(statuses), null, combined.toUri().toString());
    +    logger.info("FileSelection.create() took {} ms ", timer.elapsed(TimeUnit.MILLISECONDS));
    --- End diff --

> Fix performance regression: in creation of FileSelection in ParquetFormatPlugin to not
set files if metadata cache is available.
> --------------------------------------------------------------------------------------------------------------------------------
>                 Key: DRILL-4380
>                 URL: https://issues.apache.org/jira/browse/DRILL-4380
>             Project: Apache Drill
>          Issue Type: Bug
>            Reporter: Parth Chandra
> The regression has been caused by the changes in 367d74a65ce2871a1452361cbd13bbd5f4a6cc95
(DRILL-2618: handle queries over empty folders consistently so that they report table not
found rather than failing.)
> In ParquetFormatPlugin, the original code created a FileSelection object in the following
> {code}
> return new FileSelection(fileNames, metaRootPath.toString(), metadata, selection.getFileStatusList(fs));
> {code}
> The selection.getFileStatusList call made an inexpensive call to FileSelection.init().
The call was inexpensive because the FileSelection.files member was not set and the code does
not need to make an expensive call to get the file statuses corresponding to the files in
the FileSelection.files member.
> In the new code, this is replaced by 
> {code}
>   final FileSelection newSelection = FileSelection.create(null, fileNames, metaRootPath.toString());
>         return ParquetFileSelection.create(newSelection, metadata);
> {code}
> This sets the FileSelection.files member but not the FileSelection.statuses member. A
subsequent call to FileSelection.getStatuses ( in ParquetGroupScan() ) now makes an expensive
call to get all the statuses.
> It appears that there was an implicit assumption that the FileSelection.statuses member
should be set before the FileSelection.files member is set. This assumption is no longer true.

This message was sent by Atlassian JIRA

View raw message