drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-3735) Directory pruning is not happening when number of files is larger than 64k
Date Tue, 15 Sep 2015 00:57:46 GMT

    [ https://issues.apache.org/jira/browse/DRILL-3735?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14744608#comment-14744608
] 

ASF GitHub Bot commented on DRILL-3735:
---------------------------------------

Github user jinfengni commented on a diff in the pull request:

    https://github.com/apache/drill/pull/156#discussion_r39465039
  
    --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/planner/ParquetPartitionDescriptor.java
---
    @@ -125,4 +117,16 @@ private String getBaseTableLocation() {
         final FormatSelection origSelection = (FormatSelection) scanRel.getDrillTable().getSelection();
         return origSelection.getSelection().selectionRoot;
       }
    +
    +  @Override
    +  protected void createPartitionSublists() {
    +    Set<String> fileLocations = ((ParquetGroupScan) scanRel.getGroupScan()).getFileSet();
    +    List<PartitionLocation> locations = new LinkedList<>();
    +    for (String file: fileLocations) {
    +      locations.add(new DFSPartitionLocation(MAX_NESTED_SUBDIRS, getBaseTableLocation(),
file));
    --- End diff --
    
    Looks like we are still putting the file name including the directory name into heap memory,
before break it into multiple sublists.  In other words, this patch will reduce the direct
memory footprint allocated for value vectors. But it does not address the heap memory issue
caused by very long file names, right?



> Directory pruning is not happening when number of files is larger than 64k
> --------------------------------------------------------------------------
>
>                 Key: DRILL-3735
>                 URL: https://issues.apache.org/jira/browse/DRILL-3735
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Query Planning & Optimization
>    Affects Versions: 1.1.0
>            Reporter: Hao Zhu
>            Assignee: Mehant Baid
>             Fix For: 1.2.0
>
>
> When the number of files is larger than 64k limit, directory pruning is not happening.

> We need to increase this limit further to handle most use cases.
> My proposal is to separate the code for directory pruning and partition pruning. 
> Say in a parent directory there are 100 directories and 1 million files.
> If we only query the file from one directory, we should firstly read the 100 directories
and narrow down to which directory; and then read the file paths in that directory in memory
and do the rest stuff.
> Current behavior is , Drill will read all the file paths of that 1 million files in memory
firstly, and then do directory pruning or partition pruning. This is not performance efficient
nor memory efficient. And also it can not scale.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message