drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Arina Ielchiieva (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-4720) MINDIR() and IMINDIR() functions return no results with metadata cache
Date Thu, 29 Jun 2017 17:34:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16068670#comment-16068670
] 

Arina Ielchiieva commented on DRILL-4720:
-----------------------------------------

To retrieve sub-partition list {{FileSystemSchema}} uses the following code:
{code}
    @Override
    public Iterable<String> getSubPartitions(String table,
                                             List<String> partitionColumns,
                                             List<String> partitionValues
                                            ) throws PartitionNotFoundException {
      List<FileStatus> fileStatuses;
      try {
        fileStatuses = defaultSchema.getFS().list(false, new Path(defaultSchema.getDefaultLocation(),
table));
      } catch (IOException e) {
        throw new PartitionNotFoundException("Error finding partitions for table " + table,
e);
      }
      return new SubDirectoryList(fileStatuses);
    }
{code}

{{DrillFileSystem.list(boolean recursive, Path... paths)}} is used to return list of file
statuses.
[This method behavior is not obvious though|https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/DrillFileSystem.java#L750].
If it is called with recursive flag set to false, it will return all directories and files
in given path.
If it is called with recursive flag set to true it will return only the list of files in given
path including nested files and also will filter out all files and directories that are excluded
by Drill file system. When reading data from table, [Drill excluded all files and directories
that start with dot or underscore|https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/DrillPathFilter.java].


> MINDIR() and IMINDIR() functions return no results with metadata cache
> ----------------------------------------------------------------------
>
>                 Key: DRILL-4720
>                 URL: https://issues.apache.org/jira/browse/DRILL-4720
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Functions - Drill
>    Affects Versions: 1.7.0
>            Reporter: Krystal
>            Assignee: Arina Ielchiieva
>
> Parquet directories with meta data cache return 0 rows for MINDIR and IMINDIR functions.
> hadoop fs -ls /tmp/querylogs_4
> Found 6 items
> -rwxr-xr-x   3 mapr mapr      15406 2016-06-13 10:18 /tmp/querylogs_4/.drill.parquet_metadata
> drwxr-xr-x   - root root          4 2016-06-13 10:18 /tmp/querylogs_4/1985
> drwxr-xr-x   - root root          3 2016-06-13 10:18 /tmp/querylogs_4/1999
> drwxr-xr-x   - root root          3 2016-06-13 10:18 /tmp/querylogs_4/2005
> drwxr-xr-x   - root root          4 2016-06-13 10:18 /tmp/querylogs_4/2014
> drwxr-xr-x   - root root          6 2016-06-13 10:18 /tmp/querylogs_4/2016
> hadoop fs -ls /tmp/querylogs_4/1985
> Found 4 items
> -rwxr-xr-x   3 mapr mapr       3634 2016-06-13 10:18 /tmp/querylogs_4/1985/.drill.parquet_metadata
> drwxr-xr-x   - root root          2 2016-06-13 10:18 /tmp/querylogs_4/1985/Feb
> drwxr-xr-x   - root root          2 2016-06-13 10:18 /tmp/querylogs_4/1985/apr
> drwxr-xr-x   - root root          2 2016-06-13 10:18 /tmp/querylogs_4/1985/jan 
> SELECT * FROM `dfs.tmp`.`querylogs_4` WHERE dir0 = MINDIR('dfs.tmp','querylogs_4');
> +-----------+-------+------+---------------+----------------+------------+------------+-------+-------+-------+
> | voter_id  | name  | age  | registration  | contributions  | voterzone  | date_time
 | dir0  | dir1  | dir2  |
> +-----------+-------+------+---------------+----------------+------------+------------+-------+-------+-------+
> +-----------+-------+------+---------------+----------------+------------+------------+-------+-------+-------+
> No rows selected (0.803 seconds)
> If the meta cache is removed, expected data is returned.
> Here is the physical plan:
> {code}
> 00-00    Screen : rowType = RecordType(ANY *): rowcount = 3.75, cumulative cost = {54.125
rows, 169.125 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 664191
> 00-01      Project(*=[$0]) : rowType = RecordType(ANY *): rowcount = 3.75, cumulative
cost = {53.75 rows, 168.75 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 664190
> 00-02        Project(T51¦¦*=[$0]) : rowType = RecordType(ANY T51¦¦*): rowcount =
3.75, cumulative cost = {53.75 rows, 168.75 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 664189
> 00-03          SelectionVectorRemover : rowType = RecordType(ANY T51¦¦*, ANY dir0):
rowcount = 3.75, cumulative cost = {53.75 rows, 168.75 cpu, 0.0 io, 0.0 network, 0.0 memory},
id = 664188
> 00-04            Filter(condition=[=($1, '.drill.parquet_metadata')]) : rowType = RecordType(ANY
T51¦¦*, ANY dir0): rowcount = 3.75, cumulative cost = {50.0 rows, 165.0 cpu, 0.0 io, 0.0
network, 0.0 memory}, id = 664187
> 00-05              Project(T51¦¦*=[$0], dir0=[$1]) : rowType = RecordType(ANY T51¦¦*,
ANY dir0): rowcount = 25.0, cumulative cost = {25.0 rows, 50.0 cpu, 0.0 io, 0.0 network, 0.0
memory}, id = 664186
> 00-06                Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/tmp/querylogs_4/2005/May/voter25.parquet/0_0_0.parquet]],
selectionRoot=/tmp/querylogs_4, numFiles=1, usedMetadataFile=true, columns=[`*`]]]) : rowType
= (DrillRecordRow[*, dir0]): rowcount = 25.0, cumulative cost = {25.0 rows, 50.0 cpu, 0.0
io, 0.0 network, 0.0 memory}, id = 664185
> {code}
> Here is the plan for the same query against the same directory structure without meta
data cache:
> {code}
> 00-00    Screen : rowType = RecordType(ANY *): rowcount = 75.0, cumulative cost = {82.5
rows, 157.5 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 664312
> 00-01      Project(*=[$0]) : rowType = RecordType(ANY *): rowcount = 75.0, cumulative
cost = {75.0 rows, 150.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 664311
> 00-02        Project(*=[$0]) : rowType = RecordType(ANY *): rowcount = 75.0, cumulative
cost = {75.0 rows, 150.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 664310
> 00-03          Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:///tmp/querylogs_1/1985/Feb/voter10.parquet/0_0_0.parquet],
ReadEntryWithPath [path=maprfs:///tmp/querylogs_1/1985/jan/voter5.parquet/0_0_0.parquet],
ReadEntryWithPath [path=maprfs:///tmp/querylogs_1/1985/apr/voter65.parquet/0_0_0.parquet]],
selectionRoot=maprfs:/tmp/querylogs_1, numFiles=3, usedMetadataFile=false, columns=[`*`]]])
: rowType = (DrillRecordRow[*, dir0]): rowcount = 75.0, cumulative cost = {75.0 rows, 150.0
cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 664309
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message