drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-4720) MINDIR() and IMINDIR() functions return no results with metadata cache
Date Fri, 30 Jun 2017 11:45:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-4720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16069961#comment-16069961
] 

ASF GitHub Bot commented on DRILL-4720:
---------------------------------------

GitHub user arina-ielchiieva opened a pull request:

    https://github.com/apache/drill/pull/864

    DRILL-4720: Fix SchemaPartitionExplorer.getSubPartitions method implementations to return
only Drill file system directories

    1. Added file system util helper classes to standardize list directory and file statuses
usage in Drill with appropriate unit tests.
    2. Fixed SchemaPartitionExplorer.getSubPartitions method implementations to return only
directories that can be partitions according to Drill  file system rules (excluded all files
and directories that start with dot or underscore).
    3. Added unit test for directory explorers UDFs with and without metadata cache presence.
    4. Minor refactoring.
    
    Details in Jira [DRILL-4720](https://issues.apache.org/jira/browse/DRILL-4720).

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/arina-ielchiieva/drill DRILL-4720

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/drill/pull/864.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #864
    
----
commit 6d592373c740fd793ed6bbb3264b97b52e4b763b
Author: Arina Ielchiieva <arina.yelchiyeva@gmail.com>
Date:   2017-06-29T13:08:33Z

    DRILL-4720: Fix SchemaPartitionExplorer.getSubPartitions method implementations to return
only Drill file system directories
    
    1. Added file system util helper classes to standardize list directory and file statuses
usage in Drill with appropriate unit tests.
    2. Fixed SchemaPartitionExplorer.getSubPartitions method implementations to return only
directories that can be partitions according to Drill file system rules
    (excluded all files and directories that start with dot or underscore).
    3. Added unit test for directory explorers UDFs with and without metadata cache presence.
    4. Minor refactoring.

----


> MINDIR() and IMINDIR() functions return no results with metadata cache
> ----------------------------------------------------------------------
>
>                 Key: DRILL-4720
>                 URL: https://issues.apache.org/jira/browse/DRILL-4720
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Functions - Drill
>    Affects Versions: 1.7.0
>            Reporter: Krystal
>            Assignee: Arina Ielchiieva
>
> Parquet directories with meta data cache return 0 rows for MINDIR and IMINDIR functions.
> hadoop fs -ls /tmp/querylogs_4
> Found 6 items
> -rwxr-xr-x   3 mapr mapr      15406 2016-06-13 10:18 /tmp/querylogs_4/.drill.parquet_metadata
> drwxr-xr-x   - root root          4 2016-06-13 10:18 /tmp/querylogs_4/1985
> drwxr-xr-x   - root root          3 2016-06-13 10:18 /tmp/querylogs_4/1999
> drwxr-xr-x   - root root          3 2016-06-13 10:18 /tmp/querylogs_4/2005
> drwxr-xr-x   - root root          4 2016-06-13 10:18 /tmp/querylogs_4/2014
> drwxr-xr-x   - root root          6 2016-06-13 10:18 /tmp/querylogs_4/2016
> hadoop fs -ls /tmp/querylogs_4/1985
> Found 4 items
> -rwxr-xr-x   3 mapr mapr       3634 2016-06-13 10:18 /tmp/querylogs_4/1985/.drill.parquet_metadata
> drwxr-xr-x   - root root          2 2016-06-13 10:18 /tmp/querylogs_4/1985/Feb
> drwxr-xr-x   - root root          2 2016-06-13 10:18 /tmp/querylogs_4/1985/apr
> drwxr-xr-x   - root root          2 2016-06-13 10:18 /tmp/querylogs_4/1985/jan 
> SELECT * FROM `dfs.tmp`.`querylogs_4` WHERE dir0 = MINDIR('dfs.tmp','querylogs_4');
> +-----------+-------+------+---------------+----------------+------------+------------+-------+-------+-------+
> | voter_id  | name  | age  | registration  | contributions  | voterzone  | date_time
 | dir0  | dir1  | dir2  |
> +-----------+-------+------+---------------+----------------+------------+------------+-------+-------+-------+
> +-----------+-------+------+---------------+----------------+------------+------------+-------+-------+-------+
> No rows selected (0.803 seconds)
> If the meta cache is removed, expected data is returned.
> Here is the physical plan:
> {code}
> 00-00    Screen : rowType = RecordType(ANY *): rowcount = 3.75, cumulative cost = {54.125
rows, 169.125 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 664191
> 00-01      Project(*=[$0]) : rowType = RecordType(ANY *): rowcount = 3.75, cumulative
cost = {53.75 rows, 168.75 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 664190
> 00-02        Project(T51¦¦*=[$0]) : rowType = RecordType(ANY T51¦¦*): rowcount =
3.75, cumulative cost = {53.75 rows, 168.75 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 664189
> 00-03          SelectionVectorRemover : rowType = RecordType(ANY T51¦¦*, ANY dir0):
rowcount = 3.75, cumulative cost = {53.75 rows, 168.75 cpu, 0.0 io, 0.0 network, 0.0 memory},
id = 664188
> 00-04            Filter(condition=[=($1, '.drill.parquet_metadata')]) : rowType = RecordType(ANY
T51¦¦*, ANY dir0): rowcount = 3.75, cumulative cost = {50.0 rows, 165.0 cpu, 0.0 io, 0.0
network, 0.0 memory}, id = 664187
> 00-05              Project(T51¦¦*=[$0], dir0=[$1]) : rowType = RecordType(ANY T51¦¦*,
ANY dir0): rowcount = 25.0, cumulative cost = {25.0 rows, 50.0 cpu, 0.0 io, 0.0 network, 0.0
memory}, id = 664186
> 00-06                Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=/tmp/querylogs_4/2005/May/voter25.parquet/0_0_0.parquet]],
selectionRoot=/tmp/querylogs_4, numFiles=1, usedMetadataFile=true, columns=[`*`]]]) : rowType
= (DrillRecordRow[*, dir0]): rowcount = 25.0, cumulative cost = {25.0 rows, 50.0 cpu, 0.0
io, 0.0 network, 0.0 memory}, id = 664185
> {code}
> Here is the plan for the same query against the same directory structure without meta
data cache:
> {code}
> 00-00    Screen : rowType = RecordType(ANY *): rowcount = 75.0, cumulative cost = {82.5
rows, 157.5 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 664312
> 00-01      Project(*=[$0]) : rowType = RecordType(ANY *): rowcount = 75.0, cumulative
cost = {75.0 rows, 150.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 664311
> 00-02        Project(*=[$0]) : rowType = RecordType(ANY *): rowcount = 75.0, cumulative
cost = {75.0 rows, 150.0 cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 664310
> 00-03          Scan(groupscan=[ParquetGroupScan [entries=[ReadEntryWithPath [path=maprfs:///tmp/querylogs_1/1985/Feb/voter10.parquet/0_0_0.parquet],
ReadEntryWithPath [path=maprfs:///tmp/querylogs_1/1985/jan/voter5.parquet/0_0_0.parquet],
ReadEntryWithPath [path=maprfs:///tmp/querylogs_1/1985/apr/voter65.parquet/0_0_0.parquet]],
selectionRoot=maprfs:/tmp/querylogs_1, numFiles=3, usedMetadataFile=false, columns=[`*`]]])
: rowType = (DrillRecordRow[*, dir0]): rowcount = 75.0, cumulative cost = {75.0 rows, 150.0
cpu, 0.0 io, 0.0 network, 0.0 memory}, id = 664309
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message