drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5105) Query time increases exponentially with increasing nested levels
Date Tue, 03 Jan 2017 19:52:58 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15796019#comment-15796019
] 

ASF GitHub Bot commented on DRILL-5105:
---------------------------------------

Github user chunhui-shi commented on a diff in the pull request:

    https://github.com/apache/drill/pull/715#discussion_r94471881
  
    --- Diff: exec/vector/src/main/java/org/apache/drill/exec/vector/complex/MapVector.java
---
    @@ -134,10 +134,6 @@ public int getBufferSizeFor(final int valueCount) {
     
       @Override
       public DrillBuf[] getBuffers(boolean clear) {
    --- End diff --
    
    Thanks for suggesting to use assert. It is a nice way to avoid doing this check in production.
My thoughts is we don't need any check here, the reasons are:
    
    1, mapVector does not have its own 'real' data vectors but just go to underlying vectors
and get a sum of buffer sizes.
    2, for non-mapVectors, there may be multiple vectors(offsets, bits, etc, such as ListVector)
or the getBufferSize() is just simply get writeIndex, for which super.getBufferSize() is identical
to this.getBufferSize(). And if there is any issues that non-mapVector did not calculate bufferSize
correctly, we should have already seen in using that specific vector, thus we don't need to
do it in mapVector code.
    
    Actually my first thought was rewrite the logic about how to recursively doing bufferSize
check. But after I read all getBufferSize on all vectors, I decided the check is not needed
at all. Since any error in getBufferSize should immediately cause serious problem(serialize
and deserialize) and can not pass the functional tests of that specific vector.
    



> Query time increases exponentially with increasing nested levels
> ----------------------------------------------------------------
>
>                 Key: DRILL-5105
>                 URL: https://issues.apache.org/jira/browse/DRILL-5105
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Storage - JSON
>    Affects Versions: 1.9.0
>         Environment: 3 Node Cluster with default memory and configurations. 
>            Reporter: Abhishek Girish
>            Assignee: Chunhui Shi
>
> The time taken to query any JSON dataset depends on number of nested levels within the
dataset. Also, increasing the complexity of the dataset further impacts the execution time.

> Tabulated below is cached query execution times for a simple select * query over two
simple forms of JSON datasets: 
> || No. Levels   || Time (s) Dataset 1 || Time (s) Dataset 2  ||
> |1	           |0.22                          |0.27                          |
> |2		   |0.23		             |0.25                          |
> |4		   |0.24		             |0.22                          |
> |8		   |0.22		             |0.23                          |
> |16		   |0.34		             |0.48                          |
> |24		   |25.76		             |72.51                        |
> |26		   |103.48		             |289.6                        |
> |28		   |336.12		             |1151.94                    |
> |30		   |1342.22		     |4586.79                    |
> |32		   |5360.2		             |Expected: ~20k        |
> The above table lists query times for 20 different JSON files, 10 belonging to dataset
1 & 10 belonging to dataset 2. Each have 1 record, but the number of nested levels within
them vary as mentioned in the "No. Levels" column. 
> It appears that the query time almost doubles with addition of a nested level (note that
in the table above, it translates to almost 4x across levels starting 24) 
> The below two are the representative datasets, showcasing simple JSON structures with
nested levels.
> Structure of Dataset 1:
> {code}
> {
>   "level1": {
>     "field1": "a",
>     "level2": {
>       "field1"": "b",
>       ...
>     }
>   }
> }
> {code}
> Structure of Dataset 2:
> {code}
> "{
>   "level1": {
>     "field1": ""a",
>     "field2": {
>       "nfield1": true,
>       "nfield2": 1.1
>     },
>     "level2": {
>       "field1": "b",
>       "field2": {
>         "nfield1": false,
>         "nfield2": 2.2
>       },
>       ...
>     }
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message