drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Parth Chandra (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-6223) Drill fails on Schema changes
Date Mon, 02 Apr 2018 09:18:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16422042#comment-16422042
] 

Parth Chandra commented on DRILL-6223:
--------------------------------------

{quote}To your point about compensation logic in the context of Schema Changes
{quote} * 
{quote}Why do you think it is ok to dynamically include new columns?{quote}
 * 
{quote}Yet it is not ok to exclude them?{quote}

Usually, in real world data with dynamically changing schema's, new columns are added and
not removed. 
 * 
{quote}Consider a batch of 32k rows{quote}
 * 
{quote}A VV with null integer values will require 32kb (bits) + 32kb * 4 = 160kb{quote}
 * 
{quote}Each missing column will require that much memory per mini-fragment{quote}

One of the guarantees provided by value vectors is that elements can be accessed by index
in constant time (or, in the case of nested elements in O(m) where m is the level of nesting)
. The representation is based on providing this guarantee. It comes at the cost of additional
memory usage, which is a deliberate tradeoff.
{quote}This is unless (similarly to the implicit columns) we optimize the VV storage representation
or / and push the column preservation to higher layers such as the client or foreman
{quote}
It would be wonderful to improve vectors to use much less memory while providing the same
guarantees. A proposal would be welcome. 

 

> Drill fails on Schema changes 
> ------------------------------
>
>                 Key: DRILL-6223
>                 URL: https://issues.apache.org/jira/browse/DRILL-6223
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Execution - Relational Operators
>    Affects Versions: 1.10.0, 1.12.0
>            Reporter: salim achouche
>            Assignee: salim achouche
>            Priority: Major
>             Fix For: 1.14.0
>
>
> Drill Query Failing when selecting all columns from a Complex Nested Data File (Parquet)
Set). There are differences in Schema among the files:
>  * The Parquet files exhibit differences both at the first level and within nested data
types
>  * A select * will not cause an exception but using a limit clause will
>  * Note also this issue seems to happen only when multiple Drillbit minor fragments are
involved (concurrency higher than one)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message