drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5830) Resolve regressions to MapR DB from DRILL-5546
Date Mon, 02 Oct 2017 06:10:02 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5830?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16187677#comment-16187677
] 

ASF GitHub Bot commented on DRILL-5830:
---------------------------------------

Github user jinfengni commented on the issue:

    https://github.com/apache/drill/pull/968
  
    I'm not convinced that it's a good idea to back out the change to HBase specific changes
made in DRILL-5546.
    
    You are right that project push-down in planner ideally should do its job and push the
list of columns. However, until planner could claim there is no issues at all ( just like
what happened prior to DRILL-5546), execution still may face a columns list with "*". That's
why in HBaseGroupScan we have verifyColumnsAndConvertStar, just in case planner's project
push-down did not work in the way we want.  If as you suggested, such conversion in HBaseGroupScan
is redundant, why would it cause regression (if planner's rule works as expected)? or is it
your intention to still keep "*" in HBaseRecordReader? If that's your intention, I think we
are going in the wrong direction. HBase has a unified schema at table level.  
    
    I agree that the analysis of empty map {},  vs {a:varbinary}.  It's something we have
to deal with. As a matter of fact, such scenarios does not have to come from empty batch.
It could happen with two regions with >0 rows.  
    
    For instance, regrion 1 has 10 rows, with cf1.c1 appears in only first 5 rows, while region
2 has 20 rows with cf1.c1 appears in every rows. For the following query:
    select CF1 FROM table where some_condition_on_row_key;
    
    if "some_condition_on_row_key" is pushed to hbase and prunes the first 5 rows, region1
will return a batch with 5 rows, but with cf1 as an empty map, while region2 will have map
with cf1 as {c1:varbinary}.  
    
    In that sense, DRILL-5546 exposes such issues, and force us to have a solution to handle
empty map {} vs {a:varbinary}/



> Resolve regressions to MapR DB from DRILL-5546
> ----------------------------------------------
>
>                 Key: DRILL-5830
>                 URL: https://issues.apache.org/jira/browse/DRILL-5830
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.12.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>             Fix For: 1.12.0
>
>
> DRILL-5546 added a number of fixes for empty batches. One part of the fix was for HBase.
Key changes:
> * Add code to expand wildcards in the planner. (i.e. SELECT *)
> * Remove support for wildcards in the HBase record reader.
> As noted in DRILL-5775, this change had the effect of breaking support for MapR-DB binary
(which is API compatible with HBase.) DRILL-5775 does this by expanding wildcards in the planner
for MapR DB as was done for HBase in DRILL-5546.
> Unfortunately, this change introduced other regressions into the code as described by
DRILL-5706.
> Investigation of those issues revealed that we should back out the original DRILL-5546
changes and go down a different route.
> As it turns out, HBase already had a project push-down rule that expanded wildcards.
However, that rule didn't work correctly some of the time. DRILL-5546 fixed that bug, ensuring
that wildcards are expanded (at least in the cases tested for this ticket.)
> The actual issue turned out to be a bug in the {{RecordBatchLoader}} class which did
not consider map contents when detecting schema change. As a result, results like (row_key,
cf\{}) were treated the same as (row_key, cf\{mycol}) and the actual data colums were discarded,
but randomly depending on batch arrival order.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message