drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5514) Enhance VectorContainer to merge two row sets
Date Thu, 15 Jun 2017 19:16:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5514?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16050955#comment-16050955
] 

ASF GitHub Bot commented on DRILL-5514:
---------------------------------------

Github user bitblender commented on a diff in the pull request:

    https://github.com/apache/drill/pull/837#discussion_r122287096
  
    --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/record/TestVectorContainer.java
---
    @@ -110,13 +110,16 @@ public void testContainerMerge() {
         RowSet mergedRs = left.merge(right);
         comparison.verifyAndClear(mergedRs);
     
    -    // Add a selection vector. Ensure the SV appears in the merged
    -    // result. Test as a row set since container's don't actually
    -    // carry the selection vector.
    +    // Add a selection vector. Merging is forbidden.
    --- End diff --
    
    Maybe this can be changed to "//Merging data with a selection vector is forbidden". As
is the comment implies that we are adding a selection vector.


> Enhance VectorContainer to merge two row sets
> ---------------------------------------------
>
>                 Key: DRILL-5514
>                 URL: https://issues.apache.org/jira/browse/DRILL-5514
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.10.0
>            Reporter: Paul Rogers
>            Assignee: Paul Rogers
>            Priority: Minor
>             Fix For: 1.11.0
>
>
> Consider the concept of a "record batch" in Drill. On the one hand, one can envision
a record batch as a stack of records:
> {code}
> | a1 | b1 | c1 |
> ----------------
> | a2 | b2 | c2 |
> {code}
> But, Drill is columnar. So a record batch is really a "bundle" of vectors:
> {code}
> | a1 |    | b1 |    | c1 |
> | a2 |    | b2 |    | c2 |
> {code}
> There are times when it is handy to build up a record batch as a merge of two different
vector bundles:
> {code}
> -- bundle 1 --    -- bundle 2 --
> | a1 |    | b1 |        | c1 |
> | a2 |    | b2 |        | c2 |
> {code}
> For example, consider a reader. The reader implementation might read columns (a, b) from
a file, say. Then, the "{{ScanBatch}}" might add (c) as an implicit vector (the file name,
say.) The merged set of vectors comprises the final schema: (a, b, c).
> This ticket asks for the code to do the merge:
> * Merge two schemas A = (a, b), B = (c) to create schema C = (a, b, c).
> * Merge two vector containers C1 and C2 to create a new container, C3, that holds the
merger of the vectors from the first two.
> Clearly, the merge only makes sense if:
> * The two input containers have the same row count, and
> * The columns in each input container are distinct.
> Because this feature is also useful for tests, add the merge to the "row set" tools also.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message