drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jinfeng Ni (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5384) Sort cannot directly access map members, causes a data copy
Date Mon, 27 Mar 2017 18:21:41 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15943772#comment-15943772

Jinfeng Ni commented on DRILL-5384:

I guess the first case with text file is you are using " select * from file order by a, b,
c"?   The additional three columns are added because of handling of "*"; not because of sort
operator.  But I did not see your example; no way to comment further.

As to map vector, if the complex path does not have "array" segment, then my point is that
there is no saving in terms of memory uses, compared the current approach and the new proposal.
 The project operator which is doing the vector transfer is doing exactly same job if the
sort operator has to access the vectors referres to in the complex path. I'm not clear how
we could see "minimize memory use and optimize performance".

> Sort cannot directly access map members, causes a data copy
> -----------------------------------------------------------
>                 Key: DRILL-5384
>                 URL: https://issues.apache.org/jira/browse/DRILL-5384
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.10.0
>            Reporter: Paul Rogers
>            Priority: Minor
> Suppose we have a JSON structure for "orders" like this:
> {code}
> { customer: { id: 10, name: "fred" },
>   order: { id: 20, product: "Frammis 1000" } }
> {code}
> Suppose I want to sort by customer.id. Today, Drill will project customer.id up to the
top level as a temporary, hidden field. Drill will copy the data from the customer.id vector
to this new temporary field. Drill then sorts on the temporary column, and uses another project
to remove the columns.
> Clearly, this work, but it has a cost:
> * Extra two project operators.
> * Extra memory copy.
> * Sort must buffer both the original and copied data. This can double memory use in the
worst case.
> All of this is done simply to avoid having to reference "customer.id" in the sort.
> But, as explained in DRILL-5376, maps are just nested tuples; there is no need to copy
the data, the data is already right there in a value vector. The problem is that Drill's map
implementation makes it hard for the generated code to get at the "customer.id" vector.
> This ticket asks to allow the sort to work directly with nested scalars to avoid the
overhead explained above. To do this:
> 1. Fix nested scalar access to allow the generated code to easily access a nested scalar.
> 2. Allow a sort key of the form "customer.id".
> 3. Modify the planner to generate such sort keys instead of the dual projects.
> The result will be a leaner, faster sort operation when sorting on scalars within a map.

This message was sent by Atlassian JIRA

View raw message