drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-5376) Rationalize Drill's row metadata for simpler code, better performance
Date Fri, 24 Mar 2017 17:41:41 GMT

    [ https://issues.apache.org/jira/browse/DRILL-5376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15940797#comment-15940797
] 

Paul Rogers commented on DRILL-5376:
------------------------------------

Another asymmetry between rows and maps occurs in the "hyper vector." The simplest hyper vector
is simply two value vectors "stacked" one atop another. An SV4 indexes into the two separate
vectors.

{code}
| 10 |
| 20 |
| 30 |
  --
| 40 |
| 50 |
{code}

A hyper vector of row batches is a set of stacked vectors. Very simple to index into. An SV4
is essentially a pair (batch, offset), where batch gives the index of the vector in the stack,
the offset gives the offset of the data within the vector.

But, a map hyper vector has a completely different structure. It is a stack of maps, each
map with a list of its vectors.

{code}
| map1 --> ([10, 20, 30], ["a", "b", "c"]) |
 - - -
| map2 --> ([1, 2, 3], ["fred", "betty", "wilma"]) |
{code}

Indexing here is much harder. Assume the schema is (map( a, b)). That is a map field. In the
map are fields a and b. To get to the third a value in the second batch (2, 3), one uses the
batch to find map2. One does not index into map2 as above, instead, one then looks up field
a to find the corresponding value vector. Then one indexes into vector a.

The complexity mounts with nested maps.

Instead, a very easy alternative is simply to stack the map members:

{code}
| 20 | "a" |
| 30 | "b" |
| 40 | "c" |
  - - - -
| 1 | "fred" |
| 2 | "betty" |
| 3 | "wilma" |
{code}

Now, to get to (3,2) for a, one simply finds the "map.a" vector, and indexes into it the same
as for top-level scalar vectors. No need for two distinct code paths.

> Rationalize Drill's row metadata for simpler code, better performance
> ---------------------------------------------------------------------
>
>                 Key: DRILL-5376
>                 URL: https://issues.apache.org/jira/browse/DRILL-5376
>             Project: Apache Drill
>          Issue Type: Improvement
>    Affects Versions: 1.10.0
>            Reporter: Paul Rogers
>
> Drill is a columnar system, but data is ultimately represented as rows (AKA records or
tuples.) The way that Drill represents rows leads to excessive code complexity and runtime
cost.
> Data in Drill is stored in vectors: one (or more) per column. Vectors do not stand alone,
however, they are "bundled" into various forms of grouping: the {{VectorContainer}}, {{RecordBatch}},
{{VectorAccessible}}, {{VectorAccessibleSerializable}}, and more. Each has slightly different
semantics, requiring large amounts of code to bridge between the representations.
> Consider only a simple row: one with only scalar columns. In classic relational theory,
such a row is a tuple:
> {code}
> R = (a, b, c, d, ...)
> {code}
> A tuple is defined as an ordered list of column values. Unlike a list or array, the column
values also have names and may have varying data types.
> In SQL, columns are referenced by either position or name. In most execution engines,
columns are referenced by position (since positions, in most systems, cannot change.) A 1:1
mapping is provided between names and positions. (See the JDBC {{RecordSet}} interface.)
> This allows code to be very fast: code references columns by index, not by name, avoiding
name lookups for each column reference.
> Drill provides a murky, hybrid approach. Some structures ({{BatchSchema}}, for example)
appear to provide a fixed column ordering, allowing indexed column access. But, other abstractions
provide only an iterator. Others (such as {{VectorContainer}}) provides name-based access
or, by clever programming, indexed access.
> As a result, it is never clear exactly how to quickly access a column: by name, by name
to multi-part index to vector?
> Of course, Drill also supports maps, which add to the complexity. First, we must understand
that a "map" in Drill is not a "map" in the classic sense: it is not a collection of (name,
value) pairs in the JSON sense: a collection in which each instance may have a different set
of pairs.
> Instead, in Drill, a "map" is really a nested tuple: a map has the same structure as
a Drill record: a collection of names and values in which all rows have the same structure.
(This is so because maps are really a collection of value vectors, and the vectors cut across
all rows.)
> Drill, however, does not reflect this symmetry: that a row and a map are both tuples.
There are no common abstractions for the two. Instead, maps are represented as a {{MapVector}}
that contains a (name, vector) map for its children.
> Because of this name-based mapping, high-speed indexed access to vectors is not provided
"out of the box." Certainly each consumer of a map can build its own indexing mechanism. But,
this leads to code complexity and redundancy.
> This ticket asks to rationalize Drill's row, map and schema abstractions around the tuple
concept. A schema is a description of a tuple and should (as in JDBC) provide both name and
index based access. That is, provide methods of the form:
> {code}
> MaterializedField getField(int index);
> MaterializedField getField(String name);
> ...
> ValueVector getVector(int index);
> ValueVector getVector(String name);
> {code}
> Provide a common abstraction for rows and maps, recognizing their structural similarity.
> There is an obvious issue with indexing columns in a row when the row contains maps.
Should indexing be multi-part (index into row, then into map) as today? A better alternative
is to provide a flattened interface:
> {code}
> 0: a, 1: b.x, 2: b.y, 3: c, ...
> {code}
> Use this change to simplify client code, over time, to use a simple indexed-based column
access.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message