drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Created] (DRILL-5525) Inconsistent, unhelpful semantics for batch, field schema comparison
Date Thu, 18 May 2017 19:49:04 GMT
Paul Rogers created DRILL-5525:
----------------------------------

             Summary: Inconsistent, unhelpful semantics for batch, field schema comparison
                 Key: DRILL-5525
                 URL: https://issues.apache.org/jira/browse/DRILL-5525
             Project: Apache Drill
          Issue Type: Bug
    Affects Versions: 1.10.0
            Reporter: Paul Rogers
            Assignee: Paul Rogers
            Priority: Minor
             Fix For: 1.11.0


Drill provides two classes to define schema at execution time:

* {{MaterializedField}} - describes one column (or a collection of columns for a map)
* {{BatchSchema}} - describes the set of columns that make up a row

Each class provides an {{isEqual()}} method. However it seems these methods may not be used
much because the contain a number of flaws and contradictions.

* {{MaterializedField}} has a comment that says it compares only names; but then it turns
around and compares the object field-by-field using Protobuf:

{code}
    // DRILL-1872: Compute equals only on key. See also the comment
    // in MapVector$MapTransferPair

    return this.name.equalsIgnoreCase(other.name) &&
            Objects.equals(this.type, other.type);
{code}

* {{MaterializedField}} ends up doing a physical comparison of fields rather than a logical
comparison. Varchar columns are defined, at the logical level as Varchar. But, at the physical
level, a Varchar is a two-part column: it has a child that represents the {{$offsets$}} vector.
As a result, a comparison between logical and physical schema returns false, even if both
is just a Varchar.

* {{BatchSchema}} ends up being inconsistent because of the above confusion. It first (seemingly)
compares names, then tries to compare types. But, because the {{MaterializedField}} method
actually compares all fields, the type comparison does nothing.

{code}
    if (!fields.equals(other.fields)) { // Compare all fields
      return false;
    }
    for (int i = 0; i < fields.size(); i++) { // So this loop is a no-op
      MajorType t1 = fields.get(i).getType();
      MajorType t2 = other.fields.get(i).getType();
      if (t1 == null) {
        if (t2 != null) {
          return false;
        }
      } else {
        if (!majorTypeEqual(t1, t2)) {
          return false;
        }
      }
    }
{code}

The result is that one is not quite sure what the two methods are supposed to compare. Is
{{MaterializedField}} supposed to:

* Do a physical compare (existing behavior)?
* Do a name-only compare (as the comment suggests)?
* Do a logical comparison (as unit tests would want)?

And for {{BatchSchema}}, should it:

* Do a physical compare (existing behavior)?
* Do a type-aware comparison (as in the loop that does nothing)?

Further note that neither of the methods do anything special for map fields: they end up being
compared as part of the protobuf comparison. But, we shold probably apply the same rules to
map fields as we apply to top-level fields (as expressed in the second loop above.)

This ticket requests:

* Document current uses.
* Determine the semantics of the {{isEqual()}} method.
* Create additional methods as needed: {{isPhysicallyEqual()}}, {{isLogicallyEqual()}}, {{hasSameNames()}},
or whatever.

Not that this issue is classic: it seems that "equal" is well defined concept, but as the
Greek philosopher [Heraclitus|https://en.wikipedia.org/wiki/Heraclitus] suggested with his
famous quote that "you can never step into the same river twice", the concept of "sameness"
is fluid and ad-hoc. It is up to us to define the semantics for equality, and sometimes we
need more than one concept.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message