drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Paul Rogers (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-4264) Dots in identifier are not escaped correctly
Date Thu, 20 Jul 2017 06:33:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-4264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16094260#comment-16094260
] 

Paul Rogers commented on DRILL-4264:
------------------------------------

On the planner side, we represent field references with the {{FieldReference}} class you mentioned.
{{FieldReference}} extends {{SchemaPath}}. These classes break names down into one object
per name part.

Assume we have {{SELECT a.b.c, a."d.e" ...}}

Within the {{FieldReference}} itself, we hold onto the name using a {{PathSegment}} which
has two subclasses: {{ArraySegment}} and {{NameSegment}}. So, as you noted, in the planner,
we can tell the difference between the two cases (using functional notation):

{code}
a.b.c: FieldReference(NameSegment("a", NameSegment("b", NameSegment("c"))))
a."d.e": FieldReference(NameSegment("a", NameSegment("d.e")))
{code}

So far so good. Bug, {{SchemaPath}} provides the {{getAsUnescapedPath()}} method which concatenates
the parts of the name using dots. We end up with two {{FieldReference}} instances. Calling
{{getAsUnescapedPath()}} on each produces {{a.b.c}} and {{a.d.e}}. So, if anything actually
uses this unescaped path, we end up with an ambiguity: does "a.d.e" represent one field, two
fields or three fields? We cannot tell.

Now, if this method was only used for debugging (line {{toString()}}), it would be fine. But,
in fact, many operators refer to this method, especially when creating the run-time representation
of a field schema: {{MaterializedField}}:

>From {{StreamingAggBatch}}:

{code}
      final MaterializedField outputField = MaterializedField.create(
                    ne.getRef().getAsUnescapedPath(), expr.getMajorType());
{code}

In our examples, we end up with two materialized fields: one called "a.b.c", the other "a.d.e",
so the ambiguity persists.

As it turns out, each {{MaterializedField}} represents one field or structured column. So,
our map "a" is represented by a {{MaterializedField}}, "b" by another, "c" by yet another
and "d.e" by another. So, each should correspond to a single name part.

But, the code doesn't work that way, it actually builds up the full unescaped name.

Now, I suspect that the code here is old and inconsistent. It should be that creating a materialized
field pulls out only one name part. But, the code actually concatenates. My suspicion increases
when I see methods like these in {{MaterializedField}}:

{code}
  public String getPath() { return getName(); }
  public String getLastName() { return getName(); }
  public String getName() { return name; }
{code}

That first one really worries me: it is asking for the "path", which means dotted name. There
are many references to this name. Does this mean the code expects to get a string (not a {{NameSegment}})
that holds the composite name. If so, we are in trouble.

Now, as it turns out, it seems that the "modern" form of {{MaterializedSchema}} is that each
hold just one name part. So:

{code}
MaterializedField(name="a", children = (
  MaterializedField(name="b", children = (
    MaterializedField(name = c))),
  MaterializedField(name="d.e")))
{code}

I wonder, because the code appears to be written assuming that a {{MaterializedField}} had
a path name, does any code still rely on this fact, then split the name at dots to get fields?

If not, can we remove the {{getPath()}}, and {{getLastPath()}} methods to make it clear that
each {{MaterializedField}} corresponds to a single {{NameSegment}}?

And, if we do that, should we remove all calls to {{NameSegment.getAsUnescapedPath()}} to
make clear that we never (except for display) want dotted, combined path name?

By carefully looking at the above issues, we can be sure that no old code in Drill tries to
concatenate "a" and "d.e" to get the name "a.d.e" which it then splits into "a", "d" and "e".

A quick search for ".split(" found a number of places where we split names on a dot, including
in the Parquet Metadata file:

{code}
        public Object deserializeKey(String key, com.fasterxml.jackson.databind.DeserializationContext
ctxt)
            throws IOException, com.fasterxml.jackson.core.JsonProcessingException {
          return new Key(key.split("\\."));
        }
{code}

Are there others? Do these need to be fixed?

> Dots in identifier are not escaped correctly
> --------------------------------------------
>
>                 Key: DRILL-4264
>                 URL: https://issues.apache.org/jira/browse/DRILL-4264
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Codegen
>            Reporter: Alex
>            Assignee: Volodymyr Vysotskyi
>
> If you have some json data like this...
> {code:javascript}
>     {
>       "0.0.1":{
>         "version":"0.0.1",
>         "date_created":"2014-03-15"
>       },
>       "0.1.2":{
>         "version":"0.1.2",
>         "date_created":"2014-05-21"
>       }
>     }
> {code}
> ... there is no way to select any of the rows since their identifiers contain dots and
when trying to select them, Drill throws the following error:
> Error: SYSTEM ERROR: UnsupportedOperationException: Unhandled field reference "0.0.1";
a field reference identifier must not have the form of a qualified name
> This must be fixed since there are many json data files containing dots in some of the
keys (e.g. when specifying version numbers etc)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message