drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Volodymyr Vysotskyi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-4264) Dots in identifier are not escaped correctly
Date Tue, 25 Jul 2017 11:52:00 GMT

    [ https://issues.apache.org/jira/browse/DRILL-4264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16099918#comment-16099918

Volodymyr Vysotskyi commented on DRILL-4264:

Thanks for such detailed analysis. 

I agree with you that such deserializing of {{ColumnTypeMetadata_v3.Key}} objects will cause
problems for the fields that contain dots in their names. To solve this issue I propose to
change the structure of the {{ColumnTypeMetadata_v3.Key}} class. Instead of using an array
with the components of the field name we should use {{SchemaPath}} and serialise it as a string
obtained by calling {{SchemaPath.toExpr()}}. With this change, we also should update parquet
metadata version. 

A more complex problem is connected with {{MaterializedField}} class. {{SchemaPath}} was removed
from {{MaterializedField}} class in [PR-373|https://github.com/apache/drill/pull/373]. One
of the reasons for this refactoring was the assumption that {{MaterializedField}} should have
no knowledge of its parents. Some code in Drill supposes that {{MaterializedField.getPath()}}
returns field path including its parents. 
For example in [this line|https://github.com/apache/drill/blob/3e8b01d5b0d3013e3811913f0fd6028b22c1ac3f/exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet2/DrillParquetReader.java#L225]
{{MaterializedField}} instance will be created with the name {{col.getAsUnescapedPath()}}.
In [this line|https://github.com/apache/drill/blob/874bf6296dcd1a42c7cf7f097c1a6b5458010cbb/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/ScanBatch.java#L362]
the name with parent field names was used. Using only the field name in the {{MaterializedField}}
will cause problems since the field at the root level may have the same name as the field,
nested in the map. 
So full field path should be used in the {{MaterializedField}} class in this case.

The {{SchemaPath.getSimplePath(field.getPath())}} code is used in many places, but it does
not return the same {{SchemaPath}} that was used to create {{MaterializedField}} instance.

We should change the implementation of {{MaterializedField}} in such a way that this code
returns the same {{SchemaPath}} which was used to create {{MaterializedField}} instance. 

I think we should store a separate field {{String path}} in {{MaterializedField}} class with
value {{SchemaPath.toExpr()}} and replace all {{SchemaPath.getAsUnescapedPath()}} calls by
the {{SchemaPath.toExpr()}}. 
* when the {{MaterializedField}} instance is created using the path {{SchemaPath.toExpr()}},
the name will be assigned as the last name of the {{SchemaPath}}. 
* when {{MaterializedField}} instance is created using the name, the path will be the same
as the name with backticks. 

The less preferred solution is the revert of commit [PR-373|https://github.com/apache/drill/pull/373].
In this case dots in the field names will be handled correctly. But such solution will make
the transition to using Apache Arrow more complex (but {{MaterializedField}} was replaced
by {{Flatbuffer Field}}, so the transition is already too complex). 

> Dots in identifier are not escaped correctly
> --------------------------------------------
>                 Key: DRILL-4264
>                 URL: https://issues.apache.org/jira/browse/DRILL-4264
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Codegen
>            Reporter: Alex
>            Assignee: Volodymyr Vysotskyi
> If you have some json data like this...
> {code:javascript}
>     {
>       "0.0.1":{
>         "version":"0.0.1",
>         "date_created":"2014-03-15"
>       },
>       "0.1.2":{
>         "version":"0.1.2",
>         "date_created":"2014-05-21"
>       }
>     }
> {code}
> ... there is no way to select any of the rows since their identifiers contain dots and
when trying to select them, Drill throws the following error:
> Error: SYSTEM ERROR: UnsupportedOperationException: Unhandled field reference "0.0.1";
a field reference identifier must not have the form of a qualified name
> This must be fixed since there are many json data files containing dots in some of the
keys (e.g. when specifying version numbers etc)

This message was sent by Atlassian JIRA

View raw message