spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From rdblue <...@git.apache.org>
Subject [GitHub] spark issue #22573: [SPARK-25558][SQL] Pushdown predicates for nested fields...
Date Mon, 01 Oct 2018 17:31:51 GMT
Github user rdblue commented on the issue:

    https://github.com/apache/spark/pull/22573
  
    The approach we've taken in Iceberg is to allow `.` in names by using an index in the
top-level schema. The full path of every leaf in the schema is produced and added to a map
from the full field name to the field's ID.
    
    The reason why we do this is to avoid problem areas:
    
    * Parsing the name using `.` as a delimiter
    * Traversing the schema structure
    
    For example, the schema `0: a struct<2: x int, 3: y int>, 1: a.z int` produces this
index: `Map("a" -> 0, "a.x" -> 2, "a.y" -> 3, "a.z" -> 1)`.
    
    Binding filters like `a.x > 3` or `a.z < 5` is done using the index instead of parsing
the field name and traversing, so you get the right result without needing to decide whether
"a.x" is nested or if it is the actual name. So the lookup is quick and correctly produces
`id(2) > 3` and `id(1) < 5`. This is also used for projection because users want to
be able to select nested columns by name using dotted field names.
    
    The only drawback to this approach is that you can't have duplicates in the index: each
full field name must be unique. In the example above, the top-level `a.z` field could not
be named `a.x` or else it would collide with `x` nested in `a`.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message