drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Parth Chandra (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (DRILL-2265) Drill data exploration function for complex data types
Date Wed, 04 Mar 2015 23:58:39 GMT

     [ https://issues.apache.org/jira/browse/DRILL-2265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Parth Chandra updated DRILL-2265:
    Fix Version/s:     (was: 0.9.0)

> Drill data exploration function for complex data types
> ------------------------------------------------------
>                 Key: DRILL-2265
>                 URL: https://issues.apache.org/jira/browse/DRILL-2265
>             Project: Apache Drill
>          Issue Type: Improvement
>          Components: Functions - Drill
>            Reporter: Andries Engelbrecht
>            Assignee: Daniel Barclay (Drill)
>             Fix For: Future
> Drill data exploration function for complex data types
> When dealing with complex data in large volumes it will be extremely useful to have a
function to collect metadata to provide a better view of the total data set.
> If JSON is used as an example a data set can have an extremely large volume of JSON objects.
Each object can have multiple schemas and subschemas with multiple nested subschemas as well
as arrays. Not all objects will have all of the schemas or subschemas. When exploring this
data in Drill a SQL dot notation is used to navigate the complex subschema structure, and
it can become very cumbersome to fully understand the total picture of all the data.
> A function that can explore the JSON objects in a data set (whether single file with
multiple objects, single or multilevel directory structure) and provide the total structure
of all the JSON objects to show all schema, subschema and arrays that are available for all
the JSON objects. This way a data analyst will be able to see within the data set all the
schema data that is available. Additionally if the function can provide the statistics information
to show how many of the objects actually contain each of the schemas, subschemas and arrays
(and data in each), this may indicate to an analyst how valuable or important in may be to
explore any subschema or array.
> To speed up the collection of this data, the function may contain an option to set a
sample size to only sample a portion of the total volume and project the total data set. This
is a very common operation being used with prominent RDBMS systems today. Additionally for
data that changes or grows the metadata collection function will need to be run periodically
to update the statistics.
> To make the metadata more useful the results should be considered to be placed in a Drill
metadata structure, similar to INFORMATION_SCHEMA, but specifically for statistics metadata
only to be used by analysts for data exploration. Some security considerations should also
be deigned to only allow access to users with access to the base data.
> In addition to the use for data analyst and data exploration the metadata and statistics
can also be used for Drill internal functions in the future, such as query optimization and
creation of views.
> This example specifically focusses on JSON data, but can similarly be applied to other
complex data types that may require a very detailed understanding of the complex data set.

This message was sent by Atlassian JIRA

View raw message