drill-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jacques Nadeau (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (DRILL-3524) Drill proper DESCRIBE support for MongoDB
Date Mon, 09 Nov 2015 16:07:11 GMT

    [ https://issues.apache.org/jira/browse/DRILL-3524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14996777#comment-14996777
] 

Jacques Nadeau commented on DRILL-3524:
---------------------------------------

We need to implement a new concept to support what is needed: a partially known schema. In
that case, we know some columns exist but we also allow a dynamic schema for other columns.
It hasn't been done before in Drill but it should be done for exactly the scenarios you are
describing. Basically: sample with validation + dynamic schema. 

There is at least one other problem to consider: since Mongo supports heterogeneous types,
we need to think about what type we expose for validation since our sample may find a particular
type for a field but there may be other types in the column at later positions that have a
different type.

> Drill proper DESCRIBE support for MongoDB
> -----------------------------------------
>
>                 Key: DRILL-3524
>                 URL: https://issues.apache.org/jira/browse/DRILL-3524
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Metadata, Storage - MongoDB
>    Affects Versions: 1.1.0
>            Reporter: Hari Sekhon
>             Fix For: Future
>
>
> Request to add full DESCRIBE support for MongoDB collections.
> I understand this may be difficult / sub-optimal due to the flexible schema nature of
Mongo docs but if you can tabulate results when reading directly from MongoDB for which you
have read the field names, then it's also possible to extract all field names to present for
the describe command, albeit an inefficient scan to do so.
> Currently describe returns a pseudo / inaccurate / unhelpful metadata:
> {code}+--------------+------------+--------------+
> | COLUMN_NAME  | DATA_TYPE  | IS_NULLABLE  |
> +--------------+------------+--------------+
> | *            | ANY        | YES          |
> +--------------+------------+--------------+{code}
> Perhaps you could extend DESCRIBE to scan the first few dozen docs by default to create
a merged schema as well as adding an optional argument to the describe command to allow for
scanning a user-specified number of docs from which to describe the schema, or an ALL argument
keyword to describe to scan all docs in a collection to get the complete global schema for
the collection?
> In case of schema evolution it might be an interesting option to additionally read the
newest and oldest records, maybe the first and last records by ID etc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message