drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Neeraja Rentachintala <nrentachint...@maprtech.com>
Subject Re: select from table with options
Date Wed, 21 Oct 2015 22:55:58 GMT
Another alternative to do this to specify a metadata file (.drill files)
that came up in some of the earlier discussions to solve similar use cases.
Rather than centrally defining configurations in storage plugin (which is
what Drill does today), .drill files allow more granularity , potentially
at folder or individual file which will override the central configuration.

I think the benefit of the metadata file is it can be used for other
purposes (such as to stats etc). Another benefit is that if you are using a
BI/query tool to trigger Drill SQL queries, this will work seamlessly
rather than having to rewrite the query for custom syntax.

I would like to know what others think of this approach.

-Neeraja



On Wed, Oct 21, 2015 at 3:43 PM, Julien Le Dem <julien@dremio.com> wrote:

> I like the approach of using views to add missing metadata to an existing
> raw dataset. The raw datasets stay the same and the view becomes the
> interface to the data. Nothing is mutated and we know how things are
> derived from one another.
>
> TLDR: I'm trying to summarize the options bellow and add a few thoughts:
> (please comment whether you think upside/downside elements are valid)
> Let's refer to those options with the name in *bold*
>
> - *urls style params* (my initial strawman): *select * from
> dfs.`default`.`/path/to/file/something.psv?type=text&delimiter=|`;*
> Extensibility is ensured through the storage plugin interpretation of the
> path.
> I agree with decomposing the syntax of format vs the data path/url. so this
> would conflict with HTTP query parameters.
> I will not pursue this one but I think it was a great conversation starter!
>
> - *table functions*: *select *
> from delimitedFile(dfs.`default`.`/path/to/file/something.psv`, '|')*
> It sounds like these would have to be defined in the corresponding
> StoragePlugin as they need intimate knowledge of the underlying storage.
> Extensibility would come through the storage plugin defining those?
> +1 on named parameter vs just positional.
> The possible downside is a syntax that could be a little foreign to the
> data analyst.
>
> - *fake-o parameters*. *select *
> from dfs.`default`.`/path/to/file/something.psv` where magicFieldDelimiter
> = '|';*
> I would be inclined to avoid using filters for something that changes what
> the data looks like.
> This could be unintuitive to users. (at least it feels that way to me)
> Extensibility would come through the storage plugin defining predicate push
> down rules?
>
> - *WITH USING OPTIONS*:
> In general I would feel more natural to me to put those options as part of
> a select statement. the select statement can always be used in a WITH
> clause.
> Extensibility would come through the storage plugin receiving those
> options? As the with statement applies to a full select statement with
> potentially joins, how would we know where to aply those options?
> We have two sub-options:
>   - *(type = 'text' AND lineDelimiter = '\n')*: this seems similar to
> fake-o
> parameters (above), same comment than above. *AND* seems out of place, for
> example OR would be forbidden.
>   - *{ type: "text", linedDelimiter = "\n"}*: The advantage of this is that
> you can re-use the same syntax in the configuration file. This is a plus
> for consistency. Users would figure out what works and would just have to
> put it in the central configuration once done.
>
> - *EXTEND WITH OPTIONS*: *SELECT FROM emp EXTEND [optional columns list]
> OPTIONS (type 'text', lineDelimiter);*
> What would be the column list in that case? Would it be awkward to use
> EXTEND without a column list?
> Extensibility would come through the storage plugin receiving those extend
> options.
> It sounds like they could be simply SELECT OPTIONS?
>
> - *Specific syntax*: *select * FROM mydb.mytable*
>
>
>
>
> *    TREAT AS TEXT      USING        LINE DELIMITER '\n'        AND FIELD
> DELIMITER ','        AND TREAT FIRST ROW AS HEADERS*
> Extensibility would come through the storage plugin defining a sub grammar
> for those options.
> Possibly this is harder to implement for the storage plugin implementor.
> Upside is the user has a SQL like syntax to specify this (although, I've
> never been fond of parts where SQL is trying to be plain English)
>
>
> *Appendix*: :P
> For what it's worth, here is how Pig does it: *LOAD 'data' [USING function]
> [AS schema];*
> - Just *LOAD '/path/to/my/file'* will use the default Loader (tab separated
> values)
> - adding a custom Loader* LOAD 'data' USING MyCustomLoader('param1',
> 'param2'); *is how you implement custom formats or remote locations (HBase,
> Web service, ...)
> So you can use the default loader with a different separator (*USING
> PigStorage('\t')*) in parameter or write your own.
> - The AS clause lets you set the schema. You could have a csv file without
> header and define the names and types of each column that way.
> Doc: https://pig.apache.org/docs/r0.15.0/basic.html#load
>
>
>
> On Wed, Oct 21, 2015 at 8:57 AM, Julian Hyde <jhyde@apache.org> wrote:
>
> > Whatever API is used to scan files from SQL, there will need to be a
> > corresponding way to accomplish the same thing in a user interface.
> > Probably a form with various fields, some of them with drop-boxes etc.
> >
> > And ideally a facility that samples a few hundred rows to deduce the
> > probable field names and types and which fields are unique.
> >
> > I think that the UI is the true "user friendly" interface. A usage
> > pattern might be for people to define a data source using in the UI,
> > save it as a view, then switch to the command line to write queries on
> > that view.
> >
> > There are other use cases similar to reading flies. For example you
> > would like to read data from an HTTP URL. You might want to specify
> > similar parameters for formats, compression, parsing, and parameters
> > in the file URI that describe a family of partitioned files. A URL
> > might allow push-down of filters, projects and sorts. But still you
> > would want to specify formats, compression and parsing the same way as
> > reading files.
> >
> > To me, this argues for decomposing the file scan syntax into pieces
> > that can be re-used if you get data from places other than files.
> >
> > Julian
> >
> >
> > On Wed, Oct 21, 2015 at 6:15 AM, Jacques Nadeau <jacques@dremio.com>
> > wrote:
> > >> This fourth is also least extensible and thus most disenfranchising
> for
> > >> those outside the inner group.
> > >>
> > >> Table functions (hopefully) would be something that others could
> > > implement.
> > >
> > > This is a brainstorm about a user apis...
> > >
> > > It isn't appropriate to shoot down ideas immediately in a brainstorm.
> It
> > > has a chilling effect on other people presenting new ideas.
> > >
> > > User apis should be defined first with an end-user in mind. Consistency
> > in
> > > different contexts is also very important (note my expansion of the
> > > discussion to ASCRIBE METADATA.)
> > >
> > > Your statement about extensibility has no technical backing. If you
> have
> > a
> > > concern like this, ask the proposer if they think that this could be
> done
> > > in an extensible way. In this case I see a number of ways that this
> could
> > > be done. Note the mention of the grammar switch in my initial proposal
> or
> > > go review my outstanding patch for integrating JSON literals. In many
> > ways,
> > > I think this approach could be considered the most extensible and
> > > expressive for a Drill extension developer.
> > >
> > > I hope others will continue to brainstorm on this thread.
> >
>
>
>
> --
> Julien
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message