drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julien Le Dem <jul...@dremio.com>
Subject Re: select from table with options
Date Thu, 22 Oct 2015 00:21:16 GMT
TL;DR: I agree that there is some overlap, but I meant to say that "Select
with Option" is needed even if we have a .drill feature.

I'm assuming the .drill file has to be collocated with the data being
analyzed.
For example in HDFS, if I have /logs/foo/2015/10/21/my.log i add a .drill
file in  /logs/foo/2015/10/21/.drill and drill will lookup the parent
directory for a .drill file (and possibly the parent/parent recursively).

Which is why I was mentioning in my previous email that the analyst often
has read only access. (the data being produced by ETL or something else)
Also the user (probably analyst) trying to query the data may not want to:
 - have to edit a config file
 - modify configuration that applies to everyone reading the data

So I was thinking of 2 use cases:
 - try to read the data without having to change config that applies to
every user (this thread)
 - set config that configures the system for everyone (.drill file)




On Wed, Oct 21, 2015 at 5:02 PM, Neeraja Rentachintala <
nrentachintala@maprtech.com> wrote:

> can you elaborate on what you mean by .drill is a different use case.
>
> In my mind, .drill has 2 use cases - a way to specify hints to Drill on how
> read  certain datasets (and potentially optimize the queries on the
> datasets) and a way to save the definitions of objects created via Drill
> for reuse/access from BI tools. Both these (i.e existing or external tables
> vs Drill created or internal tables) currently are not differentiated in
> Drill, hence I believe can use the same model in terms of metadata
> handling.
>
> I would be interested in knowing your thoughts.
> -Neeraja
>
> On Wed, Oct 21, 2015 at 4:55 PM, Julien Le Dem <julien@dremio.com> wrote:
>
> > I think of .drill files as a different use case but there is potentially
> > some overlap.
> > Some things to keep in mind:
> >  - The person analyzing the data has often read-only access.
> >  - having to write a config file on one end of the system and then query
> on
> > the other end is not analyst friendly
> >
> > We should definitely keep .drill in mind while design this.
> > Although I'm thinking we should probably discuss .drill on a separate
> > thread.
> >
> >
> >
> > On Wed, Oct 21, 2015 at 3:55 PM, Neeraja Rentachintala <
> > nrentachintala@maprtech.com> wrote:
> >
> > > Another alternative to do this to specify a metadata file (.drill
> files)
> > > that came up in some of the earlier discussions to solve similar use
> > cases.
> > > Rather than centrally defining configurations in storage plugin (which
> is
> > > what Drill does today), .drill files allow more granularity ,
> potentially
> > > at folder or individual file which will override the central
> > configuration.
> > >
> > > I think the benefit of the metadata file is it can be used for other
> > > purposes (such as to stats etc). Another benefit is that if you are
> > using a
> > > BI/query tool to trigger Drill SQL queries, this will work seamlessly
> > > rather than having to rewrite the query for custom syntax.
> > >
> > > I would like to know what others think of this approach.
> > >
> > > -Neeraja
> > >
> > >
> > >
> > > On Wed, Oct 21, 2015 at 3:43 PM, Julien Le Dem <julien@dremio.com>
> > wrote:
> > >
> > > > I like the approach of using views to add missing metadata to an
> > existing
> > > > raw dataset. The raw datasets stay the same and the view becomes the
> > > > interface to the data. Nothing is mutated and we know how things are
> > > > derived from one another.
> > > >
> > > > TLDR: I'm trying to summarize the options bellow and add a few
> > thoughts:
> > > > (please comment whether you think upside/downside elements are valid)
> > > > Let's refer to those options with the name in *bold*
> > > >
> > > > - *urls style params* (my initial strawman): *select * from
> > > > dfs.`default`.`/path/to/file/something.psv?type=text&delimiter=|`;*
> > > > Extensibility is ensured through the storage plugin interpretation of
> > the
> > > > path.
> > > > I agree with decomposing the syntax of format vs the data path/url.
> so
> > > this
> > > > would conflict with HTTP query parameters.
> > > > I will not pursue this one but I think it was a great conversation
> > > starter!
> > > >
> > > > - *table functions*: *select *
> > > > from delimitedFile(dfs.`default`.`/path/to/file/something.psv`, '|')*
> > > > It sounds like these would have to be defined in the corresponding
> > > > StoragePlugin as they need intimate knowledge of the underlying
> > storage.
> > > > Extensibility would come through the storage plugin defining those?
> > > > +1 on named parameter vs just positional.
> > > > The possible downside is a syntax that could be a little foreign to
> the
> > > > data analyst.
> > > >
> > > > - *fake-o parameters*. *select *
> > > > from dfs.`default`.`/path/to/file/something.psv` where
> > > magicFieldDelimiter
> > > > = '|';*
> > > > I would be inclined to avoid using filters for something that changes
> > > what
> > > > the data looks like.
> > > > This could be unintuitive to users. (at least it feels that way to
> me)
> > > > Extensibility would come through the storage plugin defining
> predicate
> > > push
> > > > down rules?
> > > >
> > > > - *WITH USING OPTIONS*:
> > > > In general I would feel more natural to me to put those options as
> part
> > > of
> > > > a select statement. the select statement can always be used in a WITH
> > > > clause.
> > > > Extensibility would come through the storage plugin receiving those
> > > > options? As the with statement applies to a full select statement
> with
> > > > potentially joins, how would we know where to aply those options?
> > > > We have two sub-options:
> > > >   - *(type = 'text' AND lineDelimiter = '\n')*: this seems similar to
> > > > fake-o
> > > > parameters (above), same comment than above. *AND* seems out of
> place,
> > > for
> > > > example OR would be forbidden.
> > > >   - *{ type: "text", linedDelimiter = "\n"}*: The advantage of this
> is
> > > that
> > > > you can re-use the same syntax in the configuration file. This is a
> > plus
> > > > for consistency. Users would figure out what works and would just
> have
> > to
> > > > put it in the central configuration once done.
> > > >
> > > > - *EXTEND WITH OPTIONS*: *SELECT FROM emp EXTEND [optional columns
> > list]
> > > > OPTIONS (type 'text', lineDelimiter);*
> > > > What would be the column list in that case? Would it be awkward to
> use
> > > > EXTEND without a column list?
> > > > Extensibility would come through the storage plugin receiving those
> > > extend
> > > > options.
> > > > It sounds like they could be simply SELECT OPTIONS?
> > > >
> > > > - *Specific syntax*: *select * FROM mydb.mytable*
> > > >
> > > >
> > > >
> > > >
> > > > *    TREAT AS TEXT      USING        LINE DELIMITER '\n'        AND
> > FIELD
> > > > DELIMITER ','        AND TREAT FIRST ROW AS HEADERS*
> > > > Extensibility would come through the storage plugin defining a sub
> > > grammar
> > > > for those options.
> > > > Possibly this is harder to implement for the storage plugin
> > implementor.
> > > > Upside is the user has a SQL like syntax to specify this (although,
> > I've
> > > > never been fond of parts where SQL is trying to be plain English)
> > > >
> > > >
> > > > *Appendix*: :P
> > > > For what it's worth, here is how Pig does it: *LOAD 'data' [USING
> > > function]
> > > > [AS schema];*
> > > > - Just *LOAD '/path/to/my/file'* will use the default Loader (tab
> > > separated
> > > > values)
> > > > - adding a custom Loader* LOAD 'data' USING MyCustomLoader('param1',
> > > > 'param2'); *is how you implement custom formats or remote locations
> > > (HBase,
> > > > Web service, ...)
> > > > So you can use the default loader with a different separator (*USING
> > > > PigStorage('\t')*) in parameter or write your own.
> > > > - The AS clause lets you set the schema. You could have a csv file
> > > without
> > > > header and define the names and types of each column that way.
> > > > Doc: https://pig.apache.org/docs/r0.15.0/basic.html#load
> > > >
> > > >
> > > >
> > > > On Wed, Oct 21, 2015 at 8:57 AM, Julian Hyde <jhyde@apache.org>
> wrote:
> > > >
> > > > > Whatever API is used to scan files from SQL, there will need to be
> a
> > > > > corresponding way to accomplish the same thing in a user interface.
> > > > > Probably a form with various fields, some of them with drop-boxes
> > etc.
> > > > >
> > > > > And ideally a facility that samples a few hundred rows to deduce
> the
> > > > > probable field names and types and which fields are unique.
> > > > >
> > > > > I think that the UI is the true "user friendly" interface. A usage
> > > > > pattern might be for people to define a data source using in the
> UI,
> > > > > save it as a view, then switch to the command line to write queries
> > on
> > > > > that view.
> > > > >
> > > > > There are other use cases similar to reading flies. For example you
> > > > > would like to read data from an HTTP URL. You might want to specify
> > > > > similar parameters for formats, compression, parsing, and
> parameters
> > > > > in the file URI that describe a family of partitioned files. A URL
> > > > > might allow push-down of filters, projects and sorts. But still you
> > > > > would want to specify formats, compression and parsing the same way
> > as
> > > > > reading files.
> > > > >
> > > > > To me, this argues for decomposing the file scan syntax into pieces
> > > > > that can be re-used if you get data from places other than files.
> > > > >
> > > > > Julian
> > > > >
> > > > >
> > > > > On Wed, Oct 21, 2015 at 6:15 AM, Jacques Nadeau <
> jacques@dremio.com>
> > > > > wrote:
> > > > > >> This fourth is also least extensible and thus most
> > disenfranchising
> > > > for
> > > > > >> those outside the inner group.
> > > > > >>
> > > > > >> Table functions (hopefully) would be something that others
could
> > > > > > implement.
> > > > > >
> > > > > > This is a brainstorm about a user apis...
> > > > > >
> > > > > > It isn't appropriate to shoot down ideas immediately in a
> > brainstorm.
> > > > It
> > > > > > has a chilling effect on other people presenting new ideas.
> > > > > >
> > > > > > User apis should be defined first with an end-user in mind.
> > > Consistency
> > > > > in
> > > > > > different contexts is also very important (note my expansion
of
> the
> > > > > > discussion to ASCRIBE METADATA.)
> > > > > >
> > > > > > Your statement about extensibility has no technical backing.
If
> you
> > > > have
> > > > > a
> > > > > > concern like this, ask the proposer if they think that this
could
> > be
> > > > done
> > > > > > in an extensible way. In this case I see a number of ways that
> this
> > > > could
> > > > > > be done. Note the mention of the grammar switch in my initial
> > > proposal
> > > > or
> > > > > > go review my outstanding patch for integrating JSON literals.
In
> > many
> > > > > ways,
> > > > > > I think this approach could be considered the most extensible
and
> > > > > > expressive for a Drill extension developer.
> > > > > >
> > > > > > I hope others will continue to brainstorm on this thread.
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Julien
> > > >
> > >
> >
> >
> >
> > --
> > Julien
> >
>



-- 
Julien

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message