drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jacques Nadeau <jacq...@dremio.com>
Subject Re: Request for more feedback on "Support the Ability to Identify And Skip Records" design
Date Tue, 27 Oct 2015 15:52:21 GMT
There seem to be multiple user requirements that are being considered in
Hsuan & Juliens' proposals:

1. Drill doesn't have enough information to parse my data, I want to give
Drill help. (Examples might me: the field delimiter is "|", the proto idl
encoding for a protobuf file is "...", provide an external avro schema )
2. While Drill can parse my data, the structure output is incomplete. It
may be missing field types and/or field names. I want to tell Drill how to
interpret that data since the format itself doesn't provide an adequate way
to express this (typically text files as opposed to json, parquet)
3. I've defined an expected structure to my data files. If some records
don't match that, I want to have special handling to manage those records
(e.g. drop, warn number of drops, create separate file with provenance of
each failing record)
4. I have an arbitrary query and I want any data-specific execution
failures to be squelched to allow the query to complete with whatever data
remains.

My recommendation is that we have three new features:

A. table with options (what julien is working on)
B. .drill files (https://issues.apache.org/jira/browse/DRILL-3572)
C. alter table ascribe metadata (to create a .drill file through sql)
D. Support using table with options (A) to override settings in .drill (B)

I believe that A & B (and C since it is simply a derivative of B) should
provide the capability to achieve requirements 1-3 above.

When Neeraja talks of the exploration use case, feature A is probably the
most common way that people will do this. In the case of use case 3 above,
if someone wants to use a "recordPositionAndError" behavior (see
DRILL-3572), they will most likely want to do that in the context of a
query (as opposed to a view or .drill).  As such, you would probably create
a .drill file that did warn or ignore. Then layer over the top (via feature
D) a recordPositionAndError if you want that for a certain situation.

My main thought on Hsuan's initial proposal is it seems to try to provide
an incomplete resolution of #4 above. It isn't clear to me that use case #4
is a critical use case for most users. If it is, can we get some concrete
examples of it as opposed to use cases 1-3? If it is a critical use case, I
think we should solve it in a more general way (for example I don't think
we should try to maintain file-based record provenance in that context).
Among other things, the current proposal has the weird problem of not being
consistent in how the user experiences the behavior (depending on what plan
Drill decides to execute.)

Note, there were some questions about how 1-3 could be solved using B so
I've provided an example in the Jira:
https://issues.apache.org/jira/browse/DRILL-3572



--
Jacques Nadeau
CTO and Co-Founder, Dremio

On Mon, Oct 26, 2015 at 4:09 PM, Zelaine Fong <zfong@maprtech.com> wrote:

> My understanding of Jacques' proposal is that he suggests we use .drill
> instead of requiring the user to do an explicit cast in their select
> query.  That way, the changes for enhancement would be restricted to the
> scanner.
>
> Did I interpret the alternative approach correctly?
>
> -- Zelaine
>
> On Mon, Oct 26, 2015 at 4:05 PM, Hsuan Yi Chu <hyichu@maprtech.com> wrote:
>
> > Hi,
> >
> > Luckily, we will have hang-out tomorrow.
> >
> > Maybe we could have an example to elaborate how .drill can be used in a
> > cast-query?
> >
> > Thanks.
> >
> >
> > On Mon, Oct 26, 2015 at 3:31 PM, Neeraja Rentachintala <
> > nrentachintala@maprtech.com> wrote:
> >
> > > Jacques
> > > I have responded to one of your comments on the doc.
> > > can you pls review and comment. I am not clear on the approach you are
> > > suggesting using .drill and what would that mean to user experience. It
> > > would be great if you can add an example.
> > >
> > > Similar to other thread (initiated by Julien) we have around being able
> > to
> > > provide file parsing hints from the query itself for self service data
> > > exploration purposes, we need this feature to be fairly light weight
> > from a
> > > user experience point of view. i.e me as a business user got hold of
> some
> > > external data, want to take a look by running adhoc queries on Drill ,
> I
> > > should be able to do it without having to go through whole setup of
> > .drill
> > > etc which will come later as the data is 'operationalized'
> > >
> > > thanks
> > > -Neeraja
> > >
> > > On Mon, Oct 26, 2015 at 2:49 PM, Jacques Nadeau <jacques@dremio.com>
> > > wrote:
> > >
> > > > Hsuan was kind enough to put together a provocative discussion on the
> > > > mailing list about skipping records. I've started a way too long
> thread
> > > in
> > > > the comments discussion but would like to get other feedback from the
> > > > community. The main point of contention I have is that the big goal
> of
> > > this
> > > > design is to provide "data import" like capabilities for Drill. In
> that
> > > > context, I suggested a scan based approach to schema enforcement (and
> > bad
> > > > record capture/storage). I think it is a simpler approach and solves
> > the
> > > > vast majority of user needs. Hsuan's initial proposal was a much
> > broader
> > > > reaching proposal that supports an arbitrary number of expression
> types
> > > > within project and filter (assuming they are proximate to the scan).
> > > >
> > > > Would love to get others feedback and thoughts on the doc to what the
> > MVP
> > > > for this feature really is.
> > > >
> > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1jCeYW924_SFwf-nOqtXrO68eixmAitM-tLngezzXw3Y/edit
> > > >
> > > >
> > > > --
> > > > Jacques Nadeau
> > > > CTO and Co-Founder, Dremio
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message