drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Julien Le Dem <jul...@dremio.com>
Subject Re: dot drill file
Date Fri, 06 Nov 2015 22:54:23 GMT
Sorry, I missed Ted's email because of email problems we had Sunday night
to Monday morning.
Fortunately I can look it up from the archive. (
http://mail-archives.apache.org/mod_mbox/drill-dev/201511.mbox/%3CCAJwFCa0ASViUqq8Anhm18AADGq%2BX%3DLqrsxkOz0BTojtyepenLA%40mail.gmail.com%3E
)

I think we can separate views and raw datasets as they will be configured
very differently.
In this case I'm talking about the requirements for the raw datasets where
we collocate .drill files with data. (.data.drill ?)
The format configuration is the first use case I have in mind but we could
expand this with everything related to reading data (adding schema for tsv
file, schema locator for thrift/protobuf, error handling, etc).

I would include external ETL processes in the list of things that generate
.drill files (with insert into). I agree that this is typically a machine
generated file. Having it human readable makes it easier to
understand/debug. I can see how a different format would be better if it
grows big but I would think that a separate metadata cache would be the
solution to this. We would generally cache the metadata in a format that is
efficient to access for query planning (predicate push down, etc).

I'm happy with making the .drill file related only to the files in the same
directory.

Regarding error handling directives, we can keep it a separate discussion
as we can start implementing .drill files without it and possibly add it
later if we have a consensus.
My experience is that it is very common that the systems generating the
files will produce a small number of errors but won't validate or fix their
own output. Typically you have hundreds or thousands of machines generating
log files and every now or then a daemon dies and truncate the file or
writes a few corrupted records. Possibly those files get centralized by
systems that don't know anything about the format they are supposed to be
in. In this case it makes sense to define the acceptable error threshold
along with the files rather than having to add it at the query level. The
default threshold will still be 0 - meaning fail at the first error - and
select with options will allow overriding it.



On Fri, Nov 6, 2015 at 1:30 PM, Jason Altekruse <altekrusejason@gmail.com>
wrote:

> Hey Parth,
>
> I think I can provide a little clarification on this point you mentioned:
>
> > BTW, I'm not convinced that record level error handling directives belong
> > in this. I know Jacques had some thoughts about that, but I wouldn't mind
> > if someone explained it to me again :)
>
> I believe the reason Jacques was proposing this be handled with a dot drill
> file was to handle
> his concern with the initial planning time proposal made by Sean. With the
> initial proposal we could have conflicting
> results based on where a project appeared in the plan. If the project with
> a cast appeared above
> a scan, the proposal was to change the behavior of the read itself.
> Unfortunately we don't have a concept of
> an operation pinned above a scan, and actually quite frequently where we
> cannot get benefits
> of pruning by pushing something down, we generally push projects up
> the tree, assuming that other operations like filters, joins and aggregates
> are all contracting.
> In these cases we will need to evaluate the expressions in the project on
> fewer rows if we wait
> for other operations to reduce the overall size.
>
> Dot drill files are meant to add additional information for the scan
> operation. Assigning schema to
> a format that otherwise lacks it, as well as defining behavior for when
> trying to materialize into the schema
> fails (warn, error, write corrupt rows to a log file) seem like a good
> candidates for making use of this feature.
>
> On Fri, Nov 6, 2015 at 1:13 PM, Parth Chandra <parthc@apache.org> wrote:
>
> > Hi Julien,
> >
> >   In an earlier discussion, regarding 'insert into' we had discussed the
> > idea of keeping a merged schema (a common schema that applies to all the
> > files in the directory) in a .drill file.  The metadata cache file also
> has
> > the same information and, in addition, has stats.  We never did specify
> > what a merged schema contains.
> >
> >   My understanding was that the .drill file, when available, becomes the
> > source of schema information. I can see both the metadata cache and the
> > insert into functionality using a common format. For these two sets of
> > functionality, I don't see a need for the file to be human readable and
> if
> > a more efficient format is available, I think we should use that. This is
> > particularly true if we need to keep per file information.
> >
> >   Is that how we are thinking of the .drill file? Or are we talking
> about a
> > .drill.format (?) file. I guess this is similar to Ted's question.
> >
> >   BTW, I'm not convinced that record level error handling directives
> belong
> > in this. I know Jacques had some thoughts about that, but I wouldn't mind
> > if someone explained it to me again :) . To me record level error
> handling
> > is really a query level directive, not something that applies to all the
> > data (in a directory) all the time. Keeping an open mind on this though.
> >
> >   Something about the inheritance rules based on similar questions
> > regarding the metadata cache file - The metadata cache file is built
> based
> > on all the files in the hierarchy under the current directory. So if you
> > have a hierarchy
> >   A
> >    -- B
> >       -- C
> >    -- D
> > there is a metadata cache file in A, B, C and D. The cache file in A
> > contains info on all the files in B, C and D. If you update the
> directory C
> > and refresh metadata for C, then _only_ C will get updated and the
> changes
> > are not propagated upwards. If you refresh metadata for A, all the
> changes
> > are seen by A, B, and C. For the use case you're outlining, I would think
> > looking only at the directory the files are in should suffice.
> >
> >
> > Parth
> >
> >
> >
> >
> > On Sun, Nov 1, 2015 at 10:18 PM, Julien Le Dem <julien@dremio.com>
> wrote:
> >
> > > Hello,
> > > I'd like to capture the requirement for dot drill files.
> > > Here is my understanding:
> > > A ".drill" file is in JSON format and is a mechanism provided by the
> > > FileSystemPlugin to define the format plugin to use collocated with the
> > > files containing the data in a file system. It will override any
> > extension
> > > or magic number header mapping.
> > > It will enable configuring the format plugin and record level error
> > > handling mechanism (bad record skipping, etc). It could be extended to
> > > support more in the future.
> > > Is this correct? Are there inheritance rules if more than one file is
> > found
> > > in the hierarchy? Does drill look only at the dir containing the files
> or
> > > also all parent directories?
> > >
> > > --
> > > Julien
> > >
> >
>



-- 
Julien

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message