drill-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Should we make dir* columns only exist when requested?
Date Fri, 24 Apr 2015 00:12:10 GMT
Great point.

Having the file name itself is very handy.


For one thing, I can make a really slow version of [find] !

(seriously, I would love this)


On Thu, Apr 23, 2015 at 7:48 PM, rahul challapalli <
challapallirahul@gmail.com> wrote:

> I am also under the opinion that we should not assume knowledge on the user
> front for data discovery. So we should either have 'dir' columns in 'select
> *' or support a variation that Ted suggested.
> Also the folder names compliment the actual data in some cases.
>
> - Rahul
>
> On Thu, Apr 23, 2015 at 4:38 PM, Daniel Barclay <dbarclay@maprtech.com>
> wrote:
>
> > Regarding the use case in which the user stores information in pathnames:
> >
> > Since Drill supports that use case partially, shouldn't it do so more
> > completely?  In particular, since Drill provides access to subtree
> > pathname segments before the last one (the segments for directories),
> > should Drill provide access to the last one too (the simple file name)?
> >
> >
> > We support reading cases like this:
> > - root/
> > - root/2015/
> > - root/2015/01/
> > - root/2015/01/01/
> > - root/2015/01/01/log.json
> > - root/2015/02/
> > - root/2015/02/02/
> > - root/2015/02/02/log.json
> >
> > In particular, querying "select ... from `root` ..." includes the
> > date-portion segments of the pathnames in the dir0, etc, columns.
> >
> > Note that the user might not redundantly store the dates inside the
> > files themselves, since the dates are known to exist in the directory
> > names.
> >
> >
> > However, we don't support this variation of that case, right?:
> >
> > - root/
> > - root/2015
> > - root/2015/01/
> > - root/2015/01/log_01.json
> > - root/2015/02/
> > - root/2015/02/log_02.json
> >
> > In particular, Drill includes several segments of the pathname after
> > the root of the subtree, but does not include the last segment--which
> > contains data just as the segments that _are_ included do.
> >
> > (Yes, the last segment usually contains artifacts besides the contained
> > data (e.g., the file extension) and the user would have to specify how
> > to interpret the file simple name segment as data, but the user has to
> > specify the interpretation for the other segments anyway.)
> >
> >
> > Daniel
> >
> >
> >
> > Ted Dunning wrote:
> >
> >> I would propose that dir be an array that contains all of the
> directories
> >> rather than having multiple values.
> >>
> >> The multiple names are particularly inconvenient if files are are
> >> different
> >> depths.
> >>
> >>
> >>
> >> On Thu, Apr 23, 2015 at 5:56 PM, Jacques Nadeau <jacques@apache.org>
> >> wrote:
> >>
> >>  I'm specifically arguing that SELECT * doesn't return the columns.
> >>>
> >>> Here is current behavior:
> >>>
> >>> /mytdir/mysdir/myfile.json
> >>> {a:1,b:2,c:3}
> >>> {a:4,b:5,c:6}
> >>>
> >>> select * from `myfile.json`
> >>>
> >>> a, b, c
> >>> 1, 2, 3
> >>> 4, 5, 6
> >>>
> >>> select * from `/mysdir/myfile.json`
> >>>
> >>> dir0 a, b, c
> >>> mysdir, 1, 2, 3
> >>> mysdir, 4, 5, 6
> >>>
> >>> select * from `/mytdir/mysdir/myfile.json`
> >>>
> >>> dir0, dir1 a, b, c
> >>> mytdir, mysdir, 1, 2, 3
> >>> mytdir, mysdir, 4, 5, 6
> >>>
> >>>
> >>> ====================================
> >>> My proposal:
> >>>
> >>> select * from `myfile.json`
> >>> select * from `/mysdir/myfile.json`
> >>> select * from `/mytdir/mysdir/myfile.json`
> >>> ::all produce::
> >>> a, b, c
> >>> 1, 2, 3
> >>> 4, 5, 6
> >>>
> >>> select dir0, a, b, c from `/mysdir/myfile.json`
> >>>
> >>> dir0 a, b, c
> >>> mysdir, 1, 2, 3
> >>> mysdir, 4, 5, 6
> >>>
> >>> select dir0, a, b, c from `/mytdir/mysdir/myfile.json`
> >>>
> >>> dir0 a, b, c
> >>> mytdir, 1, 2, 3
> >>> mytdir, 4, 5, 6
> >>>
> >>>
> >>>
> >>>
> >>> On Thu, Apr 23, 2015 at 5:42 PM, Aman Sinha <asinha@maprtech.com>
> wrote:
> >>>
> >>>  Seems reasonable, as long as SELECT * also returns the dir# columns.
> >>>>
> >>>> On Thu, Apr 23, 2015 at 2:34 PM, Jacques Nadeau <jacques@apache.org>
> >>>> wrote:
> >>>>
> >>>>  Hey guys,
> >>>>>
> >>>>> I've been thinking that always showing dir# columns seems to alter
> data
> >>>>> returned from Drill depending on how you select the directory. 
I'd
> >>>>>
> >>>> propose
> >>>>
> >>>>> that we make it so that we only return dir# columns when they are
> >>>>> explicitly requested.
> >>>>>
> >>>>> Thoughts?
> >>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
> > --
> > Daniel Barclay
> > MapR Technologies
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message