Mailing-List: contact dev-help@drill.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@drill.apache.org
Received-SPF: pass (athena.apache.org: message received from 54.164.171.186
 which is an MX secondary for dev@drill.apache.org)
MIME-Version: 1.0
In-Reply-To: 
 <CAMCtme+OzVMkyVwvSO=b1X7rNOpmyss7X5XGQVTRF3eu1AfsPw@mail.gmail.com>
References: 
 <CAKa9qDkwZqHzowLa1gRP+XC=ABQe0329Z5wr+Z_z-ADVeW4XCw@mail.gmail.com>
	<CAFyDVvKWu5nFT3_UUWMFHUV7gZeAtySXsrVk2wUM7=AsZ9E4Rw@mail.gmail.com>
	<CAKa9qD=waiejb19R6rNig2v4AC4iqmop1bFdyiR-EwozUoBffQ@mail.gmail.com>
	<CAJwFCa2AFY3zcj-gHCi_2VGf8do22c=OHZrAtE-s3W-x8ZAXmQ@mail.gmail.com>
	<5539826C.7020408@maprtech.com>
	<CADKBNfcAmFDqCVt9F=BJvk44dLreRzZGJBLDs1GXaAR=8pGiQQ@mail.gmail.com>
	<CAJwFCa2kBXaQy7fvvOcAX_ZOJvY2POH-PkLvj5ETmkpBCpv3Jg@mail.gmail.com>
	<CAMCtmeLQCDpd2sVjQAn3z2kpvNHAoB_YrU4=tUs2nV5dYuLUOw@mail.gmail.com>
	<CAMCtme+OzVMkyVwvSO=b1X7rNOpmyss7X5XGQVTRF3eu1AfsPw@mail.gmail.com>
Date: Thu, 23 Apr 2015 19:00:49 -0700
Message-ID: 
 <CAA_-67eNTida5Z1HAp-ZFvvSuok5gv-bLrB+LCOL7vjLacH03Q@mail.gmail.com>
Subject: Re: Should we make dir* columns only exist when requested?
From: Steven Phillips <sphillips@maprtech.com>
To: dev@drill.apache.org
Content-Type: multipart/alternative; boundary=001a113db0faeffa1205146ec39e

--001a113db0faeffa1205146ec39e
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

What you are showing for the current behavior seems wrong to me:

$ tree mytdir
mytdir
=E2=94=94=E2=94=80=E2=94=80 mysdir
    =E2=94=94=E2=94=80=E2=94=80 myFile.json

$ cat mytdir/mysdir/myFile.json
{a:1,b:2,c:3}
{a:4,b:5,c:6}

0: jdbc:drill:> select * from `mytdir/mysdir/myFile.json`;
+------------+------------+------------+
|     a      |     b      |     c      |
+------------+------------+------------+
| 1          | 2          | 3          |
| 4          | 5          | 6          |
+------------+------------+------------+
2 rows selected (0.274 seconds)
0: jdbc:drill:> select * from `mytdir/mysdir/myFile.json`;
+------------+------------+------------+
|     a      |     b      |     c      |
+------------+------------+------------+
| 1          | 2          | 3          |
| 4          | 5          | 6          |
+------------+------------+------------+
2 rows selected (0.152 seconds)
0: jdbc:drill:> select * from `/mytdir/mysdir`;
+------------+------------+------------+
|     a      |     b      |     c      |
+------------+------------+------------+
| 1          | 2          | 3          |
| 4          | 5          | 6          |
+------------+------------+------------+
2 rows selected (0.157 seconds)
0: jdbc:drill:> select * from `mytdir`;
+------------+------------+------------+------------+
|    dir0    |     a      |     b      |     c      |
+------------+------------+------------+------------+
| mysdir     | 1          | 2          | 3          |
| mysdir     | 4          | 5          | 6          |
+------------+------------+------------+------------+

I don't know why in your example, you are getting a dir0 directory when
selecting a specific file. These directories should only be included when
the specified table is a directory which contains subdirectories. Any query
to a specific file or to a directory that only contains regular files
should not return dir* columns.
I think this is the correct behavior.

The fact that `mytidir` and `mytdir/mysdir` have different columns is not a
problem, because they are different tables.

I do think Daniel's idea of adding the file name as well makes sense. I'm
also open to Ted's idea for return a dir array instead of individual
columns.

On Thu, Apr 23, 2015 at 6:36 PM, Julian Hyde <julianhyde@gmail.com> wrote:

> > Ted wrote:
> >
> > For one thing, I can make a really slow version of [find] !
>
> Why does it have to be slow? Seriously, so many of the tools we use
> daily have quasi-query facilities (find, git log, du, ps, netstat) and
> we cobble together queries using complex options and pipelines of unix
> commands. Relational algebra is a potentially MORE efficient.
>
> I find myself writing ' ... | sort | uniq -c | sort -nr' almost daily
> and wish I could write ' ... order by count(*) desc'.
>
> On Thu, Apr 23, 2015 at 6:27 PM, Julian Hyde <julianhyde@gmail.com> wrote=
:
> > +1 to returning directories as context. Very useful feature. Could be
> > used to return context for other adapters (e.g. an adapter that
> > concatenates all versions of versioned logfiles).
> >
> > +1 making dir an array, per Ted's suggestion
> >
> > I think dir should not appear in *; thus you'd have to write
> >
> >   select dir, * from `/mytdir/mysdir/myfile.json`
> >
> > This behavior is analogous to Oracle's ROWID. It is not a column as
> > such, but a system function that you can apply to a row.
> >
> > You need to allow qualifiers:
> >
> >   select x.dir, x.*, y.dir, y.* from `/mytdir/mysdir/myfile.json` as
> > x, `/mytdir/mysdir/myfile2.json` as y
> >
> > and
> >
> >   select dir from `/mytdir/mysdir/myfile.json` as x,
> > `/mytdir/mysdir/myfile2.json` as y
> >
> > would be illegal because dir is ambiguous.
> >
> > You should make dir a reserved word (like ROWID).
> >
> > On Thu, Apr 23, 2015 at 5:12 PM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
> >> Great point.
> >>
> >> Having the file name itself is very handy.
> >>
> >>
> >> For one thing, I can make a really slow version of [find] !
> >>
> >> (seriously, I would love this)
> >>
> >>
> >> On Thu, Apr 23, 2015 at 7:48 PM, rahul challapalli <
> >> challapallirahul@gmail.com> wrote:
> >>
> >>> I am also under the opinion that we should not assume knowledge on th=
e
> user
> >>> front for data discovery. So we should either have 'dir' columns in
> 'select
> >>> *' or support a variation that Ted suggested.
> >>> Also the folder names compliment the actual data in some cases.
> >>>
> >>> - Rahul
> >>>
> >>> On Thu, Apr 23, 2015 at 4:38 PM, Daniel Barclay <dbarclay@maprtech.co=
m
> >
> >>> wrote:
> >>>
> >>> > Regarding the use case in which the user stores information in
> pathnames:
> >>> >
> >>> > Since Drill supports that use case partially, shouldn't it do so mo=
re
> >>> > completely?  In particular, since Drill provides access to subtree
> >>> > pathname segments before the last one (the segments for directories=
),
> >>> > should Drill provide access to the last one too (the simple file
> name)?
> >>> >
> >>> >
> >>> > We support reading cases like this:
> >>> > - root/
> >>> > - root/2015/
> >>> > - root/2015/01/
> >>> > - root/2015/01/01/
> >>> > - root/2015/01/01/log.json
> >>> > - root/2015/02/
> >>> > - root/2015/02/02/
> >>> > - root/2015/02/02/log.json
> >>> >
> >>> > In particular, querying "select ... from `root` ..." includes the
> >>> > date-portion segments of the pathnames in the dir0, etc, columns.
> >>> >
> >>> > Note that the user might not redundantly store the dates inside the
> >>> > files themselves, since the dates are known to exist in the directo=
ry
> >>> > names.
> >>> >
> >>> >
> >>> > However, we don't support this variation of that case, right?:
> >>> >
> >>> > - root/
> >>> > - root/2015
> >>> > - root/2015/01/
> >>> > - root/2015/01/log_01.json
> >>> > - root/2015/02/
> >>> > - root/2015/02/log_02.json
> >>> >
> >>> > In particular, Drill includes several segments of the pathname afte=
r
> >>> > the root of the subtree, but does not include the last segment--whi=
ch
> >>> > contains data just as the segments that _are_ included do.
> >>> >
> >>> > (Yes, the last segment usually contains artifacts besides the
> contained
> >>> > data (e.g., the file extension) and the user would have to specify
> how
> >>> > to interpret the file simple name segment as data, but the user has
> to
> >>> > specify the interpretation for the other segments anyway.)
> >>> >
> >>> >
> >>> > Daniel
> >>> >
> >>> >
> >>> >
> >>> > Ted Dunning wrote:
> >>> >
> >>> >> I would propose that dir be an array that contains all of the
> >>> directories
> >>> >> rather than having multiple values.
> >>> >>
> >>> >> The multiple names are particularly inconvenient if files are are
> >>> >> different
> >>> >> depths.
> >>> >>
> >>> >>
> >>> >>
> >>> >> On Thu, Apr 23, 2015 at 5:56 PM, Jacques Nadeau <jacques@apache.or=
g
> >
> >>> >> wrote:
> >>> >>
> >>> >>  I'm specifically arguing that SELECT * doesn't return the columns=
.
> >>> >>>
> >>> >>> Here is current behavior:
> >>> >>>
> >>> >>> /mytdir/mysdir/myfile.json
> >>> >>> {a:1,b:2,c:3}
> >>> >>> {a:4,b:5,c:6}
> >>> >>>
> >>> >>> select * from `myfile.json`
> >>> >>>
> >>> >>> a, b, c
> >>> >>> 1, 2, 3
> >>> >>> 4, 5, 6
> >>> >>>
> >>> >>> select * from `/mysdir/myfile.json`
> >>> >>>
> >>> >>> dir0 a, b, c
> >>> >>> mysdir, 1, 2, 3
> >>> >>> mysdir, 4, 5, 6
> >>> >>>
> >>> >>> select * from `/mytdir/mysdir/myfile.json`
> >>> >>>
> >>> >>> dir0, dir1 a, b, c
> >>> >>> mytdir, mysdir, 1, 2, 3
> >>> >>> mytdir, mysdir, 4, 5, 6
> >>> >>>
> >>> >>>
> >>> >>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
> >>> >>> My proposal:
> >>> >>>
> >>> >>> select * from `myfile.json`
> >>> >>> select * from `/mysdir/myfile.json`
> >>> >>> select * from `/mytdir/mysdir/myfile.json`
> >>> >>> ::all produce::
> >>> >>> a, b, c
> >>> >>> 1, 2, 3
> >>> >>> 4, 5, 6
> >>> >>>
> >>> >>> select dir0, a, b, c from `/mysdir/myfile.json`
> >>> >>>
> >>> >>> dir0 a, b, c
> >>> >>> mysdir, 1, 2, 3
> >>> >>> mysdir, 4, 5, 6
> >>> >>>
> >>> >>> select dir0, a, b, c from `/mytdir/mysdir/myfile.json`
> >>> >>>
> >>> >>> dir0 a, b, c
> >>> >>> mytdir, 1, 2, 3
> >>> >>> mytdir, 4, 5, 6
> >>> >>>
> >>> >>>
> >>> >>>
> >>> >>>
> >>> >>> On Thu, Apr 23, 2015 at 5:42 PM, Aman Sinha <asinha@maprtech.com>
> >>> wrote:
> >>> >>>
> >>> >>>  Seems reasonable, as long as SELECT * also returns the dir#
> columns.
> >>> >>>>
> >>> >>>> On Thu, Apr 23, 2015 at 2:34 PM, Jacques Nadeau <
> jacques@apache.org>
> >>> >>>> wrote:
> >>> >>>>
> >>> >>>>  Hey guys,
> >>> >>>>>
> >>> >>>>> I've been thinking that always showing dir# columns seems to
> alter
> >>> data
> >>> >>>>> returned from Drill depending on how you select the directory.
> I'd
> >>> >>>>>
> >>> >>>> propose
> >>> >>>>
> >>> >>>>> that we make it so that we only return dir# columns when they a=
re
> >>> >>>>> explicitly requested.
> >>> >>>>>
> >>> >>>>> Thoughts?
> >>> >>>>>
> >>> >>>>>
> >>> >>>>
> >>> >>>
> >>> >>
> >>> >
> >>> > --
> >>> > Daniel Barclay
> >>> > MapR Technologies
> >>> >
> >>>
>


--=20
 Steven Phillips
 Software Engineer

 mapr.com

--001a113db0faeffa1205146ec39e--