Return-Path: X-Original-To: apmail-drill-dev-archive@www.apache.org Delivered-To: apmail-drill-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3A18F17F2C for ; Fri, 24 Apr 2015 02:02:07 +0000 (UTC) Received: (qmail 39568 invoked by uid 500); 24 Apr 2015 02:02:07 -0000 Delivered-To: apmail-drill-dev-archive@drill.apache.org Received: (qmail 39515 invoked by uid 500); 24 Apr 2015 02:02:07 -0000 Mailing-List: contact dev-help@drill.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@drill.apache.org Delivered-To: mailing list dev@drill.apache.org Received: (qmail 39504 invoked by uid 99); 24 Apr 2015 02:02:06 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Apr 2015 02:02:06 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: message received from 54.164.171.186 which is an MX secondary for dev@drill.apache.org) Received: from [54.164.171.186] (HELO mx1-us-east.apache.org) (54.164.171.186) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 24 Apr 2015 02:02:01 +0000 Received: from mail-qc0-f176.google.com (mail-qc0-f176.google.com [209.85.216.176]) by mx1-us-east.apache.org (ASF Mail Server at mx1-us-east.apache.org) with ESMTPS id 330C743BAB for ; Fri, 24 Apr 2015 02:01:41 +0000 (UTC) Received: by qcbii10 with SMTP id ii10so19251618qcb.2 for ; Thu, 23 Apr 2015 19:00:50 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=8iYrcCL86NOGkaldbQZrFUh1D8tItE/afu+OcBFgCAg=; b=AKCenmBJ8VAMnVD6G+yXqx5ShmPQ43bpWAWkZPe9MeVdQxH7vwGmTYhSk7W1hTTU/K CZjOcfonitOIf0PG5sfav8Dth1sYoYZ9O6ApiG1FaZ1eeu4ec5xY+XmddMNSdfOQWEOl qELLKWt7VSntfzde5h1/bu/ZFr8QM3eezgYmm2g4o2LhZTkeT++3cceFwe6/X5EUPi0G fOjduGBKjw/HHAIrLQf6OedD3QCN2cKVh5b6feIw1DVKtRV222wiz3j4rfRNYmyeMqip WTGamhO/u2KE8aZLcj0aPsqTH4ZsWjZfPSmt1B+XaXwubVpBE7/uNbdUm0Svtzk4M6Sh 3v8A== X-Gm-Message-State: ALoCoQm6T0MvZsrostA7UhRnWnYgR/4JuaVvSNytsQNiUwrarzETMfYvwsFnthapsi1S5RGOdN5k MIME-Version: 1.0 X-Received: by 10.55.40.215 with SMTP id o84mr10780554qko.93.1429840849789; Thu, 23 Apr 2015 19:00:49 -0700 (PDT) Received: by 10.140.108.246 with HTTP; Thu, 23 Apr 2015 19:00:49 -0700 (PDT) In-Reply-To: References: <5539826C.7020408@maprtech.com> Date: Thu, 23 Apr 2015 19:00:49 -0700 Message-ID: Subject: Re: Should we make dir* columns only exist when requested? From: Steven Phillips To: dev@drill.apache.org Content-Type: multipart/alternative; boundary=001a113db0faeffa1205146ec39e X-Virus-Checked: Checked by ClamAV on apache.org --001a113db0faeffa1205146ec39e Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable What you are showing for the current behavior seems wrong to me: $ tree mytdir mytdir =E2=94=94=E2=94=80=E2=94=80 mysdir =E2=94=94=E2=94=80=E2=94=80 myFile.json $ cat mytdir/mysdir/myFile.json {a:1,b:2,c:3} {a:4,b:5,c:6} 0: jdbc:drill:> select * from `mytdir/mysdir/myFile.json`; +------------+------------+------------+ | a | b | c | +------------+------------+------------+ | 1 | 2 | 3 | | 4 | 5 | 6 | +------------+------------+------------+ 2 rows selected (0.274 seconds) 0: jdbc:drill:> select * from `mytdir/mysdir/myFile.json`; +------------+------------+------------+ | a | b | c | +------------+------------+------------+ | 1 | 2 | 3 | | 4 | 5 | 6 | +------------+------------+------------+ 2 rows selected (0.152 seconds) 0: jdbc:drill:> select * from `/mytdir/mysdir`; +------------+------------+------------+ | a | b | c | +------------+------------+------------+ | 1 | 2 | 3 | | 4 | 5 | 6 | +------------+------------+------------+ 2 rows selected (0.157 seconds) 0: jdbc:drill:> select * from `mytdir`; +------------+------------+------------+------------+ | dir0 | a | b | c | +------------+------------+------------+------------+ | mysdir | 1 | 2 | 3 | | mysdir | 4 | 5 | 6 | +------------+------------+------------+------------+ I don't know why in your example, you are getting a dir0 directory when selecting a specific file. These directories should only be included when the specified table is a directory which contains subdirectories. Any query to a specific file or to a directory that only contains regular files should not return dir* columns. I think this is the correct behavior. The fact that `mytidir` and `mytdir/mysdir` have different columns is not a problem, because they are different tables. I do think Daniel's idea of adding the file name as well makes sense. I'm also open to Ted's idea for return a dir array instead of individual columns. On Thu, Apr 23, 2015 at 6:36 PM, Julian Hyde wrote: > > Ted wrote: > > > > For one thing, I can make a really slow version of [find] ! > > Why does it have to be slow? Seriously, so many of the tools we use > daily have quasi-query facilities (find, git log, du, ps, netstat) and > we cobble together queries using complex options and pipelines of unix > commands. Relational algebra is a potentially MORE efficient. > > I find myself writing ' ... | sort | uniq -c | sort -nr' almost daily > and wish I could write ' ... order by count(*) desc'. > > On Thu, Apr 23, 2015 at 6:27 PM, Julian Hyde wrote= : > > +1 to returning directories as context. Very useful feature. Could be > > used to return context for other adapters (e.g. an adapter that > > concatenates all versions of versioned logfiles). > > > > +1 making dir an array, per Ted's suggestion > > > > I think dir should not appear in *; thus you'd have to write > > > > select dir, * from `/mytdir/mysdir/myfile.json` > > > > This behavior is analogous to Oracle's ROWID. It is not a column as > > such, but a system function that you can apply to a row. > > > > You need to allow qualifiers: > > > > select x.dir, x.*, y.dir, y.* from `/mytdir/mysdir/myfile.json` as > > x, `/mytdir/mysdir/myfile2.json` as y > > > > and > > > > select dir from `/mytdir/mysdir/myfile.json` as x, > > `/mytdir/mysdir/myfile2.json` as y > > > > would be illegal because dir is ambiguous. > > > > You should make dir a reserved word (like ROWID). > > > > On Thu, Apr 23, 2015 at 5:12 PM, Ted Dunning > wrote: > >> Great point. > >> > >> Having the file name itself is very handy. > >> > >> > >> For one thing, I can make a really slow version of [find] ! > >> > >> (seriously, I would love this) > >> > >> > >> On Thu, Apr 23, 2015 at 7:48 PM, rahul challapalli < > >> challapallirahul@gmail.com> wrote: > >> > >>> I am also under the opinion that we should not assume knowledge on th= e > user > >>> front for data discovery. So we should either have 'dir' columns in > 'select > >>> *' or support a variation that Ted suggested. > >>> Also the folder names compliment the actual data in some cases. > >>> > >>> - Rahul > >>> > >>> On Thu, Apr 23, 2015 at 4:38 PM, Daniel Barclay > > >>> wrote: > >>> > >>> > Regarding the use case in which the user stores information in > pathnames: > >>> > > >>> > Since Drill supports that use case partially, shouldn't it do so mo= re > >>> > completely? In particular, since Drill provides access to subtree > >>> > pathname segments before the last one (the segments for directories= ), > >>> > should Drill provide access to the last one too (the simple file > name)? > >>> > > >>> > > >>> > We support reading cases like this: > >>> > - root/ > >>> > - root/2015/ > >>> > - root/2015/01/ > >>> > - root/2015/01/01/ > >>> > - root/2015/01/01/log.json > >>> > - root/2015/02/ > >>> > - root/2015/02/02/ > >>> > - root/2015/02/02/log.json > >>> > > >>> > In particular, querying "select ... from `root` ..." includes the > >>> > date-portion segments of the pathnames in the dir0, etc, columns. > >>> > > >>> > Note that the user might not redundantly store the dates inside the > >>> > files themselves, since the dates are known to exist in the directo= ry > >>> > names. > >>> > > >>> > > >>> > However, we don't support this variation of that case, right?: > >>> > > >>> > - root/ > >>> > - root/2015 > >>> > - root/2015/01/ > >>> > - root/2015/01/log_01.json > >>> > - root/2015/02/ > >>> > - root/2015/02/log_02.json > >>> > > >>> > In particular, Drill includes several segments of the pathname afte= r > >>> > the root of the subtree, but does not include the last segment--whi= ch > >>> > contains data just as the segments that _are_ included do. > >>> > > >>> > (Yes, the last segment usually contains artifacts besides the > contained > >>> > data (e.g., the file extension) and the user would have to specify > how > >>> > to interpret the file simple name segment as data, but the user has > to > >>> > specify the interpretation for the other segments anyway.) > >>> > > >>> > > >>> > Daniel > >>> > > >>> > > >>> > > >>> > Ted Dunning wrote: > >>> > > >>> >> I would propose that dir be an array that contains all of the > >>> directories > >>> >> rather than having multiple values. > >>> >> > >>> >> The multiple names are particularly inconvenient if files are are > >>> >> different > >>> >> depths. > >>> >> > >>> >> > >>> >> > >>> >> On Thu, Apr 23, 2015 at 5:56 PM, Jacques Nadeau > > >>> >> wrote: > >>> >> > >>> >> I'm specifically arguing that SELECT * doesn't return the columns= . > >>> >>> > >>> >>> Here is current behavior: > >>> >>> > >>> >>> /mytdir/mysdir/myfile.json > >>> >>> {a:1,b:2,c:3} > >>> >>> {a:4,b:5,c:6} > >>> >>> > >>> >>> select * from `myfile.json` > >>> >>> > >>> >>> a, b, c > >>> >>> 1, 2, 3 > >>> >>> 4, 5, 6 > >>> >>> > >>> >>> select * from `/mysdir/myfile.json` > >>> >>> > >>> >>> dir0 a, b, c > >>> >>> mysdir, 1, 2, 3 > >>> >>> mysdir, 4, 5, 6 > >>> >>> > >>> >>> select * from `/mytdir/mysdir/myfile.json` > >>> >>> > >>> >>> dir0, dir1 a, b, c > >>> >>> mytdir, mysdir, 1, 2, 3 > >>> >>> mytdir, mysdir, 4, 5, 6 > >>> >>> > >>> >>> > >>> >>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > >>> >>> My proposal: > >>> >>> > >>> >>> select * from `myfile.json` > >>> >>> select * from `/mysdir/myfile.json` > >>> >>> select * from `/mytdir/mysdir/myfile.json` > >>> >>> ::all produce:: > >>> >>> a, b, c > >>> >>> 1, 2, 3 > >>> >>> 4, 5, 6 > >>> >>> > >>> >>> select dir0, a, b, c from `/mysdir/myfile.json` > >>> >>> > >>> >>> dir0 a, b, c > >>> >>> mysdir, 1, 2, 3 > >>> >>> mysdir, 4, 5, 6 > >>> >>> > >>> >>> select dir0, a, b, c from `/mytdir/mysdir/myfile.json` > >>> >>> > >>> >>> dir0 a, b, c > >>> >>> mytdir, 1, 2, 3 > >>> >>> mytdir, 4, 5, 6 > >>> >>> > >>> >>> > >>> >>> > >>> >>> > >>> >>> On Thu, Apr 23, 2015 at 5:42 PM, Aman Sinha > >>> wrote: > >>> >>> > >>> >>> Seems reasonable, as long as SELECT * also returns the dir# > columns. > >>> >>>> > >>> >>>> On Thu, Apr 23, 2015 at 2:34 PM, Jacques Nadeau < > jacques@apache.org> > >>> >>>> wrote: > >>> >>>> > >>> >>>> Hey guys, > >>> >>>>> > >>> >>>>> I've been thinking that always showing dir# columns seems to > alter > >>> data > >>> >>>>> returned from Drill depending on how you select the directory. > I'd > >>> >>>>> > >>> >>>> propose > >>> >>>> > >>> >>>>> that we make it so that we only return dir# columns when they a= re > >>> >>>>> explicitly requested. > >>> >>>>> > >>> >>>>> Thoughts? > >>> >>>>> > >>> >>>>> > >>> >>>> > >>> >>> > >>> >> > >>> > > >>> > -- > >>> > Daniel Barclay > >>> > MapR Technologies > >>> > > >>> > --=20 Steven Phillips Software Engineer mapr.com --001a113db0faeffa1205146ec39e--