incubator-drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kyrill Alyoshin <kyrill...@gmail.com>
Subject Re: Querying wide rows with Drill
Date Tue, 11 Nov 2014 22:03:52 GMT
Yes, Ted. We will have at most a few dozen columns per record (maybe a
hundred), so flatten should work fine. Is there a scheduled Drill release
in which the "mappify" operator is going to be available?

And Steven and Ted, thank you very much for taking the time to reply!

-Kyrill

On Tue, Nov 11, 2014 at 3:54 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> From the context of the original post, I don't expect to see more than
> dozens to perhaps thousands of columns.  In fact, it is common for this
> wide table format to represent the minority of the data and for a blobish
> array format to make up the remainder.  I think that the approaches
> detailed in our book on time series [1] are quite similar to the OP's
> intent.
>
> The retrieval typically requires reading thousands to hundreds of thousands
> of rows and results in thousands to low millions of rows after flattening.
> There would likely be some performance boost if the data source drives all
> the way to the fully flattened form, but I am dubious about the scale of
> that improvement compared to the flatten approach.
>
> I base this on the experience from the Java code base in Open TSDB.  There,
> it is common for even large blob format queries to be dominated by actual
> processing rather data marshalling from the database format.  Drill is
> likely to do even better since it can fully parallelize the marshalling
> across many drill bits.
>
> [1]
>
> https://www.mapr.com/blog/it%E2%80%99s-about-time-time-Series-Databases-New-Ways-to-Store-and-Access-Data#.VGJ3cfTF8_M
>
>
> On Tue, Nov 11, 2014 at 2:45 PM, Steven Phillips <sphillips@maprtech.com>
> wrote:
>
> > To clarify, when I said a new HBaseRecordReader, I was referring to the
> > Drill class that reads data using the HBase client and writes into the
> > ValueVectors. In the current implementation, we have a vector for each
> > column, which would mean for a sparse table, we would end up with
> > potentially millions of vectors, which would not be very efficient at
> all.
> >
> > In the new implementation, we would simply have a RepeatedMapVector,
> with a
> > Key and Value vector nested inside. You are correct that this will work
> > without any special support from DB layer.
> >
> > On Tue, Nov 11, 2014 at 12:37 PM, Ted Dunning <ted.dunning@gmail.com>
> > wrote:
> >
> > > On Tue, Nov 11, 2014 at 1:46 PM, Steven Phillips <
> sphillips@maprtech.com
> > >
> > > wrote:
> > >
> > > > For this to really work well in your case, I think we need to be able
> > to
> > > > push the "mappify" operation into the scan. In other words, we need
> the
> > > > hbase scan to ouptut the records in the desired key/value format.
> > > > Currently, hbase scan will output in the normal, sparse column
> schema,
> > > and
> > > > then a separate operator would convert it.
> > > >
> > > > One way to do this would be to write a new HBaseRecordReader that
> > outputs
> > > > in the key/value mode, and then have a System/session option to set
> > which
> > > > mode to use.
> > > >
> > >
> > > Actually, I think that what you suggest would be plenty fast even
> without
> > > any special support in the DB layer.  The key limitation is rows per
> > second
> > > retrieved from the DB, not rows per second processed by drill.
> > >
> > > THis is *very* exciting.
> > >
> >
> >
> >
> > --
> >  Steven Phillips
> >  Software Engineer
> >
> >  mapr.com
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message