drill-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Neeraja Rentachintala <nrentachint...@maprtech.com>
Subject Re: Querying wide rows with Drill
Date Tue, 11 Nov 2014 22:59:02 GMT
Mappify officially called 'kvgen' function.
You can use this function today by building it from the Drill master.
The 0.7 release plan is for around end of the month.

-thanks

On Tue, Nov 11, 2014 at 2:03 PM, Kyrill Alyoshin <kyrill007@gmail.com>
wrote:

> Yes, Ted. We will have at most a few dozen columns per record (maybe a
> hundred), so flatten should work fine. Is there a scheduled Drill release
> in which the "mappify" operator is going to be available?
>
> And Steven and Ted, thank you very much for taking the time to reply!
>
> -Kyrill
>
> On Tue, Nov 11, 2014 at 3:54 PM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
>
> > From the context of the original post, I don't expect to see more than
> > dozens to perhaps thousands of columns.  In fact, it is common for this
> > wide table format to represent the minority of the data and for a blobish
> > array format to make up the remainder.  I think that the approaches
> > detailed in our book on time series [1] are quite similar to the OP's
> > intent.
> >
> > The retrieval typically requires reading thousands to hundreds of
> thousands
> > of rows and results in thousands to low millions of rows after
> flattening.
> > There would likely be some performance boost if the data source drives
> all
> > the way to the fully flattened form, but I am dubious about the scale of
> > that improvement compared to the flatten approach.
> >
> > I base this on the experience from the Java code base in Open TSDB.
> There,
> > it is common for even large blob format queries to be dominated by actual
> > processing rather data marshalling from the database format.  Drill is
> > likely to do even better since it can fully parallelize the marshalling
> > across many drill bits.
> >
> > [1]
> >
> >
> https://www.mapr.com/blog/it%E2%80%99s-about-time-time-Series-Databases-New-Ways-to-Store-and-Access-Data#.VGJ3cfTF8_M
> >
> >
> > On Tue, Nov 11, 2014 at 2:45 PM, Steven Phillips <sphillips@maprtech.com
> >
> > wrote:
> >
> > > To clarify, when I said a new HBaseRecordReader, I was referring to the
> > > Drill class that reads data using the HBase client and writes into the
> > > ValueVectors. In the current implementation, we have a vector for each
> > > column, which would mean for a sparse table, we would end up with
> > > potentially millions of vectors, which would not be very efficient at
> > all.
> > >
> > > In the new implementation, we would simply have a RepeatedMapVector,
> > with a
> > > Key and Value vector nested inside. You are correct that this will work
> > > without any special support from DB layer.
> > >
> > > On Tue, Nov 11, 2014 at 12:37 PM, Ted Dunning <ted.dunning@gmail.com>
> > > wrote:
> > >
> > > > On Tue, Nov 11, 2014 at 1:46 PM, Steven Phillips <
> > sphillips@maprtech.com
> > > >
> > > > wrote:
> > > >
> > > > > For this to really work well in your case, I think we need to be
> able
> > > to
> > > > > push the "mappify" operation into the scan. In other words, we need
> > the
> > > > > hbase scan to ouptut the records in the desired key/value format.
> > > > > Currently, hbase scan will output in the normal, sparse column
> > schema,
> > > > and
> > > > > then a separate operator would convert it.
> > > > >
> > > > > One way to do this would be to write a new HBaseRecordReader that
> > > outputs
> > > > > in the key/value mode, and then have a System/session option to set
> > > which
> > > > > mode to use.
> > > > >
> > > >
> > > > Actually, I think that what you suggest would be plenty fast even
> > without
> > > > any special support in the DB layer.  The key limitation is rows per
> > > second
> > > > retrieved from the DB, not rows per second processed by drill.
> > > >
> > > > THis is *very* exciting.
> > > >
> > >
> > >
> > >
> > > --
> > >  Steven Phillips
> > >  Software Engineer
> > >
> > >  mapr.com
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message